info@techdevops.com | 437-991-3573 | Data Engineering Services
TechDevOps.com
Resources Tools
Experts in Microsoft SQL Server on Windows, Linux, Containers | Clusters, Always On, FCI | Migrations, Cloud, Performance



SQL Server Distribution Agent is retrying after an error
by BF (Principal Consultant; Architecture; Engineering)
2016-07-17








Replication Monitor Error Message:


"Agent SQLNYC01-Onvoices-PubInvoicesInvoices-SQLNYC01-1353 is retrying after an error. 211 retries attempted. See agent job history in the Jobs folder for more details."



Question #1: How many times will the Distribution Agent(s) SQL Job(s) retry with Transactional Replication set to continuous delivery?



2147483647 times!!!

...which means it will retry 2147483647 times at 1-minute retry interval. It it kept failing continuously it would take 4085 years for
that many retry's to happen.
...2147483647 = max size of data type INT. int


To check Distribution Agent retry errors logged in last 24 hours:

Select * from Distribution.dbo.MSdistribution_history where [Time] > dateadd(hh,-24,getdate()) and comments like '%is retrying after an error%' order by [Time] desc

or

Select t1.[Time] as 'Logged Time', UPPER(t2.name) as 'Disribution Agent Name',
LTRIM(RTRIM((Replace(Substring(t1.comments, charindex('.',t1.comments), charindex('retries',t1.comments)-charindex('.',t1.comments)),'.','')))) as 'Total Retries Attempted',
Substring(t1.comments, charindex('VM',t1.comments), 4) as 'Subscriber'
from Distribution.dbo.MSdistribution_history t1
Inner Join Distribution.dbo.MSdistribution_agents t2 on t1.agent_id = t2.id
where comments like '%retries attempted%'
and [Time] > dateadd(hh,-24,getdate())
order by 4, t1.[Time] desc



Question #2: Will the Distribution Agent ever fail in a situation with 100's of continuous retry's occurring? (ex. Issues with intermittent Network Connectivity)

The Distribution Agent would continue to retry 2147483647 times in that case. However, after the below setting for Transaction Retention is passed, the system will react and cause it to fail. (Find the below screen by navigating to SSMS..Replication Properties)




This is called the "Distribution Retention Period". If the system went past 72 hours it would likely fail with:

"18854 One or more subscriptions have been marked inactive. Drop and re-create all subscriptions for this node that are failing with this error."
"The subscription(s) have been marked inactive and must be reinitialized. NoSync subscriptions will need to be dropped and recreated. (Source: MSSQLServer, Error number: 21074"

Note the actual changes occur when this SQL Job executes: "Distribution clean up: distribution"



Question #3: How to monitor for breaking errors occurring in the Replication system?

First, those errors above are obviously something you do not want to occur. You need to actively monitor for the various issues with Transactional Replication and below are two errors to watch out for. It a good idea to setup SQL Jobs to monitor, log AND alert on certain errors and to fix those immediately to avoid a Subscription becoming inactive.

20598 - The row was not found at the Subscriber when applying the replicated command.
2627 - Violation of PRIMARY KEY constraint ‘PK_TBL’. Cannot insert duplicate key in object ‘dbo.TBL’.

Here are two T-SQL statements to setup inside a SQL Jobs to monitor for errors 20598 and 2627:

Use Distribution
go

Declare @RepErrorsNew int --check last 5 mins; sql job executes every 5 mins

Select @RepErrorsNew = (select top 1 ID from Distribution.dbo.MSrepl_errors with (nolock)
where error_text like '%The row was not found at the Subscriber when applying the replicated command.%'
and [time] > dateadd(mi,-15,getdate()) order by [time] desc)

If @RepErrorsNew IS NOT NULL and @RepErrorsNew > 0
Begin
RAISERROR('Transactional Replication Error Alert - Row Not Found!', 16, 1)
End


Use Distribution
go

Declare @RepErrorsNew int --check last 5 mins; sql job executes every 5 mins

Select @RepErrorsNew = (select top 1 ID from Distribution.dbo.MSrepl_errors with (nolock)
where error_text like '%Violation of PRIMARY KEY constraint%'
and [time] > dateadd(mi,-5,getdate()) order by [time] desc)

If @RepErrorsNew IS NOT NULL and @RepErrorsNew > 0
Begin
RAISERROR('Transactional Replication Error Alert - Violation of PRIMARY KEY constraint!', 16, 1)
End



Question #4: Are there any other Retention periods in Transactional Replication?

Yes! And it is called "Publication Retention Period". To find this setting navigate to SSMS, Replication, Local Publications, right-click on a Publication and select Properties.
Then the General tab find Subscription expiration section. The below screenshot explains it really well.





Here is the guidance from Microsoft on both of these Replication Retention periods here:

"Transactional replication uses the maximum distribution retention period (the @max_distretention parameter of sp_adddistributiondb (Transact-SQL)) and the publication retention period (the @retention parameter of sp_addpublication (Transact-SQL)):

If a subscription is not synchronized within the maximum distribution retention period (default of 72 hours) and there are changes in the distribution database that have not been delivered to the Subscriber, the subscription will be marked deactivated by the Distribution clean up job that runs on the Distributor. The subscription must be reinitialized.

If a subscription is not synchronized within the publication retention period (default of 336 hours), the subscription will expire and be dropped by the Expired subscription clean up job that runs on the Publisher. The subscription must be recreated and synchronized.

If a push subscription expires, it is completely removed, but pull subscriptions are not. You must clean up pull subscriptions at the Subscriber. For more information, see Delete a Pull Subscription."