Yukon-Katmai Discussion: 2012

Wednesday, January 18, 2012

Replication : Latency but no Latency

I am sure those of you who have been through the replication latency issues know that these issues are not easy to crack .Gaining expertise in Replication especially issues related to latency cannot be learned or taught in a class.

Mostly, issues related to latency are due to :

1: Blocking on DIST/PUB/SUB
2: High Resource consumption : CPU,Storage,Memory
3: Network issues
4: Huge size of Distribution database (msrepl_transactions and msrepl_comands)
5: Transaction log of publisher is huge (too many VLFs) causing log reader latency .

-I am sharing this one with you where actually there was no latency due to any of these as I
mentioned above.
-There was no blocking
-CPU,memory and Diskes were doing good .
-There was no network issue
-We shrunk the publisher database which did not help us at all .
-Cleanup job was running fine .We ran update stats with full scan on msrepl_transactions and
msrepl_comands which did not help either .

When latency started piling up , we first had a look at the log reader and distributor agent history in distribution database but could not get much information due to another issue (out of scope of this post) because of which the log reader history was not showing up and distributor history was showing that almost no data is coming to subscriber ..Normally every 5 mins the throughput of logreader and distribution agent threads (reader and writer threads) are written in to mslogreader_history and msdistribution_history system tables .But we were getting false entries there .

We had no choice but to take the verbose log for distribution and log reader agents .From the logs it was clear that the log reader agent was delivering more than 1000 commands/sec while distribution agent was delivering less than 100 cmds/sec .

We were totally clue less because :
- Log Reader was running at its usual speed .So there was no issue with the Publisher .Nor there was any issue in pumping the data to Distributor.The log Reader Verbose log cleared it .So this was ruled out .

- Data from Distributor to Subscriber was slow .But we were not able to find out whether its slow in reading the data from distribution db or slow in pumping the data to the subscriber .db .What we could clearly see was that there was almost no activity on the subscriber .

-No Resource bottlenecks .Its just that something was stopping the data to flow from distributor to subscriber .This was like when we see an hour glass where there is a lot of sand at the top but the hole is so small that very less can go out of it .

I then saw the distribution verbose logs again and found that there were very frequent entries related to committing the transactions .The 5 similar sentences in 5 continuous lines were not very much clear but I could feel that something related to committing transactions was happening .But still very unclear .These entries were repeating in a group of 5-6 very frequently .The time consumption was around 2 seconds for each spell of 5-6 entries .

This gave us some hint and we immediately jumped to the Distribution agent profile to see if someone has not modified any setting which might be causing this issue.We did a mistake again because we just right clicked replication >> Distributor properties >> agent profiles >> Distribution agent .This was looking fine .

However, one of our colleagues opened up the profile of the distribution agent that was giving us the issue .This was the right way guys .We needed to do this . We found that -CommitBatchSize (default Value 100) and -CommitBatchThreshold(default Value 1000) values were changed to 10 each .

We then changed it back to the default and recycled the Distribution Agent .Thats it , the story ends here .
Due to this the rate at which the transactions were delivered at 1 cmd/transaction .

Even though it was a small setting that caused us slogging for hours , but the experience of working on the replication latency issue when there was actually no resource bottleneck (I call it as false latency) was amazing ..

Cheers and Happy Learning ..

Saturday, January 14, 2012

Can restoring a database to another instance reduce Index fragmentation of underlying tables ?

Answer is NO ..but to come to this conclusion , I had to spend some time .One of my colleagues came to me with this question .My instant answer was a clear NO ...But then I asked him with a curiosity the reason for asking this question .As per him the nightly index reorg job that should run for a very long time , finished in just 3 hours .Also we cannot check the index fragmentation since it takes around 2 hours .Our tables are huge ..

I thought of 3 reasons :

1) Since there is a logic in our job to do re-org only if there is certain level of fragmentation , that day there might be no index coming in the rebuild category.This possibility was less but cannot be ruled out .

2) The restore actually reshuffled the pages and in that attempt cleared out some leaf level fragmentation .I remembered that restores take more time than backups .So I started believing this .

3) There might be some Re-org activity happening during the time the backups were happening .The backup might have a copy of well re-orged pages and this might have resulted in a less fragmented database .

I first started off with point 2 and soon realized that I was not correct .This did not take me much time .For point one , we added the log in the job so that as the job finishes , it creates the log which we can read .But this will take time to generate.

Now , I was left with option 3 .I had the table with > 99% fragmentation and 24085822 rows .The table size was around 4GB . DBCC Showcontig output is shared below :

DBCC SHOWCONTIG scanning 'stats_test' table...
Table: 'stats_test' (711673583); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 268007
- Extents Scanned..............................: 33569
- Extent Switches..............................: 268005
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 12.50% [33501:268006]
- Logical Scan Fragmentation ..................: 99.22%
- Extent Scan Fragmentation ...................: 0.01%
- Avg. Bytes Free per Page.....................: 3334.3
- Avg. Page Density (full).....................: 58.81%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I then ran Index Reorg and noticed percent_complete column in sys.dm_exec_requests DMV .When it reached 46% , I kicked off the backup of the database .

Now let me restore the backup .....fingers crossed :) ....

DBCC SHOWCONTIG scanning 'stats_test' table...

Table: 'stats_test' (711673583); index ID: 1, database ID: 6

TABLE level scan performed.

- Pages Scanned................................: 158633

- Extents Scanned..............................: 19862

- Extent Switches..............................: 19866

- Avg. Pages per Extent........................: 8.0

- Scan Density [Best Count:Actual Count].......: 99.81% [19830:19867]

- Logical Scan Fragmentation ..................: 0.15%

- Extent Scan Fragmentation ...................: 3.47%

- Avg. Bytes Free per Page.....................: 51.2

- Avg. Page Density (full).....................: 99.37%

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Mystery solved ....

Happy learning ..

Friday, January 13, 2012

Replicating SP execution : Issue in SQL Server 2005 SP3 CU2 .Works fine in SQL 2008 SP1

posting after a long time gap and might still have not posted until yesterday when I got to know how smart Replication is .

Brief Summary :
We have a very large OLTP environment where millions of small queries do inserts and updates (No deletes) .The same is replicated to other subscribers .the data is so much that most of the time we firefight latency . So because of the size of the data we started archiving ,which also started deleting the data in batches from OLTP environment and replicating the same to the subscribers .This further added to latency for obvious reasons.

So to reduce latency ,we started thinking of replicating the execution of Stored Proc that deletes the rows in batches .No, I am not trying to say the just because we can replicate the SP execution , that the replication is smart .This feature is quite old now and perhaps you all might be aware of this already.

Issue that we thought we might face :
We already have had all the required tables added as articles in the respective publications .And now if we add the Stored procedures in the publication then we thought Replication will try to update and insert the data twice . For example Lets say there are 2 articles in the publication .The first one is a table (lets say REPL_TAB) and the second one is a SP (say REPL_SP) .REPL_SP deletes x rows from REPL_TAB.

Now, if we execute REPL_SP , we thought that it should affect the Subscriber table twice .One when the SP deletes the rows and two , since the rows are being deleted REPL_TAB should also replicate the same .So we thought this might not work .We then thought of creating another publication with this SP added as an article but had same reservations .

But logically , I thought that Log Reader agent should pickup the command from the T-Log and should be smart enough to replicate it once .I mean if I run EXEC XYZ which deletes 10 rows in a table ABC ,then it should only replicate EXEC XYZ and not the delete command because the rows are being deleted from the table and that table is also an article in the same or for that sake different publication .

I first tested this in SQL Server 2005 SP3 CU2 and got it partially working .In the same publication If I have both the articles the executing the stored procedure will fill msrepl_commands and merepl_transactions with 1 row each .But If there are 2 publications with one article in each and I execute the SP to delete x rows its replicated twice .First 1 command and 1 transactions and then x commands and 1 transactions .Distribution history confirms the same .

I then tested the same on SQL Server 2008 SP1 and it worked like a charm .Below is the proof of concept for your reference :

Test 1:
We have 2 publications on Adventureworks database .One is publishing a table and the other is a stored procedure execution (by default Stored procedure execution is not enabled).Stored Procedure Del_stats_scan, deletes top 10 rows from table dbo.Stats_State in the Publisher database Adventureworks)

SQL Server verion :
Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 Evaluation Edition on Windows XP SP3

Publisher database : Adventureworks
|_Table :dbo.Stats_State
|_SP:dbo.del_stats_scan
|_Publication :ADV_Table
| |_Article :dbo.Stats_State
|_Publication:ADV_SP
|_Article :dbo.del_stats_scan

Subscriber Database : ADV_SUB
|_Table :dbo.Stats_State
|_SP:dbo.del_stats_scan

Jobs :
Log Reader Agents :1
Snapshot Agent :2
Distribution Agent :2

Replication is in synchronization :

Queries fired on publisher DB :

Query 1

delete top (10) from dbo.stats_test

Results :

Data replicated only once .The second Distribution agent did not Do anything . The reason you are seeing 2 transactions in log reader is because both are same images .Log Reader for one database is only ONE .

Fire these queries to find out what is being replicated :

select * from distribution.dbo.MSrepl_commands

select * from distribution.dbo.MSrepl_transactions

Query 2 :

exec del_stats_scan

Results :Data Replicated only once.Only the SP executed and replicated .The other distribution agent did not do any thing . The reason you are seeing 2 transactions in log reader is because both are same images. Log Reader for one database is only ONE .

Run these queries to find out what has been replicated

select COUNT(*) 'No. of rows in Repl_cmds' from distribution.dbo.MSrepl_commands

select COUNT(*) 'No. of rows in Repl_Trans'from distribution.dbo.MSrepl_transactions

Test 2:

We have 1 publication on Adventureworks database with 2 articles.One article is publishing a table and the other article is publishing the stored procedure execution (by default Stored procedure execution is not enabled).

Stored Procedure Del_stats_scan, deletes top 10 rows from table dbo.Stats_State in the Publisher database (Adventureworks).

SQL Server verion :

Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 Evaluation Edition on Windows XP SP3

Publisher database :

Adventureworks

|_Table :dbo.Stats_State

|_SP:dbo.del_stats_scan

|_Publication :ADV_Table

|_Article :dbo.Stats_State

|_Article :dbo.del_stats_scan

Subscriber Database : ADV_SUB

|_Table :dbo.Stats_State

|_SP:dbo.del_stats_scan

Jobs :

Log Reader Agents :1

Snapshot Agent :1

Distribution Agent :1

Repeat Test 1 and 2 and see the results .

Conclusion :

SQL Server 2008 Replication(log Reader) is smart enough and replicate data only once from the transaction log to Distributor .Distributor then distributes the command to the subscriber.There is a bug in SQL Server 2005 SP3 CU2 where the second test works fine but not the first test and replicate twice if we execute the SP. You will have to find out which CU in 2005 fixed this or might want to directly apply SP4.

Happy Learning .

Yukon-Katmai Discussion

Search This Blog