Recently, we were posed an issue with a client. They had recently undergone major database moves from AWS to Azure. However, they had their analytics warehouse in AWS. they used DMS tasks to sync multiple schemas into one in target analytics database.
Many important tables used to fail regularly in sync, making analytics data unreliable. In this post we’ll dig deeper into the cases and its resolution.
AWS DMS excels in one-time database migrations, yet it poses significant challenges for ongoing data replication and high-frequency change data capture (CDC). Many new AWS users sign up with ongoing data replication in mind, only to face problems like sync lag, CDC performance setbacks, scaling issues during transaction spikes, intricate setup and maintenance, and unexpectedly high operational costs.
Despite the initial affordability of AWS DMS, the combined impact of these challenges results in exorbitant total ownership costs. Organizations also suffer significant business losses when their AWS DMS instance encounters downtime.
Scenario 1:
Problem: Initially, our challenge arose from maximum table loads due to significant network latency resulting from the geographical distance between our replication instance’s region and the customer’s source database region. This latency hindered efficient data transfer.
Solution and Approach:
To address this, we strategically relocated our replication instance to the same region as the source database. Despite being on different clouds, this proximity remarkably accelerated table loads. Additionally, we encountered an obstacle with the instance’s changing IP during mandatory weekly maintenance. To resolve this issue and ensure consistent access, we placed our instance in a Virtual Private Cloud (VPC) with internet access, utilizing an Elastic IP for a stable, unchanging IP address. This solution provided a reliable and uninterrupted connection to the customer’s database server.
Scenario 2:
Problem: This challenge emerged when stabilizing continuous replication (CDC) of tables became problematic. High transaction volumes led to significant delays, failing to meet the client’s requirement for near real-time updates.
Solution and Approach:
The problem arose because many DMS user sessions were using a lot of CPU on the source database server. These sessions couldn’t end, causing a slowdown in data transfer due to low throughput. Additionally, routines in the ERP system were simultaneously making lots of updates to records, creating a race for resources and delaying the source capture process even further.
Examining DMS task metrics revealed a surge in the “CDCLatencySource” metric, indicating a delay in the source capture process. This metric measures the interval, in seconds, between the last event captured from the source endpoint and the current system timestamp. Simultaneously, analysis of the replication instance metrics disclosed a fixed “WriteIOPS” metric at 300 I/O operations per second (IOPS). Given our instance had a 100 GB disk, it became evident that enhancing IOPS required expanding the disk size, adhering to the standard of 3 IOPS per GB of disk size.
Collaborating with the customer’s database team, it was revealed that increasing the I/O operations per second (IOPS) was pivotal. Examination of metrics highlighted a fixed 300 IOPS with a 100 GB disk. To resolve this, the solution involved increasing storage size, directly impacting IOPS. By scaling up the disk size, from 300 to 3000 IOPS, the data replication process regained momentum. This augmentation allowed the instance to handle the routine’s heavy transaction load daily, ensuring optimal performance without bottlenecks.
Scenario 3:
Problem: As the volume of data grew over time, data replication experienced recurring delays and crashes, impacting the efficiency and reliability of the process.
Solution and Approach:
During our analysis of the replication setup, we noticed a metric called “SwapUsage,” which shows how often the computer uses extra disk space when it runs out of regular memory. To keep things running smoothly, it’s best to make sure the system handles data directly in its memory, as suggested in the AWS DMS documentation. We also found out that we can reduce how much space it uses by adjusting some settings, like “MemoryLimitTotal,” which decides the most memory a task can use, and “MemoryKeepTime,” which sets how long each operation can stay in memory.
After our thorough analysis, we decided to boost our replication setup by increasing its memory resources. We switched to an R5 class instance, as recommended in AWS documentation. R5 instances are memory optimised, allowing us to store a large number of transactions in memory and preventing issues during data replications. This change, effectively reset the “SwapUsage” metric to zero.
Scenario 4:
Problem: The client’s source table settings were not configured to support CDC (Change Data Capture).
Solution and Approach:
After a deep dive into the source table settings, it was discovered that they were not set up to facilitate CDC operations. The solution was two-fold: Firstly, we worked with the client’s database team to modify the source table settings to be compatible with CDC. Secondly, we implemented an automatic configuration check mechanism that validates whether a source table is CDC-enabled before initializing replication. This proactive measure ensures that the system does not run into replication failures due to misconfigured tables.
Scenario 5:
Problem: The management of schema changes became a significant obstacle, leading to frequent full loads.
Solution and Approach:
Schema changes can be dynamic and frequent in databases. The consistent need for full loads due to schema modifications was a significant setback. To counter this, we introduced a schema evolution detection mechanism. This mechanism identifies and adapts to schema changes in real-time, allowing for more seamless data migration. As a result, the reliance on full loads was significantly reduced, streamlining the replication process.
Scenario 6:
Problem: Table locks in the target database were causing synchronization disruptions.
Solution and Approach:
Target table locks can severely impede the synchronization process. To tackle this, we integrated an alert system that promptly notifies the team when a table lock is detected. Additionally, we adopted an optimistic concurrency control strategy that reduces the likelihood of encountering table locks during data replication. This approach ensures that data is synchronized efficiently without running into lock-induced stalls.
Scenario 7:
Problem: Full loads resulted in the frequent dropping of tables along with their keys in the analytics warehouse.
Solution and Approach:
Every instance of a full load resulted in the analytics warehouse tables being dropped, and consequently, the loss of persistent keys. To overcome this, we transitioned to a differential loading approach. Instead of dropping tables, we only updated the modified records. This method retains the table structure and the persistent keys, ensuring the data’s integrity and consistency in the analytics warehouse.
Scenario 8:
Problem: The dependency on full load raised costs as it mandated the use of a read replica to prevent performance degradation on the actual application database.
Solution and Approach:
Full loads not only strained the primary database but also necessitated the creation of a read replica, further escalating costs. To alleviate this, we shifted our focus to a CDC setup. Unlike the full load process, CDC allows for real-time data availability without placing undue pressure on the primary database. This transition resulted in significant cost savings, as the need for read replicas was eliminated. Additionally, it ensured that the primary application database remained performance-efficient.
In summary, we overcame data replication challenges for a client, ensuring reliability and efficiency. With smart solutions and expertise from Kaliper.io, we offer seamless assistance in transforming data replication into a cost-effective and reliable system for your business needs.