We have a Spring Batch-based data migration program that reads data from Oracle RDB and writes it to MarkLogic. We have a 3-node ML cluster that’s now on 9.0-1.1, was on 8.0-6.3 before.
We’ve been using XCC and a custom “pooling” approach where we create an XCC
ContentSource for each host. Each batch of documents to be written to ML is then handed off to a
ContentSource in round robin fashion. A new Session is then created from the
ContentSource, and then
session.insertContent is called with an array of
Content objects. Very simple, nothing fancy. We use a
ContentCreateOptions object for each document, but all we do is set the format to XML, set the collections and permissions, and set the repair level to
We now have DMSDK support in the migration tool, alongside the XCC support (which one is used is determined by a command line argument). Our DMSDK code is very simple too, basically this:
databaseClient = DatabaseClientFactory.newClient(host, port, user, password, DatabaseClientFactory.Authentication.BASIC);
dataMovementManager = databaseClient.newDataMovementManager();
writeBatcher = dataMovementManager.newWriteBatcher().withBatchSize(batchSize).withThreadCount(threadCount);
jobTicket = dataMovementManager.startJob(writeBatcher);
When the migration program gets a batch of documents to write, it then just calls this for each document in the batch:
writeBatcher.add(doc.getUri(), doc.getMetadata(), doc.getContent());
Once all the batches are written, we have the following cleanup code:
Functionally, everything works fine, but we’re consistently getting better results with our custom XCC approach. Details on the migration that we’re testing with:
- The migration inserts 2,948,131 documents
- Each document is small - just 7 elements, with each element value having less than 100 characters
Details on our 3-node cluster - each node has the same specs - 48g memory; 24 X5660 cpus at 2.80ghz; and disk storage is 500gb, with plenty of space (we’re testing against an empty database). The migration program is being run on a separate machine, with 145g memory; 24 x5675 cpus at 3.07ghz; and 500gb storage.
Here are the 4 test runs I did, all with a thread count of 64:
|| Batch Size
|| Total Time (ms)
|| % of DMSDK
I’m going to do a few more runs, but these are consistent with all the other runs I’ve done.
I’ve included an export of Monitoring History during the time frame where I did the above 4 runs (they were done in that order too). Interesting notes about what’s in there:
- CPU is significantly higher during the DMSDK runs - around 50% compared to 25%
- There are lots of ETC hits during the DMSDK runs, but none during the XCC runs. This confuses me.
- There are lots more list cache hits during the DMSDK runs, which I figure is for the same reason as the ETC hits (there are plenty of duplicate values across the 2 million plus documents).