The Tool

I’ve been working as part of the team developing the new Azure Cosmos DB Data Migration Tool, available to download now from GitHub. More about the tool itself and the basics of usage can be found in the announcement blog post. In addition to the basic features for doing 1-to-1 transfers between a single source and sink, we also added a more advanced configuration to transfer multiple data sets at once.

Configuring Multiple Operations

A normal migrationsettings.json file contains a single set of SourceSettings and a single set of SinkSettings to define the transfer operation. There are scenarios where it may be more convenient to do multiple operations without having to build separate files and run the app multiple times.

  • Transfer multiple data sets to fully populate a database with a single command. For example, a development team may use individual copies of a Cosmos account which use multiple databases and containers. Distributing a set of JSON files and a single settings file as part of the project’s source control would allow each developer to fully populate baseline test data with a single command.
  • Consolidate data spread out over multiple data stores into a single shared store. There are lots of potential use cases here including denormalizing data for read optimization, building new partitioning for a set of data spread across multiple collections, consolidating data sharded across multiple instances, etc.
  • Push copies of the same data to multiple target locations.

To support these kinds of scenarios, an array of SourceSettings/SinkSettings pairs can be added to the Operations property in the settings file. These operations will each run sequentially, but require only one command to execute and can also take advantage of shared settings to reduce the amount of individual settings in the file. One current restriction is that the Source and Sink types must be the same for all operations but this may change in the future.

Example: Multiple JSON Files Into Cosmos Containers

This example reads multiple JSON files and writes them into different databases and containers within the same Cosmos account. Notice that the individual operations don’t need to specify the required ConnectionString and PartitionKeyPath because they can inherit those values from the parent settings, in this case including an environment variable for the connection string. Notice also that while 3 of the operations use the /id partition key path set at the top, one uses a different value which overrides the inherited parent value.

{
  "Source": "json",
  "Sink": "cosmos-nosql",
  "SinkSettings": {
    //"ConnectionString": "<FROM ENV VAR>",
    "PartitionKeyPath": "/id"
  },
  "Operations": [
    {
      "SourceSettings": {
        "FilePath": "companies.json"
      },
      "SinkSettings": {
        "Database": "clients",
        "Container": "company"
      }
    },
    {
      "SourceSettings": {
        "FilePath": "people.json"
      },
      "SinkSettings": {
        "Database": "clients",
        "Container": "person",
        "PartitionKeyPath": "/companyId"
      }
    },
    {
      "SourceSettings": {
        "FilePath": "marchSales.json"
      },
      "SinkSettings": {
        "Database": "transactions",
        "Container": "sales"
      }
    },
    {
      "SourceSettings": {
        "FilePath": "aprilSales.json"
      },
      "SinkSettings": {
        "Database": "transactions",
        "Container": "sales"
      }
    }
  ]
}

Example: Copying Cosmos Data By Querying

Here data is being copied from a single partition in a Cosmos container out to two different containers representing subsets of the original data. By using shared settings the only differences on the individual operations are the queries for filtering the data and the containers each will write to.

{
  "Source": "cosmos-nosql",
  "Sink": "cosmos-nosql",
  "SourceSettings": {
    "ConnectionString": "AccountEndpoint=https://...",
    "Database": "operations",
    "Container": "transactions",
    "PartitionKeyValue": "april"
  },
  "SinkSettings": {
    "ConnectionString": "AccountEndpoint=https://...",
    "Database": "sales",
    "PartitionKeyPath": "/id"
  },
  "Operations": [
    {
      "SourceSettings": {
        "Query": "SELECT * FROM c WHERE c.type='sale'"
      },
      "SinkSettings": {
        "Container": "orders"
      }
    },
    {
      "SourceSettings": {
        "Query": "SELECT * FROM c WHERE c.type='refund'"
      },
      "SinkSettings": {
        "Container": "returns"
      }
    }
  ]
}

More Possibilities

Depending on your needs there are a lot of different ways to set up this type of configuration to do multiple operations at once. These examples should at least give you an idea of what’s possible and how little configuration can be needed to do a lot at once.

Code Versions

Example configurations are supported in the 2.0.1 Release of the Azure Cosmos DB Desktop Data Migration Tool