Blazingly Fast ETL Dataflow with .NET and actionETL

The dataflow capabilities in actionETL have been heavily optimized and are fast – as in really fast. The high performance/low overhead architecture provides a great foundation for implementing complex and large volume ETL requirements.

Benchmark numbers below, and full sources on GitHub. Why not use the free trial and run it on your Windows or Linux machine?

tl;dr

My old desktop from 2013 can pump up to 180 million rows per second between two dataflow workers, some two orders of magnitude faster than most traditional data sources can keep up with on this machine, and up to 340 million rows per second in aggregate link throughput across its 4 cores:

It’s also achieved with quite small buffer sizes of 1024 rows, which conserves memory. The 256 rows default buffer size is on this benchmark almost as fast, and using e.g. tiny 32-row or even smaller buffers when you have particularly wide rows is perfectly viable. You can even give different links in the same dataflow different buffer sizes.

With no data sources to slow the dataflow down, just pumping rows does bottleneck my 4 cores on CPU after about 3 links. After that, the aggregate link throughput stays constant even with over a thousand dataflow source, transform, and target workers all pumping rows simultaneously, showing great scalability.

You can use large numbers of workers to solve complex problems well while retaining excellent performance.

In any real application, the high efficiency of the dataflow leaves almost all CPU cycles available for user logic and data source drivers.

With the powerful API, it only takes a few lines of code to create, connect, and run all the workers for all the benchmark cases.

Just the Facts, Ma’am

In the above chart, the blue “Many Sources, No Transforms” shows running many source workers, with a single target worker connected to each source:

The red “One Source, Many Transforms” shows running a single source worker, with many downstream transforms, terminated by a single target:

And the yellow “Multiple Sources and Transforms” shows running multiple source workers, and an equal number of transform and target workers per source as there are sources in total:

The specific workers used are:

In these benchmarks, we specifically test the movement of rows between workers, without interacting with external data sources or allocating new rows (I’ll come back to that in later articles).

  • The source worker or workers send a row 1 billion times over one link, or 500 million times over two links, all the way to 1 million rows over 1024 links, to the downstream workers. This way, the aggregate number of rows over all links are always 1 billion.
  • Each transform receives the upstream rows and sends them downstream
  • Each individual flow is terminated by a target that receives the upstream rows and throws them away
  • Aggregate Link Throughput is the sum of the number of rows passed through each link (1 billion), divided by the time

That old desktop from 2013 that runs the benchmarks? It has an Intel Core i7 4770K, 3.5GHz, 4 physical cores, hyper-threading disabled, 16GB of memory, and now Windows 10 Pro (full specification). And we’re running .NET Core 3.1.

Here’s the output from a run:

AggregateLinkRows, Workers, Sources, DownstreamLinksPerSource, TotalLinks, Duration (s), AggregateLinkThroughput (Million Rows/s)
 1000000000,      2,      1,      1,      1,   5.547,    180.27
 1000000000,      4,      2,      1,      2,   2.933,    340.92
 1000000002,      6,      3,      1,      3,   3.071,    325.62
 1000000000,      8,      4,      1,      4,   3.140,    318.50
 1000000000,     32,     16,      1,     16,   3.146,    317.90
 1000000000,    128,     64,      1,     64,   3.163,    316.19
 1000000000,    512,    256,      1,    256,   3.179,    314.61
 1000000512,   2048,   1024,      1,   1024,   3.267,    306.12
 1000000000,      3,      1,      2,      2,   4.933,    202.71
 1000000002,      4,      1,      3,      3,   3.439,    290.79
 1000000000,      5,      1,      4,      4,   3.494,    286.22
 1000000000,     17,      1,     16,     16,   3.089,    323.74
 1000000000,     65,      1,     64,     64,   3.064,    326.42
 1000000000,    257,      1,    256,    256,   3.009,    332.29
 1000000512,   1025,      1,   1024,   1024,   3.241,    308.53
 1000000000,      6,      2,      2,      4,   4.550,    219.77
 1000000000,     20,      4,      4,     16,   3.730,    268.11
 1000000000,     72,      8,      8,     64,   3.364,    297.30
 1000000000,    272,     16,     16,    256,   3.507,    285.12
 1000000512,   1056,     32,     32,   1024,   3.149,    317.59

Use the Source, Luke

Coding this is very easy, here’s how to create, connect, and run all the workers for all the test cases – a good demonstration of the power of the library API:

const long desiredAggregateLinkRows = 1_000_000_000;
var testCases = new (int sources, int downstreamLinksPerSource)[]
{
	(1,1) // Warm up the JIT etc. to get consistent timings
	, (1,1), (2,1), (3,1), (4,1), (16,1), (64,1), (256,1)
		, (1024,1) // Many Sources, No Transforms
	, (1,2), (1,3), (1,4), (1,16), (1,64), (1,256)
		, (1,1024) // Single Source, Many Transforms 
	, (2,2), (4,4), (8,8), (16,16), (32,32) // Multiple Sources and Transforms 
};

foreach (var (numberOfSources, downstreamLinksPerSource) in testCases)
{
	int numberOfLinks = numberOfSources * downstreamLinksPerSource;
	long rowsPerSource = (long)Math.Ceiling(
		(double)desiredAggregateLinkRows / numberOfLinks);
	long actualAggregateLinkRows = numberOfSources * rowsPerSource 
		* downstreamLinksPerSource; // Avoid round-off errors

	var workerSystem = new WorkerSystem()
		.Root(ws =>
		{
			for (int s = 0; s < numberOfSources; s++)
			{
				var rrs = new RepeatRowsSource<MyRow>(ws, $"Source {s}"
					, rowsPerSource, new MyRow(42))
						{ SendTemplateRows = true };
				var output = rrs.Output;

				for (int l = 1; l < downstreamLinksPerSource; l++)
					output = output.Link.MulticastTransform(
						$"Transform {s} - {l}", 1).TypedOutputs[0];

				output.Link.TrashTarget($"Target {s}");
			}
		});

	(await workerSystem.StartAsync()) // Run the worker system
		.ThrowOnFailure();
}

Winding Down

I think the key takeaways are that actionETL can move around a massive amount of rows in the dataflow without any noticeable slow-downs, and do it with memory conserving small buffers.

Some future topics to look at include memory allocation and consumption, as well as data source and transform worker performance.

As mentioned, full sources on GitHub; try it out with the free trial and let me know in the comments below what throughput numbers you get!

Leave a Reply

Your email address will not be published. Required fields are marked *