Blazingly Fast ETL Dataflow with .NET and actionETL

The dataflow capabilities in actionETL have been heavily optimized and are fast – as in really fast. The high performance/low overhead architecture provides a great foundation for implementing complex and large volume ETL requirements.

Benchmark numbers below, and full sources on GitHub. Why not use the free trial and run it on your Windows or Linux machine?

tl;dr

My 7-year-old desktop from 2013 can pump up to 180 million rows per second between two dataflow workers, some two orders of magnitude faster than most traditional data sources can keep up with on this machine, and up to 340 million rows per second in aggregate link throughput across its 4 cores:

It’s also achieved with quite small buffer sizes of 1024 rows, which conserves memory. The 256 rows default buffer size is on this benchmark almost as fast, and using e.g. tiny 32-row or even smaller buffers when you have particularly wide rows is perfectly viable. You can even give different links in the same dataflow different buffer sizes.

With no data sources to slow the dataflow down, just pumping rows does bottleneck my 4 cores on CPU after about 3 links. After that, the aggregate link throughput stays constant even with over a thousand dataflow source, transform, and target workers all pumping rows simultaneously, showing great scalability.

You can use large numbers of workers to solve complex problems well while retaining excellent performance.

In any real application, the high efficiency of the dataflow leaves almost all CPU cycles available for user logic and data source drivers.

With the powerful API, it only takes a few lines of code to create, connect, and run all the workers for all the benchmark cases.

Just the Facts, Ma’am

In the above chart, the blue “Many Sources, No Transforms” shows running many source workers, with a single target worker connected to each source:

The red “One Source, Many Transforms” shows running a single source worker, with many downstream transforms, terminated by a single target:

And the yellow “Multiple Sources and Transforms” shows running multiple source workers, and an equal number of transform and target workers per source as there are sources in total:

The specific workers used are:

In these benchmarks, we specifically test the movement of rows between workers, without interacting with external data sources or allocating new rows (I’ll come back to that in later articles).

  • The source worker or workers send a row 1 billion times over one link, or 500 million times over two links, all the way to 1 million rows over 1024 links, to the downstream workers. This way, the aggregate number of rows over all links are always 1 billion.
  • Each transform receives the upstream rows and sends them downstream
  • Each individual flow is terminated by a target that receives the upstream rows and throws them away
  • Aggregate Link Throughput is the sum of the number of rows passed through each link (1 billion), divided by the time

That old desktop from 2013 that runs the benchmarks? It has an Intel Core i7 4770K, 3.5GHz, 4 physical cores, hyper-threading disabled, 16GB of memory, and now Windows 10 Pro (full specification). And we’re running .NET Core 3.1.

Here’s the output from a run:

AggregateLinkRows, Workers, Sources, DownstreamLinksPerSource, TotalLinks, Duration (s), AggregateLinkThroughput (Million Rows/s)
 1000000000,      2,      1,      1,      1,   5.547,    180.27
 1000000000,      4,      2,      1,      2,   2.933,    340.92
 1000000002,      6,      3,      1,      3,   3.071,    325.62
 1000000000,      8,      4,      1,      4,   3.140,    318.50
 1000000000,     32,     16,      1,     16,   3.146,    317.90
 1000000000,    128,     64,      1,     64,   3.163,    316.19
 1000000000,    512,    256,      1,    256,   3.179,    314.61
 1000000512,   2048,   1024,      1,   1024,   3.267,    306.12
 1000000000,      3,      1,      2,      2,   4.933,    202.71
 1000000002,      4,      1,      3,      3,   3.439,    290.79
 1000000000,      5,      1,      4,      4,   3.494,    286.22
 1000000000,     17,      1,     16,     16,   3.089,    323.74
 1000000000,     65,      1,     64,     64,   3.064,    326.42
 1000000000,    257,      1,    256,    256,   3.009,    332.29
 1000000512,   1025,      1,   1024,   1024,   3.241,    308.53
 1000000000,      6,      2,      2,      4,   4.550,    219.77
 1000000000,     20,      4,      4,     16,   3.730,    268.11
 1000000000,     72,      8,      8,     64,   3.364,    297.30
 1000000000,    272,     16,     16,    256,   3.507,    285.12
 1000000512,   1056,     32,     32,   1024,   3.149,    317.59

Use the Source, Luke

Coding this is very easy, here’s how to create, connect, and run all the workers for all the test cases – a good demonstration of the power of the library API:

const long desiredAggregateLinkRows = 1_000_000_000;
var testCases = new (int sources, int downstreamLinksPerSource)[]
{
	(1,1) // Warm up the JIT etc. to get consistent timings
	, (1,1), (2,1), (3,1), (4,1), (16,1), (64,1), (256,1)
		, (1024,1) // Many Sources, No Transforms
	, (1,2), (1,3), (1,4), (1,16), (1,64), (1,256)
		, (1,1024) // Single Source, Many Transforms 
	, (2,2), (4,4), (8,8), (16,16), (32,32) // Multiple Sources and Transforms 
};

foreach (var (numberOfSources, downstreamLinksPerSource) in testCases)
{
	int numberOfLinks = numberOfSources * downstreamLinksPerSource;
	long rowsPerSource = (long)Math.Ceiling(
		(double)desiredAggregateLinkRows / numberOfLinks);
	long actualAggregateLinkRows = numberOfSources * rowsPerSource 
		* downstreamLinksPerSource; // Avoid round-off errors

	var workerSystem = new WorkerSystem()
		.Root(ws =>
		{
			for (int s = 0; s < numberOfSources; s++)
			{
				var rrs = new RepeatRowsSource<MyRow>(ws, $"Source {s}"
					, rowsPerSource, new MyRow(42))
						{ SendTemplateRows = true };
				var output = rrs.Output;

				for (int l = 1; l < downstreamLinksPerSource; l++)
					output = output.Link.MulticastTransform(
						$"Transform {s} - {l}", 1).TypedOutputs[0];

				output.Link.TrashTarget($"Target {s}");
			}
		});

	(await workerSystem.StartAsync()) // Run the worker system
		.ThrowOnFailure();
}

Winding Down

I think the key takeaways are that actionETL can move around a massive amount of rows in the dataflow without any noticeable slow-downs, and do it with memory conserving small buffers.

Some future topics to look at include memory allocation and consumption, as well as data source and transform worker performance.

As mentioned, full sources on GitHub; try it out with the free trial and let me know in the comments below what throughput numbers you get!

Reduce .NET ETL Code Size by 23 Times with actionETL

In a fully documented example, actionETL required only 9kB of code to create a high performance and reusable custom Slowly Changing Dimension (SCD) worker, 23 times less than the 209kB used by a SQL Server® Integration Services (SSIS) implementation with similar functionality. What caused this stark difference?

Modern AppDev Techniques

The actionETL library is designed from the ground up to make ETL development very productive. By using well-known application development techniques, it provides excellent:

  • Reusability
  • Composability
  • Encapsulation
  • Testability
  • Extensibility
  • Refactoring
  • Source control and Continuous Integration/Continuous Delivery

In the SCD example, actionETL composability pays a huge dividend, where existing ‘control flow’ and dataflow workers are easily combined to create a new high performance and reusable custom worker:

Unlike with SSIS, actionETL ‘control flow’ and dataflow workers can be freely mixed and matched, and can also be used to create new custom workers.

Visual Designer

In contrast, SSIS cannot use existing control flow tasks or dataflow components when creating new tasks or components, not even via C++, and must therefore implement all required functionality from scratch. Most SSIS custom tasks and components also require significant UI code. Both aspects heavily inflate the amount of code that must be written and maintained.

Virtually all traditional ETL tools have the same heavy focus on their drag&drop visual designer as SSIS has. While this certainly helps in some ways, like initially getting up to speed on the tool, they pay a very heavy price in terms of poor support for some or all of the above modern AppDev traits.

Whitepaper

If you want to learn more about how actionETL compares to SSIS and traditional ETL tools, check out the Thirteen Factors Crippling ETL Productivity whitepaper.

VirtualBox Hyper-Threading Benchmark Surprise

I needed to ensure good CPU and memory performance in a VirtualBox virtual machine running on a 4-core desktop, and googling didn’t provide any clear guidance. After some benchmarking, the surprise came in the shape of consistently getting the best result when ignoring VirtualBox’ warning on oversubscribing processors:

VirtualBox warns if the guest is configured with more processors than there are physical cores on the host
VirtualBox warns if the guest is configured with more processors than there are physical cores on the host

As always, these results are only valid for my particular configuration and on my chosen benchmarks, including the assumptions that the physical host is idling while the single virtual machine is running flat out – do test with your own systems and tasks.

Physical Host

Desktop system where the CPU and motherboard were released in AD 2013:

CPUSingle Intel Core i7 4770K, 3.5GHz, 4 physical cores, 8 logical cores (when Hyper-Threading is enabled)
  • Note: All cores run at 3.7GHz during multi-threaded benchmarks; for single-threaded benchmarks, the core runs between 3.7 and 3.9GHz
MemoryG Skill F3-2400C10D-16GTX, Trident X Series, 2x8GB, PC3-19200, DDR3 2400MHz
  • Dual Channel, DRAM:1200MHz, CL:10 tRCD:12 tRP:12 tRAS:31 tRFC:363 CR:2T
MotherboardASRock Z87 Extreme9/ac, BIOS 2.30
DiskSamsung 840 EVO 1TB SSD
OSWindows 7 Ultimate SP1 64-bit

Note: The results in this article are likely not applicable to NUMA systems with physical processors in multiple sockets, since these have very different memory, cache, and thread scheduling characteristics.

VirtualBox

Version4.3.26
Guest Processors4 or 8
Guest Memory8GB
Guest OSWindows Server 2008 R2 Standard SP1 64-bit
Guest SettingsIO-APIC, 100% Execution Cap, PAE/NX, VT-X / AMD-V, Nested Paging all enabled

Benchmarks

For my particular requirements, I chose mainly multi-threaded CPU and memory bound benchmarks, with some disk benchmarks to guard against IO regressions – do follow the links for specifics on the individual benchmarks:

y-cruncherMulti-threaded. Calculates Pi. Mainly CPU and thread communication limited. Requires and uses SSE.
PassMarkPerformanceTest 8.0 (Build 1025) 64-bit  
  • “Preferences > Number of processes” set to number of logical processors on host (i.e. 4 or 8)
  • All CPU and memory benchmarks have been included
  • Disk benchmarks were also executed, but detailed results are not included due to the variability in IO systems:
    • Enabling/disabling Hyper-Threading did as expected not have any impact on disk performance
    • With Samsung RAPID mode disabled, disk performance in the VirtualBox guest ranged from 12% slower to 4% faster than the physical host
    • Enabling RAPID mode (which uses main memory as cache for SSD) improved VirtualBox guest disk performance with about 40%, and improved physical host disk performance with a whopping 9.5x – real life mileage will of course vary wildly

Results

To aid digestion, I’m presenting the data as speed-up or slowdown of different configurations vs. the on average fastest configuration, which was to run on the physical host with Hyper-Threading enabled.

The “Overall Average” section at the top of the chart is the average slowdown of all the actual benchmarks further down. Comparing to physical host with Hyper-Threading enabled, we see that running on:

  • VB 4 NoHT (VirtualBox with 4 processors, host has Hyper-Threading disabled) is on average the slowest at -22%, with individual benchmarks ranging from -2% to -55% slower
  • VB 4 HT (VirtualBox with 4 processors, host has Hyper-Threading enabled) is on average -22% slower, with individual benchmarks ranging from 2% faster to -44% slower
  • Phys 4 NoHT (Physical host, Hyper-Threading disabled) is on average only -10% slower, with individual benchmarks ranging from 1% faster to -50% slower
  • VB 8 HT (VirtualBox with 8 processors, host has Hyper-Threading enabled) is on average only -9% slower, with individual benchmarks ranging from -2% to -27%

Conclusions

Given my set-up, requirements and assumptions, I find that:

  • Disabling Hyper-Threading makes both the physical host and the virtual machine on average quite a bit slower – I’ll leave it enabled.
  • Following VirtualBox’ recommendation of limiting virtual processors to number of physical cores brings a slowdown of -22% (VB 4 HT above). I’ll instead configure as many virtual VirtualBox processors as there are logical (Hyper-Threaded) cores (VB 8 HT above), giving only a -9% slowdown.

Finally, a small warning: if you configure VirtualBox to use more processors than there are logical (Hyper-Threaded) cores (e.g 16 virtual processors on my 4770K) , it can run an order of magnitude slower than normal – simply ensure that you have no more VirtualBox processors configured than there are logical (Hyper-Threaded) cores available.

Hope it helps!

GOTCHA: SQL Server changes query plan without changing plan_handle

Or how to efficiently get at old query plans

Performance Tuning Mission

While on a SQL Server 2008 Data Warehouse performance tuning mission, this aspect (or undocumented gotcha as it felt like at the time) of the query plan Dynamic Management Functions caused some head-scratching, until I sussed that:

A query plan can change without its plan_handle changing

Or, putting it in Data Warehouse speak:

query_plan in sys.dm_exec_query_plan(plan_handle) is a Type 1 slowly changing attribute with respect to the plan_handle key, which means query_plan changes overwrite history for the same plan_handle

This mattered since I was tracking down intermittent but significant slow-downs of the Data Warehouse load by sampling query plans for monitoring purposes – I suspected the culprit was intermittent query plan changes which meant that:

  • Detecting when and how the query plans had changed was critical
  • The intermittent nature of the symptoms meant that actual (as opposed to estimated) query plans were needed and that monitoring had to run for several days at a time
  • Due to the storage size for query plans, and the desire to capture ‘all’ query plans, deduplicating the query plans (i.e. only storing changes) during sampling was required
    • E.g. tracing ‘all’ query plans and detecting changes during analysis would have taken too much storage space, and made for very slow analysis

I needed to detect query plan changes, but as stated above, using only plan_handle changes to detect query_plan changes does however not work, let’s demonstrate this.

T-SQL Script and Results

  • Create a test table with an index in tempdb
  • SELECT from the empty table
  • Display the plan_handle for the SELECT that just executed
  • INSERT 100000 records into the table
  • Run exactly the same SELECT statement as above from the now populated table
  • Display the plan_handle for the second SELECT that just executed

Do enable “Include Actual Execution Plan” before running the script.

USE tempdb
GO

CREATE TABLE dbo.TestQueryPlan
(
	Id INT IDENTITY(1,1),
	Pad CHAR(1000),
)
GO

CREATE INDEX IX_TestQueryPlan_Id
	ON dbo.TestQueryPlan (Id)
GO

SELECT Id, Pad
FROM dbo.TestQueryPlan
WHERE Id &amp;amp;amp;amp;amp;amp;lt;= 1
GO

SELECT plan_handle, text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle)
WHERE text LIKE '%TestQueryPlan%WHERE Id%'
	AND text NOT LIKE '%CROSS APPLY%'
GO

INSERT INTO dbo.TestQueryPlan
SELECT TOP 100000 'Pad me'
FROM master.sys.columns a
CROSS APPLY master.sys.columns b
GO

SELECT Id, Pad
FROM dbo.TestQueryPlan
WHERE Id &amp;amp;amp;amp;amp;amp;lt;= 1
GO

SELECT plan_handle, text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle)
WHERE text LIKE '%TestQueryPlan%WHERE Id%'
	AND text NOT LIKE '%CROSS APPLY%'
GO

DROP TABLE dbo.TestQueryPlan
GO

Here are the query plans for the two identical SELECT statements:

Simple query plan before insert

Query plan changed after insert

The query plans are quite different due to running SELECT on an empty vs. a populated table. Next up are the plan_handles:

Identical plan_handles

We see that the plan_handles are identical, despite the query plans above being quite different, and therein lies the gotcha – I need a better way of detecting query plan changes.

Also note that while the script uses ad-hoc SELECT queries, the same effect also applies to stored procedures.

BOL?

One thing that had led me astray was BOL in several places saying that plan_handle is unique, e.g. sys.dm_exec_query_plan notes that:

Each cached query plan is identified by a unique identifier called a plan handle.

and

plan_handle Uniquely identifies a query plan for a batch that is cached or is currently executing.

I initially took uniqueness to mean that if a query plan changes, its plan_handle will also change, which as shown above is not the case.

One way to look at it is that the BOL definition of plan_handle uniqueness:

  • Only applies to currently cached or executing query plans
  • Does not apply to plans that have been dropped from the cache
  • Does not apply to plans that have finished executing and were never stored in the cache

Pros and Cons

Looking at plan_handle from other perspectives I can certainly see reasons why it is implemented this way:

  • When a statement is recompiled and the query plan ends up different, SQL Server
    • Has no use for the old plan and throws it away
    • Does not have to update anything that pointed to the old plan using the plan_handle, since the plan_handle stays the same
  • A plan_handle encodes many things (SET options, objectid…) within its 64 bytes, so it’s not an arbitrary identifier (see sys.dm_exec_plan_attributes)

On the flip side, it’s a bit of a shame that despite a 64-byte ostensibly unique plan_handle, and the query_plan_hash from sys.dm_exec_query_stats (which is designed to generate duplicates), there is no obvious and fast way of knowing if a query plan has changed or not. Tracing all query (re)compiles would provide the necessary data, but lightweight it is not.

A second aspect of this design is that there is no way to get hold of an old query plan from DMFs etc. after it has changed, making troubleshooting harder unless of course you were explicitly capturing query plans.

Way Forward

Realising the above, I calculated, stored, and compared my own hash of the query_plan attribute, i.e. hashing the full XML text. This guaranteed detecting even the smallest query plan change, and furthermore I only had to process the often large captured query_plans once, with subsequent comparisons done on the computed hashes.

Hashing was fairly easy since I already had a CLR hashing stored procedure that could handle large inputs.

Given thousands of query plans though, sometimes megabytes in size, it would still be preferable to have a more direct way of detecting query plan changes than hashing the full text or running a continuous trace – do let me know if you have any ideas on this.

query_plan hashing worked well for my requirements and sampling facility; other approaches could also work well, especially for ad-hoc investigations (as opposed to longer-term monitoring) where the size of the logged data would not have time to become prohibitively large, and the impact to the server could be kept short, including:

  • Periodically dump the full query plans with context into tables, and check for changes (i.e. deduplicate) during analysis instead of during sampling
  • Use SQL Profiler or a Server-side trace to capture query plans using e.g. the Showplan XML Statistics Profile event, and again check for changes during analysis. Note that especially SQL Profiler “can have a significant performance overhead”
  • Use new Extended Events in SQL Server Denali that include the actual query plan, but like tracing they carry the “can have a significant performance overhead” caveat

In Conclusion

  • Long term logging of actual query plans is very useful for troubleshooting intermittent and unexpected query plan changes
  • Do take monitoring duration, storage size and impact on server into account when selecting how to collect query plans (i.e. Management Studio, SQL Profiler, Server side trace, Deduplicated sampling as described here…)
  • Be aware that query plans change without their associated plan_handle changing, so use the full query_plan XML text to detect query plan changes, and
  • Deduplicating query plans during sampling dramatically reduces storage space and simplifies analysis; do however use your own hash of the query_plan field to reduce the CPU processing required to detect changes

Hope it helps!

Kristian