Blog

Blazingly Fast ETL Dataflow with .NET and actionETL

The dataflow capabilities in actionETL have been heavily optimized and are fast – as in really fast. The high performance/low overhead architecture provides a great foundation for implementing complex and large volume ETL requirements.

Benchmark numbers below, and full sources on GitHub. Why not use the free trial and run it on your Windows or Linux machine?

tl;dr

My 7-year-old desktop from 2013 can pump up to 180 million rows per second between two dataflow workers, some two orders of magnitude faster than most traditional data sources can keep up with on this machine, and up to 340 million rows per second in aggregate link throughput across its 4 cores:

It’s also achieved with quite small buffer sizes of 1024 rows, which conserves memory. The 256 rows default buffer size is on this benchmark almost as fast, and using e.g. tiny 32-row or even smaller buffers when you have particularly wide rows is perfectly viable. You can even give different links in the same dataflow different buffer sizes.

With no data sources to slow the dataflow down, just pumping rows does bottleneck my 4 cores on CPU after about 3 links. After that, the aggregate link throughput stays constant even with over a thousand dataflow source, transform, and target workers all pumping rows simultaneously, showing great scalability.

You can use large numbers of workers to solve complex problems well while retaining excellent performance.

In any real application, the high efficiency of the dataflow leaves almost all CPU cycles available for user logic and data source drivers.

With the powerful API, it only takes a few lines of code to create, connect, and run all the workers for all the benchmark cases.

Just the Facts, Ma’am

In the above chart, the blue “Many Sources, No Transforms” shows running many source workers, with a single target worker connected to each source:

The red “One Source, Many Transforms” shows running a single source worker, with many downstream transforms, terminated by a single target:

And the yellow “Multiple Sources and Transforms” shows running multiple source workers, and an equal number of transform and target workers per source as there are sources in total:

The specific workers used are:

In these benchmarks, we specifically test the movement of rows between workers, without interacting with external data sources or allocating new rows (I’ll come back to that in later articles).

  • The source worker or workers send a row 1 billion times over one link, or 500 million times over two links, all the way to 1 million rows over 1024 links, to the downstream workers. This way, the aggregate number of rows over all links are always 1 billion.
  • Each transform receives the upstream rows and sends them downstream
  • Each individual flow is terminated by a target that receives the upstream rows and throws them away
  • Aggregate Link Throughput is the sum of the number of rows passed through each link (1 billion), divided by the time

That old desktop from 2013 that runs the benchmarks? It has an Intel Core i7 4770K, 3.5GHz, 4 physical cores, hyper-threading disabled, 16GB of memory, and now Windows 10 Pro (full specification). And we’re running .NET Core 3.1.

Here’s the output from a run:

AggregateLinkRows, Workers, Sources, DownstreamLinksPerSource, TotalLinks, Duration (s), AggregateLinkThroughput (Million Rows/s)
 1000000000,      2,      1,      1,      1,   5.547,    180.27
 1000000000,      4,      2,      1,      2,   2.933,    340.92
 1000000002,      6,      3,      1,      3,   3.071,    325.62
 1000000000,      8,      4,      1,      4,   3.140,    318.50
 1000000000,     32,     16,      1,     16,   3.146,    317.90
 1000000000,    128,     64,      1,     64,   3.163,    316.19
 1000000000,    512,    256,      1,    256,   3.179,    314.61
 1000000512,   2048,   1024,      1,   1024,   3.267,    306.12
 1000000000,      3,      1,      2,      2,   4.933,    202.71
 1000000002,      4,      1,      3,      3,   3.439,    290.79
 1000000000,      5,      1,      4,      4,   3.494,    286.22
 1000000000,     17,      1,     16,     16,   3.089,    323.74
 1000000000,     65,      1,     64,     64,   3.064,    326.42
 1000000000,    257,      1,    256,    256,   3.009,    332.29
 1000000512,   1025,      1,   1024,   1024,   3.241,    308.53
 1000000000,      6,      2,      2,      4,   4.550,    219.77
 1000000000,     20,      4,      4,     16,   3.730,    268.11
 1000000000,     72,      8,      8,     64,   3.364,    297.30
 1000000000,    272,     16,     16,    256,   3.507,    285.12
 1000000512,   1056,     32,     32,   1024,   3.149,    317.59

Use the Source, Luke

Coding this is very easy, here’s how to create, connect, and run all the workers for all the test cases – a good demonstration of the power of the library API:

const long desiredAggregateLinkRows = 1_000_000_000;
var testCases = new (int sources, int downstreamLinksPerSource)[]
{
	(1,1) // Warm up the JIT etc. to get consistent timings
	, (1,1), (2,1), (3,1), (4,1), (16,1), (64,1), (256,1)
		, (1024,1) // Many Sources, No Transforms
	, (1,2), (1,3), (1,4), (1,16), (1,64), (1,256)
		, (1,1024) // Single Source, Many Transforms 
	, (2,2), (4,4), (8,8), (16,16), (32,32) // Multiple Sources and Transforms 
};

foreach (var (numberOfSources, downstreamLinksPerSource) in testCases)
{
	int numberOfLinks = numberOfSources * downstreamLinksPerSource;
	long rowsPerSource = (long)Math.Ceiling(
		(double)desiredAggregateLinkRows / numberOfLinks);
	long actualAggregateLinkRows = numberOfSources * rowsPerSource 
		* downstreamLinksPerSource; // Avoid round-off errors

	var workerSystem = new WorkerSystem()
		.Root(ws =>
		{
			for (int s = 0; s < numberOfSources; s++)
			{
				var rrs = new RepeatRowsSource<MyRow>(ws, $"Source {s}"
					, rowsPerSource, new MyRow(42))
						{ SendTemplateRows = true };
				var output = rrs.Output;

				for (int l = 1; l < downstreamLinksPerSource; l++)
					output = output.Link.MulticastTransform(
						$"Transform {s} - {l}", 1).TypedOutputs[0];

				output.Link.TrashTarget($"Target {s}");
			}
		});

	(await workerSystem.StartAsync()) // Run the worker system
		.ThrowOnFailure();
}

Winding Down

I think the key takeaways are that actionETL can move around a massive amount of rows in the dataflow without any noticeable slow-downs, and do it with memory conserving small buffers.

Some future topics to look at include memory allocation and consumption, as well as data source and transform worker performance.

As mentioned, full sources on GitHub; try it out with the free trial and let me know in the comments below what throughput numbers you get!

New Cross-platform .NET ETL library on Windows and Linux – actionETL

actionETL is a cross-platform .NET library for easily creating high-performance ETL applications.

Try it out with the free trial

A public release is now available that adds .NET Standard 2.0 and Linux, with overall support for:

  • Windows
  • Linux
  • .NET Framework 4.6.1+
  • .NET Standard 2.0+
  • .NET Core 2.1+

It also adds nuget.org distribution and dotnet new project templates, making it very easy to get up and running – it only takes these three commands to install templates, create and then run your first actionETL application – try it now!

dotnet new --install actionETL.templates.dotnet
dotnet new actionetl.console --output my-app
dotnet run --project my-app

The actionetl.console template created this trivial but useful starting point, where you can replace and add your own ETL logic:

using actionETL;
using System.Threading.Tasks;

namespace actionETL.console.csharp
{
    static class Program
    {
        static async Task Main()
        {
            // Test if file exists
            var outcomeStatus = await new WorkerSystem()
                .Root(ws =>
                {
                    // Check filename loaded from "actionetl.aconfig.json"
                    new FileExistsWorker(ws, "File exists", ws.Config["TriggerFile"]);
                })
                .StartAsync();

            // Exit with success or failure code
            outcomeStatus.Exit();
        }
    }
}

Next, get a license with our free trial, continue with the getting started article, use the extensive documentation, and read about actionETL features. If you have any questions, please contact us.

PostgreSQL Support Now Added to actionETL .NET ETL library

PostgreSQL logo

PostgreSQL® is a hugely popular and capable database engine and is now supported by actionETL via a dedicated provider.

PostgreSQL uses a very wide set of data types, and actionETL has excellent support for both its provider-independent (Boolean, Double, String…) and provider-specific (NpgsqlDbType.Circle, NpgsqlDbType.Point…) types.

actionETL wraps the official PostgreSQL .NET provider – Npgsql. Check out the details in our PostgreSQL documentation.

We now have dedicated database providers for MariaDB™, MySQL™, PostgreSQL®, SQLite, and SQL Server®.

Other databases are accessed via the ODBC provider.

SQLite Support Now Added to actionETL

SQLite

SQLite is the most used database engine in the world and is now supported by actionETL. With SQLite being a fast and installation free local database in the public domain, it can be a great tool to combine with actionETL.

You might use SQLite as:

  • Your main (local) database
  • A complement to a traditional database server, e.g. to offload work from an expensive host, and to avoid the network performance overhead
  • Temporary SQL processing, e.g. to:
    • Reduce memory consumption by performing a large sort using the disk backed SQLite
    • Execute queries where the SQL set-based approach is a better fit (and using LINQ is not appropriate) than the dataflow row-by-row approach

actionETL wraps the official SQLite .NET provider – System.Data.SQLite. Check out the details in our SQLite documentation, as well as the full list of supported databases.

Reduce .NET ETL Code Size by 23 Times with actionETL

In a fully documented example, actionETL required only 9kB of code to create a high performance and reusable custom Slowly Changing Dimension (SCD) worker, 23 times less than the 209kB used by a SQL Server® Integration Services (SSIS) implementation with similar functionality. What caused this stark difference?

Modern AppDev Techniques

The actionETL library is designed from the ground up to make ETL development very productive. By using well-known application development techniques, it provides excellent:

  • Reusability
  • Composability
  • Encapsulation
  • Testability
  • Extensibility
  • Refactoring
  • Source control and Continuous Integration/Continuous Delivery

In the SCD example, actionETL composability pays a huge dividend, where existing ‘control flow’ and dataflow workers are easily combined to create a new high performance and reusable custom worker:

Unlike with SSIS, actionETL ‘control flow’ and dataflow workers can be freely mixed and matched, and can also be used to create new custom workers.

Visual Designer

In contrast, SSIS cannot use existing control flow tasks or dataflow components when creating new tasks or components, not even via C++, and must therefore implement all required functionality from scratch. Most SSIS custom tasks and components also require significant UI code. Both aspects heavily inflate the amount of code that must be written and maintained.

Virtually all traditional ETL tools have the same heavy focus on their drag&drop visual designer as SSIS has. While this certainly helps in some ways, like initially getting up to speed on the tool, they pay a very heavy price in terms of poor support for some or all of the above modern AppDev traits.

Whitepaper

If you want to learn more about how actionETL compares to SSIS and traditional ETL tools, check out the Thirteen Factors Crippling ETL Productivity whitepaper.

ETL Using .NET – Introducing actionETL

actionETL is a .NET library and NuGet package for easily creating high-performance ETL applications.

actionETL worker hierarchy

It combines familiar ETL tool features such as ‘control flow’ and dataflow with modern application development techniques, which together provide a highly productive environment for tackling both small and large ETL projects.

The combination is easy to learn and powerful to use, targeting users ranging from ETL developers with limited .NET experience, to seasoned .NET developers with limited ETL experience.

Check out the actionETL features, the conceptual documentation and examples, and if you’re interested, consider joining the beta program trying it out with the free trial!

VirtualBox Hyper-Threading Benchmark Surprise

I needed to ensure good CPU and memory performance in a VirtualBox virtual machine running on a 4-core desktop, and googling didn’t provide any clear guidance. After some benchmarking, the surprise came in the shape of consistently getting the best result when ignoring VirtualBox’ warning on oversubscribing processors:

VirtualBox warns if the guest is configured with more processors than there are physical cores on the host
VirtualBox warns if the guest is configured with more processors than there are physical cores on the host

As always, these results are only valid for my particular configuration and on my chosen benchmarks, including the assumptions that the physical host is idling while the single virtual machine is running flat out – do test with your own systems and tasks.

Physical Host

Desktop system where the CPU and motherboard were released in AD 2013:

CPUSingle Intel Core i7 4770K, 3.5GHz, 4 physical cores, 8 logical cores (when Hyper-Threading is enabled)
  • Note: All cores run at 3.7GHz during multi-threaded benchmarks; for single-threaded benchmarks, the core runs between 3.7 and 3.9GHz
MemoryG Skill F3-2400C10D-16GTX, Trident X Series, 2x8GB, PC3-19200, DDR3 2400MHz
  • Dual Channel, DRAM:1200MHz, CL:10 tRCD:12 tRP:12 tRAS:31 tRFC:363 CR:2T
MotherboardASRock Z87 Extreme9/ac, BIOS 2.30
DiskSamsung 840 EVO 1TB SSD
OSWindows 7 Ultimate SP1 64-bit

Note: The results in this article are likely not applicable to NUMA systems with physical processors in multiple sockets, since these have very different memory, cache, and thread scheduling characteristics.

VirtualBox

Version4.3.26
Guest Processors4 or 8
Guest Memory8GB
Guest OSWindows Server 2008 R2 Standard SP1 64-bit
Guest SettingsIO-APIC, 100% Execution Cap, PAE/NX, VT-X / AMD-V, Nested Paging all enabled

Benchmarks

For my particular requirements, I chose mainly multi-threaded CPU and memory bound benchmarks, with some disk benchmarks to guard against IO regressions – do follow the links for specifics on the individual benchmarks:

y-cruncherMulti-threaded. Calculates Pi. Mainly CPU and thread communication limited. Requires and uses SSE.
PassMarkPerformanceTest 8.0 (Build 1025) 64-bit  
  • “Preferences > Number of processes” set to number of logical processors on host (i.e. 4 or 8)
  • All CPU and memory benchmarks have been included
  • Disk benchmarks were also executed, but detailed results are not included due to the variability in IO systems:
    • Enabling/disabling Hyper-Threading did as expected not have any impact on disk performance
    • With Samsung RAPID mode disabled, disk performance in the VirtualBox guest ranged from 12% slower to 4% faster than the physical host
    • Enabling RAPID mode (which uses main memory as cache for SSD) improved VirtualBox guest disk performance with about 40%, and improved physical host disk performance with a whopping 9.5x – real life mileage will of course vary wildly

Results

To aid digestion, I’m presenting the data as speed-up or slowdown of different configurations vs. the on average fastest configuration, which was to run on the physical host with Hyper-Threading enabled.

The “Overall Average” section at the top of the chart is the average slowdown of all the actual benchmarks further down. Comparing to physical host with Hyper-Threading enabled, we see that running on:

  • VB 4 NoHT (VirtualBox with 4 processors, host has Hyper-Threading disabled) is on average the slowest at -22%, with individual benchmarks ranging from -2% to -55% slower
  • VB 4 HT (VirtualBox with 4 processors, host has Hyper-Threading enabled) is on average -22% slower, with individual benchmarks ranging from 2% faster to -44% slower
  • Phys 4 NoHT (Physical host, Hyper-Threading disabled) is on average only -10% slower, with individual benchmarks ranging from 1% faster to -50% slower
  • VB 8 HT (VirtualBox with 8 processors, host has Hyper-Threading enabled) is on average only -9% slower, with individual benchmarks ranging from -2% to -27%

Conclusions

Given my set-up, requirements and assumptions, I find that:

  • Disabling Hyper-Threading makes both the physical host and the virtual machine on average quite a bit slower – I’ll leave it enabled.
  • Following VirtualBox’ recommendation of limiting virtual processors to number of physical cores brings a slowdown of -22% (VB 4 HT above). I’ll instead configure as many virtual VirtualBox processors as there are logical (Hyper-Threaded) cores (VB 8 HT above), giving only a -9% slowdown.

Finally, a small warning: if you configure VirtualBox to use more processors than there are logical (Hyper-Threaded) cores (e.g 16 virtual processors on my 4770K) , it can run an order of magnitude slower than normal – simply ensure that you have no more VirtualBox processors configured than there are logical (Hyper-Threaded) cores available.

Hope it helps!

Tip: Open QlikView Without Data in Windows Explorer

I often use the Open ‘MyQVW’ Without Data option in Recently Opened Documents:

This is especially useful for quickly looking at the script or UI of very large QVWs, without having to wait for all the data to load (if you can open it at all away from your server!)

My trivial (but I find handy) tip is to add a QlikView Without Data option to the Windows SendTo context menu, so that I can right-click and open any QVW this way, even if it’s not in the Recently Opened Documents list:

To accomplish this, simply:

  • Navigate to the SendTo directory (on Windows 7 for instance, type shell:sendto into the start search field and hit Enter to open it)
  • Create a QlikView Without Data.cmd file in the SendTo directory, containing the following two lines:
start "QlikView" QV.exe /nodata %*
if errorlevel 1 pause

And Presto! my humongous QVW opens in less than a second.

One tiny tweak would be to remove the .cmd extension from the context menu to make it more similar to the other items in the menu. One can for instance move the .cmd file to a different directory, and then create a shortcut (which doesn’t need an extension) in the SendTo directory that points to the .cmd file.

Hope it helps!

QlikView Horizontal Table Issue with +40 Fields

I discovered the following issue with Horizontal tables (Straight Table > Properties > Presentation > Horizontal) in the AJAX ZFC client:

The chart only displays the first 40 fields; any further fields added beyond the first 40 are not displayed. Fields beyond 40 are exported to Excel OK, and also work perfectly fine in the Windows desktop client, but I couldn’t find any way to get more than 40 to display in the AJAX client.

This was tested on QlikViewServer 64-bit Edition (x64), 10.00.9088.7, with English (United Kingdom) settings running on Windows Server 2008 R2 Standard (64 bit edition), using the QlikView web server, and using several different browsers.

QlikTech has logged this as issue 44912, which (at least according to QV10/11 release notes) haven’t been fixed yet.

UPDATE 2012-07-01: QV10 Service Release 5 includes a fix for this issue!

As a temporary workaround, I ended up using two tables, the second one displaying field 41 to 70 and with a Layout > Show > Conditional set to:

=GetPossibleCount(PrimaryKeyField)=1

so that the second table only displays when there is a single record selected. This avoids the two tables potentially showing data from different records through scrolling.

Do let me know if you find a better workaround, or if this is already working in your environment.

SMO + LINQ for Entity Framework Code First Database Modifications

Database Aspects of Code First

Using Entity Framework (EF) 4.1 with the Code First approach seems like a good choice for my greenfield Web App – during development I can focus on the Web and C# .Net paradigms, and postpone the bulk of the database work until such time my application and data model has stabilised. That said there are several database aspects I must still address up front, including:

  • Decide and implement the high-level mapping between .Net classes and database tables. Morteza Manavi (blog | twitter) has a great series on this.
  • Set field properties such as string length, required vs. optional etc., which will get reflected in the database.

Since I plan to continue using Code First beyond the initial project setup, populate with test data etc. it becomes desirable to make further changes to the database beyond what EF Data Annotations and the Fluent API supports natively. These changes are usually put in the Seed() method of the inherited DbContext class; this method gets executed immediately after the database has been created.

EF 4.1 does not support adding unique keys, so I create them with a call to ExecuteSqlCommand() as follows:

context.Database.ExecuteSqlCommand(@"
  ALTER TABLE [WebApp].[MyTable] ADD CONSTRAINT 
    [UK_MyTable_Name] UNIQUE(Name); 
  ALTER TABLE [WebApp].[MySecondTable] ADD CONSTRAINT 
    [UK_MySecondTable_Name] UNIQUE(Name); 
  ");

For straight forward SQL I prefer the above approach since it will allow me to later (if and when the database takes on a life of its own) add the SQL text to the database creation script generated by EF.

Key Names

While foreign keys generated by EF have names that can also be overridden, I notice that EF has not given the primary keys any names, and there is also no EF way of setting their names. SQL Server in turn auto generates primary key names such as PK__MyTable__0376AC45EB0148EF. Apart from the ugly GUID in the name, the MyTable part unfortunately gets truncated at 8 characters, so I decide to add proper names to the primary keys.

Ladislav Mrnka (stackoverflow) was kind enough to point me to his sample code for dropping and recreating primary keys using SQL. Since a database “with a life of its own” would not need this drop-create SQL code, I prefer in this case to stay in .Net land and use SQL Server Management Objects (SMO) for implementing the changes:

SMO + LINQ + EF = True

Using SMO requires adding a few references to the project as described here. Discovering and altering primary keys is quite straight forward, and I elect to use Language-Integrated Query (LINQ) instead of nested foreach statements,
which makes for nicely concise code. Here’s the complete Seed() method:

// Seed() method from the MyDbContext : DbContext class
public void Seed(WebAppEntities context)
{
  // Add unique keys
  context.Database.ExecuteSqlCommand(@"
    ALTER TABLE [WebApp].[MyTable] ADD CONSTRAINT
      [UK_MyTable_Name] UNIQUE(Name);
    ALTER TABLE [WebApp].[MySecondTable] ADD CONSTRAINT
      [UK_MySecondTable_Name] UNIQUE(Name);
    ");


  // Get a Server object from the existing connection
  var server = new Microsoft.SqlServer.Management.Smo.Server(
    new Microsoft.SqlServer.Management.Common.ServerConnection
      (context.Database.Connection as System.Data.SqlClient.SqlConnection));
  // Get a database object. Qualify names to avoid EF name clashes.
  Microsoft.SqlServer.Management.Smo.Database database =
    server.Databases[context.Database.Connection.Database];

  // Rename auto generated primary key names
  (
    from Microsoft.SqlServer.Management.Smo.Table table in database.Tables
    where table.Schema == "WebApp"
    from Microsoft.SqlServer.Management.Smo.Index index in table.Indexes
    where index.IndexKeyType ==
      Microsoft.SqlServer.Management.Smo.IndexKeyType.DriPrimaryKey
    // Create pairs of a name string and an Index object
    select new { table.Name, index }
  // ToList() separates reading the object model above from modifying it below
  ).ToList()
  // Use extension and anonymous methods to rename the primary keys
  .ForEach(primaryKey =&amp;amp;amp;amp;amp;amp;amp;gt;
    {
      // Name primary keys "PK_TableName"
      primaryKey.index.Rename(
        "PK_" + primaryKey.Name.Replace("[", "").Replace("]", ""));
      primaryKey.index.Alter();
    }
}

Summing Up

When my project has more .Net coding than database focus, I often prefer using the power, simplicity and IntelliSense of SMO to perform my database manipulations; maybe you’ll consider doing the same ?-)