actionETL Community edition is completely free, including for commercial use. It supports unlimited data volumes and is geared towards ETL solutions of low complexity.
.NET 5 Support
actionETL is now supported also on .NET 5, which adds new features and improved performance to the .NET ecosystem. Check out the full platform details.
The dataflow capabilities in actionETL have been heavily optimized and are fast – as in really fast. The high performance/low overhead architecture provides a great foundation for implementing complex and large volume ETL requirements.
Benchmark numbers below, and full sources on GitHub. Why not use the free trial and run it on your Windows or Linux machine?
tl;dr
My 7-year-old desktop from 2013 can pump up to 180 million rows per second between two dataflow workers, some two orders of magnitude faster than most traditional data sources can keep up with on this machine, and up to 340 million rows per second in aggregate link throughput across its 4 cores:
It’s also achieved with quite small buffer sizes of 1024 rows, which conserves memory. The 256 rows default buffer size is on this benchmark almost as fast, and using e.g. tiny 32-row or even smaller buffers when you have particularly wide rows is perfectly viable. You can even give different links in the same dataflow different buffer sizes.
With no data sources to slow the dataflow down, just pumping rows does bottleneck my 4 cores on CPU after about 3 links. After that, the aggregate link throughput stays constant even with over a thousand dataflow source, transform, and target workers all pumping rows simultaneously, showing great scalability.
You can use large numbers of workers to solve complex problems well while retaining excellent performance.
In any real application, the high efficiency of the dataflow leaves almost all CPU cycles available for user logic and data source drivers.
With the powerful API, it only takes a few lines of code to create, connect, and run all the workers for all the benchmark cases.
Just the Facts, Ma’am
In the above chart, the blue “Many Sources, No Transforms” shows running many source workers, with a single target worker connected to each source:
The red “One Source, Many Transforms” shows running a single source worker, with many downstream transforms, terminated by a single target:
And the yellow “Multiple Sources and Transforms” shows running multiple source workers, and an equal number of transform and target workers per source as there are sources in total:
In these benchmarks, we specifically test the movement of rows between workers, without interacting with external data sources or allocating new rows (I’ll come back to that in later articles).
The source worker or workers send a row 1 billion times over one link, or 500 million times over two links, all the way to 1 million rows over 1024 links, to the downstream workers. This way, the aggregate number of rows over all links are always 1 billion.
Each transform receives the upstream rows and sends them downstream
Each individual flow is terminated by a target that receives the upstream rows and throws them away
Aggregate Link Throughput is the sum of the number of rows passed through each link (1 billion), divided by the time
Coding this is very easy, here’s how to create, connect, and run all the workers for all the test cases – a good demonstration of the power of the library API:
const long desiredAggregateLinkRows = 1_000_000_000;
var testCases = new (int sources, int downstreamLinksPerSource)[]
{
(1,1) // Warm up the JIT etc. to get consistent timings
, (1,1), (2,1), (3,1), (4,1), (16,1), (64,1), (256,1)
, (1024,1) // Many Sources, No Transforms
, (1,2), (1,3), (1,4), (1,16), (1,64), (1,256)
, (1,1024) // Single Source, Many Transforms
, (2,2), (4,4), (8,8), (16,16), (32,32) // Multiple Sources and Transforms
};
foreach (var (numberOfSources, downstreamLinksPerSource) in testCases)
{
int numberOfLinks = numberOfSources * downstreamLinksPerSource;
long rowsPerSource = (long)Math.Ceiling(
(double)desiredAggregateLinkRows / numberOfLinks);
long actualAggregateLinkRows = numberOfSources * rowsPerSource
* downstreamLinksPerSource; // Avoid round-off errors
var workerSystem = new WorkerSystem()
.Root(ws =>
{
for (int s = 0; s < numberOfSources; s++)
{
var rrs = new RepeatRowsSource<MyRow>(ws, $"Source {s}"
, rowsPerSource, new MyRow(42))
{ SendTemplateRows = true };
var output = rrs.Output;
for (int l = 1; l < downstreamLinksPerSource; l++)
output = output.Link.MulticastTransform(
$"Transform {s} - {l}", 1).TypedOutputs[0];
output.Link.TrashTarget($"Target {s}");
}
});
(await workerSystem.StartAsync()) // Run the worker system
.ThrowOnFailure();
}
Winding Down
I think the key takeaways are that actionETL can move around a massive amount of rows in the dataflow without any noticeable slow-downs, and do it with memory conserving small buffers.
Some future topics to look at include memory allocation and consumption, as well as data source and transform worker performance.
As mentioned, full sources on GitHub; try it out with the free trial and let me know in the comments below what throughput numbers you get!
A public release is now available that adds .NET Standard 2.0 and Linux, with overall support for:
Windows
Linux
.NET Framework 4.6.1+
.NET Standard 2.0+
.NET Core 2.1+
It also adds nuget.org distribution and dotnet new project templates, making it very easy to get up and running – it only takes these three commands to install templates, create and then run your first actionETL application – try it now!
dotnet new --install actionETL.templates.dotnet
dotnet new actionetl.console --output my-app
dotnet run --project my-app
The actionetl.console template created this trivial but useful starting point, where you can replace and add your own ETL logic:
using actionETL;
using System.Threading.Tasks;
namespace actionETL.console.csharp
{
static class Program
{
static async Task Main()
{
// Test if file exists
var outcomeStatus = await new WorkerSystem()
.Root(ws =>
{
// Check filename loaded from "actionetl.aconfig.json"
new FileExistsWorker(ws, "File exists", ws.Config["TriggerFile"]);
})
.StartAsync();
// Exit with success or failure code
outcomeStatus.Exit();
}
}
}
In a fully documented example, actionETL required only 9kB of code to create a high performance and reusable custom Slowly Changing Dimension (SCD) worker, 23 times less than the 209kB used by a SQL Server® Integration Services (SSIS) implementation with similar functionality. What caused this stark difference?
Modern AppDev Techniques
The actionETL library is designed from the ground up to make ETL development very productive. By using well-known application development techniques, it provides excellent:
Reusability
Composability
Encapsulation
Testability
Extensibility
Refactoring
Source control and Continuous Integration/Continuous Delivery
In the SCD example, actionETLcomposability pays a huge dividend, where existing ‘control flow’ and dataflow workers are easily combined to create a new high performance and reusable custom worker:
Unlike with SSIS, actionETL ‘control flow’ and dataflow workers can be freely mixed and matched, and can also be used to create new custom workers.
Visual Designer
In contrast, SSIS cannot use existing control flow tasks or dataflow components when creating new tasks or components, not even via C++, and must therefore implement all required functionality from scratch. Most SSIS custom tasks and components also require significant UI code. Both aspects heavily inflate the amount of code that must be written and maintained.
Virtually all traditional ETL tools have the same heavy focus on their drag&drop visual designer as SSIS has. While this certainly helps in some ways, like initially getting up to speed on the tool, they pay a very heavy price in terms of poor support for some or all of the above modern AppDev traits.
actionETL is a .NET library and NuGet package for easily creating high-performance ETL applications.
It combines familiar ETL tool features such as ‘control flow’ and dataflow with modern application development techniques, which together provide a highly productive environment for tackling both small and large ETL projects.
The combination is easy to learn and powerful to use, targeting users ranging from ETL developers with limited .NET experience, to seasoned .NET developers with limited ETL experience.