Benchmarking Bulk & Batch Insert with .NET and actionETL

May 9, 2021February 21, 2024 Kristian Wedberg Leave a comment

UPDATE: actionETL 0.41.0 onward uses Microsoft.Data.SqlClient 4.x or later, and avoids the SqlClient Batch Insert performance issue mentioned below.

In this article, we look at the great database insert performance you can achieve with the actionETL dataflow and .NET. In particular, we look at how the choice of SQL database, database provider, and bulk insert vs. batch insert impacts performance.

Setup

The benchmark generates rows with the RepeatRowsSource worker
Batch Insert puts multiple rows in each insert statement. The AdbInsertTarget worker supports all actionETL database providers.
- Insert statements use (named or positional) parameters. Each parameter inserts one column value in one row in the table.
Bulk Insert requires a database-specific facility and is available for MySQL-compatible and SQL Server databases via the AdbMySqlConnectorBulkInsertTarget and AdbSqlClientBulkInsertTarget workers.
The target table is unindexed with 10 columns (4 integer, 4 text, 2 decimal).
All databases and the actionETL .NET5 benchmark run locally on the same Windows 10 host, an old 4-core Intel Core i7 4770K, 3.5GHz desktop PC from 2013 with a Samsung 840 EVO 1TB SSD.

Throughput

Here we generate and insert 1 million rows into the target table, and report the number of values inserted per second.

Note: At larger volumes and in this benchmark, insert queries are mainly bottlenecked by the number of values to insert, which is why we report that number. To get the number of rows instead for our particular test, divide by 10.

We measure:

Batch Insert Default, which uses a batch size of 256 values per batch and 16,384 values per transaction.
Batch Insert Tuned, which uses tuned values for batch and transaction sizes.
- Note: This often uses the largest possible settings, but it must be done with caution, since it can overload the server if used with very large data volumes.
Bulk Insert, where supported.

Several things stand out:

Bulk Insert as well as the (embedded database) SQLite Batch Insert Tuned tests are from 2.5 to 15 times faster than the other tests and databases, and should (from a strict throughput performance standpoint) be used whenever possible.
Tuning Batch Insert settings provides a 14% to 54% improvement for non-SQLite tests and a 443% improvement for the SQLite test.
UPDATE: actionETL 0.41.0 onward uses Microsoft.Data.SqlClient 4.x or later, and avoids this performance issue:
- The SqlClient Batch Insert tests have unexpectedly low performance. In fact, our benchmark discovered a previously unknown SQL Server bottleneck. The SqlClient team is developing a workaround, and when it’s released we’ll take advantage of it in actionETL.
  - In preliminary testing with the unreleased workaround, SqlClient Batch Insert achieves a performance roughly 12% better than ODBC SQL Server, which would make it the fastest non-embedded Batch Insert in our tests.
  - ~~With actionETL, you can however already use either SqlClient Bulk Insert or the ODBC SQL Server provider combination, which both deliver great performance.~~

Throughput Scaling

In these tests we generate and insert between 1 row and 1 million rows, and again report the number of values inserted per second. The Batch Insert tests use default batch and transaction size settings.

We see that:

There is an order-of-magnitude performance difference between the fastest and slowest database-provider-API combination, both for small and large volumes.
SqlClient Bulk is the fastest across all volumes.
Even though the database etc. is warmed up with an insert before measurements are started, the overhead of opening a connection, initiating the inserts, etc. in the measured run is large enough to completely limit the throughput up to 1,000 parameters (100 rows), and mostly also up to 10,000 values (1,000 rows).
For small volumes, Bulk Insert (on those databases that support it) has the same performance as Batch Insert.

Here is the same data in a table:

Rows Inserted	1	10	100	1,000	10,000	100,000	1,000,000
MySqlClient Batch	652	5,686	52,268	209,848	280,920	294,748	294,147
MySqlConnector Batch	136	1,339	12,860	109,279	331,885	430,429	432,934
MySqlConnector Bulk	137	1,371	13,058	126,161	733,025	1,751,191	2,095,514
Npgsql Batch	1,358	12,842	96,534	266,640	318,638	329,748	330,148
ODBC SQL Server Batch	1,780	17,150	128,562	396,670	497,223	517,184	534,700
SqlClient Batch	1,838	17,364	102,841	204,082	225,870	229,410	232,350
SqlClient Bulk	1,956	19,464	181,010	1,289,693	3,569,541	4,382,258	4,382,097
SQLite Batch	501	4,849	42,224	378,911	652,616	732,463	814,030

In Summary

If you are not bottlenecked by other factors (slow network, slow inserts due to indexing, etc.), then your choice of database, provider and insert API combination can make a huge performance difference.

One common way to realize these performance advantages is to perform record-level processing in the dataflow, load data into unindexed staging tables, and finish up with set-based processing in the database.

Do use Bulk Insert where available for best performance.

Do benchmark your own use cases with different API, providers, and settings to pick the best combination.

And certainly, actionETL helps you by providing and making it easy to switch between Batch Insert and Bulk Insert across the various databases. Try it out yourself with the free Community edition!

Nitty-gritty

Database and Provider Details

Provider	Database	Tuned Rows per Batch	Tuned Rows per Transaction
MySqlClient Batch	MariaDB 10.3	200	`long.MaxValue`
MySqlConnector Batch	MariaDB 10.3	200	`long.MaxValue`
MySqlConnector Bulk	MariaDB 10.3	N/A	N/A
Npgsql Batch	PostgreSQL 12	200	`long.MaxValue`
System.Data.Odbc Batch	SQL Server 2019	99	`long.MaxValue`
Microsoft.Data.SqlClient Batch	SQL Server 2019	13	`long.MaxValue`
Microsoft.Data.SqlClient Bulk	SQL Server 2019	N/A	N/A
System.Data.SQLite Batch	SQLite.Core 1.0.113.6	25	`long.MaxValue`

actionETL Code

The benchmark was run with actionETL 0.40.0 on .NET5.

Dataflow row type:

class MyRow
{
    public long Data1 { get; }
    public long Data2 { get; } = 2000;
    public long Data3 { get; } = 3000;
    public long Data4 { get; } = 4000;
    public string Text1 { get; } = "Text5";
    public string Text2 { get; } = "Text567890";
    public string Text3 { get; } = "Text56789012345";
    public string Text4 { get; } = "Text5678901234567890";
    public decimal Number1 { get; } = 3.14M;
    public decimal Number2 { get; } = 7.92M;

    public MyRow(long data) => Data1 = data;
}

Dataflow code for batch insert:

var source = new RepeatRowsSource<MyRow>(aw, "Generate Rows"
    , numberOfRows, new MyRow(42)) { SendTemplateRows = true };

source.Output.Link.AdbInsertTarget("Insert Rows", cmca => cmca.AutoName()
    , connectionString.CreateConnectionBuilder(), tableName);
source.SetRowLimits(rowsPerBatch, rowsPerTransaction); // Optional

Dataflow target for MySQL bulk insert (replacing AdbInsertTarget above):

source.Output.Link.AdbMySqlConnectorBulkInsertTarget("Insert Rows"
    , cmca => cmca.AutoName(), connectionString.CreateConnectionBuilder()
    , tableName);

Dataflow target for SQL Server bulk insert (replacing AdbInsertTarget above):

source.Output.Link.AdbSqlClientBulkInsertTarget("Insert Rows"
    , cmca => cmca.AutoName(), connectionString.CreateConnectionBuilder()
    , SqlBulkCopyOptions.TableLock, tableName);

Enhanced Database Support with New actionETL Release

October 27, 2020October 27, 2020 Kristian Wedberg Leave a comment

The new 0.38.0 release brings enhanced database support across all database providers, for both on-premises and cloud deployments.

Improved GetTableColumnsAsync(String) and GetTableTypeAsync(String), which also helps AdbInsertTarget<TInputError> and AdbTableNonQueryWorker:
- MariaDB™ and MySql™:
  - Fixed quoting bug.
- MySql™:
  - Fixed table metadata bug for 4GB column types longblob and longtext.
- PostgreSQL®:
  - Added support for temporary tables and PostgreSQL schema search_path.
  - Added support for BitArray to bit varying type mapping (in addition to the existing varbit mapping).
  - Fixed bug in GetTableColumnsAsync(String) overloads when no schema specified.
- SQLite: Added support for views.
- SQL Server®: Improved Precision, Scale, LengthInBytes, and LengthInCharacters column metadata from GetTableColumnsAsync(String) overloads.
AdbExecuteScalarWorker<TResult> converts (with ChangeType(Object, Type)) the worker result if casting fails, thereby handling more data type mismatches.
Verified actionETL on Azure for MariaDB, MySql, PostgreSQL, and Azure SQL Database.

Check out all the actionETL features and try it out today with the Free 30-day Trial!

PostgreSQL Support Now Added to actionETL .NET ETL library

March 13, 2020August 30, 2020 admin Leave a comment

PostgreSQL® is a hugely popular and capable database engine and is now supported by actionETL via a dedicated provider.

PostgreSQL uses a very wide set of data types, and actionETL has excellent support for both its provider-independent (Boolean, Double, String…) and provider-specific (NpgsqlDbType.Circle, NpgsqlDbType.Point…) types.

actionETL wraps the official PostgreSQL .NET provider – Npgsql. Check out the details in our PostgreSQL documentation.

We now have dedicated database providers for MariaDB™, MySQL™, PostgreSQL®, SQLite, and SQL Server®.

Other databases are accessed via the ODBC provider.

SQLite Support Now Added to actionETL

January 18, 2020August 30, 2020 Kristian Wedberg Leave a comment

SQLite is the most used database engine in the world and is now supported by actionETL. With SQLite being a fast and installation free local database in the public domain, it can be a great tool to combine with actionETL.

You might use SQLite as:

Your main (local) database
A complement to a traditional database server, e.g. to offload work from an expensive host, and to avoid the network performance overhead
Temporary SQL processing, e.g. to:
- Reduce memory consumption by performing a large sort using the disk backed SQLite
- Execute queries where the SQL set-based approach is a better fit (and using LINQ is not appropriate) than the dataflow row-by-row approach

actionETL wraps the official SQLite .NET provider – System.Data.SQLite. Check out the details in our SQLite documentation, as well as the full list of supported databases.

Reduce .NET ETL Code Size by 23 Times with actionETL

December 12, 2019September 3, 2020 Kristian Wedberg Leave a comment

In a fully documented example, actionETL required only 9kB of code to create a high performance and reusable custom Slowly Changing Dimension (SCD) worker, 23 times less than the 209kB used by a SQL Server® Integration Services (SSIS) implementation with similar functionality. What caused this stark difference?

Modern AppDev Techniques

The actionETL library is designed from the ground up to make ETL development very productive. By using well-known application development techniques, it provides excellent:

Reusability
Composability
Encapsulation
Testability
Extensibility
Refactoring
Source control and Continuous Integration/Continuous Delivery

In the SCD example, actionETL composability pays a huge dividend, where existing ‘control flow’ and dataflow workers are easily combined to create a new high performance and reusable custom worker:

Unlike with SSIS, **actionETL** ‘control flow’ and dataflow workers can be freely mixed and matched, and can also be used to create *new* custom workers.

Visual Designer

In contrast, SSIS cannot use existing control flow tasks or dataflow components when creating new tasks or components, not even via C++, and must therefore implement all required functionality from scratch. Most SSIS custom tasks and components also require significant UI code. Both aspects heavily inflate the amount of code that must be written and maintained.

Virtually all traditional ETL tools have the same heavy focus on their drag&drop visual designer as SSIS has. While this certainly helps in some ways, like initially getting up to speed on the tool, they pay a very heavy price in terms of poor support for some or all of the above modern AppDev traits.

Whitepaper

If you want to learn more about how actionETL compares to SSIS and traditional ETL tools, check out the Thirteen Factors Crippling ETL Productivity whitepaper.

SMO + LINQ for Entity Framework Code First Database Modifications

September 2, 2011August 29, 2020 Kristian Wedberg Leave a comment

Database Aspects of Code First

Using Entity Framework (EF) 4.1 with the Code First approach seems like a good choice for my greenfield Web App – during development I can focus on the Web and C# .Net paradigms, and postpone the bulk of the database work until such time my application and data model has stabilised. That said there are several database aspects I must still address up front, including:

Decide and implement the high-level mapping between .Net classes and database tables. Morteza Manavi (blog | twitter) has a great series on this.
Set field properties such as string length, required vs. optional etc., which will get reflected in the database.

Since I plan to continue using Code First beyond the initial project setup, populate with test data etc. it becomes desirable to make further changes to the database beyond what EF Data Annotations and the Fluent API supports natively. These changes are usually put in the Seed() method of the inherited DbContext class; this method gets executed immediately after the database has been created.

EF 4.1 does not support adding unique keys, so I create them with a call to ExecuteSqlCommand() as follows:

context.Database.ExecuteSqlCommand(@"
  ALTER TABLE [WebApp].[MyTable] ADD CONSTRAINT 
    [UK_MyTable_Name] UNIQUE(Name); 
  ALTER TABLE [WebApp].[MySecondTable] ADD CONSTRAINT 
    [UK_MySecondTable_Name] UNIQUE(Name); 
  ");

For straight forward SQL I prefer the above approach since it will allow me to later (if and when the database takes on a life of its own) add the SQL text to the database creation script generated by EF.

Key Names

While foreign keys generated by EF have names that can also be overridden, I notice that EF has not given the primary keys any names, and there is also no EF way of setting their names. SQL Server in turn auto generates primary key names such as PK__MyTable__0376AC45EB0148EF. Apart from the ugly GUID in the name, the MyTable part unfortunately gets truncated at 8 characters, so I decide to add proper names to the primary keys.

Ladislav Mrnka (stackoverflow) was kind enough to point me to his sample code for dropping and recreating primary keys using SQL. Since a database “with a life of its own” would not need this drop-create SQL code, I prefer in this case to stay in .Net land and use SQL Server Management Objects (SMO) for implementing the changes:

SMO + LINQ + EF = True

Using SMO requires adding a few references to the project as described here. Discovering and altering primary keys is quite straight forward, and I elect to use Language-Integrated Query (LINQ) instead of nested foreach statements,
which makes for nicely concise code. Here’s the complete Seed() method:

// Seed() method from the MyDbContext : DbContext class
public void Seed(WebAppEntities context)
{
  // Add unique keys
  context.Database.ExecuteSqlCommand(@"
    ALTER TABLE [WebApp].[MyTable] ADD CONSTRAINT
      [UK_MyTable_Name] UNIQUE(Name);
    ALTER TABLE [WebApp].[MySecondTable] ADD CONSTRAINT
      [UK_MySecondTable_Name] UNIQUE(Name);
    ");


  // Get a Server object from the existing connection
  var server = new Microsoft.SqlServer.Management.Smo.Server(
    new Microsoft.SqlServer.Management.Common.ServerConnection
      (context.Database.Connection as System.Data.SqlClient.SqlConnection));
  // Get a database object. Qualify names to avoid EF name clashes.
  Microsoft.SqlServer.Management.Smo.Database database =
    server.Databases[context.Database.Connection.Database];

  // Rename auto generated primary key names
  (
    from Microsoft.SqlServer.Management.Smo.Table table in database.Tables
    where table.Schema == "WebApp"
    from Microsoft.SqlServer.Management.Smo.Index index in table.Indexes
    where index.IndexKeyType ==
      Microsoft.SqlServer.Management.Smo.IndexKeyType.DriPrimaryKey
    // Create pairs of a name string and an Index object
    select new { table.Name, index }
  // ToList() separates reading the object model above from modifying it below
  ).ToList()
  // Use extension and anonymous methods to rename the primary keys
  .ForEach(primaryKey =&amp;amp;amp;amp;amp;amp;gt;
    {
      // Name primary keys "PK_TableName"
      primaryKey.index.Rename(
        "PK_" + primaryKey.Name.Replace("[", "").Replace("]", ""));
      primaryKey.index.Alter();
    }
}

Summing Up

When my project has more .Net coding than database focus, I often prefer using the power, simplicity and IntelliSense of SMO to perform my database manipulations; maybe you’ll consider doing the same ?-)

GOTCHA: SQL Server changes query plan without changing plan_handle

April 11, 2011August 29, 2020 Kristian Wedberg 2 Comments

Or how to efficiently get at old query plans

Performance Tuning Mission

While on a SQL Server 2008 Data Warehouse performance tuning mission, this aspect (or undocumented gotcha as it felt like at the time) of the query plan Dynamic Management Functions caused some head-scratching, until I sussed that:

A query plan can change without its plan_handle changing

Or, putting it in Data Warehouse speak:

query_plan in sys.dm_exec_query_plan(plan_handle) is a Type 1 slowly changing attribute with respect to the plan_handle key, which means query_plan changes overwrite history for the same plan_handle

This mattered since I was tracking down intermittent but significant slow-downs of the Data Warehouse load by sampling query plans for monitoring purposes – I suspected the culprit was intermittent query plan changes which meant that:

Detecting when and how the query plans had changed was critical
The intermittent nature of the symptoms meant that actual (as opposed to estimated) query plans were needed and that monitoring had to run for several days at a time
Due to the storage size for query plans, and the desire to capture ‘all’ query plans, deduplicating the query plans (i.e. only storing changes) during sampling was required
- E.g. tracing ‘all’ query plans and detecting changes during analysis would have taken too much storage space, and made for very slow analysis

I needed to detect query plan changes, but as stated above, using only plan_handle changes to detect query_plan changes does however not work, let’s demonstrate this.

T-SQL Script and Results

Create a test table with an index in tempdb
SELECT from the empty table
Display the plan_handle for the SELECT that just executed
INSERT 100000 records into the table
Run exactly the same SELECT statement as above from the now populated table
Display the plan_handle for the second SELECT that just executed

Do enable “Include Actual Execution Plan” before running the script.

USE tempdb
GO

CREATE TABLE dbo.TestQueryPlan
(
	Id INT IDENTITY(1,1),
	Pad CHAR(1000),
)
GO

CREATE INDEX IX_TestQueryPlan_Id
	ON dbo.TestQueryPlan (Id)
GO

SELECT Id, Pad
FROM dbo.TestQueryPlan
WHERE Id &amp;amp;amp;amp;amp;amp;lt;= 1
GO

SELECT plan_handle, text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle)
WHERE text LIKE '%TestQueryPlan%WHERE Id%'
	AND text NOT LIKE '%CROSS APPLY%'
GO

INSERT INTO dbo.TestQueryPlan
SELECT TOP 100000 'Pad me'
FROM master.sys.columns a
CROSS APPLY master.sys.columns b
GO

SELECT Id, Pad
FROM dbo.TestQueryPlan
WHERE Id &amp;amp;amp;amp;amp;amp;lt;= 1
GO

SELECT plan_handle, text
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle)
WHERE text LIKE '%TestQueryPlan%WHERE Id%'
	AND text NOT LIKE '%CROSS APPLY%'
GO

DROP TABLE dbo.TestQueryPlan
GO

Here are the query plans for the two identical SELECT statements:


The query plans are quite different due to running SELECT on an empty vs. a populated table. Next up are the plan_handles:

We see that the plan_handles are identical, despite the query plans above being quite different, and therein lies the gotcha – I need a better way of detecting query plan changes.
Also note that while the script uses ad-hoc SELECT queries, the same effect also applies to stored procedures.
BOL?
One thing that had led me astray was BOL in several places saying that plan_handle is unique, e.g. sys.dm_exec_query_plan notes that:
Each cached query plan is identified by a unique identifier called a plan handle.
and
plan_handle Uniquely identifies a query plan for a batch that is cached or is currently executing.
I initially took uniqueness to mean that if a query plan changes, its plan_handle will also change, which as shown above is not the case.
One way to look at it is that the BOL definition of plan_handle uniqueness:

Only applies to currently cached or executing query plans
Does not apply to plans that have been dropped from the cache
Does not apply to plans that have finished executing and were never stored in the cache

Pros and Cons
Looking at plan_handle from other perspectives I can certainly see reasons why it is implemented this way:

When a statement is recompiled and the query plan ends up different, SQL Server

Has no use for the old plan and throws it away
Does not have to update anything that pointed to the old plan using the plan_handle, since the plan_handle stays the same




A plan_handle encodes many things (SET options, objectid…) within its 64 bytes, so it’s not an arbitrary identifier (see sys.dm_exec_plan_attributes)

On the flip side, it’s a bit of a shame that despite a 64-byte ostensibly unique plan_handle, and the query_plan_hash from sys.dm_exec_query_stats (which is designed to generate duplicates), there is no obvious and fast way of knowing if a query plan has changed or not. Tracing all query (re)compiles would provide the necessary data, but lightweight it is not.
A second aspect of this design is that there is no way to get hold of an old query plan from DMFs etc. after it has changed, making troubleshooting harder unless of course you were explicitly capturing query plans.
Way Forward
Realising the above, I calculated, stored, and compared my own hash of the query_plan attribute, i.e. hashing the full XML text. This guaranteed detecting even the smallest query plan change, and furthermore I only had to process the often large captured query_plans once, with subsequent comparisons done on the computed hashes.
Hashing was fairly easy since I already had a CLR hashing stored procedure that could handle large inputs.
Given thousands of query plans though, sometimes megabytes in size, it would still be preferable to have a more direct way of detecting query plan changes than hashing the full text or running a continuous trace – do let me know if you have any ideas on this.
query_plan hashing worked well for my requirements and sampling facility; other approaches could also work well, especially for ad-hoc investigations (as opposed to longer-term monitoring) where the size of the logged data would not have time to become prohibitively large, and the impact to the server could be kept short, including:

Periodically dump the full query plans with context into tables, and check for changes (i.e. deduplicate) during analysis instead of during sampling
Use SQL Profiler or a Server-side trace to capture query plans using e.g. the Showplan XML Statistics Profile event, and again check for changes during analysis. Note that especially SQL Profiler “can have a significant performance overhead”
Use new Extended Events in SQL Server Denali that include the actual query plan, but like tracing they carry the “can have a significant performance overhead” caveat

In Conclusion

Long term logging of actual query plans is very useful for troubleshooting intermittent and unexpected query plan changes
Do take monitoring duration, storage size and impact on server into account when selecting how to collect query plans (i.e. Management Studio, SQL Profiler, Server side trace, Deduplicated sampling as described here…)
Be aware that query plans change without their associated plan_handle changing, so use the full query_plan XML text to detect query plan changes, and
Deduplicating query plans during sampling dramatically reduces storage space and simplifies analysis; do however use your own hash of the query_plan field to reduce the CPU processing required to detect changes

Hope it helps!
Kristian

Previously on Kristian Wedberg: Data Warehouse and BI articles

April 4, 2011August 29, 2020 Kristian Wedberg Leave a comment

Just to have things nicely gathered in a single place, I’ll kick my new blog off with a few links to my earlier blog & articles, where the content is still relevant:

SQL Server Central articles

Conchango / EMC Consulting blog

Update: With EMC subsumed by Dell, these articles are no more:

Database Snapshots and SSIS
SSIS lookup transform gotcha (verified on SQL Server 2005)
SSIS and “Cursor Operation Conflict” (verified on SQL Server 2005)

To be continued…