Japanese


Welcome to the pg_bulkload Project Home Page


pg_bulkload is a high speed data loading utility for PostgreSQL. Here is the pgFoundry page, where you can find downloads, documentation, bug reports, mailing lists, and a whole lot more.

日本語ページはこちら


Documentation


Performance Results

Here are comparisons between COPY and pg_bulkload. Performance was measured with basic-tuned PostgreSQL server.

Table definition
Customer table in DBT-2 benchmark, that is an implementation of TPC-C.
Index definition
There are 2 indexes. The first one is a primary key with one ascending integer column. The second one is a non-unique index with one random integer column.

There are the following measurement patterns.

  1. Initial data loading to an empty table for 4GB of data
  2. Appended data loading to a table with 4GB of data for 1GB of new data
  3. Performance efficiencies by PARALLEL and FILTER features

Result 1: Initial data loading

COPY has better performance on initial data loading if the destination table is TRUNCATEd in the same transaction with COPY. In addition, performance of COPY is improved when the data are loaded without indexes and create indexes after the loading. However, pg_bulkload has completed in 85% of loading time compared with the tuned COPY.

Initial: TRUNCATE+COPY vs pg_bulkload
Item version (postgres + bulkload)
8.4.4 + 2.4 9.0b2 + 3.0
COPY with indexes 1133.4 sec 1105.8 sec
COPY without indexes
+ CREATE INDEX
717.9 sec 705.3 sec
pg_bulkload (DIRECT)
with indexes
603.2 sec 598.9 sec
Duration comparison 84.0 % 84.9 %

Result 2: Appended data loading

The TRUNCATE hack is not available on appended data loading. Also, creating indexes after loading is not always faster than loading with indexes. Therefore, pg_bulkload has completed in 35% of loading time compared with COPY on appended data loading.

Appended: COPY vs pg_bulkload
Item version (postgres + bulkload)
8.4.4 + 2.4 9.0b2 + 3.0
COPY with indexes 520.4 sec 549.3 sec
COPY without indexes
+ CREATE INDEX
805.3 sec 799.6 sec
pg_bulkload (DIRECT)
with indexes
185.2 sec 191.7 sec
Duration comparison 35.6 % 34.9 %

Result 3: PARALLEL and FILTER features

The parallel loader is used when WRITER = PARALLEL is specified. Performance would be improved on multi-CPU server because reading an input file and writing rows to a table are done with two processes. The actual measurement shows the loading time is shortened to 70%.

FILTER feature transforms input data in various operations, but it's not free. The actual measurement shows the loading time is increased to 250-300% with SQL functions and 150% with C functions.

PARALLEL and FILTER features
Item Initial (4GB) Appended (1GB)
pg_bulkload (DIRECT) 598.9 sec 191.7 sec
pg_bulkload (PARALLEL) 413.5 sec 133.0 sec
Duration comparison 69.0 % 69.4 %
pg_bulkload (SQL-FILTER) 1813.9 sec 484.6 sec
Duration comparison 302.9 % 252.7 %
pg_bulkload (C-FILTER) 918.4 sec 263.7 sec
Duration comparison 153.3 % 137.6 %

PostgreSQL 9.0b2 + pg_bulkload 3.0b1 was used for all measurements, with indexes.

Conditions

ItemValue
ServerDell PowerEdge 1900
CPUDual Core Xeon 5050 (3.0GHz)
Hyper-Threadingoff
Memory2GB
Storage SubsystemDell PowerVault 221S
DisksSCSI 7x146GB (RAID 0)
RAID ControllerPERC 4e/DC DRAM=128MB
OSCentOS 5.5 (64bit)
shared_buffers256MB
checkpoint_segments300
checkpoint_timeout5min
Table definitionDBT-2 customer table
Indexed columnsc_id (PRIMARY KEY)
c_d_id (non-unique B-Tree)
ConstraintsNOT NULL for all columns
Input file formatCSV