DuckDB NIF - An erlang library for accessing DuckDB databases

mmzeeman · April 27, 2022, 2:54pm

The last couple of weeks I have been working on a NIF interface for DuckDB. The library has some rough edges, no documentation, but it is starring to work pretty nicely.

DuckDB is an in-process SQL OLAP database management system (https://duckdb.org). It looked like a nice addition to have, and I was interested to see how their c-api with the appender interface would work out for adding bulk stats data. I compared it to ets and the normal prepared statement interface they have and was pleasantly surprised.

Inserting 100_000 records with 10 integers in ets and retrieve them (with tab2list)

> educkdb_bench:ets().
100_000 inserts took: 37.13 milliseconds
Select of 100000 rows took: 12.32 milliseconds

Inserting 100_000 records with 10 integers in an in memory duckdb database and retrieve them (with new extract api)

2> educkdb_bench:appender().
100_000 inserts took: 43.119 milliseconds
Select of 98 chunks and 100000 rows took: 14.553 milliseconds

Compared to in-memory duckdb with a normal prepared statement interface.

3> educkdb_bench:main().    
100_000 inserts took: 3870.924 milliseconds
Select of 98 chunks and 100000 rows took: 53.696 milliseconds

At the moment all simple data-types, like various integers, timestamps, varchars and blobs can be extracted. At the moment I’m working on extracting complex data types like lists, structs and maps.

mmzeeman · April 27, 2022, 3:13pm

Oh duh… the GitHub repository is here: GitHub - mmzeeman/educkdb: DuckDB NIF for Erlang. There is also a hex package: educkdb | Hex

mdwaud · April 27, 2022, 9:39pm

I have just recently been looking at DuckDB for some use cases and was very happy to find your project. At least one or two of those hex downloads are me . I’m primarily interested in using it to interact with parquet files, however I didn’t have any luck loading one using educkdb. Maybe the extension isn’t present? I also tried :educkdb.query(conn, "INSTALL parquet"), but there appears to be an issue with duckdb’s hosting of extensions. Have you seen similar or do parquet files work out of the box? Any luck with the https extension?

Anyways, glad to see all the progress you’ve made!

mmzeeman · April 27, 2022, 10:18pm

It looks like the extensions are part of the big amalgamation C++ file. I will try to enable them.

I was planning to use duckdb for some simple web stats for zotonic. DuckDB is an interesting project.

mdwaud · April 27, 2022, 10:35pm

Thanks, I thought about opening an issue, but I was so amazed that someone was working on something I needed that I didn’t want to complain!

mmzeeman · April 28, 2022, 7:33am

Don’t worry. It is nice to have somebody interested in it.

What would be good extensions to have enabled by default? I was thinking about FTS, Parquet, and JSON. Maybe the httpsfs extension as well? I think I will make it configurable so users can enable and disable the extensions they want by setting an environment variable. Hopefully it will still build with gh actions when those extensions are enabled by default. Compiling it takes quite a bit of main memory.

mdwaud · April 28, 2022, 12:46pm

My use case is importing / exporting data on a Nerves device and importing / exporting data on a cloud server to S3, so the parquet and httpfs are of interest to me.

Having the extensions configurable via an environment variable makes a lot of sense. Don’t know how much work that is, though.

mmzeeman · April 28, 2022, 1:44pm

That is already in the rebar.config file I think. What is missing are defaults which make more sense. Fts, json, parquet and httpfs are probably sensible defaults.

mmzeeman · April 28, 2022, 7:13pm

I’ve asked the duckdb people how to build the extensions so they are included with the amalgamation build. The compile flags are pointing to some hpp files, but it was not clear how to generate the hpp files.

mmzeeman · April 28, 2022, 8:19pm

There is currently no support for inlining the extensions into the amalgamation build.

There are some scripts which they use to collect the sources needed for their R and Python clients though. This includes the extensions. So it takes a little bit more time than I expected. To be continued…

mmzeeman · November 22, 2024, 7:46pm

After a long time I have updated the library to finally use a stable DuckDB release.

This new version uses the amalgamation of DuckDB v1.1.3 which is embedded in the project.

Unlike earlier unstable releases of DuckDB, the data storage format of the storage files are now by newer versions of the library.

It should now also be possible to use core DuckDB extensions like parquet, s3, fts and others.

Have fun

Maas

zabrane · November 22, 2024, 10:08pm

Hi @mmzeeman Amazing. I will play with it next week and revert.
Here is the link just in case:

mmzeeman · November 23, 2024, 9:34am

Thanks for putting in the link. Totally forgot to do that. (Oops)

The last release was with duckdb version v0.7.0. The c-api of DuckDB was kept remarkably stable. I only had to change one thing, and this one thing was a workaround too.

FYI. DuckDBs appender api is really nice to store statistics from a live system. It is really fast.

Maas

tsloughter · November 23, 2024, 11:21am

Sweet.

Curious if anyone knows or is working on themselves bindings for serializing/deserializing parquet in Erlang?

Since DuckDB is an in-memory db would just using it for reading/writing parquet make sense as an alternative?

mmzeeman · November 23, 2024, 12:17pm

I haven’t used it for that purpose yet. Mind that compiling the nif takes about 5 minutes on my M1.

Also… the parquet support itself is an extension. When you do this:

{ok, Db} = educkdb:open(":memory:"),
ok, Conn} = educkdb:connect(Db),
{ok, []}  = educkdb:squery(Conn, "INSTALL parquet"),
{ok, []}  = educkdb:squery(Conn, "LOAD parquet"),
...

The parquet extension is downloaded to ~.duckdb/extensions from extensions.duckdb.org. It is possible to load it from elsewhere, but this is definitely something to be aware of when you want to run this in production.

maxlapshin · November 24, 2024, 11:48am

Am I correct in my understanding, that DuckDB is like a clickhouse and it requires special library to access it instead of HTTP?

mmzeeman · November 24, 2024, 7:45pm

Yes, just like Clickhouse, DuckDB is a column-oriented databases are better suited to analytical processing. And like Sqlite it is an embedded database which you access directly via an api (no network). Duckdb has more datatypes than Sqlite, and is also more strict.

For data ingestion there are normal SQL inserts, via extensions you can use parquet, csv, json, S3, etc… And there is an appender interface which you can use for fast bulk inserts.

Maas

maxlapshin · November 24, 2024, 8:10pm

Got it, so it is embeddable and thus there is a serious reason to put it inside a NIF.

mmzeeman · November 24, 2024, 8:23pm

Yes… indeed… DuckDB has no over-the-wire protocol. It is intended to be used as an in-process database system. With erlang you can of course make a special DB node when you are worried (about crashing your node).