The last couple of weeks I have been working on a NIF interface for DuckDB. The library has some rough edges, no documentation, but it is starring to work pretty nicely.
DuckDB is an in-process SQL OLAP database management system (https://duckdb.org). It looked like a nice addition to have, and I was interested to see how their c-api with the appender interface would work out for adding bulk stats data. I compared it to ets and the normal prepared statement interface they have and was pleasantly surprised.
Inserting 100_000 records with 10 integers in ets and retrieve them (with tab2list)
Inserting 100_000 records with 10 integers in an in memory duckdb database and retrieve them (with new extract api)
2> educkdb_bench:appender().
100_000 inserts took: 43.119 milliseconds
Select of 98 chunks and 100000 rows took: 14.553 milliseconds
Compared to in-memory duckdb with a normal prepared statement interface.
3> educkdb_bench:main().
100_000 inserts took: 3870.924 milliseconds
Select of 98 chunks and 100000 rows took: 53.696 milliseconds
At the moment all simple data-types, like various integers, timestamps, varchars and blobs can be extracted. At the moment I’m working on extracting complex data types like lists, structs and maps.
I have just recently been looking at DuckDB for some use cases and was very happy to find your project. At least one or two of those hex downloads are me . I’m primarily interested in using it to interact with parquet files, however I didn’t have any luck loading one using educkdb. Maybe the extension isn’t present? I also tried :educkdb.query(conn, "INSTALL parquet"), but there appears to be an issue with duckdb’s hosting of extensions. Have you seen similar or do parquet files work out of the box? Any luck with the https extension?
Anyways, glad to see all the progress you’ve made!
Don’t worry. It is nice to have somebody interested in it.
What would be good extensions to have enabled by default? I was thinking about FTS, Parquet, and JSON. Maybe the httpsfs extension as well? I think I will make it configurable so users can enable and disable the extensions they want by setting an environment variable. Hopefully it will still build with gh actions when those extensions are enabled by default. Compiling it takes quite a bit of main memory.
My use case is importing / exporting data on a Nerves device and importing / exporting data on a cloud server to S3, so the parquet and httpfs are of interest to me.
Having the extensions configurable via an environment variable makes a lot of sense. Don’t know how much work that is, though.
That is already in the rebar.config file I think. What is missing are defaults which make more sense. Fts, json, parquet and httpfs are probably sensible defaults.
I’ve asked the duckdb people how to build the extensions so they are included with the amalgamation build. The compile flags are pointing to some hpp files, but it was not clear how to generate the hpp files.
There is currently no support for inlining the extensions into the amalgamation build.
There are some scripts which they use to collect the sources needed for their R and Python clients though. This includes the extensions. So it takes a little bit more time than I expected. To be continued…
Thanks for putting in the link. Totally forgot to do that. (Oops)
The last release was with duckdb version v0.7.0. The c-api of DuckDB was kept remarkably stable. I only had to change one thing, and this one thing was a workaround too.
FYI. DuckDBs appender api is really nice to store statistics from a live system. It is really fast.
The parquet extension is downloaded to ~.duckdb/extensions from extensions.duckdb.org. It is possible to load it from elsewhere, but this is definitely something to be aware of when you want to run this in production.
Yes, just like Clickhouse, DuckDB is a column-oriented databases are better suited to analytical processing. And like Sqlite it is an embedded database which you access directly via an api (no network). Duckdb has more datatypes than Sqlite, and is also more strict.
For data ingestion there are normal SQL inserts, via extensions you can use parquet, csv, json, S3, etc… And there is an appender interface which you can use for fast bulk inserts.
Yes… indeed… DuckDB has no over-the-wire protocol. It is intended to be used as an in-process database system. With erlang you can of course make a special DB node when you are worried (about crashing your node).