Menu

Post image 1
Post image 2
1 / 2
0

I built a DuckDB extension to handle chemistry data without pandas or RDKit

DEV Community·nk_Enuke·about 1 month ago
#Ve47c3qC
Reading 0:00
15s threshold

Idea While reproducing top solutions of a chemistry data competition , I started building a DuckDB community extension for handling chemistry data directly in SQL. What it can do Parse SMILES, InChI, PDB and other chemistry formats directly — no pandas, no RDKit on the side Plug into DuckDB's native CSV/Parquet/Iceberg/S3/HTTP readers, so ingestion + light preprocessing happens in one query Background What is chemistry data, anyway? One of the canonical forms is SMILES , a notation that encodes a molecular structure as a plain string. The standard library for reading and processing SMILES is RDKit . For example, ethanol is CCO in SMILES, and RDKit will give you the molecular formula C2H6O from it. RDKit doesn't stop there — molecular weight, fingerprints, descriptors, substructure search, and so on. It's basically the must-have library for chemistry data work. (Internally, the algorithms come from a bunch of different papers...) So you might say: do we even need a new extension?…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More