Menu

📰
0

Reddit - Please wait for verification

News and Notes on the Structured Query Language·/u/komal_rajput·3 days ago
#CDO1Gv3m
Reading 0:00
15s threshold

We're building a data pipeline that processes FEC (Federal Election Commission) financial filing data. Each filing contains a parent record and thousands of itemization rows (individual transactions). We're inserting these into PostgreSQL via an Airflow pipeline in batches. Current schema (simplified): CREATE TABLE silver_fec_efiling_filings ( id SERIAL PRIMARY KEY, filing_id VARCHAR UNIQUE, form_type VARCHAR, header_json JSONB, filing_json JSONB, created_at TIMESTAMPTZ ); CREATE TABLE silver_fec_efiling_itemizations ( id SERIAL PRIMARY KEY, efiling_id INTEGER REFERENCES silver_fec_efiling_filings(id), record_type VARCHAR(20), record_data JSONB, created_at TIMESTAMPTZ, UNIQUE (efiling_id, record_data) ); How we insert: We read .fec files in batches of 5,000 lines and use psycopg2's execute_values to bulk insert itemizations with ON CONFLICT (efiling_id, record_data) DO NOTHING for idempotency - the pipeline can be re-run and we don't want duplicates.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More