Parquet

Apache Parquet is a columnar storage file format optimised for analytical workloads. It's widely used in data engineering and data science for its excellent compression and query performance with large datasets.

Key Features

Columnar storage: Read only needed columns
Compression: Highly efficient (often 10x smaller than CSV)
Schema: Self-describing with embedded metadata
Predicate pushdown: Skip irrelevant data during reads
Nested data: Support for complex structures

Why Columnar Matters

When querying SELECT AVG(price) FROM sales:

Row format (CSV):  Read entire file
                   [date, product, price, quantity, customer, ...]
                   
Column format:     Read only price column
                   [price1, price2, price3, ...]

Result: 10-100x faster for analytical queries.

Common Use Cases

Data lakes (S3 + Athena)
Data warehouses (Snowflake, BigQuery, Redshift)
Machine learning pipelines
ETL pipelines
Log analytics
Time-series data

Reading/Writing Parquet

Python (pandas/pyarrow)

import pandas as pd

# Write
df.to_parquet('data.parquet', compression='snappy')

# Read
df = pd.read_parquet('data.parquet')

# Read specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'price'])

JavaScript (parquet-wasm)

import { readParquet } from 'parquet-wasm';

const data = await readParquet(buffer);

Parquet vs Other Formats

Format	Type	Compression	Query Speed	Use Case
Parquet	Columnar	Excellent	Fast (analytics)	Data lakes
ORC	Columnar	Excellent	Fast (Hive)	Hadoop
CSV	Row	None	Slow	Data exchange
JSON	Row	None	Slow	APIs
Avro	Row	Good	Fast (streaming)	Kafka

What We Like

Query performance: Orders of magnitude faster for analytics
Compression: Dramatically smaller files
Ecosystem: Supported everywhere (Spark, pandas, Athena, etc.)
Schema evolution: Add columns without rewriting
Partitioning: Natural fit for date-partitioned data

What We Don't Like

Not human-readable: Binary format requires tools
Write overhead: Slower to write than CSV
Small files: Overhead not worth it for tiny datasets
Row-level operations: Poor for single-row lookups

Best Practices

Use for analytics: Not for transactional workloads
Partition wisely: By date, region, or common filters
Right-size files: 128 MB - 1 GB per file is ideal
Use Snappy compression: Good balance of speed and ratio
Include schema: Always embed metadata

AWS Integration

S3 (Parquet files)
     ↓
Glue Data Catalog (schema)
     ↓
Athena (SQL queries)

This combination provides a serverless, pay-per-query analytics platform.

Key Features​

Why Columnar Matters​

Common Use Cases​

Reading/Writing Parquet​

Python (pandas/pyarrow)​

JavaScript (parquet-wasm)​

Parquet vs Other Formats​

What We Like​

What We Don't Like​

Best Practices​

AWS Integration​