Parquet
Apache Parquet is a columnar storage file format optimised for analytical workloads. It's widely used in data engineering and data science for its excellent compression and query performance with large datasets.
Key Features
- Columnar storage: Read only needed columns
- Compression: Highly efficient (often 10x smaller than CSV)
- Schema: Self-describing with embedded metadata
- Predicate pushdown: Skip irrelevant data during reads
- Nested data: Support for complex structures
Why Columnar Matters
When querying SELECT AVG(price) FROM sales:
Row format (CSV): Read entire file
[date, product, price, quantity, customer, ...]
Column format: Read only price column
[price1, price2, price3, ...]
Result: 10-100x faster for analytical queries.
Common Use Cases
- Data lakes (S3 + Athena)
- Data warehouses (Snowflake, BigQuery, Redshift)
- Machine learning pipelines
- ETL pipelines
- Log analytics
- Time-series data
Reading/Writing Parquet
Python (pandas/pyarrow)
import pandas as pd
# Write
df.to_parquet('data.parquet', compression='snappy')
# Read
df = pd.read_parquet('data.parquet')
# Read specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'price'])
JavaScript (parquet-wasm)
import { readParquet } from 'parquet-wasm';
const data = await readParquet(buffer);
Parquet vs Other Formats
| Format | Type | Compression | Query Speed | Use Case |
|---|---|---|---|---|
| Parquet | Columnar | Excellent | Fast (analytics) | Data lakes |
| ORC | Columnar | Excellent | Fast (Hive) | Hadoop |
| CSV | Row | None | Slow | Data exchange |
| JSON | Row | None | Slow | APIs |
| Avro | Row | Good | Fast (streaming) | Kafka |
What We Like
- Query performance: Orders of magnitude faster for analytics
- Compression: Dramatically smaller files
- Ecosystem: Supported everywhere (Spark, pandas, Athena, etc.)
- Schema evolution: Add columns without rewriting
- Partitioning: Natural fit for date-partitioned data
What We Don't Like
- Not human-readable: Binary format requires tools
- Write overhead: Slower to write than CSV
- Small files: Overhead not worth it for tiny datasets
- Row-level operations: Poor for single-row lookups
Best Practices
- Use for analytics: Not for transactional workloads
- Partition wisely: By date, region, or common filters
- Right-size files: 128 MB - 1 GB per file is ideal
- Use Snappy compression: Good balance of speed and ratio
- Include schema: Always embed metadata
AWS Integration
S3 (Parquet files)
↓
Glue Data Catalog (schema)
↓
Athena (SQL queries)
This combination provides a serverless, pay-per-query analytics platform.