A practical guide to choosing between Python and SQL for data transformations. Learn when to use each tool, common antipatterns to avoid, and decision frameworks that work.
Burak Karakan
Co-founder & CEO
This debate comes up constantly in data teams. A new engineer joins and wants to rewrite everything in their preferred language. The "SQL people" clash with the "Python people." Projects get delayed, technical debt accumulates, and it's hard to reach consensus.
Here's what I've observed: both sides often have valid points.
The challenge isn't determining which tool is objectively better. It's that teams frequently choose based on comfort rather than the actual requirements. This can lead to maintenance challenges, performance issues, and team friction.
Let's explore a more systematic way to make this decision.
Recently, I reviewed code from a team that had spent three weeks building a 500-line Python script to process 10GB of data. It took two hours to run and consumed $300/month in compute costs.
We rewrote it as a SQL query. Runtime dropped to 30 seconds with negligible costs.
But the opposite happens too: I've seen a 1,000-line SQL file with nested CTEs 15 levels deep, attempting to implement complex conditional business logic. It was difficult to maintain, test, and debug.
There's a pattern here: teams often choose their tool based on what they know, rather than what the job requires. This can have real costs:
Infrastructure costs: Processing in Python what should be in SQL can cost thousands monthly
Engineering time: More code to write, test, and maintain
Team friction: Debates in code reviews about tool choice
Here's a simple decision framework we'll expand on:
Use SQL if you're transforming entire datasets with standard operations.
Use Python if you need procedural logic, external integrations, or custom algorithms.
Use both when you can clearly separate concerns.
SQL excels at set-based operations. When you need to operate on entire tables, SQL is purpose-built for the job:
-- Building a daily user activity table
SELECT
DATE(event_timestamp) as date,
user_id,
COUNT(*) as event_count,
COUNT(DISTINCT session_id) as session_count,
MAX(event_timestamp) as last_activity
FROM events
WHERE DATE(event_timestamp) = CURRENT_DATE()
GROUP BY 1, 2
This kind of transformation is SQL's sweet spot. It's declarative - you describe what you want, not how to get it. The query optimizer figures out the most efficient execution path.
Dimension tables are all about joins and aggregations. This is what SQL was designed for:
CREATE TABLE dim_users AS
SELECT
u.user_id,
u.email,
u.created_at,
COUNT(DISTINCT o.order_id) as lifetime_orders,
SUM(o.total_amount) as lifetime_value,
MAX(o.created_at) as last_order_date,
FIRST_VALUE(o.utm_source) OVER (
PARTITION BY u.user_id
ORDER BY o.created_at
) as acquisition_source
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.email, u.created_at
If you've read our Firebase analytics post, you know these are critical. SQL makes them trivial:
CREATE TABLE users_daily AS
SELECT
user_pseudo_id,
event_date,
COUNT(*) as total_events,
COUNTIF(event_name = 'session_start') as sessions,
COUNTIF(event_name = 'purchase') as purchases,
SUM(CASE WHEN event_name = 'purchase'
THEN (SELECT value.int_value FROM UNNEST(event_params)
WHERE key = 'value')
END) as revenue
FROM `project.analytics_XXXXXX.events_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY 1, 2
SQL's set-based nature makes quality checks clean and fast:
-- Find duplicates
SELECT user_id, COUNT(*)
FROM users
GROUP BY user_id
HAVING COUNT(*) > 1
-- Find orphaned records
SELECT o.order_id
FROM orders o
LEFT JOIN users u ON o.user_id = u.user_id
WHERE u.user_id IS NULL
Data warehouses are optimized for incremental processing:
-- Only process new data
INSERT INTO daily_aggregates
SELECT
date,
metric,
value
FROM source_data
WHERE date > (SELECT MAX(date) FROM daily_aggregates)
The warehouse handles partitioning, clustering, and optimization. You just describe the logic.
Python excels at procedural logic and integration. When you need to do things step-by-step, make decisions, or interact with external systems, Python is a natural fit.
Python has extensive ML libraries. Feature engineering, model training, and predictions are all straightforward in Python:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# SQL for data prep
data = warehouse.query("""
SELECT user_id, feature1, feature2, churned
FROM ml_training_data
""")
# Python for ML
X = data[['feature1', 'feature2']]
y = data['churned']
model = RandomForestClassifier()
model.fit(X, y)
predictions = model.predict(new_data)
Custom validation rules that go beyond simple SQL checks:
def validate_order_data(orders_df):
issues = []
# Check for suspiciously high order values
for idx, row in orders_df.iterrows():
if row['total_amount'] > row['expected_range_max'] * 2:
# Call fraud detection API
fraud_score = check_fraud(row)
if fraud_score > 0.8:
issues.append({
'order_id': row['order_id'],
'issue': 'potential_fraud',
'score': fraud_score
})
# Check for invalid product combinations
for idx, row in orders_df.iterrows():
products = set(row['product_ids'])
if 'warranty' in products and 'digital_product' in products:
issues.append({
'order_id': row['order_id'],
'issue': 'invalid_product_combination'
})
return issues
Working with deeply nested or irregular JSON structures:
def flatten_firebase_params(event_params):
"""Extract parameters from Firebase nested structure"""
result = {}
for param in event_params:
key = param['key']
# Value can be in different fields
if 'string_value' in param:
result[key] = param['string_value']
elif 'int_value' in param:
result[key] = param['int_value']
elif 'float_value' in param:
result[key] = param['float_value']
elif 'double_value' in param:
result[key] = param['double_value']
return result
A startup with a strong ML team decided everything should be Python. "We're a Python shop," they said.
Here's what they built for daily user aggregations:
import pandas as pd
from google.cloud import bigquery
def aggregate_user_metrics(date):
client = bigquery.Client()
# Read entire events table into memory
query = f"""
SELECT * FROM events
WHERE DATE(timestamp) = '{date}'
"""
events_df = client.query(query).to_dataframe()
# Group by user
results = []
for user_id in events_df['user_id'].unique():
user_events = events_df[events_df['user_id'] == user_id]
session_starts = len(user_events[user_events['event_name'] == 'session_start'])
purchases = len(user_events[user_events['event_name'] == 'purchase'])
revenue = 0
for idx, row in user_events[user_events['event_name'] == 'purchase'].iterrows():
for param in row['event_params']:
if param['key'] == 'value':
revenue += param['int_value']
results.append({
'user_id': user_id,
'date': date,
'session_starts': session_starts,
'purchases': purchases,
'revenue': revenue
})
results_df = pd.DataFrame(results)
results_df.to_gbq('analytics.users_daily', if_exists='append')
This code:
Loads 10GB of data into memory (often crashes)
Processes row-by-row instead of in sets
Takes 2 hours to run
Costs $300/month in compute
The SQL version:
INSERT INTO analytics.users_daily
SELECT
user_id,
DATE(timestamp) as date,
COUNTIF(event_name = 'session_start') as session_starts,
COUNTIF(event_name = 'purchase') as purchases,
SUM(CASE WHEN event_name = 'purchase'
THEN (SELECT int_value FROM UNNEST(event_params) WHERE key = 'value')
END) as revenue
FROM events
WHERE DATE(timestamp) = CURRENT_DATE()
GROUP BY 1, 2
The real cost wasn't just the $300/month. It was the time spent optimizing Python, the incidents when it crashed, and the debugging effort for memory issues.
On the other hand, I've worked with teams that tried to use SQL exclusively, even for complex procedural logic.
One team needed to implement customer churn prediction with complex logic:
WITH customer_features AS (
SELECT
user_id,
-- 50 more feature calculations
FROM users
),
behavior_scores AS (
SELECT
user_id,
CASE
WHEN feature1 > threshold1 AND feature2 < threshold2 THEN
CASE
WHEN feature3 IN (SELECT value FROM lookup1) THEN
CASE
WHEN feature4 > (SELECT AVG(feature4) FROM customer_features) THEN 0.8
ELSE 0.6
END
ELSE
CASE
WHEN feature5 BETWEEN value1 AND value2 THEN 0.4
ELSE 0.2
END
END
ELSE
-- 10 more levels of CASE WHEN
END as churn_score
FROM customer_features
)
-- 500 more lines of nested CTEs
SELECT * FROM final_scores
This 1,000-line SQL file had several challenges:
Difficult to test (limited unit testing options for SQL logic)
Hard to debug (which of 15 nested CTEs had the issue?)
Example: Daily aggregations with anomaly detection
-- daily_metrics.sql
-- Use SQL for the heavy lifting
SELECT
date,
metric_name,
metric_value,
AVG(metric_value) OVER (
PARTITION BY metric_name
ORDER BY date
ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) as avg_last_7_days,
STDDEV(metric_value) OVER (
PARTITION BY metric_name
ORDER BY date
ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) as stddev_last_7_days
FROM metrics
WHERE date >= CURRENT_DATE() - 30
# anomaly_detection.py
# Use Python for the complex logic
def detect_anomalies(metrics_df):
anomalies = []
for idx, row in metrics_df.iterrows():
if row['stddev_last_7_days'] > 0:
z_score = (row['metric_value'] - row['avg_last_7_days']) / row['stddev_last_7_days']
if abs(z_score) > 3:
# Complex anomaly logic
severity = classify_anomaly(row, z_score)
if should_alert(row['metric_name'], severity):
anomalies.append({
'date': row['date'],
'metric': row['metric_name'],
'severity': severity
})
return anomalies
SQL does what it's good at (aggregations, window functions). Python does what it's good at (complex conditional logic, integration with alerting systems).
SQL to filter and join → Python for complex logic → SQL for final aggregation
-- Step 1: SQL preprocessing
-- Narrow down to relevant data
CREATE TEMP TABLE relevant_orders AS
SELECT
o.order_id,
o.user_id,
o.items,
u.segment,
u.lifetime_value
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.created_at >= CURRENT_DATE() - 7
AND o.status = 'completed'
AND u.segment IN ('vip', 'frequent_buyer')
-- Step 3: SQL aggregation
-- Roll up results for reporting
INSERT INTO daily_recommendation_stats
SELECT
date,
product_id,
COUNT(*) as recommendation_count,
COUNT(DISTINCT user_id) as user_count,
SUM(CASE WHEN clicked THEN 1 ELSE 0 END) as clicks
FROM recommendations
GROUP BY 1, 2
Here's something people don't talk about enough: the right tooling makes this decision easier.
If you're running SQL in one place, Python in another, deploying separately, monitoring separately - you're going to default to using one tool for everything. The friction of using both is too high.
This is why we built Bruin to run both SQL and Python in the same pipeline, with the same tooling:
If your current setup makes it painful to use both, you'll keep forcing everything into one tool. That's when you end up with 500-line Python scripts doing GROUP BYs or 1,000-line SQL files implementing procedural logic.
Based on experience with many teams, here are some principles that tend to work well:
The best data engineers are bilingual. They understand both SQL and Python, and they choose thoughtfully between them.
Start with SQL unless you have a specific reason not to. For data transformation, SQL is often a good default. Consider whether Python adds enough value to justify the switch.
Profile and benchmark - don't assume. Measure performance rather than relying on assumptions.
Make it easy to use both. Tool friction can drive suboptimal decisions. When your platform makes both SQL and Python easy to use, you're more likely to make the right choice.
Optimize for maintainability and performance. Consider the long-term implications of your tool choice, not just immediate comfort.
The Python vs SQL debate often presents a false choice.
The key question isn't "which is better?" but rather "which is right for this specific transformation?"
Teams sometimes default to the tool they know best, rather than the tool best suited to the task. This can lead to performance issues, higher costs, and maintenance challenges.
A better approach is to:
Understand what each tool does well
Recognize the patterns that favor one over the other
Build competency in both to make informed choices
Use tooling that makes both options accessible
Your data platform should make it easy to use SQL when SQL is appropriate and Python when Python is appropriate, without unnecessary friction.
This is one of the goals we had when building Bruin.
Consider experimenting with your next data transformation by implementing it both ways.
Write it in SQL. Write it in Python. Compare:
Development time
Execution time
Code maintainability
Performance at scale
This hands-on comparison can be illuminating.
If you're interested in how Bruin enables running both SQL and Python in the same pipeline with unified tooling, you can check out our docs or join our Slack community to discuss your specific use cases.
The best pipeline uses the right tool for each job.