8 Tabular Experimental Data

Most synthetic biology projects do not begin with a model.

They begin with a table.

You grow strains, induce cultures, measure fluorescence, record optical density, annotate constructs, and export the results from an instrument or a spreadsheet. Very quickly, your work becomes a problem of tabular data management.

That is why pandas is one of the most useful Python libraries for synthetic biology. It gives us a way to read, clean, reshape, summarize, and save experimental data without leaving Python.

In this chapter, we will introduce pandas through the kinds of tables that appear constantly in biology:

fluorescence measurements
growth measurements
induction assays
construct metadata
replicate-level observations
summary tables for downstream analysis

This chapter also introduces one of the most important habits for scientific computing: tidy data.

Once we introduce tidy data, we will use it as the default tabular format for the rest of the book.

8.1 Why tidy data matters

A table can be easy for a human to read but hard for code to analyze.

That tension appears everywhere in biology. Instrument exports, manually assembled spreadsheets, and presentation-ready summary tables are often arranged for human inspection rather than computation.

A tidy table follows a simple idea:

each row is one observation
each column is one variable
each type of observational unit gets its own table

For a fluorescence induction experiment, one observation might be:

one strain
at one inducer concentration
at one time point
for one replicate

In a tidy table, those properties become columns.

That might sound abstract, so let us start with a non-tidy example.

8.2 A wide table is often the first thing you get

Suppose a plate reader exports fluorescence values like this.

import pandas as pd

wide_data = pd.DataFrame(
    {
        "time_h": [0, 2, 4],
        "WT_rep1": [120, 135, 140],
        "WT_rep2": [118, 132, 141],
        "Sensor_rep1": [125, 410, 980],
        "Sensor_rep2": [130, 430, 1020],
    }
)

wide_data

	time_h	WT_rep1	WT_rep2	Sensor_rep1	Sensor_rep2
0	0	120	118	125	130
1	2	135	132	410	430
2	4	140	141	980	1020

A human can read this table easily, but the column names are mixing several different variables together:

strain identity
replicate identity
measurement value

That makes downstream analysis awkward.

For example, imagine asking any of these questions:

what is the mean fluorescence for each strain at each time point?
how much variation is there across replicates?
can we merge this table with construct metadata?
can we plot all strains with a common grammar later?

Those operations become easier when the table is tidy.

8.3 Reshaping wide data into tidy data

We can use melt() to convert the wide table into a longer, tidier format.

tidy_from_wide = wide_data.melt(
    id_vars="time_h",
    var_name="sample",
    value_name="fluorescence",
)

tidy_from_wide

	time_h	sample	fluorescence
0	0	WT_rep1	120
1	2	WT_rep1	135
2	4	WT_rep1	140
3	0	WT_rep2	118
4	2	WT_rep2	132
5	4	WT_rep2	141
6	0	Sensor_rep1	125
7	2	Sensor_rep1	410
8	4	Sensor_rep1	980
9	0	Sensor_rep2	130
10	2	Sensor_rep2	430
11	4	Sensor_rep2	1020

Now each row is closer to a single observation, but the sample column still combines two variables: strain and replicate.

We can separate them.

parsed = tidy_from_wide["sample"].str.extract(r"(?P<strain>.+)_rep(?P<replicate>\d+)")

tidy_from_wide = pd.concat([tidy_from_wide, parsed], axis=1)
tidy_from_wide["replicate"] = tidy_from_wide["replicate"].astype(int)

tidy_from_wide = tidy_from_wide[["time_h", "strain", "replicate", "fluorescence"]]

tidy_from_wide

	time_h	strain	replicate	fluorescence
0	0	WT	1	120
1	2	WT	1	135
2	4	WT	1	140
3	0	WT	2	118
4	2	WT	2	132
5	4	WT	2	141
6	0	Sensor	1	125
7	2	Sensor	1	410
8	4	Sensor	1	980
9	0	Sensor	2	130
10	2	Sensor	2	430
11	4	Sensor	2	1020

This new table is much better for computation:

one row is one measurement
time_h, strain, and replicate are explicit variables
fluorescence is the measured value

From this point onward in the book, we will prefer this style.

When we talk about experimental tables, assume that we want them in tidy format unless there is a strong reason not to.

8.4 Building a tidy experiment table

Let us create a slightly richer dataset. This time, each row will represent one observation from an induction experiment.

experiment = pd.DataFrame(
    [
        {"strain": "WT", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.82, "fluorescence": 145},
        {"strain": "WT", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.80, "fluorescence": 150},
        {"strain": "WT", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.81, "fluorescence": 152},
        {"strain": "WT", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.79, "fluorescence": 149},
        {"strain": "Sensor", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.77, "fluorescence": 210},
        {"strain": "Sensor", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.75, "fluorescence": 220},
        {"strain": "Sensor", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.78, "fluorescence": 920},
        {"strain": "Sensor", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.76, "fluorescence": 980},
        {"strain": "Amplifier", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.73, "fluorescence": 260},
        {"strain": "Amplifier", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.72, "fluorescence": 255},
        {"strain": "Amplifier", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.74, "fluorescence": 1380},
        {"strain": "Amplifier", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.71, "fluorescence": 1415},
    ]
)

experiment

	strain	inducer_mM	time_h	replicate	od600	fluorescence
0	WT	0.0	4	1	0.82	145
1	WT	0.0	4	2	0.80	150
2	WT	1.0	4	1	0.81	152
3	WT	1.0	4	2	0.79	149
4	Sensor	0.0	4	1	0.77	210
5	Sensor	0.0	4	2	0.75	220
6	Sensor	1.0	4	1	0.78	920
7	Sensor	1.0	4	2	0.76	980
8	Amplifier	0.0	4	1	0.73	260
9	Amplifier	0.0	4	2	0.72	255
10	Amplifier	1.0	4	1	0.74	1380
11	Amplifier	1.0	4	2	0.71	1415

This is the kind of structure that ages well.

A tidy table like this is easy to:

filter by strain or condition
normalize measurements
group replicates
compute summaries
merge with metadata
write back to disk

8.5 Inspecting a DataFrame

A pandas table is called a DataFrame.

Some of the first things we usually check are the size, column names, and data types.

experiment.shape

(12, 6)

experiment.columns.tolist()

['strain', 'inducer_mM', 'time_h', 'replicate', 'od600', 'fluorescence']

experiment.dtypes

strain              str
inducer_mM      float64
time_h            int64
replicate         int64
od600           float64
fluorescence      int64
dtype: object

And for a quick preview:

experiment.head()

	strain	inducer_mM	time_h	replicate	od600	fluorescence
0	WT	0.0	4	1	0.82	145
1	WT	0.0	4	2	0.80	150
2	WT	1.0	4	1	0.81	152
3	WT	1.0	4	2	0.79	149
4	Sensor	0.0	4	1	0.77	210

This may feel routine, but it is part of good scientific hygiene. A surprising amount of debugging is simply discovering that a column has the wrong type or an unexpected name.

8.6 Selecting columns

Column selection is straightforward.

experiment["fluorescence"]

0      145
1      150
2      152
3      149
4      210
5      220
6      920
7      980
8      260
9      255
10    1380
11    1415
Name: fluorescence, dtype: int64

To select multiple columns, pass a list of names.

experiment[["strain", "inducer_mM", "fluorescence"]]

	strain	inducer_mM	fluorescence
0	WT	0.0	145
1	WT	0.0	150
2	WT	1.0	152
3	WT	1.0	149
4	Sensor	0.0	210
5	Sensor	0.0	220
6	Sensor	1.0	920
7	Sensor	1.0	980
8	Amplifier	0.0	260
9	Amplifier	0.0	255
10	Amplifier	1.0	1380
11	Amplifier	1.0	1415

This is useful when you want to focus on a subset of variables without modifying the original table.

8.7 Filtering rows

Because our table is tidy, filtering becomes very natural.

sensor_only = experiment[experiment["strain"] == "Sensor"]
sensor_only

	strain	inducer_mM	time_h	replicate	od600	fluorescence
4	Sensor	0.0	4	1	0.77	210
5	Sensor	0.0	4	2	0.75	220
6	Sensor	1.0	4	1	0.78	920
7	Sensor	1.0	4	2	0.76	980

We can combine conditions too.

induced_sensor = experiment[
    (experiment["strain"] == "Sensor") & (experiment["inducer_mM"] == 1.0)
]

induced_sensor

	strain	inducer_mM	time_h	replicate	od600	fluorescence
6	Sensor	1.0	4	1	0.78	920
7	Sensor	1.0	4	2	0.76	980

That one line already expresses a biologically meaningful question: show me the induced measurements for the sensor strain.

Sorting is equally common.

experiment.sort_values(["strain", "inducer_mM", "replicate"])

	strain	inducer_mM	time_h	replicate	od600	fluorescence
8	Amplifier	0.0	4	1	0.73	260
9	Amplifier	0.0	4	2	0.72	255
10	Amplifier	1.0	4	1	0.74	1380
11	Amplifier	1.0	4	2	0.71	1415
4	Sensor	0.0	4	1	0.77	210
5	Sensor	0.0	4	2	0.75	220
6	Sensor	1.0	4	1	0.78	920
7	Sensor	1.0	4	2	0.76	980
0	WT	0.0	4	1	0.82	145
1	WT	0.0	4	2	0.80	150
2	WT	1.0	4	1	0.81	152
3	WT	1.0	4	2	0.79	149

8.8 Adding derived columns

Raw fluorescence is often less informative than fluorescence normalized by culture density.

A common first pass is to divide by OD600.

experiment = experiment.assign(
    norm_fluorescence=experiment["fluorescence"] / experiment["od600"]
)

experiment[["strain", "inducer_mM", "replicate", "norm_fluorescence"]]

	strain	inducer_mM	replicate	norm_fluorescence
0	WT	0.0	1	176.829268
1	WT	0.0	2	187.500000
2	WT	1.0	1	187.654321
3	WT	1.0	2	188.607595
4	Sensor	0.0	1	272.727273
5	Sensor	0.0	2	293.333333
6	Sensor	1.0	1	1179.487179
7	Sensor	1.0	2	1289.473684
8	Amplifier	0.0	1	356.164384
9	Amplifier	0.0	2	354.166667
10	Amplifier	1.0	1	1864.864865
11	Amplifier	1.0	2	1992.957746

This is a good example of why tidy data helps. Each row already contains the variables required for the calculation, so adding a derived column is simple and transparent.

8.9 Grouping and summarizing replicates

Biological experiments almost always involve replicate measurements.

A tidy table makes grouped summaries very convenient.

summary = (
    experiment.groupby(["strain", "inducer_mM"])
    .agg(
        n=("norm_fluorescence", "size"),
        mean_norm_fluorescence=("norm_fluorescence", "mean"),
        sd_norm_fluorescence=("norm_fluorescence", "std"),
        mean_od600=("od600", "mean"),
    )
    .reset_index()
)

summary

	strain	inducer_mM	n	mean_norm_fluorescence	sd_norm_fluorescence	mean_od600
0	Amplifier	0.0	2	355.165525	1.412599	0.725
1	Amplifier	1.0	2	1928.911306	90.575345	0.725
2	Sensor	0.0	2	283.030303	14.570685	0.760
3	Sensor	1.0	2	1234.480432	77.772203	0.770
4	WT	0.0	2	182.164634	7.545347	0.810
5	WT	1.0	2	188.130958	0.674066	0.800

This summary table is also tidy.

Each row now represents one summarized condition rather than one raw replicate. That is still perfectly tidy, because the observational unit has changed. The important thing is that the rows and columns remain explicit.

This distinction is worth remembering:

a replicate-level tidy table has one row per measurement
a condition-level tidy summary has one row per summarized condition

Both are tidy. They simply describe different units of analysis.

8.10 Quantifying induction

Once we have grouped summaries, we can compute simple biological metrics such as fold induction.

baseline = summary[summary["inducer_mM"] == 0.0][
    ["strain", "mean_norm_fluorescence"]
].rename(columns={"mean_norm_fluorescence": "baseline_norm_fluorescence"})

summary_with_baseline = summary.merge(baseline, on="strain")
summary_with_baseline["fold_induction"] = (
    summary_with_baseline["mean_norm_fluorescence"]
    / summary_with_baseline["baseline_norm_fluorescence"]
)

summary_with_baseline

	strain	inducer_mM	n	mean_norm_fluorescence	sd_norm_fluorescence	mean_od600	baseline_norm_fluorescence	fold_induction
0	Amplifier	0.0	2	355.165525	1.412599	0.725	355.165525	1.000000
1	Amplifier	1.0	2	1928.911306	90.575345	0.725	355.165525	5.431021
2	Sensor	0.0	2	283.030303	14.570685	0.760	283.030303	1.000000
3	Sensor	1.0	2	1234.480432	77.772203	0.770	283.030303	4.361655
4	WT	0.0	2	182.164634	7.545347	0.810	182.164634	1.000000
5	WT	1.0	2	188.130958	0.674066	0.800	182.164634	1.032752

Now we have a compact summary that answers a real synthetic biology question: how strongly does each strain respond to inducer?

8.11 Missing values happen

Real experiments are not perfectly complete. Wells fail, cultures contaminate, measurements are dropped, and metadata go missing.

pandas uses special missing values such as NaN to represent absent entries.

with_missing = experiment.copy()
with_missing.loc[3, "od600"] = None
with_missing.loc[8, "fluorescence"] = None

with_missing

	strain	inducer_mM	time_h	replicate	od600	fluorescence	norm_fluorescence
0	WT	0.0	4	1	0.82	145.0	176.829268
1	WT	0.0	4	2	0.80	150.0	187.500000
2	WT	1.0	4	1	0.81	152.0	187.654321
3	WT	1.0	4	2	NaN	149.0	188.607595
4	Sensor	0.0	4	1	0.77	210.0	272.727273
5	Sensor	0.0	4	2	0.75	220.0	293.333333
6	Sensor	1.0	4	1	0.78	920.0	1179.487179
7	Sensor	1.0	4	2	0.76	980.0	1289.473684
8	Amplifier	0.0	4	1	0.73	NaN	356.164384
9	Amplifier	0.0	4	2	0.72	255.0	354.166667
10	Amplifier	1.0	4	1	0.74	1380.0	1864.864865
11	Amplifier	1.0	4	2	0.71	1415.0	1992.957746

We can ask how many missing values appear in each column.

with_missing.isna().sum()

strain               0
inducer_mM           0
time_h               0
replicate            0
od600                1
fluorescence         1
norm_fluorescence    0
dtype: int64

Sometimes the correct action is to remove incomplete rows.

complete_cases = with_missing.dropna(subset=["od600", "fluorescence"])
complete_cases

	strain	inducer_mM	time_h	replicate	od600	fluorescence	norm_fluorescence
0	WT	0.0	4	1	0.82	145.0	176.829268
1	WT	0.0	4	2	0.80	150.0	187.500000
2	WT	1.0	4	1	0.81	152.0	187.654321
4	Sensor	0.0	4	1	0.77	210.0	272.727273
5	Sensor	0.0	4	2	0.75	220.0	293.333333
6	Sensor	1.0	4	1	0.78	920.0	1179.487179
7	Sensor	1.0	4	2	0.76	980.0	1289.473684
9	Amplifier	0.0	4	2	0.72	255.0	354.166667
10	Amplifier	1.0	4	1	0.74	1380.0	1864.864865
11	Amplifier	1.0	4	2	0.71	1415.0	1992.957746

Other times we may want to keep the rows but mark that some analysis cannot yet be performed.

The important point is not to ignore missingness. Missing values are part of the data-generating process and often reflect something biologically or experimentally meaningful.

8.12 Reading tidy data from a CSV file

In practice, we usually load data from files rather than typing them directly into Python.

Here is a small CSV example using an in-memory text buffer.

from io import StringIO

csv_text = """strain,inducer_mM,time_h,replicate,od600,fluorescence
WT,0.0,4,1,0.82,145
WT,0.0,4,2,0.80,150
Sensor,1.0,4,1,0.78,920
Sensor,1.0,4,2,0.76,980
"""

loaded = pd.read_csv(StringIO(csv_text))
loaded

	strain	inducer_mM	time_h	replicate	od600	fluorescence
0	WT	0.0	4	1	0.82	145
1	WT	0.0	4	2	0.80	150
2	Sensor	1.0	4	1	0.78	920
3	Sensor	1.0	4	2	0.76	980

On disk, the equivalent workflow would look like this:

from pathlib import Path

example_path = Path("data") / "induction_results.csv"
example_path

PosixPath('data/induction_results.csv')

If the file exists, we would read it with:

# pd.read_csv(example_path)

In a real project, the most important thing is that the CSV itself should already be organized as a tidy table whenever possible.

8.13 Merging measurement tables with metadata

Experiments often involve more than one table.

For example, one table may contain measurements, while another contains construct metadata.

metadata = pd.DataFrame(
    {
        "strain": ["WT", "Sensor", "Amplifier"],
        "plasmid": ["pControl", "pSensor", "pAmp"],
        "reporter": ["none", "GFP", "GFP"],
        "host": ["E. coli", "E. coli", "E. coli"],
    }
)

metadata

	strain	plasmid	reporter	host
0	WT	pControl	none	E. coli
1	Sensor	pSensor	GFP	E. coli
2	Amplifier	pAmp	GFP	E. coli

Because both tables have a strain column, we can merge them.

annotated = experiment.merge(metadata, on="strain", how="left")
annotated

	strain	inducer_mM	time_h	replicate	od600	fluorescence	norm_fluorescence	plasmid	reporter	host
0	WT	0.0	4	1	0.82	145	176.829268	pControl	none	E. coli
1	WT	0.0	4	2	0.80	150	187.500000	pControl	none	E. coli
2	WT	1.0	4	1	0.81	152	187.654321	pControl	none	E. coli
3	WT	1.0	4	2	0.79	149	188.607595	pControl	none	E. coli
4	Sensor	0.0	4	1	0.77	210	272.727273	pSensor	GFP	E. coli
5	Sensor	0.0	4	2	0.75	220	293.333333	pSensor	GFP	E. coli
6	Sensor	1.0	4	1	0.78	920	1179.487179	pSensor	GFP	E. coli
7	Sensor	1.0	4	2	0.76	980	1289.473684	pSensor	GFP	E. coli
8	Amplifier	0.0	4	1	0.73	260	356.164384	pAmp	GFP	E. coli
9	Amplifier	0.0	4	2	0.72	255	354.166667	pAmp	GFP	E. coli
10	Amplifier	1.0	4	1	0.74	1380	1864.864865	pAmp	GFP	E. coli
11	Amplifier	1.0	4	2	0.71	1415	1992.957746	pAmp	GFP	E. coli

This is one of the major reasons to preserve tidy structure. Joins become much easier when variables are explicit and consistently named.

8.14 Selecting one table shape for downstream work

Once you begin analyzing tidy data, it is tempting to keep making presentation-friendly versions of the table. That is fine for slides or papers, but for computation it is better to keep a canonical tidy table and derive other forms when needed.

For example, if you ever need a wide table for reporting, you can create it from the tidy version.

wide_summary = summary.pivot(
    index="strain",
    columns="inducer_mM",
    values="mean_norm_fluorescence",
)

wide_summary

inducer_mM	0.0	1.0
strain
Amplifier	355.165525	1928.911306
Sensor	283.030303	1234.480432
WT	182.164634	188.130958

That is a useful display, but it is not the format we want to keep as the primary analytical table.

A good default workflow is:

clean the raw data
convert to tidy format
do all analysis in tidy format
create wide or presentation-specific views only at the end

We will follow that pattern throughout the rest of this book.

8.15 Saving processed data

After cleaning and summarizing a dataset, it is often helpful to save the result for later chapters, figures, or reports.

output_dir = Path("results")
output_dir.mkdir(exist_ok=True)

output_path = output_dir / "induction_summary.csv"
summary_with_baseline.to_csv(output_path, index=False)

output_path

PosixPath('results/induction_summary.csv')

That small step turns an analysis from a one-off calculation into a reproducible workflow artifact.

8.16 A small end-to-end example

Let us combine the main ideas of the chapter into a compact workflow.

workflow_result = (
    experiment
    .assign(norm_fluorescence=lambda df: df["fluorescence"] / df["od600"])
    .groupby(["strain", "inducer_mM"])
    .agg(mean_norm_fluorescence=("norm_fluorescence", "mean"))
    .reset_index()
    .sort_values(["strain", "inducer_mM"])
)

workflow_result

	strain	inducer_mM	mean_norm_fluorescence
0	Amplifier	0.0	355.165525
1	Amplifier	1.0	1928.911306
2	Sensor	0.0	283.030303
3	Sensor	1.0	1234.480432
4	WT	0.0	182.164634
5	WT	1.0	188.130958

This pipeline works cleanly because the underlying table is tidy.

That is the theme to carry forward.

As datasets become larger and models become more sophisticated, tidy organization continues to pay off.

8.17 Recap

In this chapter, we learned how to:

represent biological experiments as pandas DataFrames
distinguish wide tables from tidy tables
reshape data into tidy format
inspect columns, dimensions, and data types
filter rows and select variables
create derived columns such as normalized fluorescence
summarize replicate-level data by condition
handle missing values explicitly
merge measurements with metadata
save processed outputs for reproducible analysis

Most importantly, we established a convention for the rest of the book:

From here onward, tabular experimental data should be assumed to be in tidy format unless stated otherwise.

8.18 Exercises

Add a new column called log10_norm_fluorescence using base-10 logarithms. You may need the math module or numpy.
Extend the experiment table by adding a second time point, such as time_h = 8, and compute condition summaries grouped by both time and inducer.
Create a metadata table that includes a promoter column and merge it with the experiment table.
Starting from a wide table with columns like Sensor_0mM_rep1 and Sensor_1mM_rep1, convert it into a tidy table with separate columns for strain, inducer, and replicate.
Save both the replicate-level tidy table and the condition-level summary table to separate CSV files in a results/ directory.