8  Tabular Experimental Data

Most synthetic biology projects do not begin with a model.

They begin with a table.

You grow strains, induce cultures, measure fluorescence, record optical density, annotate constructs, and export the results from an instrument or a spreadsheet. Very quickly, your work becomes a problem of tabular data management.

That is why pandas is one of the most useful Python libraries for synthetic biology. It gives us a way to read, clean, reshape, summarize, and save experimental data without leaving Python.

In this chapter, we will introduce pandas through the kinds of tables that appear constantly in biology:

This chapter also introduces one of the most important habits for scientific computing: tidy data.

Once we introduce tidy data, we will use it as the default tabular format for the rest of the book.

8.1 Why tidy data matters

A table can be easy for a human to read but hard for code to analyze.

That tension appears everywhere in biology. Instrument exports, manually assembled spreadsheets, and presentation-ready summary tables are often arranged for human inspection rather than computation.

A tidy table follows a simple idea:

  • each row is one observation
  • each column is one variable
  • each type of observational unit gets its own table

For a fluorescence induction experiment, one observation might be:

  • one strain
  • at one inducer concentration
  • at one time point
  • for one replicate

In a tidy table, those properties become columns.

That might sound abstract, so let us start with a non-tidy example.

8.2 A wide table is often the first thing you get

Suppose a plate reader exports fluorescence values like this.

import pandas as pd

wide_data = pd.DataFrame(
    {
        "time_h": [0, 2, 4],
        "WT_rep1": [120, 135, 140],
        "WT_rep2": [118, 132, 141],
        "Sensor_rep1": [125, 410, 980],
        "Sensor_rep2": [130, 430, 1020],
    }
)

wide_data
time_h WT_rep1 WT_rep2 Sensor_rep1 Sensor_rep2
0 0 120 118 125 130
1 2 135 132 410 430
2 4 140 141 980 1020

A human can read this table easily, but the column names are mixing several different variables together:

  • strain identity
  • replicate identity
  • measurement value

That makes downstream analysis awkward.

For example, imagine asking any of these questions:

  • what is the mean fluorescence for each strain at each time point?
  • how much variation is there across replicates?
  • can we merge this table with construct metadata?
  • can we plot all strains with a common grammar later?

Those operations become easier when the table is tidy.

8.3 Reshaping wide data into tidy data

We can use melt() to convert the wide table into a longer, tidier format.

tidy_from_wide = wide_data.melt(
    id_vars="time_h",
    var_name="sample",
    value_name="fluorescence",
)

tidy_from_wide
time_h sample fluorescence
0 0 WT_rep1 120
1 2 WT_rep1 135
2 4 WT_rep1 140
3 0 WT_rep2 118
4 2 WT_rep2 132
5 4 WT_rep2 141
6 0 Sensor_rep1 125
7 2 Sensor_rep1 410
8 4 Sensor_rep1 980
9 0 Sensor_rep2 130
10 2 Sensor_rep2 430
11 4 Sensor_rep2 1020

Now each row is closer to a single observation, but the sample column still combines two variables: strain and replicate.

We can separate them.

parsed = tidy_from_wide["sample"].str.extract(r"(?P<strain>.+)_rep(?P<replicate>\d+)")

tidy_from_wide = pd.concat([tidy_from_wide, parsed], axis=1)
tidy_from_wide["replicate"] = tidy_from_wide["replicate"].astype(int)

tidy_from_wide = tidy_from_wide[["time_h", "strain", "replicate", "fluorescence"]]

tidy_from_wide
time_h strain replicate fluorescence
0 0 WT 1 120
1 2 WT 1 135
2 4 WT 1 140
3 0 WT 2 118
4 2 WT 2 132
5 4 WT 2 141
6 0 Sensor 1 125
7 2 Sensor 1 410
8 4 Sensor 1 980
9 0 Sensor 2 130
10 2 Sensor 2 430
11 4 Sensor 2 1020

This new table is much better for computation:

  • one row is one measurement
  • time_h, strain, and replicate are explicit variables
  • fluorescence is the measured value

From this point onward in the book, we will prefer this style.

When we talk about experimental tables, assume that we want them in tidy format unless there is a strong reason not to.

8.4 Building a tidy experiment table

Let us create a slightly richer dataset. This time, each row will represent one observation from an induction experiment.

experiment = pd.DataFrame(
    [
        {"strain": "WT", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.82, "fluorescence": 145},
        {"strain": "WT", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.80, "fluorescence": 150},
        {"strain": "WT", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.81, "fluorescence": 152},
        {"strain": "WT", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.79, "fluorescence": 149},
        {"strain": "Sensor", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.77, "fluorescence": 210},
        {"strain": "Sensor", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.75, "fluorescence": 220},
        {"strain": "Sensor", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.78, "fluorescence": 920},
        {"strain": "Sensor", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.76, "fluorescence": 980},
        {"strain": "Amplifier", "inducer_mM": 0.0, "time_h": 4, "replicate": 1, "od600": 0.73, "fluorescence": 260},
        {"strain": "Amplifier", "inducer_mM": 0.0, "time_h": 4, "replicate": 2, "od600": 0.72, "fluorescence": 255},
        {"strain": "Amplifier", "inducer_mM": 1.0, "time_h": 4, "replicate": 1, "od600": 0.74, "fluorescence": 1380},
        {"strain": "Amplifier", "inducer_mM": 1.0, "time_h": 4, "replicate": 2, "od600": 0.71, "fluorescence": 1415},
    ]
)

experiment
strain inducer_mM time_h replicate od600 fluorescence
0 WT 0.0 4 1 0.82 145
1 WT 0.0 4 2 0.80 150
2 WT 1.0 4 1 0.81 152
3 WT 1.0 4 2 0.79 149
4 Sensor 0.0 4 1 0.77 210
5 Sensor 0.0 4 2 0.75 220
6 Sensor 1.0 4 1 0.78 920
7 Sensor 1.0 4 2 0.76 980
8 Amplifier 0.0 4 1 0.73 260
9 Amplifier 0.0 4 2 0.72 255
10 Amplifier 1.0 4 1 0.74 1380
11 Amplifier 1.0 4 2 0.71 1415

This is the kind of structure that ages well.

A tidy table like this is easy to:

  • filter by strain or condition
  • normalize measurements
  • group replicates
  • compute summaries
  • merge with metadata
  • write back to disk

8.5 Inspecting a DataFrame

A pandas table is called a DataFrame.

Some of the first things we usually check are the size, column names, and data types.

experiment.shape
(12, 6)
experiment.columns.tolist()
['strain', 'inducer_mM', 'time_h', 'replicate', 'od600', 'fluorescence']
experiment.dtypes
strain              str
inducer_mM      float64
time_h            int64
replicate         int64
od600           float64
fluorescence      int64
dtype: object

And for a quick preview:

experiment.head()
strain inducer_mM time_h replicate od600 fluorescence
0 WT 0.0 4 1 0.82 145
1 WT 0.0 4 2 0.80 150
2 WT 1.0 4 1 0.81 152
3 WT 1.0 4 2 0.79 149
4 Sensor 0.0 4 1 0.77 210

This may feel routine, but it is part of good scientific hygiene. A surprising amount of debugging is simply discovering that a column has the wrong type or an unexpected name.

8.6 Selecting columns

Column selection is straightforward.

experiment["fluorescence"]
0      145
1      150
2      152
3      149
4      210
5      220
6      920
7      980
8      260
9      255
10    1380
11    1415
Name: fluorescence, dtype: int64

To select multiple columns, pass a list of names.

experiment[["strain", "inducer_mM", "fluorescence"]]
strain inducer_mM fluorescence
0 WT 0.0 145
1 WT 0.0 150
2 WT 1.0 152
3 WT 1.0 149
4 Sensor 0.0 210
5 Sensor 0.0 220
6 Sensor 1.0 920
7 Sensor 1.0 980
8 Amplifier 0.0 260
9 Amplifier 0.0 255
10 Amplifier 1.0 1380
11 Amplifier 1.0 1415

This is useful when you want to focus on a subset of variables without modifying the original table.

8.7 Filtering rows

Because our table is tidy, filtering becomes very natural.

sensor_only = experiment[experiment["strain"] == "Sensor"]
sensor_only
strain inducer_mM time_h replicate od600 fluorescence
4 Sensor 0.0 4 1 0.77 210
5 Sensor 0.0 4 2 0.75 220
6 Sensor 1.0 4 1 0.78 920
7 Sensor 1.0 4 2 0.76 980

We can combine conditions too.

induced_sensor = experiment[
    (experiment["strain"] == "Sensor") & (experiment["inducer_mM"] == 1.0)
]

induced_sensor
strain inducer_mM time_h replicate od600 fluorescence
6 Sensor 1.0 4 1 0.78 920
7 Sensor 1.0 4 2 0.76 980

That one line already expresses a biologically meaningful question: show me the induced measurements for the sensor strain.

Sorting is equally common.

experiment.sort_values(["strain", "inducer_mM", "replicate"])
strain inducer_mM time_h replicate od600 fluorescence
8 Amplifier 0.0 4 1 0.73 260
9 Amplifier 0.0 4 2 0.72 255
10 Amplifier 1.0 4 1 0.74 1380
11 Amplifier 1.0 4 2 0.71 1415
4 Sensor 0.0 4 1 0.77 210
5 Sensor 0.0 4 2 0.75 220
6 Sensor 1.0 4 1 0.78 920
7 Sensor 1.0 4 2 0.76 980
0 WT 0.0 4 1 0.82 145
1 WT 0.0 4 2 0.80 150
2 WT 1.0 4 1 0.81 152
3 WT 1.0 4 2 0.79 149

8.8 Adding derived columns

Raw fluorescence is often less informative than fluorescence normalized by culture density.

A common first pass is to divide by OD600.

experiment = experiment.assign(
    norm_fluorescence=experiment["fluorescence"] / experiment["od600"]
)

experiment[["strain", "inducer_mM", "replicate", "norm_fluorescence"]]
strain inducer_mM replicate norm_fluorescence
0 WT 0.0 1 176.829268
1 WT 0.0 2 187.500000
2 WT 1.0 1 187.654321
3 WT 1.0 2 188.607595
4 Sensor 0.0 1 272.727273
5 Sensor 0.0 2 293.333333
6 Sensor 1.0 1 1179.487179
7 Sensor 1.0 2 1289.473684
8 Amplifier 0.0 1 356.164384
9 Amplifier 0.0 2 354.166667
10 Amplifier 1.0 1 1864.864865
11 Amplifier 1.0 2 1992.957746

This is a good example of why tidy data helps. Each row already contains the variables required for the calculation, so adding a derived column is simple and transparent.

8.9 Grouping and summarizing replicates

Biological experiments almost always involve replicate measurements.

A tidy table makes grouped summaries very convenient.

summary = (
    experiment.groupby(["strain", "inducer_mM"])
    .agg(
        n=("norm_fluorescence", "size"),
        mean_norm_fluorescence=("norm_fluorescence", "mean"),
        sd_norm_fluorescence=("norm_fluorescence", "std"),
        mean_od600=("od600", "mean"),
    )
    .reset_index()
)

summary
strain inducer_mM n mean_norm_fluorescence sd_norm_fluorescence mean_od600
0 Amplifier 0.0 2 355.165525 1.412599 0.725
1 Amplifier 1.0 2 1928.911306 90.575345 0.725
2 Sensor 0.0 2 283.030303 14.570685 0.760
3 Sensor 1.0 2 1234.480432 77.772203 0.770
4 WT 0.0 2 182.164634 7.545347 0.810
5 WT 1.0 2 188.130958 0.674066 0.800

This summary table is also tidy.

Each row now represents one summarized condition rather than one raw replicate. That is still perfectly tidy, because the observational unit has changed. The important thing is that the rows and columns remain explicit.

This distinction is worth remembering:

  • a replicate-level tidy table has one row per measurement
  • a condition-level tidy summary has one row per summarized condition

Both are tidy. They simply describe different units of analysis.

8.10 Quantifying induction

Once we have grouped summaries, we can compute simple biological metrics such as fold induction.

baseline = summary[summary["inducer_mM"] == 0.0][
    ["strain", "mean_norm_fluorescence"]
].rename(columns={"mean_norm_fluorescence": "baseline_norm_fluorescence"})

summary_with_baseline = summary.merge(baseline, on="strain")
summary_with_baseline["fold_induction"] = (
    summary_with_baseline["mean_norm_fluorescence"]
    / summary_with_baseline["baseline_norm_fluorescence"]
)

summary_with_baseline
strain inducer_mM n mean_norm_fluorescence sd_norm_fluorescence mean_od600 baseline_norm_fluorescence fold_induction
0 Amplifier 0.0 2 355.165525 1.412599 0.725 355.165525 1.000000
1 Amplifier 1.0 2 1928.911306 90.575345 0.725 355.165525 5.431021
2 Sensor 0.0 2 283.030303 14.570685 0.760 283.030303 1.000000
3 Sensor 1.0 2 1234.480432 77.772203 0.770 283.030303 4.361655
4 WT 0.0 2 182.164634 7.545347 0.810 182.164634 1.000000
5 WT 1.0 2 188.130958 0.674066 0.800 182.164634 1.032752

Now we have a compact summary that answers a real synthetic biology question: how strongly does each strain respond to inducer?

8.11 Missing values happen

Real experiments are not perfectly complete. Wells fail, cultures contaminate, measurements are dropped, and metadata go missing.

pandas uses special missing values such as NaN to represent absent entries.

with_missing = experiment.copy()
with_missing.loc[3, "od600"] = None
with_missing.loc[8, "fluorescence"] = None

with_missing
strain inducer_mM time_h replicate od600 fluorescence norm_fluorescence
0 WT 0.0 4 1 0.82 145.0 176.829268
1 WT 0.0 4 2 0.80 150.0 187.500000
2 WT 1.0 4 1 0.81 152.0 187.654321
3 WT 1.0 4 2 NaN 149.0 188.607595
4 Sensor 0.0 4 1 0.77 210.0 272.727273
5 Sensor 0.0 4 2 0.75 220.0 293.333333
6 Sensor 1.0 4 1 0.78 920.0 1179.487179
7 Sensor 1.0 4 2 0.76 980.0 1289.473684
8 Amplifier 0.0 4 1 0.73 NaN 356.164384
9 Amplifier 0.0 4 2 0.72 255.0 354.166667
10 Amplifier 1.0 4 1 0.74 1380.0 1864.864865
11 Amplifier 1.0 4 2 0.71 1415.0 1992.957746

We can ask how many missing values appear in each column.

with_missing.isna().sum()
strain               0
inducer_mM           0
time_h               0
replicate            0
od600                1
fluorescence         1
norm_fluorescence    0
dtype: int64

Sometimes the correct action is to remove incomplete rows.

complete_cases = with_missing.dropna(subset=["od600", "fluorescence"])
complete_cases
strain inducer_mM time_h replicate od600 fluorescence norm_fluorescence
0 WT 0.0 4 1 0.82 145.0 176.829268
1 WT 0.0 4 2 0.80 150.0 187.500000
2 WT 1.0 4 1 0.81 152.0 187.654321
4 Sensor 0.0 4 1 0.77 210.0 272.727273
5 Sensor 0.0 4 2 0.75 220.0 293.333333
6 Sensor 1.0 4 1 0.78 920.0 1179.487179
7 Sensor 1.0 4 2 0.76 980.0 1289.473684
9 Amplifier 0.0 4 2 0.72 255.0 354.166667
10 Amplifier 1.0 4 1 0.74 1380.0 1864.864865
11 Amplifier 1.0 4 2 0.71 1415.0 1992.957746

Other times we may want to keep the rows but mark that some analysis cannot yet be performed.

The important point is not to ignore missingness. Missing values are part of the data-generating process and often reflect something biologically or experimentally meaningful.

8.12 Reading tidy data from a CSV file

In practice, we usually load data from files rather than typing them directly into Python.

Here is a small CSV example using an in-memory text buffer.

from io import StringIO

csv_text = """strain,inducer_mM,time_h,replicate,od600,fluorescence
WT,0.0,4,1,0.82,145
WT,0.0,4,2,0.80,150
Sensor,1.0,4,1,0.78,920
Sensor,1.0,4,2,0.76,980
"""

loaded = pd.read_csv(StringIO(csv_text))
loaded
strain inducer_mM time_h replicate od600 fluorescence
0 WT 0.0 4 1 0.82 145
1 WT 0.0 4 2 0.80 150
2 Sensor 1.0 4 1 0.78 920
3 Sensor 1.0 4 2 0.76 980

On disk, the equivalent workflow would look like this:

from pathlib import Path

example_path = Path("data") / "induction_results.csv"
example_path
PosixPath('data/induction_results.csv')

If the file exists, we would read it with:

# pd.read_csv(example_path)

In a real project, the most important thing is that the CSV itself should already be organized as a tidy table whenever possible.

8.13 Merging measurement tables with metadata

Experiments often involve more than one table.

For example, one table may contain measurements, while another contains construct metadata.

metadata = pd.DataFrame(
    {
        "strain": ["WT", "Sensor", "Amplifier"],
        "plasmid": ["pControl", "pSensor", "pAmp"],
        "reporter": ["none", "GFP", "GFP"],
        "host": ["E. coli", "E. coli", "E. coli"],
    }
)

metadata
strain plasmid reporter host
0 WT pControl none E. coli
1 Sensor pSensor GFP E. coli
2 Amplifier pAmp GFP E. coli

Because both tables have a strain column, we can merge them.

annotated = experiment.merge(metadata, on="strain", how="left")
annotated
strain inducer_mM time_h replicate od600 fluorescence norm_fluorescence plasmid reporter host
0 WT 0.0 4 1 0.82 145 176.829268 pControl none E. coli
1 WT 0.0 4 2 0.80 150 187.500000 pControl none E. coli
2 WT 1.0 4 1 0.81 152 187.654321 pControl none E. coli
3 WT 1.0 4 2 0.79 149 188.607595 pControl none E. coli
4 Sensor 0.0 4 1 0.77 210 272.727273 pSensor GFP E. coli
5 Sensor 0.0 4 2 0.75 220 293.333333 pSensor GFP E. coli
6 Sensor 1.0 4 1 0.78 920 1179.487179 pSensor GFP E. coli
7 Sensor 1.0 4 2 0.76 980 1289.473684 pSensor GFP E. coli
8 Amplifier 0.0 4 1 0.73 260 356.164384 pAmp GFP E. coli
9 Amplifier 0.0 4 2 0.72 255 354.166667 pAmp GFP E. coli
10 Amplifier 1.0 4 1 0.74 1380 1864.864865 pAmp GFP E. coli
11 Amplifier 1.0 4 2 0.71 1415 1992.957746 pAmp GFP E. coli

This is one of the major reasons to preserve tidy structure. Joins become much easier when variables are explicit and consistently named.

8.14 Selecting one table shape for downstream work

Once you begin analyzing tidy data, it is tempting to keep making presentation-friendly versions of the table. That is fine for slides or papers, but for computation it is better to keep a canonical tidy table and derive other forms when needed.

For example, if you ever need a wide table for reporting, you can create it from the tidy version.

wide_summary = summary.pivot(
    index="strain",
    columns="inducer_mM",
    values="mean_norm_fluorescence",
)

wide_summary
inducer_mM 0.0 1.0
strain
Amplifier 355.165525 1928.911306
Sensor 283.030303 1234.480432
WT 182.164634 188.130958

That is a useful display, but it is not the format we want to keep as the primary analytical table.

A good default workflow is:

  1. clean the raw data
  2. convert to tidy format
  3. do all analysis in tidy format
  4. create wide or presentation-specific views only at the end

We will follow that pattern throughout the rest of this book.

8.15 Saving processed data

After cleaning and summarizing a dataset, it is often helpful to save the result for later chapters, figures, or reports.

output_dir = Path("results")
output_dir.mkdir(exist_ok=True)

output_path = output_dir / "induction_summary.csv"
summary_with_baseline.to_csv(output_path, index=False)

output_path
PosixPath('results/induction_summary.csv')

That small step turns an analysis from a one-off calculation into a reproducible workflow artifact.

8.16 A small end-to-end example

Let us combine the main ideas of the chapter into a compact workflow.

workflow_result = (
    experiment
    .assign(norm_fluorescence=lambda df: df["fluorescence"] / df["od600"])
    .groupby(["strain", "inducer_mM"])
    .agg(mean_norm_fluorescence=("norm_fluorescence", "mean"))
    .reset_index()
    .sort_values(["strain", "inducer_mM"])
)

workflow_result
strain inducer_mM mean_norm_fluorescence
0 Amplifier 0.0 355.165525
1 Amplifier 1.0 1928.911306
2 Sensor 0.0 283.030303
3 Sensor 1.0 1234.480432
4 WT 0.0 182.164634
5 WT 1.0 188.130958

This pipeline works cleanly because the underlying table is tidy.

That is the theme to carry forward.

As datasets become larger and models become more sophisticated, tidy organization continues to pay off.

8.17 Recap

In this chapter, we learned how to:

  • represent biological experiments as pandas DataFrames
  • distinguish wide tables from tidy tables
  • reshape data into tidy format
  • inspect columns, dimensions, and data types
  • filter rows and select variables
  • create derived columns such as normalized fluorescence
  • summarize replicate-level data by condition
  • handle missing values explicitly
  • merge measurements with metadata
  • save processed outputs for reproducible analysis

Most importantly, we established a convention for the rest of the book:

From here onward, tabular experimental data should be assumed to be in tidy format unless stated otherwise.

8.18 Exercises

  1. Add a new column called log10_norm_fluorescence using base-10 logarithms. You may need the math module or numpy.

  2. Extend the experiment table by adding a second time point, such as time_h = 8, and compute condition summaries grouped by both time and inducer.

  3. Create a metadata table that includes a promoter column and merge it with the experiment table.

  4. Starting from a wide table with columns like Sensor_0mM_rep1 and Sensor_1mM_rep1, convert it into a tidy table with separate columns for strain, inducer, and replicate.

  5. Save both the replicate-level tidy table and the condition-level summary table to separate CSV files in a results/ directory.