10  Modeling Gene Expression

Synthetic biology is not only about storing data and drawing circuits.

It is also about asking quantitative questions.

Those are modeling questions.

A model is a simplified mathematical description of a biological system.

A good model does not capture everything. It captures the things that matter for the question we are asking.

In this chapter, we will build simple models of gene expression in Python.

Our goals are practical:

This chapter is not a full course in dynamical systems.

It is a working introduction for synthetic biologists who want to think quantitatively and write code that supports that thinking.

10.1 Why model gene expression?

Experiments tell us what happened.

Models help us reason about what could happen.

That matters because synthetic biology often involves design choices before an experiment is run.

For example, you may want to compare:

  • a strong promoter vs a weak promoter
  • a stable protein vs a destabilized one
  • a low-copy plasmid vs a high-copy plasmid
  • a tightly repressed promoter vs a leaky promoter
  • a steep response curve vs a shallow one

A model lets us explore those alternatives quickly.

It does not replace experiments.

Instead, it helps with:

  • intuition building
  • experimental planning
  • parameter sensitivity analysis
  • identifying unrealistic expectations
  • communicating assumptions clearly

In synthetic biology, even simple models can be extremely useful.

10.2 Variables, parameters, and rates

A model usually contains three kinds of ingredients.

10.2.1 Variables

Variables change over time.

Examples:

  • mRNA concentration
  • protein concentration
  • inducer concentration
  • cell density

10.2.2 Parameters

Parameters are fixed for one simulation.

Examples:

  • transcription rate
  • translation rate
  • degradation rate
  • Hill coefficient
  • dissociation constant

10.2.3 Rules of change

Rules of change tell us how variables evolve.

For gene expression, a common idea is:

  • molecules are produced
  • molecules are removed or diluted

That is the core of many useful models.

10.3 The simplest expression model: production and loss

Let P be the amount of protein.

A minimal model says:

[ = - P ]

where:

  • \beta is the production rate
  • \gamma is the loss rate
  • P is the protein amount

The biological story is simple.

Protein is continuously produced, but it is also lost through degradation or dilution during growth.

This model already teaches something important.

At steady state, production and loss balance each other.

If:

[ = 0 ]

then:

[ P^* = ]

So the steady-state level depends on both production and loss.

A stronger promoter can increase expression, but so can a lower loss rate.

10.4 Computing the analytical solution

For the simple production-loss model, there is a closed-form solution.

If the initial amount is zero, then:

[ P(t) = (1 - e^{-t}) ]

Let us compute that in Python.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

beta = 4.0
gamma = 0.5

time = np.linspace(0, 12, 121)
protein = beta / gamma * (1 - np.exp(-gamma * time))

analytic_df = pd.DataFrame(
    {
        "time_h": time,
        "species": "protein",
        "value": protein,
        "model": "analytic",
        "condition": "baseline",
    }
)

analytic_df.head()
time_h species value model condition
0 0.0 protein 0.000000 analytic baseline
1 0.1 protein 0.390165 analytic baseline
2 0.2 protein 0.761301 analytic baseline
3 0.3 protein 1.114336 analytic baseline
4 0.4 protein 1.450154 analytic baseline

This is a tidy time-course table.

Each row is one observation of one variable at one time under one condition.

That is the convention we will keep using.

Even simulated data should be stored in tidy format when possible.

10.5 Visualizing the approach to steady state

plt.figure(figsize=(7, 4))
plt.plot(analytic_df["time_h"], analytic_df["value"])
plt.axhline(beta / gamma, linestyle="--")
plt.xlabel("Time (h)")
plt.ylabel("Protein amount")
plt.title("Protein approaches a steady state")
plt.show()

The dashed line is the steady-state value \beta / \gamma.

The curve rises quickly at first, then slows as loss catches up to production.

That shape appears everywhere in biology.

10.6 Simulating the same model with Euler’s method

Many biological models do not have neat analytical solutions.

So we need a numerical method.

A very common first method is Euler’s method.

The idea is simple.

If the current state is P, and the rate of change is:

[ = f(P) ]

then over a small time step dt, we approximate:

[ P_{next} = P + dt f(P) ]

For our model:

[ P_{next} = P + dt(- P) ]

Let us implement that.

def simulate_protein(beta, gamma, t_end=12, dt=0.1, p0=0.0, condition="baseline"):
    times = np.arange(0, t_end + dt, dt)
    p = p0
    records = []

    for t in times:
        records.append(
            {
                "time_h": t,
                "species": "protein",
                "value": p,
                "condition": condition,
                "model": "euler",
            }
        )
        dp_dt = beta - gamma * p
        p = p + dt * dp_dt

    return pd.DataFrame(records)

euler_df = simulate_protein(beta=4.0, gamma=0.5)
euler_df.head()
time_h species value condition model
0 0.0 protein 0.00000 baseline euler
1 0.1 protein 0.40000 baseline euler
2 0.2 protein 0.78000 baseline euler
3 0.3 protein 1.14100 baseline euler
4 0.4 protein 1.48395 baseline euler

We can compare the numerical and analytical solutions.

comparison_df = pd.concat([analytic_df, euler_df], ignore_index=True)
comparison_df.head()
time_h species value model condition
0 0.0 protein 0.000000 analytic baseline
1 0.1 protein 0.390165 analytic baseline
2 0.2 protein 0.761301 analytic baseline
3 0.3 protein 1.114336 analytic baseline
4 0.4 protein 1.450154 analytic baseline
plt.figure(figsize=(7, 4))

for model_name, subset in comparison_df.groupby("model"):
    plt.plot(subset["time_h"], subset["value"], label=model_name)

plt.xlabel("Time (h)")
plt.ylabel("Protein amount")
plt.title("Analytical and Euler solutions are close")
plt.legend()
plt.show()

For a small enough dt, Euler’s method does a good job here.

That is one of the first practical lessons of modeling:

  • the biology matters
  • the mathematics matters
  • the numerical method matters too

10.7 Time scales matter

The parameter \gamma controls how quickly the system responds.

If \gamma is large, the system adjusts quickly.

If \gamma is small, the system changes slowly.

We can compare several degradation or dilution rates.

rate_sweep = pd.concat(
    [
        simulate_protein(beta=4.0, gamma=0.2, condition="gamma=0.2"),
        simulate_protein(beta=4.0, gamma=0.5, condition="gamma=0.5"),
        simulate_protein(beta=4.0, gamma=1.0, condition="gamma=1.0"),
    ],
    ignore_index=True,
)

rate_sweep.head()
time_h species value condition model
0 0.0 protein 0.000000 gamma=0.2 euler
1 0.1 protein 0.400000 gamma=0.2 euler
2 0.2 protein 0.792000 gamma=0.2 euler
3 0.3 protein 1.176160 gamma=0.2 euler
4 0.4 protein 1.552637 gamma=0.2 euler
plt.figure(figsize=(7, 4))
for condition, subset in rate_sweep.groupby("condition"):
    plt.plot(subset["time_h"], subset["value"], label=condition)

plt.xlabel("Time (h)")
plt.ylabel("Protein amount")
plt.title("Loss rate changes both speed and steady state")
plt.legend()
plt.show()

Notice what changed.

When \gamma increases:

  • the system responds faster
  • the steady-state level falls because \beta / \gamma becomes smaller

That is a powerful design idea.

Destabilizing a protein can make a system faster, but often at the cost of lower signal.

10.8 Building a tidy summary table

Because our simulations are already tidy, summarizing them is straightforward.

Let us extract the final simulated value for each condition.

final_values = (
    rate_sweep.sort_values("time_h")
    .groupby(["condition", "species"], as_index=False)
    .tail(1)
    .loc[:, ["condition", "species", "value"]]
    .rename(columns={"value": "final_value"})
    .reset_index(drop=True)
)

final_values
condition species final_value
0 gamma=1.0 protein 3.999987
1 gamma=0.5 protein 7.983021
2 gamma=0.2 protein 18.229243

This is exactly why tidy data is useful after Chapter 5.

We do not need a special case for simulated results.

We can filter, group, summarize, and merge them the same way we handled experimental data.

10.9 A two-stage model: mRNA and protein

The one-variable model is useful, but gene expression usually has at least two conceptual stages:

  1. transcription produces mRNA
  2. translation produces protein from mRNA

A simple two-stage model is:

[ = - _m m ]

[ = k_{tl} m - _p p ]

where:

  • m is mRNA amount
  • p is protein amount
  • \alpha is transcription rate
  • \delta_m is mRNA loss rate
  • k_{tl} is translation rate
  • \delta_p is protein loss rate

This model helps us separate fast RNA dynamics from slower protein accumulation.

10.10 Simulating the two-stage model

def simulate_mrna_protein(
    alpha,
    delta_m,
    k_tl,
    delta_p,
    t_end=12,
    dt=0.05,
    m0=0.0,
    p0=0.0,
    condition="baseline",
):
    times = np.arange(0, t_end + dt, dt)
    m = m0
    p = p0
    records = []

    for t in times:
        records.append({"time_h": t, "species": "mRNA", "value": m, "condition": condition})
        records.append({"time_h": t, "species": "protein", "value": p, "condition": condition})

        dm_dt = alpha - delta_m * m
        dp_dt = k_tl * m - delta_p * p

        m = m + dt * dm_dt
        p = p + dt * dp_dt

    return pd.DataFrame(records)

two_stage_df = simulate_mrna_protein(
    alpha=6.0,
    delta_m=2.0,
    k_tl=3.0,
    delta_p=0.4,
)

two_stage_df.head()
time_h species value condition
0 0.00 mRNA 0.00 baseline
1 0.00 protein 0.00 baseline
2 0.05 mRNA 0.30 baseline
3 0.05 protein 0.00 baseline
4 0.10 mRNA 0.57 baseline

Again, this is tidy data.

Each row contains:

  • one time point
  • one species
  • one value
  • one condition

That structure is flexible enough for plotting and analysis.

plt.figure(figsize=(7, 4))
for species, subset in two_stage_df.groupby("species"):
    plt.plot(subset["time_h"], subset["value"], label=species)

plt.xlabel("Time (h)")
plt.ylabel("Amount")
plt.title("mRNA rises and saturates before protein")
plt.legend()
plt.show()

Typically, mRNA responds faster because its turnover is faster.

Protein often lags behind.

That delay matters when designing dynamic circuits.

10.11 Comparing promoter strengths

Suppose we want to compare a weak, medium, and strong promoter.

In this simple model, we can do that by changing \alpha, the transcription rate.

promoter_sweep = pd.concat(
    [
        simulate_mrna_protein(2.5, 2.0, 3.0, 0.4, condition="weak promoter"),
        simulate_mrna_protein(6.0, 2.0, 3.0, 0.4, condition="medium promoter"),
        simulate_mrna_protein(10.0, 2.0, 3.0, 0.4, condition="strong promoter"),
    ],
    ignore_index=True,
)

protein_only = promoter_sweep[promoter_sweep["species"] == "protein"]
protein_only.head()
time_h species value condition
1 0.00 protein 0.000000 weak promoter
3 0.05 protein 0.000000 weak promoter
5 0.10 protein 0.018750 weak promoter
7 0.15 protein 0.054000 weak promoter
9 0.20 protein 0.103733 weak promoter
plt.figure(figsize=(7, 4))
for condition, subset in protein_only.groupby("condition"):
    plt.plot(subset["time_h"], subset["value"], label=condition)

plt.xlabel("Time (h)")
plt.ylabel("Protein amount")
plt.title("Promoter strength changes protein dynamics")
plt.legend()
plt.show()

This is a very common modeling use case.

Before building three constructs, we can explore whether the expected response differences are likely to be large enough to matter.

10.12 Induction and Hill functions

So far, production has been constant.

But many synthetic biology systems are regulated.

A promoter may respond to an inducer, a repressor, or an activator.

A common approximation is the Hill function.

10.12.1 Activation

A simple activation function is:

[ (x) = ]

where:

  • x is inducer or activator concentration
  • K is the half-response concentration
  • n is the Hill coefficient

10.12.2 Repression

A simple repression function is:

[ (x) = ]

These functions are not perfect descriptions of every promoter.

But they are extremely useful approximations.

10.13 Coding Hill functions

def hill_activation(x, K, n):
    x = np.asarray(x, dtype=float)
    return (x**n) / (K**n + x**n)


def hill_repression(x, K, n):
    x = np.asarray(x, dtype=float)
    return (K**n) / (K**n + x**n)

inducer = np.linspace(0, 100, 201)
activation = hill_activation(inducer, K=20, n=2)
repression = hill_repression(inducer, K=20, n=2)

hill_df = pd.DataFrame(
    {
        "inducer_uM": np.concatenate([inducer, inducer]),
        "response": np.concatenate([activation, repression]),
        "relationship": ["activation"] * len(inducer) + ["repression"] * len(inducer),
    }
)

hill_df.head()
inducer_uM response relationship
0 0.0 0.000000 activation
1 0.5 0.000625 activation
2 1.0 0.002494 activation
3 1.5 0.005594 activation
4 2.0 0.009901 activation
plt.figure(figsize=(7, 4))
for relationship, subset in hill_df.groupby("relationship"):
    plt.plot(subset["inducer_uM"], subset["response"], label=relationship)

plt.xlabel("Inducer concentration (uM)")
plt.ylabel("Normalized response")
plt.title("Hill activation and repression")
plt.legend()
plt.show()

A Hill coefficient of n = 1 gives a gentler curve.

A larger n makes the response steeper.

That can matter a lot when designing threshold-like behavior.

10.14 Linking induction to protein production

We can plug a Hill activation function into our expression model.

Suppose a promoter has a maximum transcription rate alpha_max, and induction controls what fraction of that maximum is active.

Then we might write:

[ (x) = _{max} ]

Let us compute a steady-state dose-response curve for protein.

In the simple production-loss model, steady state is P^* = \beta / \gamma.

If we treat \beta as an inducer-dependent production term, we can sweep across inducer values.

alpha_max = 12.0
gamma = 0.6
inducer_values = np.linspace(0, 100, 41)

response_records = []
for x in inducer_values:
    beta_x = alpha_max * hill_activation(x, K=20, n=2)
    p_ss = beta_x / gamma
    response_records.append(
        {
            "inducer_uM": x,
            "steady_state_protein": p_ss,
            "K": 20,
            "hill_n": 2,
        }
    )

dose_response_df = pd.DataFrame(response_records)
dose_response_df.head()
inducer_uM steady_state_protein K hill_n
0 0.0 0.000000 20 2
1 2.5 0.307692 20 2
2 5.0 1.176471 20 2
3 7.5 2.465753 20 2
4 10.0 4.000000 20 2
plt.figure(figsize=(7, 4))
plt.plot(dose_response_df["inducer_uM"], dose_response_df["steady_state_protein"])
plt.xlabel("Inducer concentration (uM)")
plt.ylabel("Steady-state protein")
plt.title("Dose-response curve from a Hill-regulated promoter")
plt.show()

This is a classic synthetic biology use case.

You can think of the model as a way to connect:

  • inducer concentration
  • promoter activity
  • expression level

10.15 Parameter sweeps should also be tidy

From this point onward, parameter sweeps should be stored tidily too.

Let us compare different Hill coefficients.

parameter_sweep_records = []

for n_value in [1, 2, 4]:
    for x in inducer_values:
        beta_x = alpha_max * hill_activation(x, K=20, n=n_value)
        p_ss = beta_x / gamma
        parameter_sweep_records.append(
            {
                "inducer_uM": x,
                "steady_state_protein": p_ss,
                "hill_n": n_value,
                "K": 20,
                "parameter": "hill_n",
            }
        )

parameter_sweep_df = pd.DataFrame(parameter_sweep_records)
parameter_sweep_df.head()
inducer_uM steady_state_protein hill_n K parameter
0 0.0 0.000000 1 20 hill_n
1 2.5 2.222222 1 20 hill_n
2 5.0 4.000000 1 20 hill_n
3 7.5 5.454545 1 20 hill_n
4 10.0 6.666667 1 20 hill_n
plt.figure(figsize=(7, 4))
for hill_n, subset in parameter_sweep_df.groupby("hill_n"):
    plt.plot(subset["inducer_uM"], subset["steady_state_protein"], label=f"n={hill_n}")

plt.xlabel("Inducer concentration (uM)")
plt.ylabel("Steady-state protein")
plt.title("Hill coefficient changes response steepness")
plt.legend()
plt.show()

Now the sweep lives in a tidy table, so we can summarize it easily.

For example, what inducer concentration first reaches at least half-maximal output?

half_max_summary = []
max_output = parameter_sweep_df["steady_state_protein"].max()

for hill_n, subset in parameter_sweep_df.groupby("hill_n"):
    threshold = subset["steady_state_protein"].max() / 2
    above = subset[subset["steady_state_protein"] >= threshold]
    first_crossing = above["inducer_uM"].min()
    half_max_summary.append({"hill_n": hill_n, "half_max_inducer_uM": first_crossing})

half_max_df = pd.DataFrame(half_max_summary)
half_max_df
hill_n half_max_inducer_uM
0 1 15.0
1 2 20.0
2 4 20.0

10.16 Simulating an induction time course

A dose-response curve tells us steady-state behavior.

But sometimes we care about dynamics after induction.

We can simulate a protein model where the production rate depends on inducer concentration.

def simulate_induced_protein(inducer_uM, alpha_max, K, n, gamma, t_end=12, dt=0.1, condition=None):
    if condition is None:
        condition = f"inducer={inducer_uM}"

    beta = alpha_max * hill_activation(inducer_uM, K=K, n=n)
    return simulate_protein(beta=beta, gamma=gamma, t_end=t_end, dt=dt, condition=condition)

induction_timecourse_df = pd.concat(
    [
        simulate_induced_protein(0, 12.0, 20, 2, 0.6, condition="0 uM"),
        simulate_induced_protein(10, 12.0, 20, 2, 0.6, condition="10 uM"),
        simulate_induced_protein(30, 12.0, 20, 2, 0.6, condition="30 uM"),
        simulate_induced_protein(100, 12.0, 20, 2, 0.6, condition="100 uM"),
    ],
    ignore_index=True,
)

induction_timecourse_df.head()
time_h species value condition model
0 0.0 protein 0.0 0 uM euler
1 0.1 protein 0.0 0 uM euler
2 0.2 protein 0.0 0 uM euler
3 0.3 protein 0.0 0 uM euler
4 0.4 protein 0.0 0 uM euler
plt.figure(figsize=(7, 4))
for condition, subset in induction_timecourse_df.groupby("condition"):
    plt.plot(subset["time_h"], subset["value"], label=condition)

plt.xlabel("Time (h)")
plt.ylabel("Protein amount")
plt.title("Induction level changes dynamic trajectories")
plt.legend()
plt.show()

This kind of result helps answer practical design questions.

For example:

  • is basal expression acceptably low?
  • does the induced state separate enough from the uninduced state?
  • how long must we wait before measuring fluorescence?

10.17 A note on deterministic vs stochastic models

Everything we have done so far is deterministic.

That means the model gives the same answer every time for the same parameters and initial conditions.

Real gene expression is often noisy.

At low copy number, stochastic effects can matter a lot.

So why start with deterministic models?

Because they are often the right first step.

They help us understand:

  • the average behavior
  • the role of parameters
  • the structure of a design
  • whether a system is even plausible before adding more realism

Later, you may want stochastic simulation, Bayesian inference, or parameter estimation from data.

But deterministic models are the right foundation.

10.18 Choosing the right level of model complexity

Not every project needs the same model.

A useful rule of thumb is:

  • start with the simplest model that can answer your question
  • add complexity only when the simpler model fails meaningfully

For example:

Use a one-variable production-loss model when you care about:

  • rough accumulation times
  • steady-state scaling
  • promoter strength comparisons

Use a two-stage mRNA-protein model when you care about:

  • transcription vs translation separately
  • delays between RNA and protein
  • RNA instability

Use regulated production with Hill functions when you care about:

  • induction curves
  • repression curves
  • threshold behavior
  • dose-response tuning

A model is not better because it has more equations.

A model is better when it is better matched to the question.

10.19 Exercises

  1. Change the degradation rate gamma in the one-variable model and measure how long it takes to reach 90% of steady state.
  2. Modify the two-stage model so that the initial mRNA amount is nonzero. How does that change early protein accumulation?
  3. Compare two proteins with the same production rate but different degradation rates. Which one is faster? Which one reaches a higher steady state?
  4. Repeat the Hill-function sweep with different values of K. How does K shift the response curve?
  5. Build a tidy parameter sweep over both K and n, and summarize the final steady-state protein values.
  6. Add a constant leak term to the induced production model and examine basal expression.
  7. Save one of your simulated tidy tables to CSV for later analysis.

10.20 Recap

In this chapter, we introduced practical modeling of gene expression in Python.

The most important ideas are:

  • models turn biological stories into quantitative rules
  • production-loss models already explain steady states and response times
  • Euler’s method lets us simulate models numerically
  • simulated outputs should be stored in tidy data format just like experimental data
  • two-stage mRNA-protein models add biological realism while staying manageable
  • Hill functions are a practical way to model activation and repression
  • tidy parameter sweeps make it easy to compare design choices systematically

From this point onward, we will treat simulated tables the same way we treat experimental ones:

  • each row should represent one observation
  • variables should live in columns
  • conditions and parameters should be explicit
  • tidy data remains the default format for downstream analysis

That consistency is important.

It means the same Python tools can support both modeling and experiments, which is exactly what we want in synthetic biology.