CourtShadow · CS109

CourtShadow Reading what courtrooms leave in the shadows.

CourtShadow is a logistic regression model trained on thousands of segments from real criminal trial transcripts. It aims to address two critical questions:

How do linguistic and case-type patterns in courtroom transcripts differ across cases pre-coded into different defendant groups?

What might these observations reveal about institutional disparities in the justice system?

CourtShadow does not measure individual prejudice or make legal recommendations. Rather, it surfaces institutional and procedural patterns reflected in courtroom language to support further qualitative and legal analysis.

Project snapshot Interpretability-first
Transcripts
42
train + test
Segments
1,915
chunked transcript units
Accuracy
87.5%
chunk-level on test set
ROC AUC
0.95
discrimination
Chunk-level view Linguistic Environment Scores (LES) decomposed into text, context, and role features.
What CourtShadow reveals Institutional Signals
  • Institutional differences across defendant groups
  • How speaker roles shape patterns
  • Cases with strongest environment scores
Methods

Data & Methods

CourtShadow combines manually collected criminal trial transcripts with case-level group coding. Each transcript is transformed into a sequence of linguistically interpretable segments, featurized, and passed through a logistic regression model to produce case-level Linguistic Environment Scores.

Dataset construction

Dataset construction

  • Collected 42 publicly-available criminal trial transcripts.
  • Applied case-level group coding (POC-coded vs white-coded) using external research sources.
  • Split transcripts into smaller linguistic segments based on speaker turns and text length.
  • Merged in speaker roles (judge, prosecutor, defense, witness, defendant).
Modeling the environment

Model pipeline

CourtShadow trains a logistic regression model on individual transcript segments. Each segment is mapped to a probability that it comes from a case pre-coded as POC-coded vs white-coded. These segment scores are then aggregated into case-level Linguistic Environment Scores and decomposed by feature family.

1

Chunk transcripts

Split each trial transcript into smaller segments based on speaker turns and length constraints.

2

Featurize segments

Turn each segment into a 38-dimensional feature vector capturing text, speaker roles, and contextual structure.

3

Train logistic regression

Fit a segment-level logistic regression classifier using only training cases with known group coding.

4

Predict on test cases

Generate segment-level probabilities for held-out test cases, treating group coding as the target label.

5

Aggregate & decompose

Aggregate segment predictions into case-level Linguistic Environment Scores (LBS) and decompose contributions by feature family (text, context, roles/meta).

From transcript to score

The model never sees race directly. It only reads who speaks, how they speak, and what is on the record — then learns which environments tend to belong to POC-coded vs white-coded cases.

Transcript segments
JUDGE “Mr. Smith, when did you first see the firearm?”
PROSECUTOR “He clearly knew what he was doing that night.”
DEFENSE “Your Honor, given his background and remorse, we ask for leniency.”
WITNESS “I saw officers recover a firearm and narcotics from the car.”
Segment features
tokens = 32 sentences = 1 question_marks = 1
politeness ↑ mitigation ↑ harshness ↓
first-person rate second-person rate
weapons topic drugs topic police interaction
Model signal
Segment-level P(Group = POC-coded) = 0.78

Segment scores are averaged and adjusted across all segments in a case to produce a Linguistic Environment Score at the case level.

Understanding the Score

What comprises the Linguistic Environment Score (LES)?

CourtShadow's LES can be decomposed into 38 linguistically interpretable features per segment. Use the tabs to flip through how each feature family works, with examples and rationale.

Feature overview

Each segment is converted into a compact vector of 38 features that capture style, stance, and criminal context. To prevent data leakage, group labels are used only as the target during training and never as inputs to the model.

  • Length & structure (3)
    How long the segment is, and how it’s punctuated.
  • Discursive framing (10)
    Politeness, harshness, mitigation, certainty, uncertainty (totals + rates).
  • Pronouns & voice (4)
    First- vs. second-person usage.
  • Topic indicators (21)
    Binary flags for crime/context keywords (guns, drugs, fraud, police, injury, etc.).

Together, these families allow CourtShadow to focus on how cases are structured and described, rather than on any explicit mention of group identity.

3
Length
10
Framing
4
Pronouns
21
Topics

Length & structure (3 features)

These features capture how dense and structurally complex each segment is. Rather than looking at what is being said, these features examine how text is packaged in the transcript:

  • chunk_tokens — Number of word tokens in the segment (rough length of the turn).
  • chunk_sentences — Approximate sentence count, using ., !, and ? markers as boundaries.
  • chunk_question_marks — Number of question marks, as a proxy for direct questioning.

Short Q&A turn

“Mr. Smith, when did you first see the firearm?”

tokens: 9 sentences: 1 questions: 1

Longer narrative turn

“I went to the store, talked to the clerk, and then drove home before the officers arrived.”

tokens: 32 sentences: 1 questions: 0

CourtShadow doesn’t “read” these exchanges as a narrative. It instead treats them as structured turns whose lengths and punctuation patterns help to signal presentation style.

Why include this family?
Prior work on courtroom presentation demonstrates that the way in which speech is structured (i.e., how long a witness or lawyer talks, how often they ask questions) may offer insight into control, pressure, and credibility. These simple length and punctuation features allow CourtShadow to analyze the structural side of the transcript without using any information about the defendant's identity.

Want to dig into the theory?

Read more in related work

Discursive framing (10 features)

These features track how segments frame people and events using politeness, harshness, mitigation, and (un)certainty:

  • Politeness: Counts and rates of words like “please”, “sir”, “your honor”.
  • Harshness: words like “liar”, “dangerous”, “criminal”.
  • Mitigation: “Background”, “circumstances”, “remorse”, “treatment”, “nonviolent”.
  • Certainty/Uncertainty: “Clearly”, “obviously” vs. “maybe”, “might”, “unclear”.

Polite / mitigating

Your Honor, given his background and remorse, we ask for leniency.”

politeness ↑ mitigation ↑ harshness ↓

Harsh / certain

“He is a dangerous criminal who clearly knew what he was doing.”

harshness ↑ certainty ↑ politeness ↓

CourtShadow doesn’t decide which framing is “right” — it simply notes how often different framings appear across cases.

Why include this family?
Research on legal and courtroom discourse shows that politeness, impoliteness, and hedging strategies are central to the ways in which power and credibility are framed in trials. These 'framing' features allow CourtShadow to summarize whether segments tend to soften, intensify, or justify the ways in which defendants and events are described.

Want to dig into the theory?

Read more in related work

Pronouns & voice (4 features)

These features count first- and second-person pronouns and normalize by segment length:

  • first_person_total / rate: “I, me, my, we, our…”
  • second_person_total / rate: “You, your, yours…”

Defendant voice

I didn’t think I was doing anything wrong.”

first-person rate ↑ second-person rate ↓

Prosecutor voice

You had multiple chances to comply, and you chose not to.”

second-person rate ↑ first-person rate ↓

CourtShadow uses these pronoun patterns, combined with speaker roles, as one way of capturing who is being centered or addressed in the discourse.

Why include this family?
Studies of pronoun use in institutional and courtroom discourse show that “I/we/you” choices are tightly linked to power, alignment, and “othering”. Pronoun rates help CourtShadow to summarize whose perspective in dominant in different segments, as well as who is being directly addressed.

Want to dig into the theory?

Read more in related work

Topic indicators (21 features)

These are binary features that fire when a segment contains keywords tied to common criminal-legal contexts:

  • Violence & weapons: Guns, firearms, knives, shooting, killing, assault, murder, manslaughter.
  • Police interaction: Police, officer, cop, flee, arrest, comply, resist.
  • Financial crime: Fraud, scheme, investment, bank, account, wire transfer.
  • Drugs: Drugs, narcotics, cocaine, heroin, meth.
  • Harm & injury: Victim, injury, hospital.

Example segment

“Officers recovered a firearm and narcotics from the vehicle after the traffic stop.”

Weapons topic: on Drugs topic: on Police interaction: on

Each topic feature is a simple on/off flag, but aggregated across thousands of segments it helps show which kinds of criminal contexts dominate different cases.

Why include this family?
Prior research on sentencing and disparity evinces that charge type and add-on offenses (e.g., drug, weapons, resisting arrest) are greatly tied to outcomes. Topic indicators allow CourtShadow to control for these case-type differences when interpreting linguistic patterns.

Want to dig into the theory?

Read more in related work
Understanding the Math Behind the Model

Mathematical snapshot

A high-level overview of the probabilistic and optimization framework behind CourtShadow’s segment-level classifier and case-level aggregation.

Bernoulli Modeling

Each segment label \( y_i \in \{0,1\} \) is treated as a Bernoulli random variable indicating whether segment \(i\) comes from a case pre-coded into Group 1 (e.g., POC-coded) or Group 0 (white-coded). The model estimates a probability: \[ p_i = P(y_i = 1 \mid x_i), \] where \(x_i\) is the 38-dimensional feature vector for segment \(i\). This defines the Bernoulli likelihood: \[ L(\theta) = \prod_{i=1}^{n} p_i^{y_i}(1 - p_i)^{1-y_i}. \]

Logistic Regression

CourtShadow uses a logistic link function to map features to probabilities: \[ p_i = \sigma(\theta^\top x_i) = \frac{1}{1 + e^{-\theta^\top x_i}}. \] Training minimizes the negative log-likelihood (cross-entropy) loss: \[ J(\theta) = -\sum_{i=1}^{n} \left(y_i \log p_i + (1-y_i)\log(1-p_i)\right), \] with gradient: \[ \nabla_\theta J(\theta) = \sum_{i=1}^n (p_i - y_i)x_i. \]

Case-level Aggregation

For each case, segment-level probabilities are aggregated into a single case-level Linguistic Environment Score: \[ \bar{p}_{\text{case}} = \frac{1}{m}\sum_{j=1}^m p_j, \] where \(m\) is the number of segments in that case. To understand how different feature families contribute, the log-odds are decomposed as: \[ \theta^\top x = \sum_{k \in \text{family}} \theta_k x_k, \] which lets us summarize how text, context, and roles/meta each push the score up or down.

Model Evaluation

The Receiver Operating Characteristic (ROC) area under the curve (AUC) measures the model’s ranking ability: \[ \text{AUC} = P(\text{score}_{\text{positive}} > \text{score}_{\text{negative}}), \] i.e., the probability that a randomly chosen positive case (Group 1) receives a higher score than a randomly chosen negative case (Group 0). Calibration is assessed by comparing predicted probabilities to empirical frequencies via a reliability curve: \[ \text{reliability}(\text{bin}) = \mathbb{E}[Y \mid \hat{p} \in \text{bin}], \] where \(\hat{p}\) are the predicted probabilities grouped into bins.

What drives CourtShadow’s scores?

CourtShadow uses logistic regression, allowing for each prediction to be decomposed into feature-level contributions. For the eight held-out test cases, I use this structure to explain which feature families push a case toward higher or lower Linguistic Environment Scores (LES).

High-LES POC-coded case

The Wilmington Ten

A held-out test case with a very high LES: P(Group = POC-coded) ≈ 0.90–0.98. The model is highly confident that its environment resembles POC-coded cases in the training data.

Connie Tyndall, one of the Wilmington Ten, in prison in Wagram, North Carolina (1976)

Connie Tyndall, one of the Wilmington Ten, in prison in Wagram, North Carolina (1976). Photo by Nancy Shia, via Wikimedia Commons, licensed under CC BY-SA 4.0.

Case snapshot. A group of civil rights activists prosecuted in Wilmington, North Carolina, in the early 1970s on arson and conspiracy charges linked to local protests.

Group coding. POC-coded (Black defendants), used as a high-LES example in this project.

At the segment level, CourtShadow assigns scores to each turn of talk. For this case, I average contributions across all segments to obtain a case-level explanation of its LES.

Structure

Long, multi-sentence segments with frequent questioning patterns push this case toward higher LES.

Framing

Elevated politeness, harshness, and mitigation rates contribute positive log-odds, signaling a charged rhetorical environment.

Topics

Topic flags for injury, assault, drugs, and police interaction set this case in a high-stakes criminal context that resembles POC-coded training cases.

Pronouns & address

Frequent second-person address (“you”) and moderate use of first-person voice also push the score upward, reflecting who is being directly addressed in the record.

In combination, these feature families yield a case whose overall structure, topics, and framing look much more like POC-coded environments in the training data than white-coded ones.

Low-LES white-coded case

United States v. Cohen

A held-out test case with a lower LES: P(Group = POC-coded) ≈ 0.32. The model reads its environment as closer to white-coded cases.

Michael D. Cohen, former personal attorney to Donald Trump

Michael D. Cohen, former personal attorney to Donald Trump. Photo by IowaPolitics.com, via Wikimedia Commons, licensed under CC BY-SA 2.0.

Case snapshot. Federal prosecution of Michael Cohen on financial and campaign-finance-related offenses, including payments made during the 2016 election cycle.

Group coding. White-coded defendant, used as a low-LES example in this project.

Using the same decomposition, I aggregate feature contributions for United States v. Cohen. Here, some features push upward toward POC-coded, while others pull the score down, toward white-coded environments.

Structure

A large number of question marks and dense Q&A turns contribute negatively, pulling the LES toward the white-coded side.

Framing

Totals for politeness and harshness act in the opposite direction from their rates, illustrating that the same family can push toward either group depending on how it appears across segments.

Topics

Keywords for drugs, weapons, and financial crime do appear, but in this case their contributions are often offset by structural and framing cues that pull the LES downward.

Pronouns & address

Second-person totals push modestly toward POC-coded, but second-person rates and other features counteract this, contributing to a lower overall LES.

The result is a case that still involves serious topics, but whose overall pattern of structure, pronouns, and framing resembles white-coded training cases more than POC-coded ones.

Patterns across all test cases

  • Recurrent families. Across all eight test cases, structural features (length, sentences, questions), framing features (politeness, harshness, mitigation), topic indicators (weapons, drugs, fraud, injury, police), and pronoun patterns repeatedly appear among the top contributors.
  • No single “magic” feature. High- and low-LES cases do not hinge on a single keyword. Instead, their scores emerge from multi-feature patterns that blend structure, topics, and framing over many segments.
  • Nuance within families. Even within one family, totals and rates can push in different directions. For example, polite_total may pull a case toward white-coded environments, while polite_rate pushes another toward POC-coded, depending on how politeness is distributed in the transcript.

Together, these explanations support the project’s central claim: CourtShadow is picking up institutional environments encoded in courtroom records, not just anchoring on a single word or superficial cue.

Model Evaluation & Key Results

On the held-out test set, CourtShadow shows strong discrimination between POC-coded and white-coded cases, especially once we aggregate from individual segments to the case level.

Overall performance

Chunk-level accuracy
87.5%
56 / 64 segments correctly classified on the test set.
Chunk-level ROC AUC
0.95
Area under the Receiver Operating Characteristic (ROC) curve for segment scores.
Case-level accuracy
100%
8 / 8 held-out test cases correctly classified (4 POC-coded, 4 white-coded). This is encouraging but based on a small number of cases.
Case-level ROC AUC
1.00
Perfect separation between POC-coded and white-coded cases on the test set. Again, this should be interpreted with caution given only 8 cases.

Calibration snapshot

At the case level, predicted Linguistic Environment Scores (LES) align well with observed frequencies in this small test set: low predicted scores correspond to white-coded cases, high scores to POC-coded cases, and mid-range scores mix both groups. The calibration curve below visualizes how predicted LES compares to empirical positive rates across bins.

How to read these numbers

CourtShadow is trained on individual transcript segments, but many of the most important inferences happen at the case level.

  • Chunk-level accuracy (87.5%). Evaluates whether each segment is classified as coming from a Group 1-coded or Group 0-coded case. This tells us the model sees systematic differences across individual turns of talk, but chunks are noisy and context-dependent.
  • Case-level accuracy (100%). Aggregates segment probabilities for each case and asks whether the overall environment looks more like a Group 1-coded or Group 0-coded case. This reflects the project’s focus: environmental differences across entire trials.
  • ROC AUC. Measures how well the model ranks Group 1-coded above Group 0-coded cases. A value near 1.0 means Group 1-coded cases tend to receive higher LES scores than Group 0-coded cases, even before choosing a threshold.
  • Calibration. Asks whether predicted LES values match empirical frequencies: among cases assigned an LES near 0.9, do we actually see roughly 90% Group 1-coded cases? On this small test set, calibration is reasonably aligned.

These results suggest that courtroom transcripts contain non-trivial, systematic signals related to defendant group coding. Because the test set is small, the numbers are best read as a proof of concept rather than a final measurement.

Chunk vs. case-level confusion

Comparing segment-level and case-level predictions on the held-out test set.

Chunk-level confusion matrix
Chunk-level confusion matrix (64 test segments).
Case-level confusion matrix
Case-level confusion matrix (8 test cases).

Case-level calibration

How predicted LES scores compare to observed Group 1-coded frequencies across bins.

Case-level calibration curve
The diagonal line shows perfect calibration. CourtShadow’s case-level LES points track this trend closely on the small test set.

LES distribution (test cases)

Distribution of Linguistic Environment Scores across held-out test cases.

Histogram of LES scores across test cases
Each bar aggregates test cases by their LES. Higher scores correspond to environments that resemble Group 1-coded cases in the training data.

Model Evaluation: Baselines & L2 Regularization

To interpret CourtShadow’s performance responsibly, I compare it to simple baselines and check that improvements are not just artifacts of class imbalance or overfitting.

Baseline comparisons

Majority-class baseline (chunks)

A trivial classifier that always predicts the more common group in the training data achieves an accuracy of roughly 67.4%.

In the training data, 1,261 of 1,870 segments belong to the majority Group 1-coded class. Always predicting Group 1-coded therefore gets 1,261 / 1,870 ≈ 0.674 of segments “correct” by construction.

CourtShadow’s chunk-level accuracy of 87.5% is therefore substantially above what we would expect from simply memorizing the majority group.

Random-chance intuition

With two groups, a perfectly uninformed classifier would hover near 50% accuracy if classes were balanced.

In this dataset, Group 1-coded segments are more common, which pushes the majority-class baseline above 50%. This makes it especially important to compare against the 67.4% majority baseline, not a naive 50% guess.

CourtShadow’s performance should be read relative to this stronger baseline, which it clearly exceeds at the chunk level.

Case-level performance vs. baselines

At the case level, always predicting the majority label would still misclassify a non-trivial share of cases.

By contrast, CourtShadow correctly classifies 8 / 8 held-out test cases, using case-level Linguistic Environment Scores that aggregate segment predictions.

This suggests that the case-level LES captures meaningful structure beyond simple label imbalance: CourtShadow is not just echoing the marginal distribution of labels, but leveraging systematic differences in how cases are structured and described.

L2 regularization: stabilizing a small-N model

Simply turning up the learning rate or number of gradient steps on an unregularized logistic regression did not improve performance on this dataset; if anything, it tended to overfit noisy segment-level patterns. Adding an L2 penalty gives the model a way to control complexity directly.

CourtShadow uses 38 interpretable features per segment but only a modest number of training cases. Without regularization, gradient descent can push some weights to extreme values that fit idiosyncratic chunks rather than general patterns.

This is exactly the regime where L2 regularization is useful: it nudges the model toward smaller, more stable coefficients instead of trying to memorize every training fluctuation.

Instead of minimizing only the negative log-likelihood \(J(\theta)\), the L2-regularized objective adds a penalty on the squared weights (excluding the bias term):

\[ J_{\text{L2}}(\theta) = J(\theta) + \lambda \sum_{j=1}^{d} \theta_j^2, \]

where \(j = 1, \dots, d\) indexes the non-bias features. The gradient becomes:

\[ \nabla_{\theta_j} J_{\text{L2}}(\theta) = \sum_{i=1}^n (p_i - y_i)x_{ij} + 2\lambda \theta_j \quad \text{for } j \ge 1, \]

and the bias coordinate (\(j = 0\)) is left unpenalized. This term gently pulls large coefficients back toward zero, discouraging the model from relying too heavily on any single feature.

I sweep over a range of regularization strengths and plot performance as a function of \(\lambda\).

Very small penalties (\(\lambda \approx 0\)) behave like the unregularized model and can overfit; very large penalties underfit by flattening all weights.

A moderate value, around \(\lambda = 10^{-3}\), yields the best trade-off on held-out data: improved chunk-level accuracy and case-level calibration, with weights that remain interpretable and stable across splits.

With L2 regularization turned on, CourtShadow’s logistic regression becomes more robust than the unregularized model that merely tweaks learning rate and iteration count.

The regularized model better matches held-out performance, while preserving the clean feature-family structure used for Linguistic Environment Score explanations.

Together, these checks suggest that CourtShadow’s performance is not just a byproduct of class imbalance or overfitting: the model meaningfully outperforms simple baselines, and L2 regularization helps keep its logistic regression weights in a regime where case-level explanations remain stable and interpretable on a small test set.

Effect of L2 regularization

Model performance as a function of L2 strength \(\lambda\), sweeping from almost-unregularized to heavily-regularized regimes.

Plot showing model performance as a function of L2 regularization strength
In this project, I vary \(\lambda\) across several orders of magnitude and track held-out performance. Very small penalties overfit noisy segment-level patterns, while very large penalties underfit. The curve peaks around a moderate \(\lambda\) (on the order of \(10^{-3}\)), which is the value used for the final CourtShadow model.

Ethics & Limitations

CourtShadow is a proof-of-concept, not a diagnostic tool. This section outlines where the model is limited, how it should (and should not) be used, and why different kinds of errors matter differently in the context of the justice system.

Model limitations

How to read CourtShadow responsibly

Small-N, skewed sample

CourtShadow is trained on 42 criminal trial transcripts and 1,900+ segments, collected from publicly available cases. This is a small, convenience sample: it is not representative of all jurisdictions, case types, or defendant experiences.

Any patterns the model finds should be read as hypothesis-generating, not as population-wide estimates. Scaling this work would require much larger, systematically sampled corpora.

What the model can and cannot see

CourtShadow only “sees” what is preserved in the transcript: who speaks, how long, how they frame events, and what topics appear. It does not observe:

  • Plea negotiations or cases that never go to trial
  • Charging decisions, policing practices, or jury selection
  • Non-verbal signals, tone, or local courtroom culture

As a result, CourtShadow approximates an institutional linguistic environment, not the full process that produces case outcomes.

Not a diagnostic or decision tool

This model is designed strictly for research and illustration. It should never be used to make or recommend decisions about individual defendants, judges, prosecutors, or cases.

Instead, the intended use is to:

  • Highlight aggregate patterns in courtroom language
  • Suggest places where qualitative and legal analysis might look next
  • Support conversations about how institutional records encode disparities

In short, CourtShadow provides structured questions, not answers.

Error tradeoffs

When CourtShadow is wrong, who pays the price?

CourtShadow never predicts an individual’s race. Instead, it classifies linguistic environments into two groups (A/B-coded) and we study how often those environments line up with real-world disparities. Even so, different kinds of mistakes hurt different people.

More tolerant of possible bias (more false positives) More cautious about flagging bias (more false negatives)

A false positive happens when the model flags a transcript as living in a “Group A-coded” linguistic environment when the overall pattern of language is actually closer to Group B.

  • Researchers may overestimate how often Group A-coded language appears.
  • Courts or watchdogs might chase “ghost” disparities that are not really there.
  • Communities associated with Group B could feel wrongly labeled or pathologized.

In confusion matrix terms: \( \hat{y} = 1 \) (Group A-coded) but \( y = 0 \) (Group B-coded).

A false negative happens when the model fails to flag a transcript whose language looks like other Group A-coded environments.

  • Real disparities in how people are treated may never be investigated.
  • Communities harmed by the pattern see “data-driven” proof that nothing is wrong.
  • Institutional bias can remain invisible under a veneer of objectivity.

Here, \( y = 1 \) (Group A-coded) but \( \hat{y} = 0 \) (Group B-coded).

CourtShadow is not deployed in any real courtroom. All thresholds, error tradeoffs, and groups are used in a retrospective, research-only setting to study structural patterns in language, not to make decisions about individuals.

Main takeaways

CourtShadow is a proof-of-concept that courtroom transcripts alone contain enough structured signal for a simple, interpretable model to distinguish environments associated with different defendant groups. The goal is not to prove causality, but to show that language and procedure leave a measurable imprint related to defendant race.

1. Court records encode institutional patterns

At the case level, CourtShadow correctly classifies all held-out test cases, with 100% accuracy and AUC ≈ 1.0. This goes far beyond a majority-class or random baseline and shows that who speaks, what they talk about, and how they talk differ systematically across the groups in this project.

In other words: if we only see the transcript and a 38-dimensional feature vector per segment—no race variable at all—the model can still reliably tell which environments belong to which defendant group.

2. Structure & procedure matter as much as wording

Decomposing the logistic regression weights shows that case-level Linguistic Environment Scores are driven by families of features:

  • Structure: Length, sentences, and questions per segment.
  • Framing: Politeness, harshness, mitigation, (un)certainty.
  • Topics: Weapons, drugs, fraud, police interaction, injury.
  • Pronouns & address: “I/we/you” rates by role.

The signal comes from patterns across many turns, not from a single loaded word. This supports the project’s focus on institutional environments rather than individual prejudice.

3. Case-level “shadows” highlight where to look next

Some cases with defendants of color receive very high CourtShadow scores, while many white-defendant cases cluster lower. Rather than treating the score as a verdict, the project uses it as a triage tool:

  • High-score cases flag environments whose structure, topics, and framing most resemble POC-coded trials in the training data.
  • Low-score cases provide contrast, showing which institutional patterns are more common in white-coded environments.

These shadows are an invitation for deeper qualitative and legal work, not a substitute for it.

CS109 concepts in action

CourtShadow intentionally leans on the core toolkit of CS109 rather than a black-box model, to keep every step mathematically transparent:

  • Data pipeline: manual collection of 42 criminal transcripts, cleaning, chunking into 1,900+ segments, and merging in speaker-role metadata.
  • Feature engineering: 38 interpretable features per segment spanning structure, framing, topics, and pronouns.
  • Probabilistic modeling: Bernoulli likelihood \(P(y=1 \mid x)\) with a logistic link, trained by gradient descent.
  • Model evaluation: held-out test split, majority-class and random baselines, accuracy, ROC AUC, and calibration curves.
  • Regularization: L2-penalized objective \(J_{\text{L2}}(\theta) = J(\theta) + \lambda \sum_j \theta_j^2\), tuned over a grid of \(\lambda\) values to balance bias–variance.
  • Interpretability: decomposing \(\theta^\top x\) into family-level contributions to build case-level explanations, not just a single score.

The result is a model that is statistically grounded, regularized, and explainable end-to-end, in the spirit of CS109.

An Intentional Design: beyond a single number

A major design choice in CourtShadow is to treat interpretability as a feature, not an afterthought:

  • Building Linguistic Environment Scores by aggregating segment-level probabilities into case-level views.
  • Presenting side-by-side case studies (e.g., The Wilmington Ten vs. United States v. Cohen) to show how feature families behave in concrete trials.
  • Using custom visualizations—confusion matrices, calibration curves, and contribution bars—to make abstract metrics legible to non-technical audiences.

Together, the modeling decisions and visuals turn a logistic regression into a narrative tool for thinking about institutional bias.

Where this project could go next

CourtShadow is deliberately small-N and interpretable. With more time and data, natural extensions would include:

  • Scaling to larger, more diverse transcript corpora and jurisdictions.
  • Comparing interpretable models to richer text encoders (e.g., embeddings) while keeping case-level explanations central.
  • Integrating formal fairness diagnostics and domain-expert annotation to evaluate how different modeling choices impact conclusions.

The current version is a first pass that shows it is possible to model courtroom “shadows” with transparent tools—and that doing so raises questions worth exploring far beyond a single CS109 project.

Real-world Applications

While CourtShadow is a research demonstration rather than any kind of diagnostic or legal decision tool, its approach intersects meaningfully with ongoing policy conversations—especially those shaped by California’s Racial Justice Act (RJA).

Think of this section as a policy orbit: how a small, interpretable CS109 model touches large debates about racial justice, evidence, and courtroom language.

Policy spotlight

The Racial Justice Act: a shift in how bias is proven

Enacted in 2020 and expanded in 2022, California’s Racial Justice Act (RJA) transformed how defendants may demonstrate racial bias in criminal proceedings. Historically, challenges invoking racial disparities ran into the high evidentiary bar established by McCleskey v. Kemp (1987), where the U.S. Supreme Court ruled that statistical evidence of systemic disparities was insufficient without proof of intentional discrimination in an individual case.

The RJA changes that standard. Under the Act, defendants may now seek relief—including reduced sentences—when they can show that racial bias (explicit or implicit) “was a factor” in charging, convictions, sentencing, or courtroom conduct— without needing to prove purposeful discrimination by a specific actor.

For a detailed overview, see Stanford Law’s analysis of the RJA ↗ .

Research trajectory

How CourtShadow could support future RJA-related research

Because the RJA explicitly allows defendants to use patterns in courtroom language, procedure, and treatment as evidence, projects like CourtShadow gesture toward a potential research direction: studying whether the institutional environment of courtroom transcripts differs systematically across groups. CourtShadow does not measure bias or make case-level legal claims, but it illustrates how linguistic and structural features of the record can be quantified, aggregated, and compared.

In principle, models of this type could someday help:

  • identify systematic patterns in courtroom speech that merit deeper qualitative or legal review;
  • assist researchers examining how procedural treatment varies across groups over many transcripts;
  • support policy discussions on how institutional language reflects broader disparities.

Any such application would require far larger datasets, domain-expert review, and safeguards against misuse. CourtShadow’s role is illustrative, not diagnostic.

Historical arc

A historical note: McCleskey & CourtShadow

The RJA was crafted in direct response to the jurisprudential barrier erected by McCleskey v. Kemp, which held that even striking statewide racial disparities in death-sentencing data did not meet the burden of showing intentional discrimination in an individual case. One interesting connection: the McCleskey case itself is included in CourtShadow’s training set.

Although the model does not know the defendant’s race and uses no explicit racial variables, it is trained on segments that come from historically significant proceedings—including one at the center of the RJA’s legislative origins.

This underscores a broader point: the language of the courtroom, captured in transcripts, is part of the historical record that shapes present-day reforms. A model like CourtShadow helps highlight that this language is quantifiable, interpretable, and connected to longstanding debates about structural equity.

CourtShadow stays firmly in the realm of research—but the shadows it maps align with live legal and policy questions about how courts speak, record, and remember.

Project Documents

Below are downloadable and viewable versions of the final write-up and all appendices (CS109 Challenge Materials).

Final Write-Up

Full report summarizing goals, methods, results, and implications.

Download PDF

Appendix A — Dataset Construction & Features

Segmentation, cleaning, and normalization pipeline alongside feature definitions.

Download PDF

Appendix B — Mathematical Details and Derivations

All equations and derivations that were used in constructing and evaluating the model> Download PDF

Appendix C — Model Evaluation

Accuracy, ROC curve, confusion matrices, LES calibration, and L2 Regularization plots.

Download PDF

Appendix D — Preprocessing and Segmentation Algorithms

In-depth description of the preprocessing pipeline through which raw text was transformed.

Download PDF

Appendix E — Ethical Considerations and Racial Justice Act Context

Important information regarding the ethical and appropriate usage of this tool.

Download PDF
Creator

Diya Karan

nadiya@stanford.edu

I’m an undergraduate student at Stanford studying the subtle and institutional structures that shape courtroom language. CourtShadow is part of my broader work on transparency, interpretability, and the responsible use of computational methods in legal contexts.

Your profile photo

Sophomore •
CS + Political Science • Stanford University