Steering Word Embeddings

● Gendered words
● Occupations

Each dot is a word. Faded dots show original positions; bright dots show where they land after a gender direction is removed. Trails show the shift.

Vocabulary:

What are Word Embeddings?

A dictionary explains a word using other words—a human-readable representation. A word embedding represents a word using a list of numbers—a machine-readable representation. Rather than hand-coding these numbers, we learn them from data, so that words with similar meanings end up with similar vectors.

The learning principle behind word embeddings is the distributional hypothesis (Harris, 1954): a word's meaning can be inferred from the words it co-occurs with.Co-occurrence counting is one approach. Others, like Word2vec, learn by predicting context words directly. Contextual models like BERT and GPT produce different vectors for the same word depending on context. To learn embeddings, we start with a text corpus and construct a word co-occurrence matrix that counts how often pairs of words appear together within a given window.

A word-by-word co-occurrence matrix where each cell shows how often two words appear near each other in a corpus. Rows and columns are labeled with words like 'the', 'of', 'and'; high-frequency pairs have larger counts. — A word co-occurrence matrix: each cell counts how often two words appear near each other in a corpus. (source)

Each row is a sparse, high-dimensional representation of a word. Words in similar contexts get similar rows, so the raw counts capture similarity—but not the kind of geometric structure that supports analogies (more on analogies below).

GloVe (Pennington et al., 2014) learns embeddings by approximately factorizing the log co-occurrence matrix. Given a co-occurrence matrix $\mathbf{X} \in \mathbb{R}^{V \times V}$, where $\mathbf{X}_{ij}$ counts how often words $i$ and $j$ appear near each other, the goal is to find a low-dimensional vector $\overrightarrow{w}_i$ for each word such that dot products between vectors, after adjusting for per-word biases, reconstruct the log counts:

$$\overrightarrow{w}_i \cdot \overrightarrow{w}_j + b_i + b_j \approx \log \mathbf{X}_{ij}$$

where $b_i$ and $b_j$ are per-word bias terms.Why the bias terms? Without them, a word like "the" that co-occurs frequently with everything would need a large vector magnitude to produce large dot products with every other word—distorting the geometry that encodes meaning. The biases $b_i$ absorb each word's baseline co-occurrence rate, freeing the vectors to focus on which words co-occur more or less than expected. The parameters are fit by weighted least squares over all observed word pairs. This factorization does more than compress the data. Because the objective is a dot product, relationships between words become linear operations on the vectors. The result is a space with approximately linear structure—which is why analogies work and the subspace methods in this article are possible.

In this article, we use 100-dimensional GloVe vectors trained on 6 billion tokens of text, normalized to unit length. All computations run in your browser.

Structure in the Embedding Space

Do embeddings capture meaningful structure? We can check by visualizing groups of related words.

Since our vectors are 100-dimensional, we project them to lower dimensions for plotting using MDS (multidimensional scaling), which finds a layout that best preserves pairwise distances. Use the eigenvalue bars to the right of each plot to switch between 1D, 2D, and 3D views. The percentage shows how much of the original distance structure is captured.

Let $S = \{\overrightarrow{w}_1, \dots, \overrightarrow{w}_n\}$ be the $D$-dimensional word vectors (for GloVe, $D = 100$) for $n$ chosen words, with pairwise Euclidean distances $d_{ij} \defeq \|\overrightarrow{w}_i - \overrightarrow{w}_j\|$. The goal of MDS is to find low-dimensional coordinates $(\mathbf{x}_1, \dots, \mathbf{x}_n)$ such that $\|\mathbf{x}_i - \mathbf{x}_j\| \approx d_{ij}$.

When the distances come from Euclidean vectors (as ours do), there is a closed-form solution known as classical MDS, which is equivalent to PCA on the centered vectors: center the data, compute the covariance matrix, and take the top eigenvectors as coordinate axes. The eigenvalues $\lambda_1 \ge \lambda_2 \ge \cdots$ measure how much distance structure each dimension captures. The fraction of variance explained by the top $m$ dimensions is $\sum_{j=1}^{m} \lambda_j \big/ \sum_{j} \lambda_j$—this is the percentage shown next to each plot.

An important caveat: each plot below runs MDS on a small, hand-picked set of words—not the full 50,000-word vocabulary. The low-dimensional structure looks clean precisely because we are projecting a curated subset. Running MDS on the entire vocabulary would spread the variance across many more directions, and these tidy patterns would be harder to see. The structure is real—it exists in the full 100-dimensional space—but the clarity of these plots is partly a consequence of selecting words that share a common axis of variation.

Superlatives

Adjective forms trace out parallel paths:

Numbers

Digits and their word forms occupy distinct but parallel regions:

The same ordering emerges for number words:

Written and numeric forms linked together:

Word Analogies

The superlatives already demonstrate analogies—poor is to poorer as rich is to richer—and the pattern generalizes. Analogous pairs share roughly the same vector offset. The classic example: "man is to woman as king is to queen" corresponds to:

$$\overrightarrow{\text{man}} - \overrightarrow{\text{woman}} \approx \overrightarrow{\text{king}} - \overrightarrow{\text{queen}}$$

More generally, $a : b :: c : d$ means $\overrightarrow{a} - \overrightarrow{b} \approx \overrightarrow{c} - \overrightarrow{d}$. Rearranging, we can solve for an unknown fourth word: $\overrightarrow{d} \approx \overrightarrow{c} - \overrightarrow{a} + \overrightarrow{b}$.

Why does this work? "King" and "queen" appear in many of the same contexts—royalty, thrones, crowns—so their co-occurrence patterns are similar. They differ mainly in gendered contexts, and that difference is the same one that separates "man" from "woman." Because GloVe embeds words so that dot products approximate log co-occurrence counts, these shared and differing patterns become geometric structure: the vector offset $\overrightarrow{\text{king}} - \overrightarrow{\text{queen}}$ points in roughly the same direction as $\overrightarrow{\text{man}} - \overrightarrow{\text{woman}}$.

Try your own

is to as is to

Identifying Subspaces

The consistency of these analogies suggests that the embedding space contains interpretable subspaces—directions we can find systematically, not just stumble across in individual word pairs. The "gender direction" isn't a coincidence among a few word pairs—it's a direction that organizes many words across the space.

Consider the difference vectors for several gendered pairs: $\overrightarrow{\text{woman}} - \overrightarrow{\text{man}}$, $\overrightarrow{\text{she}} - \overrightarrow{\text{he}}$, $\overrightarrow{\text{queen}} - \overrightarrow{\text{king}}$. If gender is captured by a low-dimensional subspace, these difference vectors should all point in roughly the same direction. And they do—they are far more aligned with each other than random pairs of vectors in 100-dimensional space would be.

Gender is not the only interpretable subspace. Here are word pairs that differ primarily in size:

To find this subspace systematically, we collect many such pairs and compute a scatter matrix from their difference vectors. Given $p$ pairs $(w_i^+, w_i^-)$, let $\mathbf{d}_i \defeq \overrightarrow{w}_i^+ - \overrightarrow{w}_i^-$ be the difference vector for the $i\textsuperscript{th}$ pair. The scatter matrix $\mathbf{C} \defeq \sum_{i=1}^{p} \mathbf{d}_i \, \mathbf{d}_i^\top$ is a $D \times D$ positive semidefinite matrix whose eigenvectors point in the directions of greatest variation among the difference vectors. Taking the eigendecomposition $\mathbf{C} = \mathbf{U} \boldsymbol{\Lambda} \mathbf{U}^\top$ and keeping the top-$k$ columns of $\mathbf{U}$ gives an orthogonal basis $\mathcal{B} = \{\overrightarrow{b}_1, \dots, \overrightarrow{b}_k\}$ for the subspace that best explains the pair differences.

Since $\mathbf{C}$ is a sum of $p$ rank-one matrices, it has rank at most $p$—so with 10 word pairs, at most 10 eigenvalues are nonzero. Truncating to $k \le p$ focuses on the dominant directions and discards weaker ones. Choosing $k$ involves a tradeoff: too small and you miss secondary directions of the concept (e.g., gender correlates with both pronouns and names, which may not be collinear); too large and you start removing variation unrelated to the concept, distorting the embedding. In practice, examining the eigenvalue spectrum of $\mathbf{C}$ helps—a sharp drop after the first few eigenvalues suggests the concept is low-dimensional. Bolukbasi et al. (2016) used $k = 1$; in this article, we use $k = 10$, the full rank of our 10 defining pairs. With 10 pairs in a 100-dimensional space, this removes only 10% of the embedding dimensions, leaving the remaining structure intact.

Steering by Subspace Projection

Once we have a subspace, we can steer the embeddings by projecting it out. (This operation is called "debiasing" in Bolukbasi et al., 2016; we use "steering" to emphasize that the same technique applies to any subspace, not just bias-related ones.)

The idea is simple: to remove a concept, subtract each word's projection onto that subspace and renormalize.

$$\overrightarrow{w}_{\text{steered}} \defeq \frac{\overrightarrow{w} - \overrightarrow{w}_{\mathcal{B}}}{\|\overrightarrow{w} - \overrightarrow{w}_{\mathcal{B}}\|} \quad \text{where} \quad \overrightarrow{w}_{\mathcal{B}} \defeq \sum_{j=1}^{k} (\overrightarrow{w} \cdot \overrightarrow{b}_j) \, \overrightarrow{b}_j$$

This removes each word's component along the chosen subspace while preserving everything else. The renormalization step is not inherent to the projection—it is a consequence of working with unit-length vectors. Since we measure similarity by cosine similarity (i.e., dot products between unit vectors), we renormalize after projection to stay on the unit sphere.

The animation below walks through the full pipeline: starting from gendered word pairs, translating their difference vectors to a common origin to reveal a shared direction, then projecting that direction out. A caveat: the animation performs the projection in the 2D space you see on screen, not in the original 100-dimensional space. If the embeddings really were two-dimensional, the animation would be exact. In 100 dimensions, the direction, the projection, and the renormalization all happen in a space we cannot visualize directly—MDS gives us a faithful summary of distances, but not of directions.

The plot below shows the real result: steering computed in the full 100-dimensional space, then projected to 2D via joint MDS over both the original and steered positions. Faded dots mark where each word started; trails show how it moved:

Steering in Action

Occupation Analogies: Before vs. After

Before steering, the analogy man : woman :: doctor yields "nurse"—the nearest neighbor to $\overrightarrow{\text{doctor}} + (\overrightarrow{\text{woman}} - \overrightarrow{\text{man}})$ is nurse, reflecting a gendered association baked into the embeddings:

After removing the gender subspace, the analogy no longer finds a gendered counterpart. Both man→woman and woman→man yield "physician":

The doctor example is not an isolated case. The table below runs the same analogy—man : woman :: occupation—across several professions, before and after steering:

Which Words Changed Most?

The plot below shows each occupation after steering, with faded dots at the original positions:

The bar chart ranks occupations by how far their embedding moved when the gender subspace was removed—measured as $\|\overrightarrow{w}_{\text{steered}} - \overrightarrow{w}\|$, the Euclidean distance between the original and steered vectors. Words with large shifts had a large gender component; words with small shifts were already mostly orthogonal to the gender direction.

Discussion

Subspace projection is a surgical intervention: it removes a specific, interpretable direction from the embedding space while leaving everything else largely intact. But it has limitations worth noting.

Projecting out a subspace discards information indiscriminately: if some gender-correlated information is genuinely useful for a downstream task—distinguishing "actress" from "actor," for example—it will be lost along with the unwanted stereotypical associations. Projecting out a subspace also assumes the concept lives in a linear subspace, which is an approximation; real-world concepts may have nonlinear structure that a flat projection can't fully capture.

More recent work has explored alternatives: Zhao et al. (2018) proposed learning gender-neutral embeddings during training rather than correcting them post-hoc, and Ravfogel et al. (2020) proposed iteratively projecting out every direction a classifier can exploit, removing a concept more thoroughly than a single projection. These methods address some of the limitations here, but the core idea—that embeddings contain interpretable subspaces we can identify and manipulate—remains foundational. The same linear structure that makes analogies possible also makes steering possible: the geometry that encodes meaning is the geometry we can edit.

Explore

Try it yourself: type word groups below (one group per line). Words on the same line are plotted together and connected by arrows. Special line prefixes:

word - word — steering pair (projects out the subspace)
? a : b :: c — analogy (solves for the fourth word)
> label: a - b, c - d — direction arrow (SVD of pair differences)
> label: w1 w2 w3 — direction arrow (PCA of word sequence)

Acknowledgments

This article grew out of a homework assignment co-written with David Mueller for a machine learning course at Johns Hopkins University.

References

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In NeurIPS.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Harris, Z. S. (1954). Distributional structure. Word, 10(2-3), 146-162.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In EMNLP.
Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., & Goldberg, Y. (2020). Null it out: Guarding protected attributes by iterative nullspace projection. In ACL.
Zhao, J., Zhou, Y., Li, Z., Wang, W., & Chang, K.-W. (2018). Learning gender-neutral word embeddings. In EMNLP.