From Words to Vectors: How AI Understands Meaning

Santosh Vaidya February 15, 2025 9 min read

Eighth Article — Embeddings

**From Words to Vectors: How AI Understands Meaning**

Hook: Beyond Simple Keyword Matching
Google doesn't just match keywords — it knows "movie tickets" and "cinema passes" mean the same thing. This magic? Embeddings. Like-wise, when you Google "best credit card for rewards," the results aren't just about matching "credit card" and "rewards." Instead, the search engine understands you're looking for a high-ROI points program or cash-back benefits.

These numerical fingerprints turn words, images, and sounds into a language machines understand, powering everything from Netflix recommendations to ChatGPT's wit.

Why This Matters:

Embeddings are how AI bridges the gap between human language and machine logic. Without them, AI would be stuck in a world of keyword bingo, unable to understand contextual nuance like "credit risk" vs. "credit line." Embeddings are the secret sauce that makes AI truly 'get' your data — unlocking insights in trading, compliance, wealth management, and more.

What Are Embeddings?

Simple Definition:

Embeddings are numerical representations (vectors) of data — like words, images, or sounds — that capture their meaning and relationships. Similar items (e.g., "stock" and "equity") cluster closer in this mathematical space.

Analogy: Plotting Cities

Imagine plotting cities based on culture and language rather than pure geography. Paris and Brussels end up near each other; Tokyo and São Paulo sit far apart. Embeddings do this for financial terms, documents, and customer profiles — highlighting similarities and differences beyond just matching individual words.

Key Components

Three foundational pillars emerge:

1. Vector Representations

Converting data (e.g., "mortgage rates" or "derivative contracts") into arrays of numbers (e.g., [0.25, -0.1, 0.7]).

2. Semantic Search

Retrieving results based on meaning, not just keywords. E.g., finding "retirement investment options" when a user types "401(k) alternatives."

3. Similarity Matching

Measuring how similar two pieces of content are — vital in detecting related transactions, flagging compliance issues, or recommending financial products.

How It Works

Step 1: Tokenization

Break text into words or subwords for processing:

"ChatGPT" → ["Chat", "G", "PT"]
"Balance sheet" → ["Balance", "sheet"]
"Buy side analyst report" → ["Buy", "side", "analyst", "report"]

Step 2: Convert to Vectors

Use pre-trained models like Word2Vec, GloVe, or BERT:

from gensim.models import KeyedVectors  
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  
king_vector = model['king']  
# 300-dimensional vector

For finance-specific tasks, you might load a domain-adapted model (e.g., FinBERT) that understands "yield curve," "bond spreads," and "mortgage-backed securities."

Step 3: Semantic Relationships

Math reveals meaning:

king - man + woman = queen
Paris - France + Italy = Rome

In banking terms:

bond - interest_rate + inflation = inflation_adjusted_bond (hypothetical example)

By visualizing these vectors, you'll see synonyms cluster (e.g., "loan" and "credit"), while antonyms repel (e.g., "profit" and "loss").

Real-World Applications

Recommendation Systems: Netflix uses embeddings to suggest shows based on your viewing habits, not just genres.
Search Engines: Google's BERT model improves results by understanding query intent (e.g., "best hiking trails for kids" vs. "extreme hiking").
Investment Recommendation Engines: Brokerage platforms embed user profiles, portfolio preferences, and news articles to suggest opportunities (e.g., "tech stocks with moderate risk") by finding close matches in vector space.
Anti-Money Laundering (AML): Embeddings can represent transaction descriptions, customer histories, and geographic data. Unusual or suspicious patterns (like repeated small deposits to offshore accounts) stand out as anomalies.
Document Search & Compliance: Large institutions handle thousands of contracts, disclosure statements, and regulatory filings. With embeddings, a compliance officer can search semantically (e.g., "show me all documents related to LIBOR transitions"), not just exact matches.

Challenges & Best Practices

Pitfalls:
Out-of-Vocabulary Words: Rare terms like "Zylithium" get generic vectors.
Bias: Embeddings trained on biased data inherit stereotypes (e.g., "nurse" → female, "CEO" → male).

Pro Tips:
Use Pre-Trained Models: Start with GloVe, BERT, or OpenAI's embeddings.
Fine-Tune for Domains: Retrain embeddings on medical texts for healthcare apps.
Dimensionality Reduction: Tools like PCA or UMAP simplify high-dimensional vectors for visualization.

Tools & Resources

Hugging Face Sentence Transformers: Generate embeddings with minimal code — try sentence-transformers/all-MiniLM-L6-v2 for quick results.
TensorFlow Embedding Projector: Visually explore embeddings in 3D — great for analysts needing to interpret clustering patterns.
FAISS (Meta): Efficient similarity search for massive datasets. Perfect for large banks analyzing millions of transactions or documents.

Conclusion

Embeddings are the Rosetta Stone of AI — translating messy human language into structured math. By mastering them, you unlock smarter search, richer recommendations, and AI that truly understands.

Next Up:
"Memory Limits: How AI Forgets (and Remembers)" (Article 9). Explore context windows and the art of keeping AI on-task!