Financial NLP service

NLPInsight Analyzer¶

Classify financial text sentiment — and understand why domain-specific pre-training matters more than model size.

Model quality 80.6% Accuracy on noisy financial tweets.

Coverage 98% 74 tests with CI threshold discipline.

Latency 5ms path TF-IDF route stays small, fast and explainable.

Runtime size 267 MB No heavy transformer dependency in the default image.

The Problem¶

Financial markets generate 10,000+ news articles/day. Manual sentiment review costs $50–100/hour per analyst. Automated classification must handle domain nuance: "revenue declined less than expected" is positive in financial context — a pattern that bag-of-words models consistently misclassify.

Business Translation¶

Problem

Financial text is noisy¶

Market language carries domain nuance, abbreviations and class imbalance that a generic sentiment demo can hide.

Decision

Use a lightweight production path¶

TF-IDF + Logistic Regression keeps inference fast, small and explainable for resource-constrained deployment.

Impact

Useful under cost constraints¶

The service trades some accuracy upside for latency, image size and operational simplicity.

Trade-off

FinBERT stays optional¶

The heavier transformer path is documented, but not forced into the default runtime without GPU and cost justification.

Why Accuracy Works Here (and F1-Macro as Guard Rail)¶

The dataset has 3 classes: 58.0% neutral, 26.9% positive, 15.1% negative. Trained on Twitter Financial News Sentiment (11,931 real tweets) — noisy, informal text with stock tickers and abbreviations. F1-macro (0.748) guards the minority negative class — the highest-value signal for risk management.

Metric	TF-IDF + LogReg (production)	FinBERT (GPU)	Why It Matters
Accuracy	80.6%	~85-88%*	Honest metric on hard, noisy tweets
F1 (weighted)	0.810	~0.85*	Overall system performance weighted by class frequency
F1 (macro)	0.748	~0.82*	Safety guard: ensures minority negative class isn't neglected

* FinBERT fine-tuning requires GPU. Estimated from published benchmarks.

80.6% on real financial tweets (vs 97% on the easier Financial PhraseBank) is an honest, defensible metric. The dataset upgrade from curated sentences to noisy tweets better demonstrates real-world NLP capability.

Architecture¶

flowchart TD
    A[Text Input] --> B[SentimentPredictor\nAuto-detect]
    B --> C{model.joblib\nexists?}
    C -->|Yes| D[TF-IDF Vectorizer\nmax 10K features]
    D --> E[LogisticRegression\nclass_weight=balanced]
    C -->|No| F{config.json\nexists?}
    F -->|Yes| G[FinBERT Tokenizer\nmax 256 tokens]
    G --> H[ProsusAI/FinBERT\n110M params]
    H --> I[Classification Head\n+ Softmax]
    E --> J[negative / neutral / positive\n+ confidence scores]
    I --> J

    style D fill:#2d6a4f,color:#fff
    style E fill:#2d6a4f,color:#fff
    style G fill:#7b2cbf,color:#fff
    style H fill:#7b2cbf,color:#fff
    style I fill:#7b2cbf,color:#fff

Green = Production path (TF-IDF, 5ms, 267 MB) · Purple = GPU path (FinBERT, 87ms, 1.4 GB)

Why TF-IDF in production: TF-IDF runs in 5ms (in-pod) with a 267 MB image vs FinBERT's 87ms with a 1.4 GB image. For latency-critical pipelines, the accuracy trade-off (80.6% vs ~88%) is acceptable. The training pipeline supports FinBERT fine-tuning when GPU is available.

Engineering Trade-Off¶

Chosen: small, fast, explainable default model. Rejected: making the heaviest model the default before the serving cost and GPU requirement are justified.

This follows the same operating principle used in the BankChurn debugging deep dive: production ML choices should be measured against runtime behavior, not only model score.

Code Review Shortcuts¶

FastAPI app Dockerfile Training code Tests K8s manifest

Operational¶

Metric	Value	Context
Test Coverage	98% (74 tests)	CI threshold: 85%
Docker Image	267 MB	`nlpinsight:v3.6.0` on Artifact Registry (no torch dependency)
Model Size	~5 MB (TF-IDF+LogReg)	Downloaded via Init Container from GCS
P50 / P95 Latency	78ms / 140ms (GCP), 100ms / 120ms (AWS)	Through ingress, Locust smoke test (6 users)

Responsible AI¶

Fairness: Per-class F1 parity monitored; no class F1 below 0.90
Drift: Sentiment distribution monitored via Prometheus (nlpinsight_predictions_total{sentiment}); shift alerts calibrated as relative change from 7-day baseline (not absolute — a market crisis legitimately shifts the distribution)
Validation: Pandera schemas for input text and label format

Live Prediction¶

Swagger UI	Sentiment Prediction

Try It¶

Single TextBatch (up to 500)Health Check

curl -s -X POST http://localhost:8003/predict \
  -H "Content-Type: application/json" \
  -d '{"text":"Fed raises interest rates amid inflation concerns, markets tumble"}' \
  | python3 -m json.tool

Expected: sentiment (negative), confidence (~0.7+), probabilities per class.

curl -s -X POST http://localhost:8003/predict/batch \
  -H "Content-Type: application/json" \
  -d '{"texts":["Revenue beat expectations","Stock crashed after earnings miss","Markets closed flat today"]}' \
  | python3 -m json.tool

Expected: Array of 3 predictions (positive, negative, neutral).

curl -s http://localhost:8003/health | python3 -m json.tool

📄 Full Model Card — includes metric rationale, performance benchmarks, and production decision narrative.

Last Updated: April 2026 — v3.6.0