Case Study
№ 01

Customer Review Intelligence System for Fintech Apps.

Why sentiment analysis fails on technical complaints - and the severity-scoring system that fixes it.

10,386 Reviews4 AppsVADER + BERTopicSeverity Scoring

Artifacts

Built end-to-end as a full case study with production-ready artifacts: cleaned dataset, preprocessing pipeline, feature list, model outputs, and scoring functions.

02Business Problem

Fintech reviews are noisy.

Some customers write long emotional complaints about minor issues. Others quietly report severe problems - frozen accounts, stuck transfers, unresolved fraud - in short, calm language that any standard sentiment tool reads as neutral or even positive.

For a fintech company, that is a product-risk failure, not a sentiment error. The goal of this project was to build a system that separates emotional noise from real operational impact, so engineering and support teams can prioritize what actually hurts users and the business.

Missing a complaint about “account locked” is not a sentiment error. It is a product-risk failure.

03TL;DR

Problem: Standard sentiment tools (VADER, star ratings, keyword rules) miss high-severity fintech complaints because they score tone, not operational impact.
Solution: A severity-scoring NLP system that separates emotional noise from real operational issues and ranks complaints by business impact.
How: VADER baseline → confusion matrix analysis → BERTopic unsupervised topic modeling on hidden negatives → domain-specific severity scoring framework → competitive benchmarking across 4 apps.
Impact: Helps fintech teams prioritize engineering fixes, reduce support load, benchmark against competitors, and act on the complaints most likely to drive churn.
GitHub: johnkirima/Fintech-Sentiment-Intelligence-Analysis ↗

04What I Actually Built

A six-stage analytical system.

Step 01
Data collection & cleaning
Pulled 10K+ reviews across Venmo, Cash App, Chime, and PayPal. De-duped, normalized, and filtered non-English and empty records.
Step 02
VADER baseline
Established a tone-based sentiment baseline to quantify how often standard tooling agrees with the user's own 1/5 star rating.
Step 03
Confusion matrix
Cross-tabbed model sentiment against star rating to isolate hidden negatives: low-rated reviews scored positive or neutral.
Step 04
BERTopic
Ran unsupervised topic discovery on hidden negatives to surface the actual complaint categories (fraud, fund holds, disputes, support).
Step 05
Severity scoring
Applied a domain-specific 1–5 severity scale weighted by impact terms - fraud, lockout, dispute, outage - to rank complaints by business risk, not tone.
Step 06
Competitive benchmarking
Compared hidden-negative rate, severity profile, and topic mix across the four platforms to expose where each one is quietly failing.

05What Worked / Key Findings

The gap between perceived sentiment and actual user experience is measurable - and uneven across providers.

Severity of Hidden Negative Reviews / VADER Miss Rate by Severity

Fig. 01Hidden negatives - reviews the model failed to flag - cluster at the higher end of the severity scale, where impact on user trust is greatest.

Competitive Gap Heatmap / Hidden Negative Counts by App and Severity

Fig. 02Each provider exhibits a distinct gap signature between rating-based and model-based sentiment, exposing where monitoring blind spots concentrate.

Competitive Intelligence / VADER Failure Analysis by App

Fig. 03Benchmarking across Venmo, Cash App, Chime, and PayPal surfaces relative exposure: no platform is uniformly best, and weaknesses differ in kind, not only degree.

06Reality Check - What Failed

What didn't work, and what I had to fix.

VADERConsistently scored calm, factual complaints ('my transfer is stuck', 'account locked', 'can't verify identity') as neutral. Useful as a baseline; unusable as a decision signal - missed 26% of 1–2★ reviews.
Star ratings aloneMisleading. Many 3–4★ reviews described severe operational problems: 'Great UI but my money's been stuck for 2 days'. Rating ≠ complaint severity.
Keyword severity scoringFirst-pass severity used keyword matching ('fraud', 'scam', 'locked'). Failed on slang and non-standard phrasing ('my card got cooked', 'app is chalked', 'they finessed me'). Had to switch to topic-cluster-based scoring.
BERTopic (first pass)Produced noisy micro-topics dominated by app names and generic words; very short reviews ('Trash') didn't cluster meaningfully. Fixed with custom stopword lists, min_topic_size=30, and manual merging of similar clusters.
SarcasmVADER interprets 'Great, now my account is locked' or 'Love waiting 3 weeks for my money' as positive. Flagged as a known limitation, not solved - would require dedicated sarcasm detection.

07Why I Built It This Way

Key decisions and the reasoning behind them.

Why VADER firstFast, interpretable, no training data required. Industry-standard baseline that makes results relatable - and the 26% miss rate becomes a concrete, quantifiable business case.
Why not stop at VADERSentiment polarity doesn't tell you what the complaint is about. Product teams need specific issues (transfers, verification, support), not just 'negative'.
Why BERTopic over LDA / NMFBERT embeddings capture semantic meaning better than TF-IDF on short, informal review text. HDBSCAN picks the topic count automatically; UMAP gives visual validation of cluster coherence.
Why not fine-tune a transformerNo labeled severity data - would require manually labeling thousands of reviews. Time and compute constraints for a portfolio project. Unsupervised BERTopic + rule-based severity got strong results without labels. Fine-tuning is on the roadmap as a next step.
Sentiment ≠ severityEmotional tone is not business impact. A calm complaint about a failed transfer is more severe than an angry complaint about UI colors. Framing severity as a distinct output changes which errors count as failures.
Why focus on complaints, not positivesNegative reviews drive churn and revenue loss. Positive reviews are less actionable. Concentrating on high-impact analysis given limited project time.
Why multi-app analysisComparative context - shows which issues are app-specific vs. industry-wide. Enables competitive positioning and confirms the methodology generalizes across providers.

08Topic Discovery

BERTopic surfaces the categories sentiment alone cannot see.

/ Fraud & Scam Protection Failures (highest severity)
/ AI Support & Bot Frustration
/ Venmo Transaction Friction
/ PayPal Account Disputes
/ App Performance & Support
/ Fund Holds & Card Issues
/ Chime Banking System Issues

The 'Why' Behind the Miss / Complaint Themes per App (Hidden Negatives Only)

Fig. 04Topic concentration varies sharply by platform: the shape of a provider's complaint mix is itself a competitive signal.

Average Severity per Complaint Theme

Fig. 05Fraud-adjacent topics carry the highest mean severity, regardless of how the underlying review was scored by tone.

The failure is not just misclassification: it is missing the categories that matter.

09Business Impact & Decision Use

From sentiment metric to risk instrument.

Severity scoring helps fintech teams prioritize engineering fixes and customer support resources on the issues that actually move trust and retention, instead of the loudest ones.

PrioritizationRank incoming complaints by severity, not volume or tone, so engineering picks up what hurts users first.
Support loadRoute high-severity themes (fraud, lockouts, disputes) straight to specialist queues instead of general triage.
Competitive intelligenceSee where each competitor is quietly failing - a benchmarking layer beyond aggregate star ratings.
Revenue recoverySurface the complaint categories most correlated with churn language, so retention teams can intervene earlier.

10What I'd Improve Next

Honest next steps.

Supervised severity classifierLabel a few thousand reviews on the 1–5 severity scale and fine-tune a transformer directly on severity instead of scoring it after the fact.
Causal modelingMove from correlation ('these topics appear in churn-adjacent reviews') to causal estimates of complaint categories on retention.
DashboardShip the outputs as a lightweight internal dashboard: severity trend per app, top rising topics, week-over-week deltas.
Multi-language coverageExtend preprocessing and models to Spanish, Portuguese, and French to match real fintech user bases.
Social media signalFuse app-store reviews with Twitter/Reddit complaints for earlier detection of emerging issues.
Sarcasm detectionAdd a dedicated sarcasm layer - the single biggest source of remaining misclassification on positive-tone / high-severity reviews.

11Links & Resources

Customer Review Intelligence System for Fintech Apps.

Fintech reviews are noisy.

A six-stage analytical system.

Data collection & cleaning

VADER baseline

Confusion matrix

BERTopic

Severity scoring

Competitive benchmarking