GenAI vs Crypto Scammers: Which LLM Wins

The Growing Threat of Cryptocurrency Romance Scams

Cryptocurrency romance scams have evolved into a sophisticated, multi-billion dollar criminal enterprise. These operations combine emotional manipulation with financial fraud, often lasting weeks or months before the final money extraction.

The perpetrators follow detailed scripts, systematically building trust through fake personas before introducing “investment opportunities.”

To combat this threat, I embarked on a unique research project: infiltrating scammer networks to collect real conversation data and using LLMs to automatically classify their tactics.

Methodology

Data Collection: 175 Scammer Conversations

Over 3 years, I personally interacted with 175 cryptocurrency scammers across multiple dating platforms, collecting 15,913 messages of real scam conversations.

This unprecedented dataset captures the full spectrum of scam tactics, from initial contact to final money extraction attempts.

For this analysis, I focused on the first 50 scammer conversations, totaling 3,946 messages (3,697 Japanese, 249 English, excluding emoji-only messages). Each message was manually reviewed and categorized using a set of 11 distinct scam strategy categories, identified based on recurring behavioral patterns and scammer training materials.

The 11 Scam Categories

Based on analysis of scammer training materials and behavioral patterns, I identified 11 scam strategy categories:

Grooming Phase

1. Emotional Bonding – Building romantic connections, isolation tactics

2. Financial Baiting – Displaying wealth to generate interest

3. Fake Persona Building – Creating believable background stories

4. Manipulative Care – Fake concern and compliments

5. Excuse Avoidance – Avoiding video calls and meetings

Profiling Phase

6. Personal Profiling – Gathering lifestyle and family information

7. Financial Inquiry – Probing income, assets, and financial capacity

Persuasion Phase

8. Financial Education – Teaching crypto and investment “lessons”

9. Investment Pitch – Promoting fake investment schemes

10. Urgency or Pressure – Creating time-sensitive manipulation

Exploitation Phase

11. Money Extraction – Direct requests for funds

Model Testing Setup

Automated LLM Classification with Secure Protocols

To evaluate the LLMs, I developed a consistent and secure script that allowed each of the four LLMs (GPT-3.5, GPT-4, Claude 3 Haiku, and Gemini 1.5 Pro) to categorize the scam messages independently.

A critical aspect of this step was ensuring no data leakage. This means that when an LLM was classifying a message, it only had access to the message text itself. It was explicitly prevented from seeing any manual labels or the categorization results from other LLMs.

You can find the complete Python script in my GitHub repository.

Results

Model Accuracy

ModelJapanese 🇯🇵
(3,697 messages)
English 🇬🇧
(249 messages)
GPT-41️⃣ 70.8%2️⃣ 76.7%
GPT-3.52️⃣ 66.9%4️⃣ 70.3%
Claude 3 Haiku3️⃣ 66.2%1️⃣ 78.3%
Gemini 1.5 Pro4️⃣ 62.0%3️⃣ 72.3%

Accuracy by Category (Japanese)

CategoryMessagesGPT-4GPT-3.5ClaudeGemini
none2,50475.8%75.0%75.1%62.6%
Personal Profiling41769.5%56.1%53.5%76.0%
Emotional Bonding27375.5%72.5%63.7%51.6%
Fake Persona Building19457.2%18.6%23.7%68.0%
Investment Pitch965.2%13.5%49.0%11.5%
Manipulative Care5978.0%49.2%45.8%98.3%
Financial Education3871.1%63.2%44.7%60.5%
Financial Baiting3528.6%71.4%31.4%31.4%
Money Extraction352.9%2.9%5.7%2.9%
Financial Inquiry3262.5%81.2%46.9%71.9%

Key Insights and Discoveries

1. GPT-4 Wins Overall, But Category Performance Varies Dramatically

While GPT-4 achieved the highest overall accuracy on Japanese messages, no single model dominated all categories. Each model showed distinct strengths.

  • GPT-4: Best at emotional bonding
  • GPT-3.5: Excelled at financial baiting and financial inquiry detection
  • Claude: Best performer on investment pitches
  • Gemini: Surprisingly strong at personal profiling, fake persona building and manipulative care
2. All Models Struggle with Money Extraction

All models performed poorly at detecting direct money extraction attempts – the final and most critical scam phase. Best performance was Claude at only 5.7% accuracy.

3. Language Matters: English vs Japanese Performance

Claude showed a remarkable 12-point accuracy improvement on English messages (78.3% vs 66.2% for Japanese), while other models showed smaller gaps. This suggests significant language-specific bias in model training and performance.

4. The “None” Category Challenge

With 67.7% of messages classified as legitimate conversation, accurately distinguishing between scam tactics and normal chat proved crucial.

GPT-4 and GPT-3.5 performed best here, while Gemini struggled, over-classifying innocent messages as scam attempts.

5. Cultural Context Blindness: The Japanese Politeness Problem

LLMs demonstrated a critical weakness in understanding Japanese cultural communication norms, with Gemini suffering most severely from this issue.

In Japanese culture, expressions of care, concern, and attentiveness are standard politeness markers, not necessarily signs of manipulation.

Gemini’s 62.6% accuracy on legitimate conversation (“none” category) compared to ~75% for other models reveals systematic over-classification of polite Japanese phrases.

The model frequently misclassified cultural politeness like “お疲れ様でした” as “Manipulative Care.”

This cultural blindness explains why Gemini paradoxically achieved the highest accuracy on actual “Manipulative Care” detection (98.3%) – it was flagging both genuine manipulation and normal Japanese politeness as the same category.

Conclusion

Large language models (LLMs) have emerged as powerful tools in the fight against digital deception—but this study reveals both their potential and their current limitations.

Through the classification of nearly 4,000 real scam messages from 50 cryptocurrency romance scams, we found that no single model consistently outperformed across all scam tactics.

While GPT-4 achieved the highest overall accuracy, other models like Claude and Gemini demonstrated unique strengths in niche categories like investment pitches or fake persona detection.

However, the most sobering insight is this: every model performed poorly on detecting direct money extraction attempts, the final and most dangerous step in the scam process.

Even the best model (Claude) only achieved 5.7% accuracy in this critical category—underscoring the difficulty of identifying explicit fraud when cloaked in emotionally manipulative language.

We also uncovered significant language-specific performance gaps, with all models performing better in English than Japanese. This points to the need for more diverse training data and language-aware model tuning.

Ultimately, this benchmark serves as both a progress report and a call to action. GenAI tools are promising allies in scam detection, but relying on them blindly is not enough. Human oversight, diverse training data, and continued evaluation are essential for building safer systems.

Real-World Implications

The findings have immediate practical applications:

For Platforms:

  • Deploy hybrid detection: No single model works—combine GPT-4’s emotional detection with Claude’s investment pitch recognition.
  • Address cultural misinterpretation: Current models often confuse standard Japanese politeness with manipulation, leading to false positives.
  • Rethink money extraction detection: 5.7% accuracy means current approaches are fundamentally broken.

For AI Researchers:

  • Train on real adversarial data: Synthetic scam data clearly isn’t cutting it.
  • Build conversation-aware models: Message-level classification misses scammer progression patterns.
  • Address cross-cultural training gaps: English-centric training creates dangerous blind spots.

What’s Next: Fine-Tuning for Superior Performance

The performance of general-purpose LLMs in this study is just the starting point. The real opportunity lies in transforming these models through fine-tuning—training them further on a domain-specific dataset to boost their precision, context awareness, and reliability.

With 15,913 manually reviewed messages from 175 real cryptocurrency scammers, this dataset offers an unparalleled foundation. Unlike synthetic or simulated text, these conversations capture authentic scammer psychology, cultural nuances, and evolving fraud tactics—elements no model has been exposed to at scale.

In my upcoming blog post, I’ll explore how to fine-tune open-source LLMs using this dataset.

Fine-tuning offers three key advantages:

  • Boost Accuracy in Critical Categories: Address persistent weaknesses—especially in detecting “Money Extraction”—by training models on more representative, labeled examples.
  • Reduce Cultural Misclassification: Use culturally grounded Japanese data to help models distinguish between standard politeness and manipulative grooming, minimizing false positives.
  • Build Real-World-Ready Models: Adapt models for practical deployment, moving from generic language understanding to nuanced threat recognition in scam detection systems.

This next phase will detail dataset preparation, architecture selection, and performance evaluation. The goal is to show how targeted fine-tuning can turn general LLMs into expert-level fraud detectors—helping us build safer, smarter tools in the GenAI era.

By moving beyond off-the-shelf models and toward culturally-aware, tactic-specific fine-tuning, we can take meaningful steps to reduce harm from AI-assisted scams worldwide.

Stay tuned for the next post, where I’ll share early results—and whether fine-tuned open-source LLMs can finally outsmart the scammers.

This research was conducted ethically with proper security measures. No real victims were involved, and all scammer interactions were conducted safely with appropriate protections.

Data Note: The complete dataset of 15,913 messages from 175 scammers represents one of the largest collections of real scam conversations available for research. Anonymized subsets will be made available once I complete manual review of all messages.

Technical Note: All model testing was conducted with identical prompts, security measures to prevent data leakage, and consistent evaluation criteria. Full methodology and code are available here.

RSS
Follow by Email
LinkedIn
Share