How We Use AI to Detect Duplicate Business Registrations (And Why It's Harder Than You Think)
How We Use AI to Detect Duplicate Business Registrations (And Why It’s Harder Than You Think)
TL;DR: We built an AI-powered business fingerprinting system that detects duplicate registrations with 95% accuracy. It handles typos, abbreviations, different formats, and even intentional variations. Uses Claude to normalize business data, generates unique fingerprints, and prevents users from creating multiple websites for the same business.
The Problem: Users Keep Creating Duplicates
We let users generate websites by entering a business name. Simple, right?
Wrong.
What we saw:
- “Joe’s Pizza Brooklyn” (Monday)
- “Joes Pizza - Brooklyn NY” (Tuesday)
- “Joe’s Pizzeria” (Wednesday)
Same business. Three websites. Three subscriptions. Chaos.
Why it happens:
- Typos: “Joe’s” vs “Joes” vs “Joe’s”
- Abbreviations: “Brooklyn” vs “Bklyn” vs “BK”
- Formatting: “123 Main St” vs “123 Main Street, Apt 2”
- Intentional variations: Users forget they already created a site
The cost:
- Wasted AI API calls ($2-5 per website generation)
- Confused users (“Why do I have 3 websites?”)
- Support tickets (“Which one is the real one?”)
- Database bloat (3x more records than actual businesses)
We needed to detect duplicates before generating the website.
The Insight: Fingerprints, Not Exact Matches
The breakthrough came when we stopped trying to match business names exactly and started thinking about “business fingerprints.”
Exact matching (doesn’t work):
"Joe's Pizza Brooklyn" ≠ "Joes Pizza - Brooklyn NY"
Fingerprint matching (works):
normalize("Joe's Pizza Brooklyn") → "joes-pizza-brooklyn"
normalize("Joes Pizza - Brooklyn NY") → "joes-pizza-brooklyn"
✅ MATCH!
But normalization alone isn’t enough. We needed AI.
How It Works: The Technical Architecture
1. AI-Powered Business Name Extraction
When a user enters text, we use Claude to extract structured data:
async function extractBusinessInfo(userInput: string) {
const prompt = `
Extract business information from this input:
"${userInput}"
Return JSON with:
- businessName: The core business name (no location, no legal entity)
- location: City, state, or neighborhood
- type: Business type (restaurant, plumber, etc.)
- legalEntity: LLC, Inc, etc. (if present)
Examples:
Input: "Joe's Pizza LLC in Brooklyn"
Output: {
"businessName": "Joe's Pizza",
"location": "Brooklyn",
"type": "restaurant",
"legalEntity": "LLC"
}
Input: "Best Plumbing Services - San Diego, CA"
Output: {
"businessName": "Best Plumbing Services",
"location": "San Diego, CA",
"type": "plumber",
"legalEntity": null
}
`;
const response = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 500,
messages: [{
role: 'user',
content: prompt
}]
});
return JSON.parse(response.content[0].text);
}
Why AI? Because business names are messy:
- “Joe’s Pizza Brooklyn” → name: “Joe’s Pizza”, location: “Brooklyn”
- “Brooklyn Joe’s Pizza” → name: “Joe’s Pizza”, location: “Brooklyn”
- “Joe’s Pizzeria of Brooklyn” → name: “Joe’s Pizzeria”, location: “Brooklyn”
AI understands context that regex can’t handle.
2. Normalization Pipeline
Once we have structured data, we normalize it:
function normalizeBusinessName(name: string): string {
return name
.toLowerCase()
.replace(/['']/g, '') // Remove apostrophes
.replace(/[^\w\s]/g, '') // Remove punctuation
.replace(/\s+/g, '-') // Spaces to hyphens
.replace(/^(the|a|an)-/, '') // Remove articles
.replace(/-llc|-inc|-corp|-ltd$/, '') // Remove legal entities
.trim();
}
function normalizeLocation(location: string): string {
return location
.toLowerCase()
.replace(/\b(street|st|avenue|ave|road|rd|boulevard|blvd)\b/g, '') // Remove street types
.replace(/\b(apartment|apt|suite|ste|unit)\s*\d+/g, '') // Remove apt numbers
.replace(/[^\w\s]/g, '')
.replace(/\s+/g, '-')
.trim();
}
function normalizePhone(phone: string): string {
// Extract just the digits
const digits = phone.replace(/\D/g, '');
// US phone: keep last 10 digits
if (digits.length >= 10) {
return digits.slice(-10);
}
return digits;
}
Examples:
normalizeBusinessName("Joe's Pizza LLC") → "joes-pizza"
normalizeBusinessName("The Joe's Pizzeria") → "joes-pizzeria"
normalizeLocation("123 Main St, Apt 2") → "123-main"
normalizeLocation("123 Main Street") → "123-main"
normalizePhone("(555) 123-4567") → "5551234567"
normalizePhone("+1-555-123-4567") → "5551234567"
3. Fingerprint Generation
We combine normalized components into a unique fingerprint:
interface BusinessFingerprint {
primaryKey: string; // Most specific
secondaryKeys: string[]; // Fallback matches
metadata: {
originalName: string;
normalizedName: string;
location?: string;
phone?: string;
type?: string;
};
}
function generateFingerprint(businessInfo: ExtractedBusinessInfo): BusinessFingerprint {
const normalizedName = normalizeBusinessName(businessInfo.businessName);
const normalizedLocation = businessInfo.location
? normalizeLocation(businessInfo.location)
: null;
const normalizedPhone = businessInfo.phone
? normalizePhone(businessInfo.phone)
: null;
// Primary key: name + location (most specific)
const primaryKey = normalizedLocation
? `${normalizedName}-${normalizedLocation}`
: normalizedName;
// Secondary keys: alternative matches
const secondaryKeys = [
normalizedName, // Name only
normalizedPhone ? `phone-${normalizedPhone}` : null, // Phone only
businessInfo.type ? `${normalizedName}-${businessInfo.type}` : null // Name + type
].filter(Boolean);
return {
primaryKey,
secondaryKeys,
metadata: {
originalName: businessInfo.businessName,
normalizedName,
location: normalizedLocation,
phone: normalizedPhone,
type: businessInfo.type
}
};
}
Example fingerprints:
Input: "Joe's Pizza Brooklyn"
Output: {
primaryKey: "joes-pizza-brooklyn",
secondaryKeys: [
"joes-pizza",
"joes-pizza-restaurant"
],
metadata: {
originalName: "Joe's Pizza",
normalizedName: "joes-pizza",
location: "brooklyn",
type: "restaurant"
}
}
Input: "Joes Pizza - Brooklyn NY (555) 123-4567"
Output: {
primaryKey: "joes-pizza-brooklyn",
secondaryKeys: [
"joes-pizza",
"phone-5551234567",
"joes-pizza-restaurant"
],
metadata: {
originalName: "Joes Pizza",
normalizedName: "joes-pizza",
location: "brooklyn",
phone: "5551234567",
type: "restaurant"
}
}
✅ PRIMARY KEY MATCH: Same business!
4. Duplicate Detection
Before creating a new business, we check for duplicates:
async function checkForDuplicates(fingerprint: BusinessFingerprint): Promise<DuplicateResult> {
// Check primary key first (exact match)
const primaryMatch = await db.findBusinessByFingerprint(fingerprint.primaryKey);
if (primaryMatch) {
return {
isDuplicate: true,
confidence: 'high',
matchedBusiness: primaryMatch,
matchType: 'primary'
};
}
// Check secondary keys (fuzzy match)
for (const secondaryKey of fingerprint.secondaryKeys) {
const secondaryMatch = await db.findBusinessByFingerprint(secondaryKey);
if (secondaryMatch) {
// Verify it's actually the same business (not just similar name)
const similarity = calculateSimilarity(fingerprint, secondaryMatch.fingerprint);
if (similarity > 0.8) {
return {
isDuplicate: true,
confidence: 'medium',
matchedBusiness: secondaryMatch,
matchType: 'secondary',
similarity
};
}
}
}
return {
isDuplicate: false,
confidence: 'none'
};
}
function calculateSimilarity(fp1: BusinessFingerprint, fp2: BusinessFingerprint): number {
let score = 0;
let checks = 0;
// Name similarity (most important)
if (fp1.metadata.normalizedName === fp2.metadata.normalizedName) {
score += 0.5;
}
checks++;
// Location similarity
if (fp1.metadata.location && fp2.metadata.location) {
if (fp1.metadata.location === fp2.metadata.location) {
score += 0.3;
}
checks++;
}
// Phone similarity
if (fp1.metadata.phone && fp2.metadata.phone) {
if (fp1.metadata.phone === fp2.metadata.phone) {
score += 0.2;
}
checks++;
}
return score / checks;
}
5. User Confirmation Flow
When we detect a duplicate, we ask the user:
async function handleBusinessRegistration(userInput: string) {
// Extract and normalize
const businessInfo = await extractBusinessInfo(userInput);
const fingerprint = generateFingerprint(businessInfo);
// Check for duplicates
const duplicateCheck = await checkForDuplicates(fingerprint);
if (duplicateCheck.isDuplicate) {
// Show confirmation dialog
const userConfirmed = await showDuplicateDialog({
originalInput: userInput,
matchedBusiness: duplicateCheck.matchedBusiness,
confidence: duplicateCheck.confidence
});
if (!userConfirmed) {
// User says it's a duplicate, redirect to existing business
return {
action: 'redirect',
businessId: duplicateCheck.matchedBusiness.id
};
}
// User says it's NOT a duplicate, create new business
// (but flag for manual review if confidence is high)
if (duplicateCheck.confidence === 'high') {
await flagForManualReview(fingerprint, duplicateCheck);
}
}
// Create new business
const business = await createBusiness(businessInfo, fingerprint);
return {
action: 'created',
businessId: business.id
};
}
Duplicate dialog UI:
function showDuplicateDialog(data: DuplicateData): Promise<boolean> {
return new Promise((resolve) => {
const dialog = document.createElement('div');
dialog.innerHTML = `
<div class="duplicate-dialog">
<h3>We found a similar business</h3>
<p>You entered: <strong>${data.originalInput}</strong></p>
<p>We found: <strong>${data.matchedBusiness.name}</strong></p>
<p>Created: ${formatDate(data.matchedBusiness.createdAt)}</p>
<div class="actions">
<button class="btn-primary" id="use-existing">
Use Existing Business
</button>
<button class="btn-secondary" id="create-new">
No, Create New Business
</button>
</div>
</div>
`;
document.body.appendChild(dialog);
dialog.querySelector('#use-existing').addEventListener('click', () => {
resolve(false); // It's a duplicate
dialog.remove();
});
dialog.querySelector('#create-new').addEventListener('click', () => {
resolve(true); // Not a duplicate
dialog.remove();
});
});
}
The Challenges We Solved
Challenge 1: False Positives
Problem: “Joe’s Pizza Brooklyn” and “Joe’s Burgers Brooklyn” matched as duplicates
Solution: Multi-factor scoring with type checking
function calculateSimilarity(fp1, fp2) {
// ... previous code ...
// Type check (critical for restaurants)
if (fp1.metadata.type && fp2.metadata.type) {
if (fp1.metadata.type !== fp2.metadata.type) {
score *= 0.5; // Heavily penalize type mismatch
}
}
return score;
}
Challenge 2: Franchise Locations
Problem: “McDonald’s Brooklyn” and “McDonald’s Manhattan” are different locations, not duplicates
Solution: Location-aware fingerprinting
// For franchise businesses, location is REQUIRED in primary key
const isFranchise = FRANCHISE_NAMES.includes(normalizedName);
const primaryKey = isFranchise || normalizedLocation
? `${normalizedName}-${normalizedLocation}`
: normalizedName;
Challenge 3: AI Hallucinations
Problem: Claude sometimes extracts incorrect business types
Solution: Confidence scoring + fallback to user input
const businessInfo = await extractBusinessInfo(userInput);
// Validate AI extraction
if (!businessInfo.businessName || businessInfo.businessName.length < 2) {
// AI failed, fall back to user input
businessInfo.businessName = userInput;
}
// Store both AI-extracted and original input
await db.createBusiness({
...businessInfo,
originalInput: userInput,
aiExtracted: true
});
The Results: 95% Accuracy
Before (no deduplication):
- 30% of businesses had duplicates
- 1,000 businesses → 1,300 database records
- $650 wasted on duplicate AI generations
After (fingerprinting system):
- 5% false negative rate (missed duplicates)
- 2% false positive rate (flagged non-duplicates)
- 93% of duplicates caught before generation
- $600 saved per month in AI costs
User feedback:
“Oh wow, I already created this last week! Thanks for catching that.” - Bakery owner
“I thought I lost my website. Turns out I just typed the name slightly differently.” - Contractor
Why This Matters for AI Applications
Most AI applications assume clean input. We learned:
Bad: Trust user input → create duplicates → clean up later Good: Normalize input → detect duplicates → confirm with user
The startup lesson: AI is great at understanding messy input, but you still need deterministic logic for matching. Use AI to extract structure, use code to match patterns.
Key Insights
- AI for extraction, code for matching: Claude extracts business info, code generates fingerprints
- Multi-factor scoring: Name + location + phone + type = high confidence
- User confirmation: When in doubt, ask the user
- Graceful degradation: If AI fails, fall back to user input
What’s Next
We’re exploring:
- Fuzzy matching: Levenshtein distance for typo detection
- Address normalization: Use Google Maps API to standardize addresses
- Phone number lookup: Verify business phone numbers with Twilio
- Historical data: Learn from user corrections to improve AI extraction
But the core insight remains: Fingerprints > exact matches.
Try it yourself: Enter “Joe’s Pizza Brooklyn” on WebZum, then try “Joes Pizza - Brooklyn NY”. Watch the duplicate detection catch it.
Building a deduplication system? Key takeaway: AI + normalization + fingerprinting = robust duplicate detection. Don’t rely on exact matches—businesses are messy.
The future of data quality isn’t perfect input—it’s intelligent normalization.