Keyword discovery — Ranket Docs

Keyword discovery is the first thing Ranket does after brand setup, and it runs again every Monday. The output is a pool of 15-25 keywords with low difficulty, real search volume, and high relevance to your product — the “unicorns” every SEO blog hopes to find.

What makes this different

Most AI SEO tools ask an LLM to brainstorm keyword variations and call it a day. That produces phrases with no real volume data and no way to know if a low-DR brand can actually rank for them.

Ranket’s discovery is grounded in DataForSEO search data for every candidate. Claude is only used for judgment (relevance scoring, editorial picks), not for invention. Every keyword surfaced has been confirmed to exist as a real search query with real volume and known difficulty.

The 4 seed sources

Discovery starts from four real-data sources, in priority order:

1. Brand profile target keywords (highest priority)

When the brand was set up, Claude extracted 15 seed keywords from your scraped marketing pages. These are the brand’s curated topical territory. Example for a virtual-staging SaaS:

"virtual staging software real estate"
"AI photo enhancement real estate"
"furniture removal photo editor"
"day to dusk real estate photography"
"virtual tour software alternative to Matterport"
...10 more

These get priority because they’re the most explicitly on-brand. Many will be high-KD themselves (KD 40-60), but their variations via keyword_suggestions are usually rankable.

2. Self-ranked keywords

We call DataForSEO’s keywords_for_site on your domain. Every term your site already ranks for (anywhere in Google’s top 100, even position 87) becomes a seed. These are gold because Google has already decided your site is topically relevant for them.

For a brand-new site with no rankings, this source returns nothing — that’s why we have sources 1 and 3 as fallbacks.

3. Scraped page H1s / titles

Every H1 and <title> tag from the marketing pages we scraped becomes a seed. We normalise them first:

"Master Real Estate Video Editing in 2026: AI Workflow Step-by-Step"
   → drop "Master" leadword
   → strip year marker
   → split on colon
   → "real estate video editing"

The normaliser strips year markers, interrogative leadwords (“How to”, “What is”), trailing punctuation, and subtitles after colons.

4. Google Search Console queries (when connected)

If you’ve authorised GSC, the top queries you get impressions for in the last 28 days flow in as seeds. These are the strongest signal because Google has already shown your site for them.

Phase B — Claude supplement (only when thin)

If after combining all 4 real sources the total seed count is under 25, we run a single Sonnet call that generates 15-20 additional on-brand seed phrases based on the brand profile.

This kicks in for brand-new sites with 1-3 pages and no GSC data. Once the brand publishes a few articles, the real-data sources fill out and Phase B stops triggering.

Variation expansion

For each seed (up to 100, prioritised by source), we call DataForSEO keyword_suggestions/live:

seed: "virtual staging software"
returns up to 30 variations like:
  • "virtual staging software for realtors"
  • "best virtual staging software"
  • "free virtual staging software"
  • "virtual staging software cost"
  • "virtual staging software comparison"
  ...
each with searchVolume, keywordDifficulty, cpc already attached

Total raw candidate pool: typically 1,500-3,000 phrases.

Hard filters

We drop candidates that fail any of:

KD > 30 — too competitive for a typical brand to rank
Volume < 30 — not enough traffic to be worth an article
Branded — contains the brand’s own domain root (e.g. “bright-shot login” is dropped)
Garbage — contains URL-encoded characters or weird symbols
Volume > 10,000 with KD < 2 — almost always a navigational mega-term where DataForSEO couldn’t compute proper KD

Result: typically 50-200 candidates remain.

Word-set dedup

DataForSEO often returns the same concept multiple times with words in different order or filler words added:

"360 virtual tour software"
"virtual tour software 360"
"software for 360 virtual tour"
"virtual tour 360 software"
"virtual 360 tour software"
   ↓ word-set dedup (sort words, strip stop words)
"360 software tour virtual"
   → keep the highest-volume variant

The dedup key sorts words alphabetically AND strips stop words (of, for, to, your, the, etc.) so cosmetic variants collapse:

"real estate aerial photos"          → "aerial estate photos real"
"aerial photos of real estate"       → "aerial estate photos real"  ← match
"aerial photos for real estate"      → "aerial estate photos real"  ← match

Typical reduction: 50-200 → 30-150.

Claude relevance scoring

We send the surviving candidates to Claude Haiku in batches of 150, asking for a 0-1 relevance score against the brand profile. The prompt is calibrated to be honest — Haiku is told that a real estate SaaS shouldn’t rate “best laptop for travel” at 0.5 just because the audience uses laptops.

Candidates scoring below 0.6 get dropped. Typical floor pass: 30-150 → 12-30.

Cascade fallback

If after relevance scoring the pool is under 15 keywords, we cascade:

1. Take the top 10 keywords from current pool as new seeds
2. Run keyword_suggestions on them
3. Apply hard filters + dedup to new candidates
4. Score relevance for new ones
5. Add to pool

Repeat up to 2 rounds if still under 15.

Each cascade round opens a new keyword neighborhood. “Best virtual staging software” as a seed surfaces variations like “virtual staging software comparison” and “virtual staging for realtors near me” — terms the original seeds didn’t reach.

SERP rank-gap validation

For the top 50 candidates by interim score, we call DataForSEO SERP and inspect the top 10. If the SERP is dominated by mega-domains (Wikipedia, Amazon, Wikipedia, NYT), the rank gap is penalised. If the top 10 contains weaker domains that a low-DR brand could realistically outrank, the rank gap score is boosted.

Composite opportunity score

Every candidate gets a final score 0-100:

score = composeScore(volume, kd, relevance, rankGap)

where:
  volC      = min(4.7, log10(volume + 1))            // capped volume
  kdC       = log10(max(5, kd) + 2)                  // floored KD
  raw       = (volC / kdC) × relevance × rankGap
  score     = clamp(0, 100, round(raw / 3 × 100))

The volume cap (log10(50,000)) prevents generic mega-keywords from dominating. The KD floor (5) prevents kd=0/1 candidates from exploding the score. The result is a balanced metric that rewards real opportunities over fake unicorns.

Editorial judgment pass (Opus)

For the final polish, we send the top 30 candidates to Claude Opus 4.7 with the brand profile and ask: which 5-10 of these would a senior content strategist write articles for FIRST?

Opus considers:

Intent alignment — does the searcher actually want what the brand sells?
Brand-defensibility — will the article naturally showcase the brand’s product?
Funnel position — problem-aware and solution-aware over fully top-of-funnel
Editorial freshness — skip near-duplicates that survived word-set dedup

Picked keywords get a +20 opportunity score boost so they surface at the very top of the pool. Each pick gets a short editorial reason (“strong commercial intent, low competition, fits virtual-staging buyer journey”).

Worked example: BrightShot first refresh

For bright-shot.com (DR 23, 100 scraped pages), one refresh surfaced:

Score	KD	Volume	Relevance	Keyword
83	1	1,300	0.60	decluttering house for sale
79	5	720	0.75	real estate hdr photography
77	6	880	0.75	real estate aerial photos
66	10	170	0.95	virtual real estate staging software
57	0	30	0.90	360 degree virtual tour software
52	19	140	0.85	virtual tour platforms
50	19	50	0.95	best virtual staging software for real estate
49	12	50	0.80	videotour ai reviews
…				…

18 keywords total. Distribution: 7 from competitor-ranked (self-ranked of own domain), 10 from variation, 1 from page-h1. All on-brand. All KD ≤ 30. Editorial pass marked 8 as top picks.

Cost and timing

Steady-state, established brand:

DataForSEO
  keywords_for_site × 1                            $0.013
  keyword_suggestions × 100 + items                $1.50
  SERP × 50 (rank-gap)                             $0.10
  bulk_search_volume × 1                           $0.075
Claude
  Haiku relevance × ~5 batches                     $0.04
  Opus editorial × 1                               $0.12
Embeddings (semantic dedup if enabled)             $0.0001
─────────────────────────────────────────────────────────
Total per refresh                                  ~$1.85

Duration: 30-60 seconds.

For thin brands (cascade triggers): +$0.50 per cascade round.

Configuration

Default thresholds, overridable per-brand:

maxKeywordDifficulty: 30
minSearchVolume: 30
relevanceFloor: 0.6
maxSeeds: 100 (capped to control variation API spend)
cascadeMinTarget: 15
maxCascadeRounds: 2
editorialPickCount: 8

Higher-DR brands (DR > 50) can lift maxKeywordDifficulty to 40-50 to chase more competitive terms — at that authority, KD 45 is plausibly rankable.

Limits

Maximum 100 seeds expanded per refresh
Maximum 30 variations per seed
Maximum 50 SERP calls per refresh (rank-gap window)
Maximum 2 cascade rounds (prevents runaway cost on thin pools)

Diagnostics

Every refresh logs counts at each stage:

{
  "seedsFromPages": 128,
  "seedsFromSelfRanked": 36,
  "seedsFromProfile": 15,
  "seedsFromGsc": 16,
  "seedsFromClaudeSupplement": 0,
  "totalSeedsAfterDedupe": 195,
  "totalSeedsUsedForVariations": 100,
  "rawCandidatePool": 1832,
  "afterHardFilters": 287,
  "afterWordSetDedup": 198,
  "afterRelevance": 24,
  "cascadeRoundsRan": 0,
  "afterRankGap": 24,
  "editorialPicks": 8,
  "costUsd": 1.83,
  "requests": 152,
  "durationMs": 47000
}

The dashboard surfaces these so you can debug a thin pool (e.g. relevance scoring is too strict) or a failed refresh (e.g. DataForSEO 5xx).