OSINT with LLMs: turning web search into verifiable evidence

How I design OSINT systems with LLMs and web search: sources, evidence, confidence, security, evaluation and human review.

Finding public information with an LLM looks simple: ask a question, let the model search the web and receive a polished answer. The problem is that a convincing answer is not necessarily a proven answer.

In AI and automation projects, I have worked with web-search tools integrated into LLMs to research public information and turn it into results that people or other systems can use. This kind of work is closely related to OSINT, or Open Source Intelligence: producing useful knowledge from open sources.

The difficult part is not finding related text. It is preserving the relationship between every claim, its source and the level of confidence it actually deserves.

Direct answer: how should an OSINT system with LLMs work?

A reliable OSINT system with LLMs should define the question, search across multiple public sources, extract facts with citations, compare contradictions, separate evidence from inference and deliver a reviewable result.

My reference architecture has seven steps:

turn the business need into verifiable questions;
plan queries and relevant sources;
retrieve public content responsibly;
extract structured claims with evidence;
cross-check and assign confidence;
apply security and privacy controls;
generate a report with links and explicit limitations.

The LLM helps search, read and organise. The evidence still lives in the sources.

What OSINT is and what changes when an LLM is added

OSINT is the collection and analysis of information legally obtained from public sources. These can include corporate websites, official registries, news, reports, repositories, academic publications or public professional profiles.

An LLM brings speed to tasks that previously required a great deal of manual work:

turning a broad question into focused queries;
summarising long documents;
extracting names, dates, relationships or events;
comparing how different sources describe the same fact;
translating content;
producing a readable first synthesis.

It also introduces new risks. The model can fill gaps with plausible information, mix dates, lose nuance, trust a weak source or interpret a malicious instruction embedded in a webpage. I would therefore never use an LLM as a primary source or accept a report without traceability.

From an ambiguous request to verifiable facts

A request like this is too broad:

Research this company and tell me whether it looks trustworthy.

"Trustworthy" can mean many things. Before searching, I would break the request into questions that can be answered with evidence:

Does the entity appear in official registries?
Who is publicly listed as responsible for it?
What products or services does it claim to provide?
Are there relevant recent news reports?
Do different sources agree on location, activity and dates?
What information could not be verified?

This decomposition reduces arbitrary interpretation. It also makes it possible to decide which sources are suitable for each question and when the system should admit that it has no answer.

Source hierarchy matters

Not every page found carries the same evidential weight. An initial hierarchy could be:

Level	Source type	Typical use
1	Registries, authorities and official documents	Confirm identity, dates or regulated information
2	Corporate websites, technical documentation and owned repositories	Understand what an entity says about itself
3	Established media, academic publications and industry reports	Add context and independent verification
4	Directories, aggregators, forums and social media	Discover leads that must be cross-checked

The highest-ranked source does not always contain the whole truth, but it changes the weight of a claim. A commercial description published by a company is useful for understanding its positioning; it does not prove every claimed result.

I would also retain the publication or retrieval date. The web changes, and a statement that was correct two years ago may be outdated today.

The central object is not the summary: it is the claim with evidence

Instead of asking the model only for a final paragraph, I would represent every important finding in a structured form:

{
  "claim": "The organisation states that it operates in two European markets.",
  "source_url": "https://example.com/about",
  "source_type": "official_website",
  "published_at": null,
  "retrieved_at": "2026-06-11",
  "evidence": "A short excerpt supporting the claim",
  "confidence": "medium",
  "status": "self_reported"
}

This structure forces the system to distinguish:

claim: what is believed to be true;
evidence: the exact part of a source that supports it;
provenance: where it came from and when it was retrieved;
confidence: how much support exists;
status: whether it is confirmed, self-reported or inferred.

A summary, table or alert can then be generated. But the readable result grows from auditable data, not the other way around.

How I would manage contradictions and confidence

If two sources contradict a date, name or figure, the system should not silently choose the version that seems most likely. It should record both versions, assess the quality and recency of each source and expose the conflict.

A confidence score can combine criteria such as:

authority and proximity of the source to the fact;
number of independent sources that agree;
recency of the information;
presence of direct textual evidence;
consistency with registries or official documentation;
absence of relevant contradictions.

I would not turn that score into mathematical truth. It helps prioritise review and communicate uncertainty; it should not hide it.

Confidence	Operational interpretation
High	Multiple reliable and independent sources agree
Medium	Useful evidence exists, but it depends on one source or questions remain
Low	It is a lead, inference or insufficiently corroborated detail

Low-confidence claims can still guide an investigation, but they should not automatically trigger important decisions.

Security: a webpage can also attack the agent

When an LLM reads external content, that content must be treated as untrusted data. A page may contain instructions aimed at the model, hidden text or content designed to manipulate the result. This is a form of indirect prompt injection.

I would apply controls such as:

clearly separating system instructions from retrieved content;
preventing a page from changing the objective or allowed tools;
limiting agent actions to an explicit allowlist;
validating URLs, file types and sizes;
never executing code found during research;
logging queries, sources and decisions;
requiring human confirmation before any consequential action.

The agent can suggest that a source deserves attention. It should not gain new permissions because a website asks it to.

Privacy and responsible use

Publicly accessible information is not automatically proportionate to process in every way. A responsible OSINT system needs a legitimate purpose, data minimisation and respect for applicable legal and ethical requirements.

In an enterprise setting, I would define the following before starting:

the precise question that needs to be answered;
authorised source categories;
personal data that must not be collected;
how long results and evidence will be retained;
who can access the report;
which decisions require human review;
how incorrect or outdated information can be corrected.

I would also avoid turning weak public signals into sensitive profiles or conclusions. The technical ability to collect information does not replace judgement about whether it should be collected.

Reference technical architecture

A generic flow could look like this:

Question and research policy
    -> query planner
        -> web-search provider
            -> document retrieval and cleaning
                -> claim and citation extraction
                    -> cross-checking, deduplication and confidence
                        -> human review or final report

I would separate responsibilities so components can be tested and replaced:

the planner turns objectives into queries;
the search provider returns candidates and metadata;
the retriever obtains only the necessary content;
the extractor produces structured claims;
the verifier compares sources and flags contradictions;
the generator writes without losing citations or limitations.

Depending on the context, I would use a workflow to orchestrate steps and backend services for logic that needs to be stable and testable. This follows the same criterion I explain in n8n, FastAPI or Spring: how I choose an architecture for AI automation.

How I would evaluate the system

A search system should not be evaluated only by whether its answer "sounds good". I would create a set of research tasks with known results and measure:

factual precision: how many claims are correct;
coverage: how many relevant facts it retrieves;
citation quality: whether the evidence actually supports the claim;
source quality: whether it prioritises suitable sources;
conflict detection: whether it identifies incompatible versions;
calibration: whether it expresses low confidence when evidence is weak;
recency: whether it distinguishes recent information from old content;
prompt-injection resistance: whether it ignores instructions from pages.

I would also review false positives. In OSINT, a name collision or overly quick inference can cause more harm than missing a secondary lead.

A practical example without sensitive data

Imagine that a team needs to prepare a public-information brief about a potential supplier. The goal is not to decide automatically whether to hire it, but to gather evidence for human review.

The system could:

confirm the legal identity through official sources;
locate the website and documentation declared by the supplier;
search for recent news and industry mentions;
extract published products, locations and responsible people;
identify differences between sources;
generate a cited timeline;
clearly state which points remain unverified.

The appropriate output would not be an opaque "trustworthy" or "untrustworthy" verdict. It would be a concise report with supported facts, visible contradictions and open questions that help a person make an informed decision.

What I have learned from building with web search and LLMs

The better a model writes, the more important it becomes to demand evidence. Fluency makes a report easier to read, but it can also hide that a conclusion depends on a single source or a fragile inference.

My principle for these systems is therefore simple:

An LLM can accelerate research, but every important conclusion should be traceable back to its source.

This approach connects with other projects where I have used automation to analyse large amounts of information and prioritise decisions, such as the system I used to analyse approximately 10,000 job listings. Technology provides speed; structure, evidence and review provide confidence.

When information comes from private documents rather than the web, I would use a different approach. In RAG, web search, fine-tuning or rules, I compare which technique fits each kind of knowledge and decision.

You can read more articles about AI and automation on the blog or contact me through the contact page.