Using a vector database with PHP
PHP has roughly 80 array functions: eight – zero. array_map, array_walk, array_reduce,array_filter, array_each… wait, that last one was deprecated and removed in PHP 7.2. Or not? And some new functions came up in PHP 8.5, last november.
The real problem is discoverability. You know the task, “find the first element matching a condition”, but you can’t remember if it’s array_find, array_search, array_filter, or something that doesn’t exist yet and you’ll have to write yourself. You end up on the eternal manual php.net, scrolling through the full list like it’s a 2003 API reference, which it kind of is, hoping that the right function will jump out of the dark at the best moment.
Vector databases solve this with a different approach entirely. Instead of matching text, they match meaning. You describe what you want, and you get back the semantically closest functions, regardless of the exact words. So, today we’ll build exactly that using Vektor, a pure PHP file-based vector database, and Ollama for local embeddings.
In a nutshell: no cloud; no API key; no monthly bill; no privacy invasion. Just PHP doing what PHP does: being surprisingly capable.
What we’re building
A command-line tool that:
- Reads the local PHP documentation from HTML files
- Extracts every array function name, signature, and description
- Converts each into a 768-dimensional embedding vector using Ollama
- Stores those vectors in Vektor
- Answers natural-language questions like “sort array while preserving key-value association”
The result is a local semantic search engine over PHP’s array API. Entirely file-based, running offline, and using a few hundred kilobytes on disk.
Prerequisites
- PHP 8.2+, and try it on PHP 8.6 if you want, it is out
- Composer
- Ollama installed and running locally
- The PHP documentation downloaded locally (more on this below)
Once Ollama is installed, pull the embedding model:
ollama pull nomic-embed-text
nomic-embed-text produces 768-dimensional vectors, runs entirely on-device, and is purpose-built for semantic text similarity. Verify that it is working:
curl -s http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"test"}' | \
php -r '$d=json_decode(file_get_contents("php://stdin"),true);echo count($d["embedding"]);'
# Expected: 768
Step 1: Download the PHP documentation
Here is the step where we resist the urge to write a scraper, which is the correct instinct. php.net provides the complete documentation as downloadable archives at https://www.php.net/download-docs.php.
Choose “Many HTML files”. This gives you one HTML file per manual page, which is exactly what we need for per-function parsing. Download and extract:
# The exact filename varies with date; find it on the download page wget https://www.php.net/distributions/manual/php_manual_en.tar.gz tar -xzf php_manual_en.tar.gz mv php-chunked-xhtml php_manual_en # You now have: php_manual_en/
We used the English manual, but you may also use any of the translations. Just update the language once we reach the queries. Naturellement!
Array functions live in files named function.array-*.html:
ls php_manual_en/function.array-*.html | wc -l # 79
Seventy-nine functions. One HTML file each. Let’s parse them.
Step 2: Project setup
mkdir php-array-search cd php-array-search composer require centamiv/vektor mkdir data # Symlink or copy the manual directory ln -s /path/to/php_manual_en ./php_manual_en
In the end, we will have the following final structure:
php-array-search/ ├── data/ # Vektor binary files (written automatically) ├── php_manual_en/ # Downloaded PHP docs ├── parse.php # HTML extraction functions ├── parse_one.php # HTML extraction of one ├── embed.php # Ollama embedding helper ├── index.php # Indexing script (run once) ├── search.php # Search CLI (run repeatedly) ├── optimize.php # Maintenance └── vendor/
Step 3: Parse the PHP documentation
The downloaded manual is machine-generated from DocBook XML, so the HTML structure is consistent and reliable. Each function page carries the function name, a one-line purpose, the signature, parameter descriptions, and the return value description. We want all of it.
<?php
// parse.php
function parsePhpFunctionFile(string $htmlFile): ?array
{
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($htmlFile);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Function name, e.g. "array_map"
$nameNodes = $xpath->query('//p[@class="refname"]');
if ($nameNodes->length === 0) {
return null;
}
$name = trim($nameNodes->item(0)->textContent);
// One-line purpose, strip the leading em-dash that the manual includes
$purposeNodes = $xpath->query('//span[@class="refpurpose"]');
$purpose = '';
if ($purposeNodes->length > 0) {
$purpose = trim($purposeNodes->item(0)->textContent);
$purpose = preg_replace('/^\s*[—\-]+\s*/u', '', $purpose);
}
// Function signature, grab the first methodsynopsis block
$sigNodes = $xpath->query('//*[contains(@class,"methodsynopsis")]');
$signature = '';
if ($sigNodes->length > 0) {
$signature = preg_replace('/\s+/', ' ', trim($sigNodes->item(0)->textContent));
}
// Parameter descriptions, first paragraph of each <dd> in the parameters section
$paramNodes = $xpath->query(
'//div[contains(@class,"refsect1") and contains(@class,"parameters")]//dl//dd//p[1]'
);
$paramTexts = [];
foreach ($paramNodes as $node) {
$text = trim($node->textContent);
if ($text !== '') {
$paramTexts[] = $text;
}
}
// Return value, first paragraph of the returnvalues section
$returnNodes = $xpath->query(
'//div[contains(@class,"refsect1") and contains(@class,"returnvalues")]//p[1]'
);
$returnText = $returnNodes->length > 0
? trim($returnNodes->item(0)->textContent)
: '';
return [
'name' => $name,
'purpose' => $purpose,
'signature' => $signature,
'params' => implode(' ', $paramTexts),
'returns' => $returnText,
];
}
function buildEmbedText(array $func): string
{
// Richer context = better vectors. Never embed the function name alone.
return sprintf(
'%s: %s. Signature: %s. Parameters: %s. Returns: %s.',
$func['name'],
$func['purpose'],
$func['signature'],
$func['params'],
$func['returns']
);
}
The buildEmbedText function is more than optional decoration. Embedding just "array_map"gives you a vector that will cluster with "map" and "array". Embedding the full description, with its name, purpose, signature, what the parameters do, what is returned, gives a vector that encodes what the function actually does. That semantic richness is what makes the search useful.
If you want to verify the parser on a single file before indexing everything, this snippet is your friend:
<?php
require 'parse_one.php';
$func = parsePhpFunctionFile('php_manual_en/function.array-map.html');
print_r($func);
echo "\n--- Embed text ---\n";
echo buildEmbedText($func) . "\n";
Expected output (abbreviated):
Array
(
[name] => array_map
[purpose] => Applies the callback to the elements of the given arrays
[signature] => array array_map(?callable $callback, array $array, array ...$arrays)
[params] => A callable to run for each element in each array. ...
[returns] => Returns an array containing the results of applying the ...
)
--- Embed text ---
array_map: Applies the callback to the elements of the given arrays. Signature: ...
If the XPath selectors produce empty strings for your downloaded version, inspect one file with xmllint --html --xpath '//span[@class="refname"]' php_manual_en/function.array-map.htmland adjust the selectors accordingly. The structure is consistent across the manual, so once one file works, all of them will.
Step 4: Embeddings with Ollama
Nothing exotic here. The code is mainly a curl call to the local Ollama API, which returns the embedding:
<?php
// embed.php
function getEmbedding(string $text): array
{
$payload = json_encode([
'model' => 'nomic-embed-text',
'prompt' => $text,
]);
$ch = curl_init('http://localhost:11434/api/embeddings');
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => $payload,
CURLOPT_HTTPHEADER => ['Content-Type: application/json'],
]);
$response = json_decode(curl_exec($ch), true);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($httpCode !== 200 || empty($response['embedding'])) {
throw new RuntimeException(
"Ollama returned HTTP $httpCode: " . json_encode($response)
);
}
return $response['embedding']; // 768 floats
}
Step 5: Indexing
This is where it all comes together. One important constraint to understand before writing a single line: Vektor’s binary file layout is baked to a fixed record width determined by the vector dimension.
The default is 1536, and it is sized for OpenAI embeddings. We are using 768-dimensional vectors from nomic-embed-text. If you forget to set the dimension before instantiating the Indexer, it will silently write vectors into a binary file with the wrong size. The reads then return garbage, and there will be no error message: just meaningless results, also known as false positives. We also say hallucinations nowadays, although the fashion is already fading away. Just set dimensions first.
<?php
// index.php
require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/parse.php';
require __DIR__ . '/embed.php';
use Centamiv\Vektor\Core\Config;
use Centamiv\Vektor\Services\Indexer;
// Dimension MUST be set before new Indexer().
// nomic-embed-text = 768. Vektor default = 1536. They are not compatible.
Config::setDataDir(__DIR__ . '/data');
Config::setDimensions(768);
$indexer = new Indexer();
$payloadFile = __DIR__ . '/data/payload.json';
$payload = file_exists($payloadFile) ? (json_decode(file_get_contents($payloadFile), true) ?? []) : [];
$docDir = __DIR__ . '/php_manual_en';
$functions = array_slice(get_extension_funcs('standard'), 21, 87);
$files = array_map(fn ($name) => $docDir . '/function.'.str_replace('_', '-', $name). '.html', $functions);
// initial collection of functions, a bit naive
//$files = glob($docDir . '/function.array-*.html');
if (empty($files)) {
die("No HTML files found in $docDir. Check your symlink or path.\n");
}
$indexed = 0;
$skipped = 0;
foreach ($files as $file) {
$func = parsePhpFunctionFile($file);
if ($func === null || $func['name'] === '' || $func['purpose'] === '') {
echo "Skipping (unparseable): " . basename($file) . "\n";
$skipped++;
continue;
}
// Function names are well under 36 chars, Vektor's ID limit.
$id = $func['name'];
$text = buildEmbedText($func);
echo "Indexing {$func['name']}... ";
try {
$vector = getEmbedding($text);
$indexer->insert($id, $vector);
$payload[$id] = [
'name' => $func['name'],
'purpose' => $func['purpose'],
'signature' => $func['signature'],
'params' => $func['params'],
'returns' => $func['returns'],
];
echo "done\n";
$indexed++;
} catch (RuntimeException $e) {
echo "FAILED: {$e->getMessage()}\n";
$skipped++;
}
}
file_put_contents($payloadFile, json_encode($payload, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE));
echo "\nIndexed : $indexed\n";
echo "Skipped : $skipped\n\n";
$stats = $indexer->getStats();
printf(
"Vectors on disk : %d\nGraph nodes : %d\nStorage : %d KB\n",
$stats['records']['vectors_total'],
$stats['records']['graph_nodes'],
intdiv(
$stats['storage']['vector_file_bytes']
+ $stats['storage']['graph_file_bytes']
+ $stats['storage']['meta_file_bytes']
+ (file_exists($payloadFile) ? filesize($payloadFile) : 0),
1024
)
);
Run it:
php index.php
Note that usually, index.php is the default file in an online PHP application. Here, it is the actual semantic meaning of the verb: do an indexation.
Expected output:
Indexing array_chunk... done Indexing array_column... done Indexing array_combine... done ... Indexing usort... done Indexed : 59 Skipped : 0 Vectors on disk : 59 Graph nodes : 59 Storage : 200 KB
59 functions, ~200 KB on disk. A complete, self-contained knowledge base. The indexing run takes 30–60 seconds on a modern laptop. This is due to Ollama generating embeddings sequentially. You can take the time to speed that up later, it is not critical in our case. And you only do it once.
Peek at what Vektor created in data/:
ls -lh data/ # vector.bin : raw float arrays, fixed record width # graph.bin : HNSW graph connections # meta.bin : ID-to-offset BST (no RAM map, pure disk seeks) # payload.bin : serialized metadata (JSON blobs)
Four binary files, no external process, no daemon. The entire database is those four files.
Step 6: Search
Now the good part:
<?php
// search.php
require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/embed.php';
use Centamiv\Vektor\Core\Config;
use Centamiv\Vektor\Services\Searcher;
Config::setDataDir(__DIR__ . '/data');
Config::setDimensions(768);
$searcher = new Searcher();
$payloadFile = __DIR__ . '/data/payload.json';
$payload = file_exists($payloadFile) ? (json_decode(file_get_contents($payloadFile), true) ?? []) : [];
$query = $argv[1] ?? 'find elements in common between two arrays';
$k = (int) ($argv[2] ?? 5);
echo "\nQuery : \"$query\"\n";
echo str_repeat('─', 62) . "\n";
$queryVector = getEmbedding($query);
$results = $searcher->search($queryVector, $k, includeVector: false);
foreach ($results as $result) {
$meta = $payload[$result['id']] ?? ['name' => $result['id'], 'purpose' => '(no metadata)'];
$score = number_format($result['score'], 4);
$wrapped = wordwrap($meta['purpose'] ?? '', 54, "\n ", true);
printf("[%s] %s\n %s\n\n", $score, $meta['name'], $wrapped);
}
Use it from the command line:
php search.php "find elements in common between two arrays" php search.php "sort array while keeping key-value association" php search.php "apply a function to transform every element" php search.php "remove duplicate values" php search.php "add element to the beginning"
What the vectors reveal
Let’s look at real output and see what’s interesting about it.
“find elements in common between two arrays”
[0.6704] array_find
array_find — Returns the first element satisfying a
callback function
[0.6568] array_intersect
array_intersect — Computes the intersection of
arrays
[0.6506] array_combine
array_combine — Creates an array by using one array
for keys and another for its values
[0.6472] array_find_key
array_find_key — Returns the key of the first
element satisfying a callback function
[0.6470] array_intersect_key
array_intersect_key — Computes the intersection of
arrays using keys for comparison
The entire intersect family, ranked by specificity. array_diff appears at #5, because “finding what is in common” and “finding what is different” are semantically close operations. The model doesn’t know PHP, but it knows language. That result is not wrong; it is useful. And don’t let any superior AI engine tell you otherwise: what we’re doing here is a small scope version of the global race that we see around us. Nothing more.
“sort array while keeping key-value association”
[0.7179] asort
asort — Sort an array in ascending order and
maintain index association
[0.6950] ksort
ksort — Sort an array by key in ascending order
[0.6896] sort
sort — Sort an array in ascending order
[0.6855] arsort
arsort — Sort an array in descending order and
maintain index association
[0.6774] uasort
uasort — Sort an array with a user-defined
comparison function and maintain index association
The two associative variants lead, followed by the key sorts. usort does not appear because it explicitly does not preserve key association: nice. The model picks this up from the description text. This is the search working as intended.
“apply a callback to an array”
[0.7834] array_map
array_map — Applies the callback to the elements of
the given arrays
[0.7469] array_reduce
array_reduce — Iteratively reduce the array to a
single value using a callback function
[0.7313] array_walk
array_walk — Apply a user supplied function to every
member of an array
[0.7220] array_all
array_all — Checks if all array elements satisfy a
callback function
[0.7218] array_filter
array_filter — Filters elements of an array using a
callback function
This one is philosophically interesting. array_map, array_walk, and array_reduce all “apply a function to elements of an array.” They are semantically nearly identical: the difference is behavioral. array_map returns a new array, array_walk modifies it in place, array_reduceand then returns a scalar. And the three descriptions use similar language, so they cluster tightly.
array_filter also takes a callback but its purpose is selection, not transformation: that could explain its absence from that list. The search is surfacing a meaningful semantic grouping that the PHP manual’s alphabetical listing completely obscures.
“remove duplicate values”
[0.6298] array_unique
array_unique — Removes duplicate values from an
array
[0.5445] array_count_values
array_count_values — Counts the occurrences of each
distinct value in an array
[0.5281] array_unshift
array_unshift — Prepend one or more elements to the
beginning of an array
[0.5150] array_reduce
array_reduce — Iteratively reduce the array to a
single value using a callback function
[0.5080] array_combine
array_combine — Creates an array by using one array
for keys and another for its values
array_unique is first: well, but of course. array_count_values is second. The model has no knowledge of PHP idioms, yet array_count_values still ended up near array_unique in the vector space because both descriptions discuss array values and their uniqueness. Coincidence? Possibly. Useful? Definitely.
A word on scores
Cosine similarity in this range does not have universal absolute meaning: what matters is the relative ranking. Roughly:
| Score | Meaning |
|---|---|
| > 0.90 | Near-identical purpose |
| 0.80–0.90 | Same functional family |
| 0.70–0.80 | Related operations |
| < 0.70 | Loosely connected at best |
If all your results are clustering below 0.65, your embedding text is too sparse. Go back to buildEmbedText and add more context. The parameter descriptions have a large impact.
Step 7: Maintenance
A technical note here: Vektor uses soft deletes: when you call insert() with an existing ID, the old record is tombstoned rather than overwritten. Over multiple re-indexing runs, the binary files grow with dead records. The Optimizer compacts them:
<?php // optimize.php require __DIR__ . '/vendor/autoload.php'; use Centamiv\Vektor\Core\Config; use Centamiv\Vektor\Services\Optimizer; Config::setDataDir(__DIR__ . '/data'); Config::setDimensions(768); echo "Running optimizer...\n"; (new Optimizer())->run(); echo "Done.\n";
php optimize.php
Run this after any bulk re-indexing. It rebuilds the HNSW graph from scratch and reclaims space from all the tombstoned records.
Going further
Expand to all PHP built-in functions. The manual has HTML files for ~1000 built-in functions. Change the glob:
<?php
$files = glob($docDir . '/function.*.html');
?>
Expect ~10 minutes of embedding time and a few MB of vector data. The search stays fast: HNSW is a logarithmic-time algorithm; doubling the index does not double query time.
Run Vektor as a shared HTTP API. Vektor ships with a built-in controller for HTTP mode. Index once from a CLI script and query from anywhere:
cp .env.example .env # Set VEKTOR_API_TOKEN= and VEKTOR_DIMENSIONS=768 in .env php -S 0.0.0.0:8000 -t public
Just think about security, as you’ll be on the internet, not just running a local experimentation. Not the same pond.
Then search via HTTP, useful for a team tool or a web UI:
curl -s http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"vector": [...768 floats...], "k": 5, "include_metadata": true}'
Try a larger embedding model. nomic-embed-text is fast and small.
mxbai-embed-large produces 1024-dimensional vectors with generally better retrieval accuracy for English text. Remember: changing dimensions means deleting data/ and re-indexing from scratch: the binary record width changes.
Embed richer content. The PHP manual pages include code examples. Appending a representative code snippet to the embed text often improves recall dramatically, because the example captures usage patterns that the prose description doesn’t state explicitly.
array_walk and array_map sound similar in description; in practice, one mutates and the other doesn’t: a code example makes that difference visible to the embedding model.
Your mileage may vary
I took the values for this post while writing it. It applies to the datasets that I used and mentioned, in early days of July 2026. The results might vary a lot depending on the actual LLM used (both version, type, and number of parameters), the available PHP functions and their documentation, etc. The scores also may vary.
The takeaway
PHP has had the pieces for this kind of tooling for a while. Vektor provides the vector storage and the HNSW index. Ollama provides the embeddings. The PHP manual provides the data.
You provide fifteen minutes of setup time.
The result is a local, zero-dependency semantic search over PHP’s own API. No cloud, no subscription, no rate limits: just four binary files on disk and a curl call to your laptop.
PHP has eighty array functions, most of them named in ways that made sense in 1997. A vector database doesn’t care. Ask it what you want in plain language, and it will find the function you are looking for, even the one you didn’t know existed.

