Using a vector database with PHP

PHP has roughly 80 array functions: eight – zero. array_map, array_walk, array_reduce,array_filter, array_each… wait, that last one was deprecated and removed in PHP 7.2. Or not? And some new functions came up in PHP 8.5, last november.

The real problem is discoverability. You know the task, “find the first element matching a condition”, but you can’t remember if it’s array_find, array_search, array_filter, or something that doesn’t exist yet and you’ll have to write yourself. You end up on the eternal manual php.net, scrolling through the full list like it’s a 2003 API reference, which it kind of is, hoping that the right function will jump out of the dark at the best moment.

Vector databases solve this with a different approach entirely. Instead of matching text, they match meaning. You describe what you want, and you get back the semantically closest functions, regardless of the exact words. So, today we’ll build exactly that using Vektor, a pure PHP file-based vector database, and Ollama for local embeddings.

In a nutshell: no cloud; no API key; no monthly bill; no privacy invasion. Just PHP doing what PHP does: being surprisingly capable.

What we’re building

A command-line tool that:

Reads the local PHP documentation from HTML files
Extracts every array function name, signature, and description
Converts each into a 768-dimensional embedding vector using Ollama
Stores those vectors in Vektor
Answers natural-language questions like “sort array while preserving key-value association”

The result is a local semantic search engine over PHP’s array API. Entirely file-based, running offline, and using a few hundred kilobytes on disk.

Prerequisites

PHP 8.2+, and try it on PHP 8.6 if you want, it is out
Composer
Ollama installed and running locally
The PHP documentation downloaded locally (more on this below)

Once Ollama is installed, pull the embedding model:

ollama pull nomic-embed-text

nomic-embed-text produces 768-dimensional vectors, runs entirely on-device, and is purpose-built for semantic text similarity. Verify that it is working:

curl -s http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":"test"}' | \
  php -r '$d=json_decode(file_get_contents("php://stdin"),true);echo count($d["embedding"]);'
# Expected: 768

Step 1: Download the PHP documentation

Here is the step where we resist the urge to write a scraper, which is the correct instinct. php.net provides the complete documentation as downloadable archives at https://www.php.net/download-docs.php.

Choose “Many HTML files”. This gives you one HTML file per manual page, which is exactly what we need for per-function parsing. Download and extract:

# The exact filename varies with date; find it on the download page
wget https://www.php.net/distributions/manual/php_manual_en.tar.gz
tar -xzf php_manual_en.tar.gz
mv php-chunked-xhtml php_manual_en 
# You now have: php_manual_en/

We used the English manual, but you may also use any of the translations. Just update the language once we reach the queries. Naturellement!

Array functions live in files named function.array-*.html:

ls php_manual_en/function.array-*.html | wc -l
# 79

Seventy-nine functions. One HTML file each. Let’s parse them.

Step 2: Project setup

mkdir php-array-search
cd php-array-search
composer require centamiv/vektor
mkdir data
# Symlink or copy the manual directory
ln -s /path/to/php_manual_en ./php_manual_en

In the end, we will have the following final structure:

php-array-search/
├── data/               # Vektor binary files (written automatically)
├── php_manual_en/      # Downloaded PHP docs
├── parse.php           # HTML extraction functions
├── parse_one.php       # HTML extraction of one 
├── embed.php           # Ollama embedding helper
├── index.php           # Indexing script (run once)
├── search.php          # Search CLI (run repeatedly)
├── optimize.php        # Maintenance
└── vendor/

Step 3: Parse the PHP documentation

The downloaded manual is machine-generated from DocBook XML, so the HTML structure is consistent and reliable. Each function page carries the function name, a one-line purpose, the signature, parameter descriptions, and the return value description. We want all of it.

<?php
// parse.php

function parsePhpFunctionFile(string $htmlFile): ?array
{
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTMLFile($htmlFile);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);

    // Function name, e.g. "array_map"
    $nameNodes = $xpath->query('//p[@class="refname"]');
    if ($nameNodes->length === 0) {
        return null;
    }
    $name = trim($nameNodes->item(0)->textContent);

    // One-line purpose, strip the leading em-dash that the manual includes
    $purposeNodes = $xpath->query('//span[@class="refpurpose"]');
    $purpose = '';
    if ($purposeNodes->length > 0) {
        $purpose = trim($purposeNodes->item(0)->textContent);
        $purpose = preg_replace('/^\s*[—\-]+\s*/u', '', $purpose);
    }

    // Function signature, grab the first methodsynopsis block
    $sigNodes = $xpath->query('//*[contains(@class,"methodsynopsis")]');
    $signature = '';
    if ($sigNodes->length > 0) {
        $signature = preg_replace('/\s+/', ' ', trim($sigNodes->item(0)->textContent));
    }

    // Parameter descriptions, first paragraph of each <dd> in the parameters section
    $paramNodes = $xpath->query(
        '//div[contains(@class,"refsect1") and contains(@class,"parameters")]//dl//dd//p[1]'
    );
    $paramTexts = [];
    foreach ($paramNodes as $node) {
        $text = trim($node->textContent);
        if ($text !== '') {
            $paramTexts[] = $text;
        }
    }

    // Return value, first paragraph of the returnvalues section
    $returnNodes = $xpath->query(
        '//div[contains(@class,"refsect1") and contains(@class,"returnvalues")]//p[1]'
    );
    $returnText = $returnNodes->length > 0
        ? trim($returnNodes->item(0)->textContent)
        : '';

    return [
        'name'      => $name,
        'purpose'   => $purpose,
        'signature' => $signature,
        'params'    => implode(' ', $paramTexts),
        'returns'   => $returnText,
    ];
}

function buildEmbedText(array $func): string
{
    // Richer context = better vectors. Never embed the function name alone.
    return sprintf(
        '%s: %s. Signature: %s. Parameters: %s. Returns: %s.',
        $func['name'],
        $func['purpose'],
        $func['signature'],
        $func['params'],
        $func['returns']
    );
}

The buildEmbedText function is more than optional decoration. Embedding just "array_map"gives you a vector that will cluster with "map" and "array". Embedding the full description, with its name, purpose, signature, what the parameters do, what is returned, gives a vector that encodes what the function actually does. That semantic richness is what makes the search useful.

If you want to verify the parser on a single file before indexing everything, this snippet is your friend:

<?php
require 'parse_one.php';

$func = parsePhpFunctionFile('php_manual_en/function.array-map.html');
print_r($func);
echo "\n--- Embed text ---\n";
echo buildEmbedText($func) . "\n";

Expected output (abbreviated):

Array
(
    [name]      => array_map
    [purpose]   => Applies the callback to the elements of the given arrays
    [signature] => array array_map(?callable $callback, array $array, array ...$arrays)
    [params]    => A callable to run for each element in each array. ...
    [returns]   => Returns an array containing the results of applying the ...
)

--- Embed text ---
array_map: Applies the callback to the elements of the given arrays. Signature: ...

If the XPath selectors produce empty strings for your downloaded version, inspect one file with xmllint --html --xpath '//span[@class="refname"]' php_manual_en/function.array-map.htmland adjust the selectors accordingly. The structure is consistent across the manual, so once one file works, all of them will.

Step 4: Embeddings with Ollama

Nothing exotic here. The code is mainly a curl call to the local Ollama API, which returns the embedding:

<?php
// embed.php

function getEmbedding(string $text): array
{
    $payload = json_encode([
        'model'  => 'nomic-embed-text',
        'prompt' => $text,
    ]);

    $ch = curl_init('http://localhost:11434/api/embeddings');
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST           => true,
        CURLOPT_POSTFIELDS     => $payload,
        CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
    ]);

    $response = json_decode(curl_exec($ch), true);
    $httpCode  = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    if ($httpCode !== 200 || empty($response['embedding'])) {
        throw new RuntimeException(
            "Ollama returned HTTP $httpCode: " . json_encode($response)
        );
    }

    return $response['embedding']; // 768 floats
}

Step 5: Indexing

This is where it all comes together. One important constraint to understand before writing a single line: Vektor’s binary file layout is baked to a fixed record width determined by the vector dimension.

The default is 1536, and it is sized for OpenAI embeddings. We are using 768-dimensional vectors from nomic-embed-text. If you forget to set the dimension before instantiating the Indexer, it will silently write vectors into a binary file with the wrong size. The reads then return garbage, and there will be no error message: just meaningless results, also known as false positives. We also say hallucinations nowadays, although the fashion is already fading away. Just set dimensions first.

<?php
// index.php

require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/parse.php';
require __DIR__ . '/embed.php';

use Centamiv\Vektor\Core\Config;
use Centamiv\Vektor\Services\Indexer;

// Dimension MUST be set before new Indexer().
// nomic-embed-text = 768. Vektor default = 1536. They are not compatible.
Config::setDataDir(__DIR__ . '/data');
Config::setDimensions(768);

$indexer  = new Indexer();
$payloadFile = __DIR__ . '/data/payload.json';
$payload  = file_exists($payloadFile) ? (json_decode(file_get_contents($payloadFile), true) ?? []) : [];

$docDir = __DIR__ . '/php_manual_en';
$functions = array_slice(get_extension_funcs('standard'), 21, 87);
$files = array_map(fn ($name) => $docDir . '/function.'.str_replace('_', '-', $name). '.html', $functions);
// initial collection of functions, a bit naive
//$files  = glob($docDir . '/function.array-*.html');

if (empty($files)) {
    die("No HTML files found in $docDir. Check your symlink or path.\n");
}

$indexed = 0;
$skipped = 0;

foreach ($files as $file) {
    $func = parsePhpFunctionFile($file);

    if ($func === null || $func['name'] === '' || $func['purpose'] === '') {
        echo "Skipping (unparseable): " . basename($file) . "\n";
        $skipped++;
        continue;
    }

    // Function names are well under 36 chars, Vektor's ID limit.
    $id   = $func['name'];
    $text = buildEmbedText($func);

    echo "Indexing {$func['name']}... ";

    try {
        $vector = getEmbedding($text);
        $indexer->insert($id, $vector);
        $payload[$id] = [
            'name'      => $func['name'],
            'purpose'   => $func['purpose'],
            'signature' => $func['signature'],
            'params'    => $func['params'],
            'returns'   => $func['returns'],
        ];
        echo "done\n";
        $indexed++;
    } catch (RuntimeException $e) {
        echo "FAILED: {$e->getMessage()}\n";
        $skipped++;
    }
}

file_put_contents($payloadFile, json_encode($payload, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE));

echo "\nIndexed : $indexed\n";
echo "Skipped : $skipped\n\n";

$stats = $indexer->getStats();
printf(
    "Vectors on disk : %d\nGraph nodes     : %d\nStorage         : %d KB\n",
    $stats['records']['vectors_total'],
    $stats['records']['graph_nodes'],
    intdiv(
        $stats['storage']['vector_file_bytes']
        + $stats['storage']['graph_file_bytes']
        + $stats['storage']['meta_file_bytes']
        + (file_exists($payloadFile) ? filesize($payloadFile) : 0),
        1024
    )
);

Run it:

php index.php

Note that usually, index.php is the default file in an online PHP application. Here, it is the actual semantic meaning of the verb: do an indexation.

Expected output:

Indexing array_chunk... done
Indexing array_column... done
Indexing array_combine... done
...
Indexing usort... done

Indexed : 59
Skipped : 0

Vectors on disk : 59
Graph nodes     : 59
Storage         : 200 KB

59 functions, ~200 KB on disk. A complete, self-contained knowledge base. The indexing run takes 30–60 seconds on a modern laptop. This is due to Ollama generating embeddings sequentially. You can take the time to speed that up later, it is not critical in our case. And you only do it once.

Peek at what Vektor created in data/:

ls -lh data/
# vector.bin   : raw float arrays, fixed record width
# graph.bin    : HNSW graph connections
# meta.bin     : ID-to-offset BST (no RAM map, pure disk seeks)
# payload.bin  : serialized metadata (JSON blobs)

Four binary files, no external process, no daemon. The entire database is those four files.

Step 6: Search

Now the good part:

<?php
// search.php

require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/embed.php';

use Centamiv\Vektor\Core\Config;
use Centamiv\Vektor\Services\Searcher;

Config::setDataDir(__DIR__ . '/data');
Config::setDimensions(768);

$searcher    = new Searcher();
$payloadFile = __DIR__ . '/data/payload.json';
$payload     = file_exists($payloadFile) ? (json_decode(file_get_contents($payloadFile), true) ?? []) : [];

$query = $argv[1] ?? 'find elements in common between two arrays';
$k     = (int) ($argv[2] ?? 5);

echo "\nQuery : \"$query\"\n";
echo str_repeat('─', 62) . "\n";

$queryVector = getEmbedding($query);
$results     = $searcher->search($queryVector, $k, includeVector: false);

foreach ($results as $result) {
    $meta    = $payload[$result['id']] ?? ['name' => $result['id'], 'purpose' => '(no metadata)'];
    $score   = number_format($result['score'], 4);
    $wrapped = wordwrap($meta['purpose'] ?? '', 54, "\n           ", true);
    printf("[%s] %s\n           %s\n\n", $score, $meta['name'], $wrapped);
}

Use it from the command line:

php search.php "find elements in common between two arrays"
php search.php "sort array while keeping key-value association"
php search.php "apply a function to transform every element"
php search.php "remove duplicate values"
php search.php "add element to the beginning"

What the vectors reveal

Let’s look at real output and see what’s interesting about it.

“find elements in common between two arrays”

[0.6704] array_find
           array_find — Returns the first element satisfying a
           callback function

[0.6568] array_intersect
           array_intersect — Computes the intersection of
           arrays

[0.6506] array_combine
           array_combine — Creates an array by using one array
           for keys and another for its values

[0.6472] array_find_key
           array_find_key — Returns the key of the first
           element satisfying a callback function

[0.6470] array_intersect_key
           array_intersect_key — Computes the intersection of
           arrays using keys for comparison

The entire intersect family, ranked by specificity. array_diff appears at #5, because “finding what is in common” and “finding what is different” are semantically close operations. The model doesn’t know PHP, but it knows language. That result is not wrong; it is useful. And don’t let any superior AI engine tell you otherwise: what we’re doing here is a small scope version of the global race that we see around us. Nothing more.

“sort array while keeping key-value association”

[0.7179] asort
           asort — Sort an array in ascending order and
           maintain index association

[0.6950] ksort
           ksort — Sort an array by key in ascending order

[0.6896] sort
           sort — Sort an array in ascending order

[0.6855] arsort
           arsort — Sort an array in descending order and
           maintain index association

[0.6774] uasort
           uasort — Sort an array with a user-defined
           comparison function and maintain index association

The two associative variants lead, followed by the key sorts. usort does not appear because it explicitly does not preserve key association: nice. The model picks this up from the description text. This is the search working as intended.

“apply a callback to an array”

[0.7834] array_map
           array_map — Applies the callback to the elements of
           the given arrays

[0.7469] array_reduce
           array_reduce — Iteratively reduce the array to a
           single value using a callback function

[0.7313] array_walk
           array_walk — Apply a user supplied function to every
           member of an array

[0.7220] array_all
           array_all — Checks if all array elements satisfy a
           callback function

[0.7218] array_filter
           array_filter — Filters elements of an array using a
           callback function

This one is philosophically interesting. array_map, array_walk, and array_reduce all “apply a function to elements of an array.” They are semantically nearly identical: the difference is behavioral. array_map returns a new array, array_walk modifies it in place, array_reduceand then returns a scalar. And the three descriptions use similar language, so they cluster tightly.

array_filter also takes a callback but its purpose is selection, not transformation: that could explain its absence from that list. The search is surfacing a meaningful semantic grouping that the PHP manual’s alphabetical listing completely obscures.

“remove duplicate values”

[0.6298] array_unique
           array_unique — Removes duplicate values from an
           array

[0.5445] array_count_values
           array_count_values — Counts the occurrences of each
           distinct value in an array

[0.5281] array_unshift
           array_unshift — Prepend one or more elements to the
           beginning of an array

[0.5150] array_reduce
           array_reduce — Iteratively reduce the array to a
           single value using a callback function

[0.5080] array_combine
           array_combine — Creates an array by using one array
           for keys and another for its values

array_unique is first: well, but of course. array_count_values is second. The model has no knowledge of PHP idioms, yet array_count_values still ended up near array_unique in the vector space because both descriptions discuss array values and their uniqueness. Coincidence? Possibly. Useful? Definitely.

A word on scores

Cosine similarity in this range does not have universal absolute meaning: what matters is the relative ranking. Roughly:

Score	Meaning
> 0.90	Near-identical purpose
0.80–0.90	Same functional family
0.70–0.80	Related operations
< 0.70	Loosely connected at best

If all your results are clustering below 0.65, your embedding text is too sparse. Go back to buildEmbedText and add more context. The parameter descriptions have a large impact.

Step 7: Maintenance

A technical note here: Vektor uses soft deletes: when you call insert() with an existing ID, the old record is tombstoned rather than overwritten. Over multiple re-indexing runs, the binary files grow with dead records. The Optimizer compacts them:

<?php
// optimize.php

require __DIR__ . '/vendor/autoload.php';

use Centamiv\Vektor\Core\Config;
use Centamiv\Vektor\Services\Optimizer;

Config::setDataDir(__DIR__ . '/data');
Config::setDimensions(768);

echo "Running optimizer...\n";
(new Optimizer())->run();
echo "Done.\n";

php optimize.php

Run this after any bulk re-indexing. It rebuilds the HNSW graph from scratch and reclaims space from all the tombstoned records.

Going further

Expand to all PHP built-in functions. The manual has HTML files for ~1000 built-in functions. Change the glob:

<?php
    $files = glob($docDir . '/function.*.html');
?>

Expect ~10 minutes of embedding time and a few MB of vector data. The search stays fast: HNSW is a logarithmic-time algorithm; doubling the index does not double query time.

Run Vektor as a shared HTTP API. Vektor ships with a built-in controller for HTTP mode. Index once from a CLI script and query from anywhere:

cp .env.example .env
# Set VEKTOR_API_TOKEN= and VEKTOR_DIMENSIONS=768 in .env
php -S 0.0.0.0:8000 -t public

Just think about security, as you’ll be on the internet, not just running a local experimentation. Not the same pond.

Then search via HTTP, useful for a team tool or a web UI:

curl -s http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"vector": [...768 floats...], "k": 5, "include_metadata": true}'

Try a larger embedding model. nomic-embed-text is fast and small.

mxbai-embed-large produces 1024-dimensional vectors with generally better retrieval accuracy for English text. Remember: changing dimensions means deleting data/ and re-indexing from scratch: the binary record width changes.

Embed richer content. The PHP manual pages include code examples. Appending a representative code snippet to the embed text often improves recall dramatically, because the example captures usage patterns that the prose description doesn’t state explicitly.

array_walk and array_map sound similar in description; in practice, one mutates and the other doesn’t: a code example makes that difference visible to the embedding model.

Your mileage may vary

I took the values for this post while writing it. It applies to the datasets that I used and mentioned, in early days of July 2026. The results might vary a lot depending on the actual LLM used (both version, type, and number of parameters), the available PHP functions and their documentation, etc. The scores also may vary.

The takeaway

PHP has had the pieces for this kind of tooling for a while. Vektor provides the vector storage and the HNSW index. Ollama provides the embeddings. The PHP manual provides the data.

You provide fifteen minutes of setup time.

The result is a local, zero-dependency semantic search over PHP’s own API. No cloud, no subscription, no rate limits: just four binary files on disk and a curl call to your laptop.

PHP has eighty array functions, most of them named in ways that made sense in 1997. A vector database doesn’t care. Ask it what you want in plain language, and it will find the function you are looking for, even the one you didn’t know existed.

Want to Keep in touch with us, subscribe to our newsletter !

Technology

Using a vector database with PHP

Using a vector database with PHP

What we’re building

Prerequisites

Step 1: Download the PHP documentation

Step 2: Project setup

Step 3: Parse the PHP documentation

Step 4: Embeddings with Ollama

Step 5: Indexing

Step 6: Search

What the vectors reveal

“find elements in common between two arrays”

“sort array while keeping key-value association”

“apply a callback to an array”

“remove duplicate values”

A word on scores

Step 7: Maintenance

Going further

Your mileage may vary

The takeaway

Login