---
title: "Using a vector database with PHP"
url: https://www.exakat.io/using-a-vector-database-with-php/
date: 2026-07-03
modified: 2026-07-03
author: "dams"
description: "Using a vector database with PHP PHP has roughly 80 array functions: eight - zero. array_map, array_walk, array_reduce,array_filter, array_each... wait, that last one was deprecated and removed in PHP 7.2...."
categories:
  - "Technology"
tags:
  - "AI"
  - "tutorial"
  - "vector"
image: https://www.exakat.io/wp-content/uploads/2026/07/vector.320.png
word_count: 2978
---

# Using a vector database with PHP

# Using a vector database with PHP

PHP has roughly 80 array functions: eight - zero. `array_map`, `array_walk`, `array_reduce`,`array_filter`, `array_each`... wait, that last one was deprecated and removed in PHP 7.2. Or not? And some new functions came up in PHP 8.5, last november.

The real problem is discoverability. You know the task, "find the first element matching a condition", but you can't remember if it's `array_find`, `array_search`, `array_filter`, or something that doesn't exist yet and you'll have to write yourself. You end up on the eternal manual [php.net](https://www.php.net/manual), scrolling through the full list like it's a 2003 API reference, which it kind of is, hoping that the right function will jump out of the dark at the best moment.

Vector databases solve this with a different approach entirely. Instead of matching text, they match meaning. You describe what you want, and you get back the semantically closest functions, regardless of the exact words. So, today we'll build exactly that using [Vektor](https://github.com/centamiv/vektor), a pure PHP file-based vector database, and [Ollama](https://ollama.com/) for local embeddings.

In a nutshell: no cloud; no API key; no monthly bill; no privacy invasion. Just PHP doing what PHP does: being surprisingly capable.

## What we're building

A command-line tool that:

- Reads the local PHP documentation from HTML files
- Extracts every array function name, signature, and description
- Converts each into a 768-dimensional embedding vector using Ollama
- Stores those vectors in Vektor
- Answers natural-language questions like "sort array while preserving key-value association"

The result is a local semantic search engine over PHP's array API. Entirely file-based, running offline, and using a few hundred kilobytes on disk.

## Prerequisites

- PHP 8.2+, and try it on PHP 8.6 if you want, it is out
- [Composer](https://getcomposer.org/)
- [Ollama](https://ollama.com/) installed and running locally
- The PHP documentation downloaded locally (more on this below)

Once Ollama is installed, pull the embedding model:

ollama pull nomic-embed-text

`nomic-embed-text` produces 768-dimensional vectors, runs entirely on-device, and is purpose-built for semantic text similarity. Verify that it is working:

curl -s http://localhost:11434/api/embeddings \
-d '{"model":"nomic-embed-text","prompt":"test"}' | \
php -r '$d=json_decode(file_get_contents("php://stdin"),true);echo count($d["embedding"]);'
# Expected: 768

## Step 1: Download the PHP documentation

Here is the step where we resist the urge to write a scraper, which is the correct instinct. php.net provides the complete documentation as downloadable archives at [https://www.php.net/download-docs.php](https://www.php.net/download-docs.php).

Choose "Many HTML files". This gives you one HTML file per manual page, which is exactly what we need for per-function parsing. Download and extract:

# The exact filename varies with date; find it on the download page
wget https://www.php.net/distributions/manual/php_manual_en.tar.gz
tar -xzf php_manual_en.tar.gz
mv php-chunked-xhtml php_manual_en
# You now have: php_manual_en/

We used the English manual, but you may also use any of the translations. Just update the language once we reach the queries. Naturellement!

Array functions live in files named `function.array-*.html`:

ls php_manual_en/function.array-*.html | wc -l
# 79

Seventy-nine functions. One HTML file each. Let's parse them.

## Step 2: Project setup

mkdir php-array-search
cd php-array-search
composer require centamiv/vektor
mkdir data
# Symlink or copy the manual directory
ln -s /path/to/php_manual_en ./php_manual_en

In the end, we will have the following final structure:

php-array-search/
├── data/ # Vektor binary files (written automatically)
├── php_manual_en/ # Downloaded PHP docs
├── parse.php # HTML extraction functions
├── parse_one.php # HTML extraction of one
├── embed.php # Ollama embedding helper
├── index.php # Indexing script (run once)
├── search.php # Search CLI (run repeatedly)
├── optimize.php # Maintenance
└── vendor/

## Step 3: Parse the PHP documentation

The downloaded manual is machine-generated from DocBook XML, so the HTML structure is consistent and reliable. Each function page carries the function name, a one-line purpose, the signature, parameter descriptions, and the return value description. We want all of it.

The `buildEmbedText` function is more than optional decoration. Embedding just `"array_map"`gives you a vector that will cluster with `"map"` and `"array"`. Embedding the full description, with its name, purpose, signature, what the parameters do, what is returned, gives a vector that encodes what the function actually does. That semantic richness is what makes the search useful.

If you want to verify the parser on a single file before indexing everything, this snippet is your friend:

Expected output (abbreviated):

Array
(
[name] => array_map
[purpose] => Applies the callback to the elements of the given arrays
[signature] => array array_map(?callable $callback, array $array, array ...$arrays)
[params] => A callable to run for each element in each array. ...
[returns] => Returns an array containing the results of applying the ...
)

--- Embed text ---
array_map: Applies the callback to the elements of the given arrays. Signature: ...

If the XPath selectors produce empty strings for your downloaded version, inspect one file with `xmllint --html --xpath '//span[@class="refname"]' php_manual_en/function.array-map.html`and adjust the selectors accordingly. The structure is consistent across the manual, so once one file works, all of them will.

## Step 4: Embeddings with Ollama

Nothing exotic here. The code is mainly a curl call to the local Ollama API, which returns the embedding:

## Step 5: Indexing

This is where it all comes together. One important constraint to understand before writing a single line: Vektor's binary file layout is baked to a fixed record width determined by the vector dimension.

The default is 1536, and it is sized for OpenAI embeddings. We are using 768-dimensional vectors from `nomic-embed-text`. If you forget to set the dimension before instantiating the `Indexer`, it will silently write vectors into a binary file with the wrong size. The reads then return garbage, and there will be no error message: just meaningless results, also known as false positives. We also say hallucinations nowadays, although the fashion is already fading away. Just set dimensions first.

Run it:

php index.php

Note that usually, index.php is the default file in an online PHP application. Here, it is the actual semantic meaning of the verb: do an indexation.

Expected output:

Indexing array_chunk... done
Indexing array_column... done
Indexing array_combine... done
...
Indexing usort... done

Indexed : 59
Skipped : 0

Vectors on disk : 59
Graph nodes : 59
Storage : 200 KB

59 functions, ~200 KB on disk. A complete, self-contained knowledge base. The indexing run takes 30–60 seconds on a modern laptop. This is due to Ollama generating embeddings sequentially. You can take the time to speed that up later, it is not critical in our case. And you only do it once.

Peek at what Vektor created in `data/`:

ls -lh data/
# vector.bin : raw float arrays, fixed record width
# graph.bin : HNSW graph connections
# meta.bin : ID-to-offset BST (no RAM map, pure disk seeks)
# payload.bin : serialized metadata (JSON blobs)

Four binary files, no external process, no daemon. The entire database is those four files.

## Step 6: Search

Now the good part:

Use it from the command line:

php search.php "find elements in common between two arrays"
php search.php "sort array while keeping key-value association"
php search.php "apply a function to transform every element"
php search.php "remove duplicate values"
php search.php "add element to the beginning"

## What the vectors reveal

Let's look at real output and see what's interesting about it.

### "find elements in common between two arrays"

[0.6704] array_find
array_find — Returns the first element satisfying a
callback function

[0.6568] array_intersect
array_intersect — Computes the intersection of
arrays

[0.6506] array_combine
array_combine — Creates an array by using one array
for keys and another for its values

[0.6472] array_find_key
array_find_key — Returns the key of the first
element satisfying a callback function

[0.6470] array_intersect_key
array_intersect_key — Computes the intersection of
arrays using keys for comparison

The entire intersect family, ranked by specificity. `array_diff` appears at #5, because "finding what is in common" and "finding what is different" are semantically close operations. The model doesn't know PHP, but it knows language. That result is not wrong; it is useful. And don't let any superior AI engine tell you otherwise: what we're doing here is a small scope version of the global race that we see around us. Nothing more.

### "sort array while keeping key-value association"

[0.7179] asort
asort — Sort an array in ascending order and
maintain index association

[0.6950] ksort
ksort — Sort an array by key in ascending order

[0.6896] sort
sort — Sort an array in ascending order

[0.6855] arsort
arsort — Sort an array in descending order and
maintain index association

[0.6774] uasort
uasort — Sort an array with a user-defined
comparison function and maintain index association

The two associative variants lead, followed by the key sorts. `usort` does not appear because it explicitly does not preserve key association: nice. The model picks this up from the description text. This is the search working as intended.

### "apply a callback to an array"

[0.7834] array_map
array_map — Applies the callback to the elements of
the given arrays

[0.7469] array_reduce
array_reduce — Iteratively reduce the array to a
single value using a callback function

[0.7313] array_walk
array_walk — Apply a user supplied function to every
member of an array

[0.7220] array_all
array_all — Checks if all array elements satisfy a
callback function

[0.7218] array_filter
array_filter — Filters elements of an array using a
callback function

This one is philosophically interesting. `array_map`, `array_walk`, and `array_reduce` all "apply a function to elements of an array." They are semantically nearly identical: the difference is behavioral. `array_map` returns a new array, `array_walk` modifies it in place, `array_reduce`and then returns a scalar. And the three descriptions use similar language, so they cluster tightly.

`array_filter` also takes a callback but its purpose is selection, not transformation: that could explain its absence from that list. The search is surfacing a meaningful semantic grouping that the PHP manual's alphabetical listing completely obscures.

### "remove duplicate values"

[0.6298] array_unique
array_unique — Removes duplicate values from an
array

[0.5445] array_count_values
array_count_values — Counts the occurrences of each
distinct value in an array

[0.5281] array_unshift
array_unshift — Prepend one or more elements to the
beginning of an array

[0.5150] array_reduce
array_reduce — Iteratively reduce the array to a
single value using a callback function

[0.5080] array_combine
array_combine — Creates an array by using one array
for keys and another for its values

`array_unique` is first: well, but of course. `array_count_values` is second. The model has no knowledge of PHP idioms, yet `array_count_values` still ended up near `array_unique` in the vector space because both descriptions discuss array values and their uniqueness. Coincidence? Possibly. Useful? Definitely.

## A word on scores

Cosine similarity in this range does not have universal absolute meaning: what matters is the relative ranking. Roughly:

| Score | Meaning |
| ----- | ------- |
| > 0.90 | Near-identical purpose |
| 0.80–0.90 | Same functional family |
| 0.70–0.80 | Related operations |
| < 0.70 | Loosely connected at best |

If all your results are clustering below 0.65, your embedding text is too sparse. Go back to `buildEmbedText` and add more context. The parameter descriptions have a large impact.

## Step 7: Maintenance

A technical note here: Vektor uses soft deletes: when you call `insert()` with an existing ID, the old record is tombstoned rather than overwritten. Over multiple re-indexing runs, the binary files grow with dead records. The `Optimizer` compacts them:

php optimize.php

Run this after any bulk re-indexing. It rebuilds the HNSW graph from scratch and reclaims space from all the tombstoned records.

## Going further

Expand to all PHP built-in functions. The manual has HTML files for ~1000 built-in functions. Change the glob:

Expect ~10 minutes of embedding time and a few MB of vector data. The search stays fast: HNSW is a logarithmic-time algorithm; doubling the index does not double query time.

Run Vektor as a shared HTTP API. Vektor ships with a built-in controller for HTTP mode. Index once from a CLI script and query from anywhere:

cp .env.example .env
# Set VEKTOR_API_TOKEN= and VEKTOR_DIMENSIONS=768 in .env
php -S 0.0.0.0:8000 -t public

Just think about security, as you'll be on the internet, not just running a local experimentation. Not the same pond.

Then search via HTTP, useful for a team tool or a web UI:

curl -s http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"vector": [...768 floats...], "k": 5, "include_metadata": true}'

Try a larger embedding model. `nomic-embed-text` is fast and small.

`mxbai-embed-large` produces 1024-dimensional vectors with generally better retrieval accuracy for English text. Remember: changing dimensions means deleting `data/` and re-indexing from scratch: the binary record width changes.

Embed richer content. The PHP manual pages include code examples. Appending a representative code snippet to the embed text often improves recall dramatically, because the example captures usage patterns that the prose description doesn't state explicitly.

`array_walk` and `array_map` sound similar in description; in practice, one mutates and the other doesn't: a code example makes that difference visible to the embedding model.

## Your mileage may vary

I took the values for this post while writing it. It applies to the datasets that I used and mentioned, in early days of July 2026. The results might vary a lot depending on the actual LLM used (both version, type, and number of parameters), the available PHP functions and their documentation, etc. The scores also may vary.

## The takeaway

PHP has had the pieces for this kind of tooling for a while. Vektor provides the vector storage and the HNSW index. Ollama provides the embeddings. The PHP manual provides the data.

You provide fifteen minutes of setup time.

The result is a local, zero-dependency semantic search over PHP's own API. No cloud, no subscription, no rate limits: just four binary files on disk and a curl call to your laptop.

PHP has eighty array functions, most of them named in ways that made sense in 1997. A vector database doesn't care. Ask it what you want in plain language, and it will find the function you are looking for, even the one you didn't know existed.