Largest PHP code base

largest PHP code bases are close to 10 millions tokens

When testing the exakat static analysis engine, we need to run it on real code : even better, use the largest PHP code base available. Open Source projects are a real blessing there, since they come in different shapes and stripes. Some projects dates back from PHP 3 and evolved until now, some are directly PHP 7 only ; some are full OOP, while others are fully functional ; some apply ‘East programming’ paradigm, others use the ‘bazaar way’. Some are in weird languages…

Nowadays, code bases tends to be small. Components are the norm, and less full-stack applications are published. But the exakat engine needs to run on larger code base. This is an excellent test on many levels : poorly optimized analysis will now show ; more variety on code syntax and situations ; general speed is also under monitoring.


Largest PHP code base

Project LOC Tokens
limesurvey 662229 8841624
dolibarr 380909 8820590
moodle 902284 8114632
kaltura 1364790 7449004
webtrees 179576 6676142
magento2 825624 6396977
magento 803192 6228889
tcpdf 43796 5640300
ez 848497 5445200
claroline 547976 4934730

LOC : lines of code

Usually, the size of a project is measured in LOC : Lines of code. You may measure your project simply with phploc or by summing the results of the ‘wc’ command.

Exakat doesn’t pay any attention to comments. Comments, white space, commented-out code, php doc and such are all dropped at tokenizing time. Exakat focuses on code and its meaning, not on documentation. Although phpdoc would be useful, most of the other comments are totally unusable : they are just not machine readable. So, why keep them ?

Counting tokens

With comments out of the way, the only way to compare project size is to count tokens. Tokens are the atoms of a PHP script, and may be single characters, like ; + , , double characters like :: => -> ?? or long literals.

There may be several tokens on one line. My usual rule is 10 tokens on one line, on average. That may vary between 123 tokens per line, as in TCPDF, which has long list of arrays full of integers down to 3 tokens per line, as in purl or domainparser projects. This is very relaxed syntax. Over 930 projects, the average is 8,28, and the median is 8. 10 tokens / line is a good approximation.

Excluding common libraries

Exakat only counts tokens that it will later process. This means that finding tcpdf in phpmyadmin code will change the tokens count from 617k to 6,2 M. Of course, no need to analyze tcpdf everytime phpmyadmin is tested, so tcpdf is ignored. There is a short list of frameworks and commonly found libraries (PDF classes for example) that are detected and omitted by exakat when counting tokens. This helps focusing on local code, and exclude external contribution.

This also means that any common libraries found in an OSS code is excluded and reduce the size of the project.

Even larger PHP code base?

Exakat needs to identify other large PHP projects so as to add them to the top 10. Any suggestion may be directed to or @exakat on twitter !