What is the largest PHP code base?

When testing the exakat static analysis engine, we need to run it on real code : even better, use the largest PHP code base available. Open Source projects are a real blessing there, since they come in different shapes and stripes. Some projects dates back from PHP 3 and evolved until now, some are directly PHP 7 only ; some are full OOP, while others are fully functional ; some apply ‘East programming’ paradigm, others use the ‘bazaar way’. Some are in weird languages…

Nowadays, code bases tends to be small. Components are the norm, and less full-stack applications are published. But the exakat engine needs to run on larger code base. This is an excellent test on many levels : poorly optimized analysis will now show ; more variety on code syntax and situations ; general speed is also under monitoring.

Largest PHP code base

Project	LOC	Tokens
limesurvey	662229	8841624
dolibarr	380909	8820590
moodle	902284	8114632
kaltura	1364790	7449004
webtrees	179576	6676142
magento2	825624	6396977
magento	803192	6228889
tcpdf	43796	5640300
ez	848497	5445200
claroline	547976	4934730

LOC : lines of code

Usually, the size of a project is measured in LOC : Lines of code. You may measure your project simply with phploc or by summing the results of the ‘wc’ command.

Exakat doesn’t pay any attention to comments. Comments, white space, commented-out code, php doc and such are all dropped at tokenizing time. Exakat focuses on code and its meaning, not on documentation. Although phpdoc would be useful, most of the other comments are totally unusable : they are just not machine readable. So, why keep them ?

Counting tokens

With comments out of the way, the only way to compare project size is to count tokens. Tokens are the atoms of a PHP script, and may be single characters, like ; + , , double characters like :: => -> ?? or long literals.

There may be several tokens on one line. My usual rule is 10 tokens on one line, on average. That may vary between 123 tokens per line, as in TCPDF, which has long list of arrays full of integers down to 3 tokens per line, as in purl or domainparser projects. This is very relaxed syntax. Over 930 projects, the average is 8,28, and the median is 8. 10 tokens / line is a good approximation.

Excluding common libraries

Exakat only counts tokens that it will later process. This means that finding tcpdf in phpmyadmin code will change the tokens count from 617k to 6,2 M. Of course, no need to analyze tcpdf everytime phpmyadmin is tested, so tcpdf is ignored. There is a short list of frameworks and commonly found libraries (PDF classes for example) that are detected and omitted by exakat when counting tokens. This helps focusing on local code, and exclude external contribution.

This also means that any common libraries found in an OSS code is excluded and reduce the size of the project.

Even larger PHP code base?

Exakat needs to identify other large PHP projects so as to add them to the top 10. Any suggestion may be directed to contact@exakat.io or @exakat on twitter !

Want to Keep in touch with us, subscribe to our newsletter !

Code auditing, Technology