From PHP code to Static analysisFrom PHP code to static analysis

Static analysis is the review of the PHP code, without executing it. In a word, it means reading the code, and understanding it, just like developers do when they modify a piece of PHP. It is the same process, with different tools. Here is the path from PHP code to static analysis.

Manual code review

Static analysis : role distributionManual review is a slow process, where a reviewer reads the code, and uses his experience and know-how to discover poorly coded classes. Such reviews report high level errors, close to architecture and conception levels, and common errors, at the line of code level.

Static analysis finds its place for the simple reviews. It also aims at doing the systematic and repetitive review. Once automated, a review may be applied over and over to the same code base, following its evolution. Human, on the other hand, are better at higher levels of abstractions and detection of new patterns.

The Static Load

Static analysis is distinct from dynamic analysis : this second type of analysis requires the execution of the code, and, for that, requires working code and a valid architecture to do so. Unit testing, life cycle testing and log monitoring are all dynamic analysis.

In short, dynamic analysis runs the code, and works with actual data, fake or actual. It observes actual reactions of the code, based on a large selection of situations. This include data and PHP behavior.

Here, you can see the different stages of PHP code execution, and how static analysis differs from it.

PHP starts from the text file : you might call it code or source, but it is initially nothing more than a text file. PHP will apply the tokenizer on it : this is the part of PHP that breaks the text into PHP tokens, which are basically PHP words. Then, it will check the syntax : this is the rules that organizes the tokens into a meaningful sentence. This is similar to a natural language.

At that point, PHP start its own process : optimize the code for execution, maybe cache some of it, then, go to the execution phase. The execution has its own constraints : for example, PHP will not run an infinite loop indefinitely. It will stop at the infamous ‘Max execution time’ of 30s (by default), or it will find a final condition. Even if you have a long-running server with an infinite loop, it will probably stop when the hardware breaks down. No infinite loop for PHP.

On the other hand, static analysis handles infinite loop quite well. Since the code is not executed, but merely understood, it means that a while(true) {} is just another loop. This is a significant advantage over PHP engine.

The other advantage of static analysis is the time constraints. PHP works hard to produce the expected content is the smallest amount of time possible, while static analysis has a far larger time frame. It is the time of development, not the time of production. Any interesting feedback is worth a reasonable amount of time.

Processing code with static analysis

For static analysis to understand code, it has to follow several steps. We just have seen the tokenization process, which turns the source into PHP elements.

The path from Text file to Code auditThe process starts from the code itself. The files are tokenized, then given to the lexer, which will build an AST : Abstract Syntax Tree. This is an advanced representation of the code. Then, the code representation is stored in a central database. That database may be as simple as memory itself, or more sophisticated, like a SQL or a graph database. The important aspect here is to have a query system : a way to search that large database to extract patterns. The patterns themselves are constituted from coding references : security recommendations, best practices, migration guides, etc. They are turned into a query or a visitor pattern, to visit all the situations in the code. Then, any spotted issue is reported as a result.

The whole process is quite simple, yet it includes some technical situations that are worth detailing.

The Atomic Tokens

The tokenization process is the first process. It is also the oldest process of PHP, and has been around since PHP 2. It has grown over the years, and got several modernization, in particular the next step : the AST. Yet, it holds a number of interesting features.

The PHP tokenizer mechanism is available thanks to the ext/tokenizer extension. You may use the tokengetall() function on the content of a PHP file, and a long array of tokens, just like this one below. The original script is the one on the left, and a part of the tokens is presented on the right.

Tokenizer : turning PHP code to tokensIn great PHP tradition, the array is polymorphous : it expresses some of the tokens as a single string (like the parenthesis), and others as an array of three elements : 0 stands for the actual token; 1 stands for the code, as written in the text file, and 2 stands for the line number in the PHP file. You may identify the 'EXT' string, or the define function name in the tokens quite easily.

Tokens are usually designated by their name ; TSTRING, TEVAL, TVARIABLE, etc. Since they are constants, they appear with their constant value in the token array. You may use tokenget_name() to access their name. Such value change from version to version : since only PHP and static analyser authors use them, this is not a problem. We’ll only refer to token with their name from now on.

All Tokens Are Not Equals

It is easy to end up with a million tokens from a given PHP file. Tokens cover every aspect of the PHP script : variables, constants, comments, spaces, delimiters. Usually, one third of the tokens are white space and comments : PHP will simply get rid of them at execution time. Static analysis keep some of them, such as the PHPdoc comment, but get rid of most of the others.

A PHP code source also includes a lot of delimiters, such as ( ) { } [ ] ' " etc. Those are important for PHP to understand the organisation of the code : yet, they are not important beyond the lexer, and are removed too. All in all, this means that two thirds of any PHP script are useless for execution. This is quite surprising to understand.

One PHP AST

The delimiters are not kept, per se, for the execution. Obviously, they are important for PHP to understand the code. So, while those delimiters are removed, their meaning is preserved. For example, the curly braces for a block in a function will separate the code in the block from the other context.

One AST for PHP

In the code above, you can see a simple PHP code, broken done into the AST. Delimiters such as curly braces are turned into a ‘link’ called ‘EXPRESSION’. The parenthesis is converted into ‘CONDITION’ (for if then), or ARGUMENT (for function parameters). PHP reuses the same tokens for multiple situations : the actual meaning of one token actually depends on its context.

One of the most reused tokens is T_VARIABLE, which, as you expect, is used for variables ($x = 1), but also for parameters (function foo($parameter)), property definitions (private C $property = 1), global definitions, and static members designations (A::$property). There is no way to distinguish them at the token level : it all depends on the context.

Nowadays, the AST is directly available from PHP itself. There is a C extension, aptly called ext/ast, which skips all the token work. It speeds up and normalize the developments of static analysis.

Let’s play in the trees

While the AST is a milestone when understanding code, it is definitely not the last step. The current state describes the code at the line level (as in Line Of Code). As a developer, we understand a lot more of the code by noticing that there is a functioncall (foo($a)) and a function definition (function foo($b)) which are conceptually linked : one is the definition, and the other one is the usage.

This is a meta notion in the programming world. PHP uses it heavily, although the way it is implemented depends heavily on the couple definition-usage. Here is a short list of them, to give you an idea.

  • Variables are created by assignation (explicit assignations, parameters passing), and they are used by being summoned.
  • Classes are created with the ‘class’ keyword, and used with ‘new’ keyword
  • Traits are created with the ‘trait ‘ keyword, and used with the ‘use’ keyword (sic)
  • Includes are created with each new file (this is even external to PHP), and used with ‘include’ and its cousins
  • Functions are created with the ‘function’ keyword, and used by their name, with parentheses or not (callbacks)
  • constants, properties, methods, anonymous classes…

It is possible to establish a link between any of the usage mentioned above, and their definition. This creates a short cut in the code, which is crucial to understand the flow of the execution.

During the execution, PHP uses numerous hashmaps, to jump from one part of the code to the other. When the jump is not possible, it may either produce a warning (undefined variable), or a Fatal error (undefined function). This error processing is a bit harsh, but it keeps the engine fast.

This is also the moment where static analysis has to part way with PHP behavior : static analysis search for those missing definitions. It has to differentiate between the ‘foo()’ function call and an erroneous call to ‘goo()’ : one has a definition, and not the other.

AST with definitions links added

As you can see, the AST has been distorted by the addition of ‘DEFINITION’ links, which brings a function and its usage closer, and parameters with their usage inside the function block. Extended to the full size of the application, we now have a dense cloud of objects. Actually, it is not reasonable to display it in this article, beyond simple examples as above.

The Missing Definitions

The cloud of PHP code is going to get even more dense when we’ll start noticing that some of the functions (or methods, or classes…) are missing a definition. You’re hearing me right : a function that has no definition, yet that works flawlessly with PHP code. There are two kinds of such functions : the native functions, and the external components.

PHP code relies on Componenents and native PHP functions

Native functions are functions such as strtolower, which are part of the default PHP distribution. The specific example of strtolower is always compiled, although it may be disabled in the php.ini file. There are other native functions, such as mb_strtolower(), which are also considered natives, but depends on the actual configuration. Extensions from PECL and from beyond (such as Xdebug), are all bringing constants, functions, classes and directives that have to be recognized in the code.

For those, static analysis has to make use of local database of definitions, to be able to recognize those structures. The signature of the functions is often sufficient to understand how pow(1,2,3) is actually calling a native PHP function, with too many arguments. Each extension may be documented to recreate any expected function.

The same strategy is applicable to the external components. Here, components is an umbrella word that covers any external code that is used, but not invented here. That means framework, composer components, libraries, plugins and other modules. Those are usually excluded from the analysis, as there is not point in reviewing the code of an independent team until it is plagued by a bug that leaks into your own code.

The same strategy of stub files, which describes the code of the component, yet skips the implementation details, is a good way to keep static analysis focused on the matter of the day : your own code.

Ready to Navigate the Code

You may measure the distance we have walked together. We started from the text files that are the source code, tokenized it, lexed it, linked it and complemented with external knowledge. This is quite similar to the work PHP does every time we hit a web server, and it executes some code. It is also a lot more work than just executing it : as we have seen, this new representation of the code is able to detect undefined codes and allow us to fix the code. This is a major advantage, as we are getting ready to navigate this ocean of expressions and operators, detecting mental overload, impossible codes and accumulated complexities.

You may see all this in action, by running the Exakat engine. Take a look at the public reports, check the Exakat engine and take us for a ride with Exakat cloud.

Happy PHP code reviews!

One thought on “From PHP code to static analysis

  1. Pingback: From PHP code to static analysis - Software Mile.com

Comments are closed.