Moving from array to classArrays forever

Ever since I started using PHP, arrays have always been my friend. They are versatile, they have a wide range of functions, and they are easy to use. I kept using them versions after versions, and even with PHP 7.2, I still rely on them a lot. Over the years, classes have also made their way into my toolset. They have a different usage : classes are for complex data structures, for business logic. Simple data structures get an array. Until we tried what seemed impossible : a moving from an array to a class.

Classes are more efficient than arrays in PHP

I ran into yet another tweet that classes are getting better than array in PHP 7. This is coming from Nikita Popov, so this is serious. I took a closer look, and ran into this interesting article, from the same author : . Arrays and classes are basically the same internally : a list of properties. Classes contain this list of properties and a class name, for definitions. On the other hand, arrays define their own keys and values. They do it for each occurrence, while classes federate their properties definitions and do it only once.

Basically, when creating 10 times [‘a’ => 1] ; PHP creates 10 strings ‘a’. While instantiating class A { public $a ; } 10 times, means that ‘a’ is created only once. Later, PHP identifies the property as ‘number 0’ and saves the length of the property name. To make this example obvious, replace ‘a’ by ‘a_really_long_but_still_valid_php_name_for_a_property’ and the savings are be now obvious.

Replacing an array by a class

How can I use this feature of PHP to my advantage ? Apparently, replacing an array by a class would save memory. The number of properties must be significant. Reading the graph from Nikita, any size offers a gain, but the small, the lesser. So small arrays of 5 properties are excluded. Also, and may be more important, the usage of the array must be high : ideally, an array in a loop is a prime candidate.

My first check was a quick check, with a simple micro-optimization script. How do 10000 array creations compare to 10000 objects instantiation ? Check the script on https://3v4l.org/AiZN2. The results are simple : objects are usually 30 % faster, and they use half the memory. Both results are interesting, and significant. They are also a bit variable.

Target in the code

Now, all we need is a good candidate. An array, using more than 5 properties, and created inside a loop. Those criteria exclude a large number of options arrays, which are created somewhere, and used once, later, for configuration. It also exclude arrays used as a base for loops : those should be turned into a generator.

At the heart of the Exakat engine, the Loader is responsible to review all the tokens, and put them in the Gremlin database. It turns a PHP script into an array of tokens, and prepare those tokens for insertion. In the same time, it collects several pieces information that are provided indirectly by the PHP tokenizer. PHP may show that a token that is a variable, but it is in fact a property definition (private $var), or an argument (foo($arg)), or a variable name ($$var). Loader collects that kind of information for the graph to be build.

About 2 thirds of the tokens read by the PHP tokenizer are discarded. A good third of the tokens are actually useless, like white space, comments, etc. the second third are delimiters, like () or ;, which are important to parse and understand PHP code, but are not useful to run static analysis. The last third of the tokens is the interesting one. A project like wordpress have about 2 millions of such tokens, ez publish has 4,4 millions, and the largest of all (wikia) has 11 millions.

Array and how it appears in the code

Currently, the token that will be inserted in the graph is handled by an array. Initially, that’s how PHP provides the token descriptions, when calling token_get_all() . This function returns an array of arrays. Tokens are described as a string, like for parenthesis or curly brackets, but mostly as an array, containing the actual token, its constant name and the line number.

Tokens are processed, and get completed with other information. Line number and code are always provided for every token, along with ‘fullcode’, which is a rebuild of the current structure with the supportive tokens. For example, you want to see $array[‘index’] in a fullcode, while the current token only holds an uninformative ‘$array’.

Many properties are calculated depending on the type of token. For example, strings get their delimiter spotted : it may be ‘ or «  or none. This doesn’t apply to variable names. Global variables gets their name reduced by $, so as to be spot in a $GLOBAL array. And classes gets their fully namespaces name calculated. All those properties depends on the token, and they are just omitted when they are not needed. This is also the strength of the array : schema-less.

This leads to a lot of isset() tests. Sometimes, the properties are read by the parent token. For example, when checking is an expression is constant, the parent token has to check if all its children have a constant property, and if this property is set to true. As a matter of fact, the raising usage of isset() was a sign that the data structure was not adapted.

Replacing the array by a class

Replacing the array by a class was quite easy. The array acts as a container, with very little processing related to its internal data. The processing needed is the check for existence with isset(). This may be handled with default values. So, the replacing class has no methods nor constants : only public properties. The properties names were the index from the array.

Collecting all the property names with __set()

The main problem at that stage is to find all the properties. The class needs to be defined first, then instantiated. Of course, the schema-less nature of the arrays meant that no central repository of all the index was available. Lazy as I am, I collected them on the fly. I used the magic method __set() to help me : __set() is called when a property is used but not defined. I wrote it so it would warn me of undefined properties in the replacing class. Running the unit tests now meant collecting the missing properties names.

One of the advantage of using a class as an array is that a class allows for default values. The arrays are created empty, and one need to explicitly set the default values. By moving to a class, with default values for properties, every line that set an array index to a default value can be dropped. I could drop around 50 lines over a class of 4k LOC. That’s a net gain and always good to do.

The other good reduction of code was the conversion of $array[‘fullcode’] to $array->fullcode. This looks innocuous too, but the first expression is 4 tokens, and the second is 3. The code is lighter to read, shorter to write, and probably faster to parse for PHP. It’s only probably here, as this was not measured, and doesn’t mean much compared to the processing time.

Replacing isset with comparison with constants

All isset() calls were now obsolete, as the class would always have them set, useful or not. So, the isset() were hunted and replaced with comparisons. So, instead of ‘isset($array[‘intval’])’, I could now read $array->intval === self::NO_INTVAL ; Again, code was made a lot more readable, although it was longer.

Final caveat

For a short moment, I regretted all the promising new code when I reached this line :

<?php
  fputcsv($fp, $array) ;
?>

fputcsv() is a native PHP functions that requires an array. There is no way to give it an object. The first workaround felt like a treason : it is a cast to (array) and it worked all right. It was even better than with the array. In fact, with all the properties being defined a priori, they keep the same order, whatever the way they are processed. This meant that some sorting code that was used to make sure the good headers were pushed in the right CSV file could now go.

Finally, the casting was moved to a method of the array-class: some of the CSV exports require a short list of columns and not all of them. A method was the perfect place to convert the class to some specific format. Initially, the array-class was built to be an alternative to an array, and method-less. It appeared to be quite handy to be able to extract consistent information when it was needed. It is also a good place to store the constants used as default. And although it doesn’t apply here, adding setters to every property could also add an extra layer of data-checking to the class.

Readability

First, the PHP code is more readable and shorter. The whole class lost about 10 % of its code size, thanks to the shorter syntax : 208k to 187k. This includes the move from isset() to comparisons. The code itself didn’t change at all : no new feature, no cleaning. Just smaller code. That’s a first win.

Performances

Running it on a small sized PHP project, the exakat engine took 20s to load, and it now takes 18s, so 2s seconds less. That’s 10 % speed gain, on a real world application. We didn’t measure the actual gain on the class alone, but within the whole applicatino, including all other collaborating classes. Except for the Loader, those classes didn’t get any refactoring and saw no gain.

Memory

Memory consumption went down from 45Mb to 25Mb. This is a roughly 45 % reduction of memory saving. This is even better with larger loads, and it less interesting with small load : there is some constant overhead that won’t go away. Still, the refactoring was important.

Replacing an array by a class

Replacing arrays with classes was an interesting move, and a surprisingly interesting one, at several levels. Here is a list of things worth remember from this experiment :

  • After the constant-only interfaces, the static method-only traits, the property-only class is a tool to know about. Avoid abusing it.
  • A properties-only class is a worthy replacement for an array. A good array candidate for such replacement should not be processed with array_* PHP functions (and sort, and implode…). Ideally, it is created in a loop, and has many properties.
  • Moving to an class means setting default to all values. This is another valid case for the move : replacing isset() with readable comparisons, and avoiding setting explicitly default value every time the array is created.
  • The magic method __set() is a good tool, during development, to warn about undefined properties. This is the dynamic analysis equivalent to searching for undefined properties with static analysis. With a simple function __set($name, $value) { echo “$name is not set in “.__CLASS__.”\n”; } in a class, you may use __set() with a goal opposite to its initial purpose.
  • Not all arrays should be turned into a class. Rarely used arrays and small arrays are probably not worth turning into a class. It takes times, and won’t show in the performances.
  • Turning an array into a class is the first step toward extending it with methods and constants. Do it for the memory, stay for the features.

The ideas presented here were applied successfully in the Exakat engine, the smart auditing tool for PHP. Download it now and follow us on twitter.