Upgraded Exakat : new power under the hood.

There is an upgraded Exakat version available. Since May, we’ve been working hard on a major upgrade of the exakat code base. After two years of growing and adding features, the initial architecture was showing signs of aging, that needed to be upgraded. For example, structures definitions, such as classes, functions, constants, etc. were kept in an index, allowing for easy lookup of definitions, but making it harder to count usage. They are now made into a normal link, that may be used both ways.

Moving from Gremlin 2 to 3

The second upgrade of importance is the change from Gremlin 2.0 to Gremlin 3.0. The change of version brings a lot of incompatibilities, and changes in the way the graph is navigated. In particular, Gremlin 3.0 API covers more ground, reducing the usage of lamdba functions. This is also the trend we’ve been following while writing exakat’s code : staying more and more within the Gremlin API, and avoiding to leak to Java/Groovy or procedural code. This is an improvement, and hopefully, this will also give more freedom for the gremlin engine and the final database to optimize the queries. The main consequence was to trigger a total rewrite of the ‘tokenizer’ phase of exakat, and upgrade of the analyzer.

Rewrite the AST phase of Exakat

Up to 0.6.7, exakat loaded raw tokens in the graph, and used Gremlin to build the AST (Abstract Syntax Tree). The import was fast, and the build was slow. It also was based on 827 queries, more or less dynamically build, that proved to be inhumane to upgrade. After upgrading 80 of them, it was decided to find another approach.

The new approach was to change the load process, and rely on PHP to build the AST, and then, import then in the database. This phase could be done with PHP, and is build as a one-pass analysis : the list of tokens, obtained from the PHP tokenizer itself, is now read and processed into a tree in one loop. This makes the code an order of magnitude faster than previously. Later, the organized tokens are loaded in the database, with their correct label, and immediately indexed. Less tokens are loaded, and only useful tokens : the new graph is now cleaner, and ready to run the analysis.

Exakat’s DSL

The analysis phase of exakat also relies on Gremlin queries to analyze the code, but contrary to the tokenizer phase, it has a decoupling layer. To keep the analyzers queries close to PHP semantics, there is a wide range of methods that translate PHP-meaning like ‘all non-anonymous classes’ to ‘ g.V().hasLabel(« Class »).where( __.out(‘NAME’).hasLabel(‘Void’).is(eq(0)) ) ‘.

There are 174 of them, so it’s a long work to convert them all to Gremlin 3. However, the number of raw gremlin queries dropped from 764 to 392. Just like for Gremlin 3, more processing is being done within the exakat API, and less is done with raw Gremlin scripts, that can’t be upgraded in a generic way. That is a good step in the right direction. The next target will not be to remove all of those ‘raw’ calls, but rather, to continue reducing them : after all, innovation often starts there, until it get reused, and then, generalized in the API.

A few lessons reviewed during the process

  • Unit tests are crucial during the rewrite. They kept the new code in line with the previous, and allowed comparisons. They confirm that the current version still needs some work, but that more than half the features are correct : this is good for moral.
  • Once the rewrite is started, any trust acquired in the code over months of usage is lost. And this trust can only be rebuilt once the unit tests are passing again (or at least, enough of them). We need tools to help build trust, as there are currently none, except running the code as often as possible.
  • Having a community, even a small one, around a project help during time of uncertainty. It’s always encouraging to break from coding, and discuss a feature or a future usage of the new code. Many thanks to all who helped, and continue to help.
  • API leaks, so when writing a new API, there should be a ‘raw’ method that allows to reach all the internal parts of the underlying platform. There also should be a ‘raw’ usage monitoring, so as to remove them as much as possible (but no more). In fact, once there are several use cases, it is possible to see trends, and move then in the API.

Download it now

Exakat is currently in version 0.7.4 : we produced quick-fix releases last week of them. We’ll work on stability and total conformity with 0.6.7 in the 0.7 versions, then start adding more features in the 0.8. Download exakat now !