Project (Stage 2)

First, I’d like to discuss what I have been looking at up until this point. The GNU Go source file is rather large and finding a proper way to benchmark the Zobrist hash it implements was exhausting as many functions make up the overall hashing algorithm. I tried reaching out to the mailing address specified at the development website for the game to see a recommended path to take for testing the hash in the game but was left with no response (possibly the holidays are to blame this).

As discussed with my professor originally, optimizing the hash algorithm itself would probably see minimal if any improvements. With this in mind I moved my attention to looking at how the transposition table, where Go positions are the key and results of the reading for certain functions are the data, is structured. This seemed like it would be the best route to take for optimization of the Zobrist hash using TBL/TBX instructions on the data for quicker searches. These instructions reads each value from the vector elements in the index source SIMD and FP register, uses each result as an index to perform a lookup in a table of bytes that is described by one to four source table SIMD and FP registers, places the lookup result in a vector, and writes the vector to the destination SIMD and FP register. For those of you that don’t know what a transposition table does in the game of Go, I found this paper to have the clearest of explanations…

“A transposition is the new occurrence of a state previously processed during the execution of the search algorithm. Transpositions in Go can occur due to situations in which, for example, pieces are captured and the board state returns to a previous configuration. The objective of a TT is to avoid that already processed data are reprocessed unnecessarily. As it is possible in a game of Go the capture of a single piece, as well as a group of pieces, after a capture move is executed, the board state may return to a configuration that is differentiated from the previous state, by one, two or many more pieces.

The TT is implemented as a hash table which is a data structure that associates keys to values. Each key represents a state on the Go game-board and is associated with a given piece of information such as, board evaluation and already explored tree depth. The representation of the board state in the form of hash key is carried out using a technique de- scribed by Zobrist.”

Now I want to see how specific bits in the transposition table are being arranged as this relates to the overall effectiveness of the hashing algorithm. At this moment my thinking would be the best case scenario for these bits to be used would be in a lookup table that is capable of SIMD/vectorized instructions such as TBL/TBX. For this I will dig into the source code to figure out how the compiler options were set up at compile time and see if there is any chance we can pass the compiler any vectorization flags that aren’t used (if any at all). Now lets check out this transposition table in some more detail…

***Update for above: I have found the game engine was compiled using the -O2 flag which gives me hope that if I implement a -O3 flag some optimization will occur.***

According to the documentation in the source files, the files that are of relevance to us are located in the engine directory, which looks like this…

Screen Shot 2018-01-07 at 11.29.43 PM.png

After peeking into all the files in this directory I found out that cache.h/cache.c together with hash.h implement the hashing of go positions using the Zobrist algorithm. When you go into cache.h the first thing you will notice is the definition oh a Hashnode, which is a node stored in the transposition table…

…The official GNU Go documentation also has some great reading on the organization of the hash table and the hash structures that are used in the game for the more curious…

Screen Shot 2018-01-07 at 11.41.52 PM.png

From this we can see by breaking down the 32 bits the unsigned int data holds what each particular bit is being allocated for. Following this, it defines what an Hashentry is and also does some right and left bitwise shifts with the data…

Screen Shot 2018-01-07 at 11.48.03 PM.png

You will also notice the definition of a transposition table following the bit shifting operations. After looking at the header file, lets dive into the source file cache.c to check out the transposition table implementation…

Screen Shot 2018-01-07 at 11.59.12 PM.png

We can also notice a function that will use bitwise xor operations to calculate a hash value for the transposition table. Peering deeper into this file we can find  a function that will initialize a transposition table as well as clear it up…

Screen Shot 2018-01-08 at 12.06.20 AM

It is interesting to note the use of malloc() to allocate memory for the transposition table. There are two functions that I believe are most noteworthy to investigate further regarding the optimization aspect of things, these are tt_get() and tt_update(). The former is used to get a result and move then return a value, while the later updates a transposition table entry.

tt_get() :

Screen Shot 2018-01-08 at 12.27.06 AM.png

Its interesting to note the function parameters as they are the data which we investigated the particular bits of earlier.

tt_update() :

Screen Shot 2018-01-08 at 12.37.50 AM.png

As we can notice above it gets a combined hash value for the transposition table, then gets the hash entries and nodes, then does a search for the newest and deepest nodes in the hash table.

If you recall from above, inside the engine directory where our cache and hash files are located there is also a object file called cache.o. This is great to see as now we can use an old and trusted command to take a look under the hood into the assembly code what is happening on our Aarch64 server. Using objdump -d lets take a look at how the instructions are being executed…

Screen Shot 2018-01-08 at 1.12.46 AM.png

tt_get() :

Screen Shot 2018-01-08 at 12.48.19 AM.png

From above we can see that there is indeed no vectorization at all taking place for any of the hash table. This makes me wonder if there may be some hope after all that we can use the table vector lookup instructions to improve instructions and function calls. I also am curious to find out the optimization level at compile time and if vectorization was even attempted at all. Nonetheless we can notice some un-vectorized instructions such as adrp, ldr, mov, add, lsl, and the exclusive or operations eor being used.  Also if you recall from the function there are three things taking place here which are getting the combined hash value, getting the correct entry and node, and returning the data.

tt_update() :

Screen Shot 2018-01-08 at 1.10.40 AM.png

Screen Shot 2018-01-08 at 1.11.03 AM.png

Here we are getting a combined hash value, getting the hash entry and nodes, seeing if we found an already existing node (newest/deepest), if newest is deeper than deepest, then we switch them,  otherwise we replace the newest hash value. We can also notice that no vectorization or TBL instructions are taking place that hopefully we may be able to optimize at compile time for better search efficiency on the platform.

Also, I would like to look at the hash.c file to examine what is another important part of the Zobrist hash algorithm. If we can take advantage of compile time optimization maybe there are some tweaks that can be found from this files output. This file contains some interesting code, initially getting a random Hashvalue where all the bits are used and then filling an array with random numbers for Zobrist hashing, then going on to initialize the board hash system…

Screen Shot 2018-01-08 at 2.27.49 AM.png

There are two noticeable hashing functions in this file, hashdata_recalc() which calculates the hash value from scratch using exclusive or operations…

Screen Shot 2018-01-08 at 2.42.37 AM.png

The other being a function that calculates a transformation invariant hashvalue called hashdata_calc_orientation_invariant(). An invariant in mathematics is a property that remains unchanged when transformations of a certain type are applied to the objects. You can also notice the use of the exclusive or operations here as well…

Screen Shot 2018-01-08 at 2.48.12 AM.png

Now lets take a look at the objdump -d of these two functions from the hash.o output file…

<hashdata_recalc>:

Screen Shot 2018-01-08 at 2.56.47 AM.png<hashdata_calc_orientation_invariant>:

Screen Shot 2018-01-08 at 2.58.25 AM.png

What stands out, aside from noticing no vectorization taking place, is all the branching instructions going on. I am assuming this is due to the exclusive or operations taking place inside the hash functions.

Now that we’ve learned what’s going on with the implementation of the Zobrist hash algorithm and use of transposition tables, lets move on to take a look at how these files were compiled originally.

To figure out how the engine was compiled I first ran a make clean and make command to see what was happening during the build process (I will look inside the Makefile next)…

Screen Shot 2018-01-08 at 1.41.07 PM.png

original compilation of cache.o:
Screen Shot 2018-01-08 at 1.44.04 PM.png

original compilation of hash.o:Screen Shot 2018-01-08 at 1.45.54 PM

From above we can notice that there is a compile time flag of -O2 on the cache and hash object files. Now lets look into the Makefile to see what exactly is going on…

Based on the output of what the compiler was doing during the make above I can see that the -O2 flag is located in a variable called CFLAGS

Screen Shot 2018-01-08 at 2.08.49 PM.png

We can also notice some other flags being used as well as the compiler warnings being called upon. For our purpose we are concerned with the CFLAGS option which we will be changing from -O2 to -O3. According to the GNU manual this is called Overriding , being that each time we run make we can override the previous CFLAG value that will produce a new object file based on that option. After reading more online documentation concerning optimization flags to send to the compiler I came across an interesting one that is -march=name, this specifies the name of the target ARM architecture. GCC uses this name to determine what kind of instructions it can emit when generating assembly code. This seems like a prefect candidate to include as a compiler option. We will be using the flag -march=native which causes the compiler to auto-detect the architecture of the build computer. Ok, lets introduce these flags into the makefile…

Screen Shot 2018-01-08 at 2.28.22 PM

Now lets do a make clean and a make again…

After running the commands we can now see that our flags are being used to produce the new object files…

Screen Shot 2018-01-08 at 2.30.51 PM.png

Screen Shot 2018-01-08 at 2.32.02 PM.png

With these optimization flags in place, lets go back and do an objdump of the new object files and take a look at the functions discussed earlier.

Sample of optimized <tt_update> assembly instructions (located in cache.c):

Screen Shot 2018-01-08 at 2.35.40 PM.png

 

Sample of optimized <hashdata_calc_orientation_invariant> assembly instructions (located in hash.c):

Screen Shot 2018-01-08 at 2.40.04 PM.png

Unfortunately for us, as we can note from the above instructions, there is still no vectorization going on or vectorized table lookup instructions being used even with the updated compiler flags targeting the object files. For interest purposes I wanted to see if there was any other files or functions that may have gained some auto-vectorization from the compiler flag updates. I targeted the montecarlo.o file for this. In Monte Carlo Go the engine plays random games to the end, generating moves from a pattern database within the context of the algorithm UCT (upper confidence bounds applied to trees). In the montecarlo.c file I located a function that seemed like it would be good for these lookup purposes, the function is called uct_genmove() that searches the hash table and plays simulations to calculate the next best potential move on the board. A sample of the function looks like this…

Screen Shot 2018-01-08 at 3.47.22 PM.png

First I compiled using the original optimization flags and looked at the objdump of that object file. I then used the flag optimizations -O3 and-march=native and ran make again to produce a new object file to see if there was any difference. After comparing the disassembly of uct_genmove() against both optimization levels side by side I noticed a difference, the new optimized flags gave us vectorization!

Here is a comparison from uct_genmove():

We will be using the command objdump -dS to see the source code along side the disassembly output. From a line in the function that is implemented with

gg_assert(tree.hashtable_even);

we notice the first instance of vectorization being utilized. I will show the original un-vecotrized output then the vectorized output that has the movi instructions below it…

un-vectorized:

Screen Shot 2018-01-08 at 7.48.10 PM.png

vectorized:

Screen Shot 2018-01-08 at 7.50.10 PM.png

We also notice that (tree.hashtable_odd) is now only 1 instruction. We also notice the liens being vectorized are searching the nodes for winning moves and tallying the sum of the scores which originally were not being utilized as vector instructions.
The next instruction that I find that now utilizes vector instructions is labelled
mc->move_partition_lists_white[k] = 1;

un-vectorized:

Screen Shot 2018-01-08 at 8.14.59 PM.png

vectorized:

Screen Shot 2018-01-08 at 8.15.35 PM

with the line mc->move_value_sum_white = 0.0; we can notice that it is now a single vectorized move instruction unlike and add operation it was before

un-vectorized:

Screen Shot 2018-01-08 at 8.20.10 PM.png

vectorized:

Screen Shot 2018-01-08 at 8.20.52 PM.png

the line starting_position.consecutive_passes = 0; now has a vectorized move instruction while previously it had none…

un-vectorized:

Screen Shot 2018-01-08 at 8.25.14 PM.png

vectorized:

Screen Shot 2018-01-08 at 8.26.07 PM

Under the line starting_position.settled[pos] = forbidden_moves[pos]; we can see many vectorized instructions being utilized with many xtn/xtn2 instructions.

un-vectorized:

Screen Shot 2018-01-08 at 8.31.20 PM.png

vectorized:

Screen Shot 2018-01-08 at 8.32.00 PM.png

Wrap-up:

With successfully being able to vectorize some parts of the game engine that uses the Zobrist hashing algorithm, specifically in the Monte Carlo simulation game function uct_genmove() it would be interesting to see how this improves the efficiency of many simulated games at a time for lookup speed purposes. I wish I had focused on the Monte Carlo simulations more closely a lot sooner, as there are many AI fundamentals being relied upon. I believe if I spent time searching for a shorter program and clearer hashing algorithm I would’ve been able to produce much more efficient testing outputs which may have been easier in the end to implement. I was intrigued by the game of Go originally and this project has intrigued me even more, I believe I will continue testing out the games source files in the future. If there was one thing I wish I knew to make the overall logic of the program much easier to follow…it would be to understand the game more thoroughly as it is quite complicated for someone with zero prior knowledge outside of Chess. I will end the post here as it is rather long, hope you enjoyed the read!

 

***UPDATE***

After looking more into how to do any sort of testing relevant to our previous optimizations  I have come across the only option I could find. under debugging command line options to use in the game I came across one that prints statistics of the games moves relevant to our Monte Carlo optimizations. Unfortunately it does not have a time to lookup option, although it does display the depth of node searches it makes in the table lookup relevant to its next move. Now lets see if there are any changes at all affecting this output…

We will be using the command line option:

./gnugo –mc-games-per-level 10 -S

This is the default level of Monte Carlo simulations to make to consider a move. At level 10, GNU Go simulates 80,000 games in order to generate a move. Cool, lets see the results we receive with a couple of test moves…

Without optimization simulation results:

a bit clearer look at the lookup results:

For the Monte Carlo simulation to make its best predicted move it took a search over 16,172 nodes to makes its decision to move to Q3. I then made a next move to J10 and this was the result of the simulations next best decision to makes its move to D4…

Lets see if the optimized results in the Monte Carlo file generated any different output to the search over the nodes results…

Exactly the same as the previous un-vectorized version of searching through the nodes after I made the initial move to D15. Lets try my next move of J10 again to see if its exactly the same as well after the simulation (of interest is that it moves to Q4 this time instead of Q3)…

Well for the second simulated decision of the game based on the same previous moves it searched around 2 thousand less nodes and generated a move to C3 instead of D4. It seems there may be a minimal lookup speed effect in nodes searched if anything.

This may be related to what we noticed during the vectorization of the function that generates the Monte Carlo simulations next move …

Here we can recall there is vectorized move instructions taking place for node->wins and node->sum_scores used in the function. Based on this the vectorization seems to have a minimal effect thus far in the searching through nodes in the table. I would be curious to play a whole game with tough decisions to see if there was more of a deviation in results. I also wish there was a way to test the time of the lookup speed in the developer debugging tool options.

I attempted one other test located in the regression testing directory. This basically makes sure we don’t have a bug in our changed program and it outputs the same results, although here I’m more curios to see execution time. I also made sure to check if anyone else was using the server that would have a side effect on these results, and luckily I was the only one on.

Running the command  with no vector instructions in game engine:

time ./test.sh blunder.tst

1st run:

I do this two more times and get these results:

2nd run:

3rd run:

Lets see if theres any changes after compiling the engine to utilize vectorization:

1st run:

2nd run:

3rd run:

As we can see there was a small improvement in the milliseconds here basically a bit over 12.8 for the unoptimized version and a bit over 12.5 for the optimized. I will be curious to keep playing around with the game files as it is a really interesting game.

Advertisements

Project (Stage 1…)

After reading over more of the GNU Go documentation, I found a possible way to speed up performance of the games overall hash speed itself. By default, GNU Go makes a cache of 8 Megabytes in RAM for its internal use. The cache is used to store intermediate results during its analysis of the position. Increasing the cache size may give a modest speed improvement to the overall hash speed. The game stores results of its reading calculations in a hash. If the hash table is filled, it is emptied and the reading continues, but some reading may have to be repeated.

According to the Arm programmer’s guide a cache is “a small, fast block of memory that sits between the core and main memory. It holds copies of items in main memory. Accesses to the cache memory occur significantly faster than those to main memory. Whenever the core reads or writes a particular address, it first looks for it in the cache. If it finds the address in the cache, it uses the data in the cache, rather than performing an access to main memory. This significantly increases the potential performance of the system, by reducing the effect of slow external memory access times. It also reduces the power consumption of the system, by avoiding the need to drive external signals.” With this in mind if we have more room in the cache to hold the games position analysis we may see a performance boost from the default level.

In a von Neumann architecture, a single cache is used for instruction and data (a unified cache). A modified Harvard architecture has separate instruction and data buses which means there are two caches, an instruction cache (I-cache) and a data cache (D-cache). In the ARMv8 processors, there are distinct instruction and data L1 caches backed by a unified L2 cache. Since our Aarchie system has a lot of RAM and can handle a larger cache we will be able to set a custom cache size in our “./configure” script. This will look something like :

./configure –enable-cache-size=48

Project (Stage 1)

For the SPO600 project we are to find an open source software package that includes a hash function (Murmurhash, SHA, etc) that is implemented in a language that compiles to machine code (C, C++, or Assembler). From there we will benchmark performance on AArch64 systems and then identify one or several potential optimizations methods that we will attempt to implement in Stage 2 of the project.

After spending some time searching to not much avail for a strong hash function candidate I reverted back to a program I looked into at the start of the semester. My blog post post a couple months back was focused on building an open source software package which was the gnugo game, and after looking into its source code again I found that the game engine uses a hash function. This is what I will be looking into further from now on, unless I find a stronger candidate that my be easier to test optimization with.

The gnugo game engine uses a hashing method called Zobrist hashing which is used in computer programs that play board games such as Chess and Go, to implement transposition tables, which is a special hash table that is indexed by a board position and used to avoid analyzing the same position more than once.

Here is a look at how the code is implemented within the hash.c file of the game engine…

***Zobrist hashing starts by randomly generating bit strings for each possible element of a board game***

Screen Shot 2017-12-16 at 11.25.23 PM

And here we can see how it calculates a transformation invariant hash value…

Screen Shot 2017-12-16 at 11.37.30 PM.png

For those curious to look more into the code of the game itself it can be found here.

At the moment I will not be running any performance/optimization tests as I am still pondering the most appropriate way to go about this for efficient results.

For a method to optimize the current implementation for AArch64 systems, with the advice of my professor, was to possibly use table lookup instructions (TBL and TBX) to accelerate some parts of this hash. These instructions will look up single-byte values in the range of 0-127 in a table, fetching the results for up to 16 input values with a single instruction (0-63) or two instructions (0-127).

Over the course of the next week I will be looking more into this hashing algorithm and trying to isolate its performance on the AArch64 systems. Stay tuned!

Autovectorization in GCC

Today we will take a plunge off the deep end and see what vectorization is all about. Imagine being able to speed up a program you created many many times over, who would not want that? If this is an option you have at your disposal, you would be quite foolish not to take full advantage of the potential performance improvements…and this is where vectorization walks into your life.

Vectorization can be thought of as an optimization technique that will apply loop unrolling to your code and utilize the SIMD(Single Instruction Multiple Data) functionality of your processor. If you’re more mathematically inclined, this refers to the code being converted from a scalar implementation (processes single operands per cycle) to a vector implementation (processes multiple pairs of operands per cycle). When this occurs automagically by the compiler, it can be considered autovectorization.

An example of a simple vectorized loop may look something like this:

non-vectorized loop:

Screen Shot 2017-10-15 at 9.50.58 PM.png

vectorized loop:

Screen Shot 2017-10-15 at 9.51.04 PM.png

This displays the program rewritten to use vector operations on an entire array opposed to operations on individual array elements.

An analogy to help envision why you would be crazy not to take some time to apply vectorization to your code would be how we use highway lanes in daily life. If we have a highway consisting of four lanes for traffic yet there are a thousand cars in just one lane, this can be considered inefficient use of the lanes (scalar implementations) and slow down traffic speed. Opposed to utilizing each lane (vectorization) to speed up the flow of traffic (data).  This is how it works when we leave unused space (code not vectorized) in our SIMD registers that could be used by more data elements. If we have a 128-bit SIMD register (known as V0 – V31) and only use it for one 32-bit integer we are leaving wasted space for three additional integers.

A great article that discusses many of the loops that can be vectorized and drawbacks to vectorization (plus much more) can be found here. If you’d rather sit back and watch a great lecture on these capabilities, James Reinders (who I stole the traffic analogy from) of Intel gives an enlightening one.

So, that all sounds great but how can I inform the compiler to vectorize my code? The flag -ftree-vectorize can be used and vectorization also turns on by default at an -O3 optimization level. If we want information on which loops were vectorized and which were not, we can use the flag -ftree-vectorizer-verbose. For the geeks who are really curious, you can find a list of all SIMD vector instructions here.

Now lets get to an example using our aarchie (AArch64) system…

We are going to write a program that creates two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result.

After we do this we will use our old trusted objdump -d command to see if the compiler is autovectorizing our code for us. And for the sake of testing, lets try a couple different examples of code out.

Lets get to it…

For the first version I have implemented the instructions in a single loop…

Screen Shot 2017-10-20 at 8.44.09 PM.png

Now we will compile using gcc -O3 -o vector1 vector1.c 

I will post the <main> segment of the disassembly for us to see if any vectorization has occurred (along with my comments in green)…

Screen Shot 2017-10-20 at 9.36.16 PM.png

Although we used the -O3 flag which automatically triggers the compiler to utilize vectorization if possible, we can see from above the code has not been vectorized. This may be due to the concept that the compiler will favour a simple loop for vectorization opposed to one that attempts to do to many operations.

Lets see if we can learn anymore useful info before we change our source code…

Hmmm it appears as though after trying -ftree-vectorizer-verbose and setting various levels there was no output to display. Checking some suggestions I tried using -ftree-vectorize -maltivec -ftree-vectorizer-verbose which gave me the same result… nothing to display. This isn’t good news if we want the compiler to tell us whats wrong with out loop for it not to be vectorized…

After some more searching I found a flag that seems to work and provide us with some information to aid us as to why this loop didn’t become vectorized…

-ftree-vectorize -fopt-info-vec-all

Screen Shot 2017-10-20 at 9.49.08 PM.png

Here we can see that it complains “loop contains function calls or data references that cannot be analyzed”.

Lets give it another try with a tweak in our code, this time not encapsulating all the instructions in a single loop.

Screen Shot 2017-10-20 at 10.30.43 PM.png

and the objdump

1stvec

2ndvec.png

It looks like we have some partial vectorization happening. This is much better than no vectorization at all, but lets check and see what is happening from the compilers perspective with -ftree-vectorize -fopt-info-vec-all

Whoa……there’s tons of information pertaining to many instructions within the code, way too much to put up here. I will post the most significant piece of information though which shows us what loop was vectorized…

Screen Shot 2017-10-21 at 1.28.46 AM.png

As we can see from this the second for loop was vectorized but not the first. Lets see if we can dig deeper with -ftree-vectorize -fopt-info-vec-missed

Screen Shot 2017-10-21 at 1.30.27 AM.png

It appears the calls to rand() may be to blame for the initial loop not vectorizing properly. There are many iterations with many possible alternating values on each pass of the loop with that function call, which defeats the purpose of vectorization in theory. As you can see, there are also some “misalign” notes among others.

After playing around with some suggestions for leading the compiler towards further vectorizing the code with __builtin_assume_aligned and some more simplification of the loops, nothing would work to further vectorize from the past example. I broke the for loops into three separate statements, with a call to rand() both having their own for loop for loading the arrays, and a final one for calculating the sum. When I looked at the missed vectorization opportunities I learned that the last for loop (sum calculation) vectorized properly while the two loops with the call to rand() did not. This further supports my theory that the random function generator is to blame for not having the vectorized code as well as the statement…

Screen Shot 2017-10-21 at 1.58.27 AM.png

Although the path to vectorization may seem complicated and painstaking, the opportunities for the performance gain are still an admirable reach to have. Personally, the most tedious task was looking up the many different assembly instructions to see what they were doing at the machine level as this is not really human intuitive code. Apart from that, vectorization can lead you down a path of continuous knowledge and learning, and endless rabbit hole it seems…which is something I like getting lost into from time to time. Vectorization will be on my list of concepts to indulge further time into learning and will hopefully lead me to be able to utilize it in my own code soon.

Build and Testing glibc

For our next build process, we are going to test out the GNU C Library (glibc), which provides the core libraries for the the GNU system and many more that use Linux as the kernel.

First we must find and install the source code on our system. A quick glance at their website informs me the latest released version is version 2.26 (2017-08-02), so this is what we will download now…

As usual, with our wget and tar commands…

wget:tar.png

Extraction complete! As with our last open source build (gnugo) this one also has an INSTALL file   , so lets take a look at it and see what we have to do to build this beast…

INSTALL.png

A bit more confusing than building the gnugo source as its talking about configuring into different directory paths…

First we must make a new directory entitled “glibc-build” at the same directory level as were we downloaded the source files. Once that is complete, from our new build directory we can issue a configure command using the mandatory –prefix option with our current build directory appended. Finally once this is done, we can issue a make command to build the source. Lets give it a try now…

configure.png

Seems as everything has configured properly, onto the make!…

make.png

After a long 14 minute process it finally finished….

After another lengthy amount of time wondering why I couldn’t locate the .c files and could only see .o, I realized I had to navigate back into the downloaded new versions directory.

We are now asked to test the library that we have built by introducing a bug in the behaviour of a function. After browsing many functions it seems the simplest one to test is rand() found in rand.c located in the stdlib folder. So lets make a simple program that uses the function…

random program.png

This will generate ten random numbers as follows…

expected rand output

Notice the use of the testrun.sh executable provided to us in our glibc-build directory that makes our  programs source code use our custom built library functions. Lets now introduce a small bug in the rand file…

rand bug.png

Now we will have to re configure and make our library…

The build took much less time this go around thankfully (sub 60 seconds).

Lets see what happens when we run our random program again…

rand bug output.png

Our bug has worked! It only prints the number “1”!

For now it seems there are many functions to explore in this library that may be fun to play with and implement even more bugs! or possibly even find some bugs to fix!

 

Building an Open Source Software Package

Today we’re going to be building an open source software package from the Free Software Foundation’s GNU Project. How exciting!! Now lets pick a package to build…

Initially I’m very surprised with the vast amount of packages that are offered to the developer. A total of 395 by my estimation, although one package in particular caught my eye as I was navigating through the list. I have been a lifelong fan and player of the game of Chess, and never really looked into the game of Go which seemed to have some similarities, yet there the package was … labelled as “gnugo”. The game caught mainstream attention this year when Google’s AlphaGo AI won three matches against the worlds best Go player.

With the selection in place, lets download the source code for the software (we will not be installing the software on the system)…

We will use wget to download the file and then use tar to compress and extract the file…

wget.png

seems like it worked, now lets look into the INSTALL file and see what is the recommended path to take for building this package…

install.png

We will need to first use the “configure” command to build this package…

configure

Now for the “make“…

make

For some deeper learning, you can read more about these commands and what they do here.

Since everything seemed to work, lets try to run a “make check” command…

make check.png

After approximately two minutes of waiting for the command to finish, its interesting to note that it goes into all the directories and reports back there “is nothing to be done”, lets continue…

We have now downloaded and built the package and it is time to do a test run to make sure we can actually use this program…

For gnugo the developers recommend running it with a graphical user interface called CGoban. However it is possible to run the game using the Ascii interface, and that is the route we will take…

After looking into the directory structure I found the executable in the interface folder…

gnugo

Sweet!! we have the game up and running on the system! Lets see what happens if we attempt to make a move…

moves.png

Now that is cool! I also am fond of the “GNU Go is thinking. . .” string to make it seem more AI based. After it has selected a move, we are now asked to make one in return. Since I do not know Go strategy at all, I will stop here and read some documentation on strategies for gameplay to have an epic encounter in the future with this program!

As a  final aside, it was also very interesting to be able to go inside the games engine directory and look at all the C files…

engine.png

an example from “dragon.c“…

dragon.png

Pretty sweet! This has propelled me to become very excited for learning more about building open source software packages and hopefully contributing to some in the near future.

Assembler Lab

For the purposes of this endeavor we will be building off of the last Compiled C Lab post. Except this time, instead of just using compiler options to look into changes of the assembly code, we will actually be writing an assembly program on two different architectures! “Cool” (only says the geeks)! Lets get going…

Before we begin writing our program, we were provided three different versions of the same “Hello World” C program to inspect on both architectures.

The 3 versions of “Hello World” and their objdump <main> output on x86_64:

printf() version:

Screen Shot 2017-09-29 at 10.47.54 PM

Screen Shot 2017-09-29 at 10.58.44 PM

write() version:

Screen Shot 2017-09-29 at 10.53.14 PM

Screen Shot 2017-09-29 at 11.00.13 PM

syscall() version:

Screen Shot 2017-09-29 at 10.54.39 PM

Screen Shot 2017-09-29 at 11.01.06 PM

Example of partial view of  objdump -d on syscall() version (notice total amount of code):

Screen Shot 2017-09-29 at 11.05.27 PM.png

Now we will look at two versions of “Hello World” written in different assembly syntax and their corresponding objdump -d output…

GNU AS(GAS) syntax (AT&T style):

Screen Shot 2017-09-29 at 11.39.51 PM

This is the full output of the objdump (notice how small it is compared to its C counterpart):

Screen Shot 2017-09-29 at 11.45.39 PM.png

NASM syntax (Intel style):

Screen Shot 2017-09-29 at 11.41.49 PM.png

Again, notice how small the output is compared to the C version…

Screen Shot 2017-09-29 at 11.47.48 PM.png

A very interesting discussion and comparison of the two Linux assemblers can be found here.

Now lets look at the three different versions of the “Hello World” C program and their objdump -d output on aarch64

/*
It is interesting to note how both architectures handle the program instructions.
Although they appear quite similar we can notice variations in how they interact with their registers (which we learned about in the last post!).
*/

printf() version:

Screen Shot 2017-09-29 at 11.56.23 PM.png

write() version:

Screen Shot 2017-09-29 at 11.57.26 PM.png

syscall() version:

Screen Shot 2017-09-29 at 11.58.18 PM.png

Now lets inspect a “Hello World” assembly program and its objdump -d on aarch64

Screen Shot 2017-09-30 at 12.11.46 AM.png

Very similar to its x86_64 counterpart, although we can notice a slight variation in how they interact with their registers.

Now time to implement an assembly program with a loop that will print the values 0-9 as follows:

Screen Shot 2017-09-30 at 12.45.43 AM.png

Wait! this may not be as easy as it seems! We will have to print the word “loop” each time it loops with its index value beside it. To print the index value we must convert an integer to digit character. For those curious, you can refer to the manpage for ascii on how to properly gather the values.

An example using x86_64:

Screen Shot 2017-09-30 at 12.34.14 AM

A bit more confusing than a simple loop in C I’d say.

We are now asked to extend the code to loop from 00-30, then to suppress the high digit when it is 0, in the end printing the values 0-30. Lets give it a try…

Wait! this seems tricky… yet it might be even trickier than it seems!

We will need to take the loop index and convert it to a 2-digit decimal number by dividing by 10. To perform this operation we will use the div instruction, which takes the dividend from rax and the divisor from the register supplied as an argument. The quotient will be placed in tax and the remainder will be placed in rdx.

I initially ran into a couple errors that I had to spend a bit of time on Google to figure out what was happening…

Screen Shot 2017-09-30 at 1.51.06 AM.png

It seems we are using gcc to link which will by default add the C libraries which expect a main and already contain _start that will invoke main…

So I attempted a suggested fix…

Screen Shot 2017-09-30 at 1.51.50 AM.png

Interesting error log, although I can see that adding the -m32 makes it a 32 bit executable which won’t work on this architecture . So alas, I attempted to compile without the -m32 tag …

Screen Shot 2017-09-30 at 1.52.03 AM.png

The output…

Screen Shot 2017-09-30 at 1.58.47 AM.png

sweet! it works!

link to x86_64 code.

We were then asked to perform the same loop using on aarch64. The source code for which can be found here.

Overall, I found that writing and debugging in assembly vs a high level programming language such as C is far more difficult to initially grasp (especially as a noob to assembly language and these specific architectures dealing with the registers). I believe this is due to the rather unintuitive nature of programming the instructions and interacting with the underlying CPU registers. Although it may seem discouraging a bit at first look due to the rather unfamiliar nature of the instructions and the mnemonics involved, I can see the value it has to learn. With assembly we can access any memory location, control the machine code better, and manipulate bits easier than high level languages. Comparing the two architectures, although it feels that aarch64 has a simpler command structure, if I was forced to choose a personal preference, I’d side with x86_64 for now. They are very similar, but I do prefer the clear register names such as rbp and the bash script nature of referring to immediate values with a $. I will spend some more time continuing to learn about assembly language and these architectures and hopefully reach a greater understanding of how everything interacts in the near future.

For deeper learning, refer to these great references:

x86_64

aarch64

And if you are still feeling a bit discouraged with assembly, here is a good discussion as to why it is still important today!