Experimental porting of the etnaviv Mesa driver

July 8 2021 by Josef Söntgen

In this series of posts I am going to elaborate on porting the etnaviv driver to Genode and what this effort did entail. The second post is about dealing with the Mesa driver and briefly skims over the process.

Introduction

(Disclaimer: I only strife topics like Gallium vs. old DRI architecture and DRI2 vs. DRI3 and in general simplify things greatly. So I beg your pardon for one or the other technical inaccuracy.)

In the last post we talked about porting the Linux etnaviv DRM driver to Genode. At the end we left with a working Drm session. Now that in itself is merely the first step to something more useful: in particular to run 3D accelerated applications on Genode. As the driver is only one side of the coin we need the other side as well, namely that piece of software that translates the commands issued by the application into something the driver and by extension the device itself understands.

Long story short, the most prominent open-source implementation of such a piece of software is the Mesa 3D graphics library. On the application facing side it provides, among others, implementations of various rendering APIs (OpenGL, Vulkan, …) and on the driver or rather back end facing side it, among others, allows for utilizing the DRM API - nice, that's our way in. In between sits the state-tracker that manages the overall interaction between the rendering API and the device. Additionally there is the code that allows for integrating the library into the surrounding environment - read the OS.

Architecture

As we already fooled around with Mesa in the past, we were off to a smooth start. Since the current Mesa 11.2 version was too old we updated it to version 21.0.0. One observation was that the amount of code that needs to be generated during building of the library as increased substantially.

For that we used the usual approach: generate the files once and put them into its own repository and check them out during port preparation. These generate commands are part of the .port file but will only be executed when explicitly be called for, e.g.:

 ./tool/ports/prepare_port mesa-21 GENERATE_FILES=1

For one that has the benefit that not everybody who prepares the port has to perform the generation and for the other incorporating the files into the hash-sum of the port leads to a stable result. (Different host systems with different run-time versions might lead to differently generated files.)

Sebastian already added most of the files when enabling the softpipe driver and fortunately etnaviv did not require that many more files.

With the needed sources files in place, we went on to enable the first driver: swrast using the softpipe back end. Now, in contrast to the current Mesa-11 and i965 port, we were back to using gallium. Briefly speaking, gallium modularizes the inner workings of Mesa and loosens the dependency of the upper-layer of the Mesa library from the actual device driver. Enabling support for a new device boils down to implementing the proper (back end) pipe interface only.

In our initial Mesa port in 2010 gallium was also used with the 'i915' driver but as Intel kept using the DRI interface afterwards (used by the i965 driver), we removed the gallium based drivers. Now we are back again - the circle is closed.

One way to integrate a rendering API is by using EGL. Again, that is old news. We already added Genode to the EGL library as a platform, whose init function is called dri2_initialize_genode. The egl library is linked against the application and will in return eventually load the proper back end, the so called egl_drv library via dlopen("egl_drv.lib.so"). We use the routing mechanism to select whether to load the software-render, egl_swrast.lib.so or egl_i965.lib.so:

 […]
 <route>
  <service name="ROM" label="egl_drv.lib.so">
    <parent label="egl_swrast.lib.so"/>
  </service>
  […]
 </route>
 […]

The various EGL driver libraries contain, as the name suggest, the driver code. In this case the drivers implement the DRI API. The rendering API on the other hand is provided by the mesa.lib.so library. At the moment we are mostly interested in OpenGL as rendering API, where we act according to the supported version of a given device, i.e., OpenGL vs. OpenGLES.

So we basically ended up with the following way of integration:

The application is linked directly against egl and mesa
The EGL DRI2 back end loads egl_drv.lib.so
We patched Mesa to load the driver again via its loader mechanism rather than trying to load, e.g., i965_dri.so
Rendering API calls will end up in mesa as the various glFunc stubs are resolved already.

With the new Mesa, we could re-use some of the EGL platform related code and tried to keep the integration the same. (From the outside a gallium-based driver also plugs into the DRI driver infrastructure.)

Interfacing changes

Now when it comes to etnaviv the first prominent difference between it and the old i965 driver is the notion of render node. Since we took some liberties in how we access the front buffer or rather render target we had to adapt the code.

We do not share the buffer with the GUI server but map the back buffer in the EGL back end directly and copy the pixels into a buffer provided by the Gui session, i.e., employ software blitting. To access the front buffer from the CPU in a linear fashion we resorted to a i965 internal function that maps the corresponding buffer-object and de-tiles the memory in software.

If you recall the previous post we implemented the Drm session as slim wrapper around Linux' DRM API. Originally, whenever Mesa called the DRI2 loader extension getBuffersWithFormat() to get access to the back buffer our own back_bo_to_dri_buffer() function would query the properties of the back buffer. In particular it would acquire the name (or rather handle) by calling

 dri2_dpy->image->queryImage(image, __DRI_IMAGE_ATTRIB_NAME, &name);

To keep the journey short, eventually we ended up with a DRM_IOCTL_GEM_FLINK I/O control request that we passed along to the DRM driver. The driver, however, refused to cooperate as we apparently were not allowed to perform this particular I/O control on a render node. At this point we gathered information about DRI2 vs. DRI3, DRM primary vs render nodes and the DRM PRIME API.

TL;DR render nodes provide a subset of primary nodes and are used to execute unprivileged operations such as (off-screen) rendering. Simply from an API point-of-view using render nodes looks quite appealing. After all they are more inline with how we already treated the Intel GPU in the past.

So instead of querying the name (handle), we used __DRI_IMAGE_ATTRIB_FD.

Now, let us say following around Mesa on how to cope with DRI3 was not so much a journey as more of an odyssey fueled by ignorance and one or the other wrongly pointed (no pun intended) function pointer. We enabled some debug flags - be back to that in a bit - to get a more thorough picture on how things are supposed to interact. (The comparison to Linux is somewhat mood in that regard as there, depending on the platform, the back buffer object was allocated by another entity, e.g., display via DRM-KMS.)

In between, when running eglgears, EGL would successfully create a window, and it would happily keep on drawing the gears. At least that is what we assumed - the displayed buffer was purely black. Looking at the list of buffer objects, armed with the information from the debug messages, we could easily identify the render related buffer object(s).

Easily as in “our front buffer is 600 x 600 x 32-bit which equals 1440000 bytes and there is a BO which is 1835008 bytes large - close enough”. Simple math should count for something, right?

As a hack we added a Gui session directly to the etnaviv driver component. Whenever the driver would get a completion interrupt, we would dump the BO in question.

Nope, still no gears in sight.

Maybe simple math was wrong after all? Besides the obvious choice there were more buffer objects that could be render targets. Eventually we found a buffer object that contained a bunch of red pixels that - if you squint enough - could resemble a gear. That, however, was to be expected as the GPU probably would not store intermediate render results in a linear fashion.

Okay, apparently the GPU does its job and for some reason we just do not get result back. Following the etnaviv gallium driver you will see that the driver instructs the GPU to copy its results from the intermediate render buffer to the supplied back buffer if certain constrains are uphold. Setting some debug flag upset those constrains. Yikes.

Needless to say, after some minor fiddling the gears would be displayed in the all its glory now.

On a side note, it is nice that the usage of pre-process page-tables by this GPU allows for Mesa to manage the GPU virtual-addresses on its own and we do not have to deal with BO re-locations anymore. Hopefully that will be similar when we replace i965 with iris.

Nice rendered scenes by glmark2 but subpar performance

With the old but trusted gears test running fine, we tried something more exciting. As already foreshadowed in the previous post, we turned our attention to glmark2. Making it run with our etnaviv port was not that big of a deal, a welcoming situation for a change.

However, the performance was unexpected. To make good on last times teasing, it indeed revolves around uncached memory as one short coming. All buffer objects are allocated by the driver component and as you recall there the BOs are backed by NC memory. This includes the back buffer that gets blitted in software.

To illustrate the problem lets look at our fb_bench test.

First the cached memory throughput:

 [init -> test-fb_bench] TEST 1: byte-wise memcpy from RAM to RAM
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 646 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 1 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 2: byte-wise memcpy from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 648 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 2 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 3: copy via blit library from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 1377 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 3 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 4: unaligned copy via blit library from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 884 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 4 finished

and now the uncached one:

 [init -> test-fb_bench] TEST 1: byte-wise memcpy from RAM to RAM
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 5 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 1 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 2: byte-wise memcpy from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 5 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 2 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 3: copy via blit library from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 44 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 3 finished
 [init -> test-fb_bench] 
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 4: unaligned copy via blit library from RAM to FB
 [init -> test-fb_bench] 
 [init -> test-fb_bench] throughput: 39 MiB/sec
 [init -> test-fb_bench] 
 [init -> test-fb_bench] TEST 4 finished

In particular TEST 3 is of interest as this covers the use-case we have in the graphics stack. Again, turning to our friend simple math, ~2 MiB per back buffer blit and around 44 MiB of available throughput… that equals the ~20 FPS we get when running glmark2:

 [init -> glmark2] [build] use-vbo=false: FPS: 19 FrameTime: 52.632 ms
 [init -> glmark2] [build] use-vbo=true: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [texture] texture-filter=nearest: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [texture] texture-filter=linear: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [texture] texture-filter=mipmap: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [shading] shading=gouraud: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [shading] shading=blinn-phong-inf: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [shading] shading=phong: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [shading] shading=cel: FPS: 20 FrameTime: 50.000 ms
 [init -> glmark2] [bump] bump-render=high-poly: FPS: 17 FrameTime: 58.824 ms
 [init -> glmark2] [bump] bump-render=normals: FPS: 12 FrameTime: 83.333 ms

Coincidence? Let's find out.

Instead of allocating the backing memory of each BO by as uncached memory, we go with cached memory:

 [init -> glmark2] [build] use-vbo=false: FPS: 167 FrameTime: 5.988 ms
 [init -> glmark2] [build] use-vbo=true: FPS: 203 FrameTime: 4.926 ms
 [init -> glmark2] [texture] texture-filter=nearest: FPS: 178 FrameTime: 5.618 ms
 [init -> glmark2] [texture] texture-filter=linear: FPS: 181 FrameTime: 5.525 ms
 [init -> glmark2] [texture] texture-filter=mipmap: FPS: 189 FrameTime: 5.291 ms
 [init -> glmark2] [shading] shading=gouraud: FPS: 181 FrameTime: 5.525 ms
 [init -> glmark2] [shading] shading=blinn-phong-inf: FPS: 170 FrameTime: 5.882 ms
 [init -> glmark2] [shading] shading=phong: FPS: 150 FrameTime: 6.667 ms
 [init -> glmark2] [shading] shading=cel: FPS: 135 FrameTime: 7.407 ms
 [init -> glmark2] [bump] bump-render=high-poly: FPS: 118 FrameTime: 8.475 ms
 [init -> glmark2] [bump] bump-render=normals: FPS: 137 FrameTime: 7.299 ms

That makes a difference indeed. Unfortunately, besides graphical glitches it is not robust:

 [init -> imx8q_gpu_drv] MMU fault status 0x00000001
 [init -> imx8q_gpu_drv] MMU 0 fault addr 0x040c1800

Of course, simply flicking the switch for all BOs without checking that cache maintenance is done properly or rather properly implemented is with consequences.

For good measure, we try something different - revert back to uncached memory and omit the blitting dri2_genode_etnaviv_put_image():

 [init -> glmark2] [build] use-vbo=false: FPS: 345 FrameTime: 2.899 ms
 [init -> glmark2] [build] use-vbo=true: FPS: 517 FrameTime: 1.934 ms

Okay… and the same again with cached memory:

init -> glmark2 build use-vbo=false: FPS: 345 FrameTime: 2.899 ms init -> glmark2 build use-vbo=true: FPS: 516 FrameTime: 1.938 ms

Interesting, apparently it does not make a difference… (Looking at the flags used for allocating BOs by Mesa this is somewhat obvious.)

More to come

Obviously, the experiment is not over yet but merely at a stage where we can move in closely and start to build something that may be used on Sculpt productively. Besides the stabilization work, using a proper Gpu session rather than the ad-hoc Drm session is high on the todo-list.

Those endeavors, however, warrant another post…

—

PS: I know, for a post regarding computer graphics there are a lot less images than one would expect - I might update the post with some in the future ☺.

Discuss at reddit.com/r/genode