Debugging mesa and the linux 3D graphics stack

June 11, 2019 — Richárd Thier

Have you ever wanted to troubleshoot or even contribute to the linux graphics stack? Were you ever interested in doing a deep dive? Read on if you dare, but do not fear - I was rookie on the topic myself before I went for my hunt on a 50-1000% slowdown after some updates and a distro change.

I wanted to document the process of delving deep into open source drivers. I just wanted to put things together as it might be valuable for others wanting to do similar things - I was and still is a rookie for these things after all.

The power of open source: We can fix a more than decade old hardware getting a slowdown as easily as a new one! Fuck you planned obsolesence haha!

Introduction

Sorry if I was not PC above, but I like the original style of Linus Torvalds and planned obsolesence is something I fight against through everything possible so I just do not care what others think about a well-placed, but rare foul word.

Lately I have changed my linux distribution from a not-much updated Ubuntu 16.04 to a completely new Arch32 linux. Everything went as smooth as it can be with a distro that needs more hand customization, but 3D performance have become really slow for some reason.

This post summarizes my hunt for the problem - from analysing config files all through the end of adding my own quickfix in mesa source code! Yes you read it well. This post is a good reading for anyone who wants to start contributing to open source as this was my introduction to the deeper topics as well.

Can you believe you can fix something that causes actully 1000% slowdown in games and applications like Urban Terror?

Urban Terror 4.1

Not only that, but this post tries to encourage and introduce even more people to try finding quickfixing or even properly fixing similar issues as despite that this took me literally weeks and a lot of other peoples help, in the end there is an effect if you put in the effort - and I was not knowledged in GPU drivers or even at contributing or debugging at bigger open source codebases so if you just do not give up, you can do something similar.

If you got lost from all the details of the journey here, just keep in mind that I went through all of this for weeks and not as fast as you might just read my words. If you take the time, it will come to you too I believe ;-)

Symptoms

After setting up my system, installing an X-server, mesa drivers for my card and generally doing all the things that you need to do yourself after an arch linux installation I have immediately recognized that there are problems with 3D performance and GPU acceleration. The slowdown was huge around 50-1000% at least!

On the old system I could play Mount&Blade: Warband without a problem despite my laptop is from 2007 and many of the hardware are from 2005 in it. Also I could play NFS: Underground, Heroes4, Arx fatalis (libertatis), Urban Terror among others.

The 2D gaming performance was close to what I had earlier so Heroes was just going well as it was before, but anything that did 3D got a hilarious impact in its performance. Even "extreme tux racer" had low FPS but was still a bit playable, but Urban Terror had like 3-4 FPS at most while it used to have at least 30-60 FPS. Also Urban Terror and tux racer are native linux games so we can be sure that this problem is not only affecting wine games...

The linux graphics stack

This is how I understand the linux graphics stack is:

        +-------------------------------+
        |          Applications         |
        +----+--------+------------+----+
             |        |(dri)       |
        +----+----+   |  +---------+-+
        |  Wine   |---|--|   Xorg    |
        +----+----+   |  +--+----+---+
        (dri)|   +----+     |    |
        wd3d |   |  +-------+    |
             |   |  |            |
        +----+------+---++-------+------+
        |    Mesa (gl)  ||xf86-video-ati|
        +----+----------++-------+------+
        (3D) |           (2D)    |
             |                   |
        +----+-------------+     |
        |     Userland     |     |
        |      driver      +-----+
        | (gallium | dri)  |
        +--------+---------+
                 |                userland
        =========|========================
                 |(ioctl)        kerneland
                 |
        +--------+---------+ +-----------+
        |                  | |  Kernel   |
        |radeon driver drm | |    core   |
        |   kernel module  | |(sched,etc)|
        +-------+----------- +-----------+
                | (ring buffers or I/O)
        +-------+------------------------+
        |          3D card (hw)          |
        +--------------------------------+

Of course there might be some changes or missed details, but this is how I look on the whole graphics stack. The applications are surely on the top and the 3D card is surely on the bottom of the stack.

On linux we have OpenGL and of course (not usually!) have DirectX and OpenGl is served by "Mesa". Xorg is really simplified on this image for convenience, but it is good to know that there are code paths that goes through it and there is a possibility to directly render 3D without going through X: this is called dri as it is the "Direct Rendering Infrastructure".

So a 3D app can directly go to the mesa implementation of opengl or go through the X server which is slower. When running windows applications using wine, of course it serves an extra layer. It is not an emulator howver it is a "shim". This means that the calls an application (an EXE) makes are translated to the equivalent linux calls. At this point there is wined3d that basically do the translation of Direct3D calls to their OpenGL equivalents.

The GPU driver is split in two basically: userland and kernel modules. Every possible thing is implemented in userland, but of course the I/O calls and such things need elavated privilage so some parts of the GPU driver belong to the same git where the whole linux kernel is. The kernel module should be compiled either directly or loaded as a module and the userland mesa should exist on your system as shared libraries at places your apps are searching it.

The original older userland drivers are "dri" drivers and all cards had their own seperate driver implementation and posed a radically different API towards the Mesa GL libraries. This became a big burden so later drivers are "Gallium" drivers now and they have a standardised "Gallium" API for anyone using them.

Not only mesa, can use this Gallium driver API, but there are 2D acceleration or video acceleration like VAAPI and nowadays there is "Gallium nine" which is basically a native Direct3D 9 implementation to be used with wine. The latter is completely seperate from wined3d as this gallium9 does not translate the DX9 calls to their corresponding OpenGL calls (and adding boiletplate or hackz if there is no corresponding GL call directly), but they literally implement DX9 on top of the Gallium API. Do not confuse Gallium9 with Gallium though!

The userland and the kerneland drivers can communicate through syscalls and ioctl calls, the 3D card and the kernel modules likely communicate through ring buffers (circular buffers read by both and pulled by GPU) or direct IO (push from CPU usually). Everything else either communicates through calls or calls towards shared objects that resemble what DLLs are on windows.

Quite a complicated thing and I am sure I do not understand all of it myself, but this is a rough picture we can use for cornering the issue.

I have also added the linux kernel core to the picture because kernel changes and its configuration can effect the overall system performance a lot too!

It seems like a big and spread thing, but some of these belong together:

Wine source control: Contains wined3d and wine itself and its config stuff.
Xorg source control: contains a lot of X-org stuff and its config stuff.
Mesa source control: contains userland drivers and mesa + gallium9 and others.
Kernel source control: contains the kernel core and most kernel modules.

So basically this is "only" four kind of places and their configuration that we need to look for and analyse.

3D performance Checklist

This is the order in which I went to check things:

Check what kind of apps slow down and their version
Check drivers and direct rendering
Check xorg.conf and other relevant configuration changes
Corner the problem: Set up a state where things are still fast
Corner the problem: what component is causing the issue?

We will move through the above steps in the following chapters. First we look for the easy-peasy things that should be checked anyways and then move to the more hardcore coding and debugging in the end.

To be honest I did not do everything in this very order, but to write up my experience and know-how about the issue the above order seems to be better.

If you want to see how I have approached the problem (including dumb questions) then most of that is codumented in the following forums:

bbs.archlinux32.org

www.phoronix.com

bugs.freedesktop.org

The links are in order of increasing relevance, with most of the information provided in the middle link actually.

Check what kinds of apps slow down and their version

This is really simple but necesary! You want to know what kind of applications are slow because if you realize only wine applications are slow it might be a problem with wine or the wine configuration or registry values in wine! Also really similarly you can see if 2D or video acceleration is also slow or if actually the whole system is slow generally or it is only 3D performance.

These are necessary to know! Also you should note how much things are slower because if some of them are really slow and others are slower, but okay, then you might face multiple issues in different directions.

In my case I have found that 3D performance was hit by a real slowdown as I have mentioned before from 50-1000% slowdown which literally makes most games and applications unplayable after the update. I thought it was not a problem with wine, because even extreme tux racer was lagging a small bit (10-15FPS) and Urban Terror (a native opengl game) became an unplayable slideshow. As the native linux games slowed down too, it was sure the problem is not (only) with the wine configuration.

In the meantime the memory and CPU usage became just lighter after switching to the arch32 system. Earlier an empty system used around 90-150 megs of ram (out of my 1384 Mb) and now the idle usage is around 50-70 megs. Also CPU is generally close to 0-3% usage when idle so general system performance is not that much slower. I saw a minor slowdown in watching youtube videos in pale moon browser, but that might be a different problem of some missing codecs.

It might be straightforwards to say, but of course you should take care of the version of the application / game itself, escpecially if things do not seem to be a system-wide slowdown as you might measure a new version to an old one and the problem might be in the app itself then!

Check drivers and direct rendering

To see if there is hardware acceleration try glxinfo | less:

    name of display: :0
    display: :0  screen: 0
    direct rendering: Yes 
    server glx vendor string: SGI 
    server glx version string: 1.4 
    server glx extensions:
    ...
    Extended renderer info (GLX_MESA_query_renderer):
        Vendor: X.Org R300 Project (0x1002)
        Device: ATI RC410 (0x5a62)
        Version: 19.2.0
        Accelerated: yes
        Video memory: 128MB
        Unified memory: no
        Preferred profile: compat (0x2)
        Max core profile version: 0.0
        Max compat profile version: 2.1
        Max GLES1 profile version: 1.1
        Max GLES[23] profile version: 2.0
    OpenGL vendor string: X.Org R300 Project
    OpenGL renderer string: ATI RC410
    OpenGL version string: 2.1 Mesa 19.2.0-devel (git-2947e49f75)
    OpenGL shading language version string: 1.20
    ...

You should see "direct rendering: Yes" otherwise usually there is no real 3D acceleration enabled in most cases because most hardware accelerated drivers get you to use direct rendering despite theoretically one can write one that has dri disabled and still use the 3d card directly. If you have a 3D card but you see here LLVMPipe or swrast or anything similar here in the OpenGL render string that is also a bad sign for you.

If you have no direct rendering, something should be wrong usually. In that case I would try a simple troubleshooting to see what is going on:

Either there are no userland drivers (or badly configured)
or there are no kernel modules for your card
or there are random errors in the logs so you run on fallback
or actually there is a chance that you do have everything well, but have multiple libgl.so and you are using the wrong one.

A good idea is to look for /var/log/Xorg.0.log, dmesg and all kind of logs to see if there is any kind of visible error or maybe even the warnings that relate to your 3D card, pci/agp buses, x server etc. Also you might lspci -v or use similar commands like lshw to tell if the video card is seen by linux and issue lsmod | grep radeon to see if you have the kernel modules of your wishes (of course with i915 if you have that card).

To see if your apps are searching and finding the right libraries you can use ldd /usr/bin/glxgears kind of things and see what paths are written out for the graphics shared objects. Most package managers also have a way to show you what files their packages have installed and where.

For example on arch linuxes you can use pacman like this:

    pacman -Ql mesa # to see what have mesa installed
       ...
    sudo pacman -Fy # update filename seach database
       ...
    pacman -Fs libGL.so
      extra/libglvnd 1.1.1-1.0
        usr/lib/libGL.so
      extra/nvidia-340xx-utils 340.107-3.5
        usr/lib/nvidia/libGL.so

Similar commands are available for all kinds of package managers usually.

I also need to add that there are two kind of drivers: gallium and simple dri mesa drivers. Gallium is the newer and dri was living beside it with mostly untouched codebase so it does not have the latest stuff. Early on it was easy to tell from the OpenGL renderer string that you are using Gallium, because it told you "Gallium 0.4 on ATI RC410" (of course the second part is your card), and I immediately saw this is not present anymore.

Many blogs and pages still tell you this is supposed to be there and no one seemed to know if it just got removed so I first started to think that I might for some reason be using the older dri driver for my r300 card which might have described the slowdown.

As it turns out this is not a good indicator now: Gallium is not printed even if you are using a Gallium driver - I think mostly because non-gallium drivers tend to completely die out and mine is not even existing in the mesa codebase anymore.

As you can see, in my case I had direct rendering enabled and everything seems to be "perfect" for accelerated 3D, but the performance felt as slow as a some software rendering.

I have went and also used GALLIUM_HUD to see how the CPU usage is while doing 3D stuff on the screen and this also hinted me that I indeed have GALLIUM on:

    vblank_mode=0 GALLIUM_HUD="cpu,fps" glxgears

Cpu usage seemed high for me but using top it is only 25-30% and later it turned out to be normal actually so I am only saying this for reference.

I had around 320-350 FPS here because of vblank_mode=0 and because the option called SwapBuffersWait was turned off too in my xorg.conf setup.

Just to be sure there is hardware acceleration I tried how it is when forcing a software rendering using LLVMPipe in mesa:

    vblank_mode=0 LIBGL_ALWAYS_SOFTWARE=1 glxgears
    vblank_mode=0 LIBGL_ALWAYS_SOFTWARE=1 etr
    LIBGL_ALWAYS_SOFTWARE=1 glxinfo | less

These have shown to perform much worse so surely there was "some" amount of hardware GPU acceleration going on, but not in an optimal way.

Check xorg.conf and other relevant configuration changes

The xorg.conf file is configuring a lot of things, not only for the code paths that are going through the x server, but also for the direct rendering paths, so the above figure can be a bit misleading in this sense.

I did not have my old original system anymore, but I have saved my very own, custom xorg.conf files I am using for various purposes.

They mostly change only on the bit depth because I can tell you some games run much better when not only they are set to use 16 bits, but also the whole X server is running with the 16 bit depth.

My basic setup was this though:

    Section "Device"
        Identifier      "Configured Video Device"
        Driver          "radeon"
        # Not really necessary, just here to be sure I am using gallium
        Option          "ForceGallium" "True"
        Option          "AGPMode" "8" 
        Option          "AGPFastWrite" "True"
        Option          "EnablePageFlip" "True"
        Option          "SwapbuffersWait" "False"
        Option          "DRI" "2" 
        Option          "ColorTiling" "on"
        # The glamor way does not start for me because it complains about missing EXA stuff and hangs...
        #Option          "AccelMethod" "glamor"
        Option          "AccelMethod" "EXA"
        # This is only for r600+ cards it seems :-( 
        #Option          "ColorTiling2D" "on"
        # Turns off radeon-specific acceleration
        # According to https://ubuntuforums.org/showthread.php?t=2339861
        # This told be be able to help but adding the radeon.agpmode=8
        # parameter at boot should outweigh the loss of this as memory 
        # bandwith can grow from 133Mb/s PCI mode to 1000-2000Mb/s AGP!
        # Maybe the parameter should be added at boot time now instead 
        # of here in the xorg.conf and that is the change that makes 
        # things slow now? Or I need the "NoAccel" in order for AGP to work?
        #Option          "NoAccel" "True" # This however removes DRI2 :-( 
    EndSection

    Section "Monitor"
        Identifier      "Configured Monitor"
    EndSection

    Section "Screen"
        Identifier      "Default Screen"
        Monitor         "Configured Monitor"
        Device          "Configured Video Device"
        DefaultDepth    24  
        SubSection "Display"
            Depth    24  
            Modes    "1024x768"
        EndSubSection
    EndSection

    Section "Extensions"
        Option "Composite" "Disable"
    EndSection

Using DRI2 instead of DRI3 can both save you performance and remove glitches but I think it is unnecessary here because with my card there is no way to properly use DRI2 anyways I think.

AGP things I am still unsure on, but originally they seemed to pay a role. Now I see they are completely ignored. Actually my card is not even AGP so maybe I had wrong memories from an other machine and just set these accordingly. Still I went long in my analysis in that direction as if earlier AGP was accelerated and now not - but bad directions happen while analysis is done :-)

The EnablePageFlip and SwapBuffersWait is good to set up like this however and you better disable composition on lower end machines as you do not want to run compiz and stuff like that just to make apps a bit slower did you?

I had luck of owning my original settings from the time things were fast, but what if you do not have a save like this or never had an xorg.conf before?

    man radeon # gives you config values for radeon!

Corner the problem: Set up a state where things are still fast

Okay to be honest the best way to do these kind of things is to already use a rolling release system with everything being up-to-date. This way you will be among the first to know if something breaks your system and can tell the devs straight away - or fix the issue straight away in case you have dev skills.

In my case I made a big change: ditched my old ubuntu system after it crashed when updating to 18.04 and knowing it does not support 32 bit anymore anyways.

All I knew that I once had a 16.04 Ubuntu with some not-latest mesa and xorg, not the latest kernel (not even having spectre and meltdown fixes) and now if I am running years later software things are suddenly crap-slow.

But when things got wrong? Did it really got slower or is it my hallucination? Also how to get a system that I can test old things on once again?

Old live cd releases

Thank goodness I could find old releases:

old-releases.ubuntu.com

Actually most of the distributions do offer older releases, but what was the best for me is that the ubuntu live cd came with the radeon driver properly added and configured so I could just download a live CD and experiment if it is broken at that version or not. I have found out for example, that things are already slow on 16.04 if I use the latest livecd release for it, but it is not slow if I use the original first one!

This says it is not an issue of the new arch32 distribution or its config, but it is some issue in something else! Hurray! We know something at least!

But did not it take ages to always make a live bootable USB drive from various live CDs? Yes it does, but it might be still faster actually than trying all the various old versions of libraries and software in case the time period is that big as it was in my case: more than two years were before a working setup and the latest one that does not work!!!

Package archives

If I would already use a rolling release and see things broke down I think I would likely see much smaller time period where the potential slowdown change can lay somewhere and there would be more sense in "bisecting" seperate small components of the system - like reinstalling the last mesa or xorg or kernel using the package manager and the caches. The word "bisect" just means that you basically binary search for the cause of the problem in time.

Many people write how can you revert one package to an older version in any distros when the problem is close to you. Likely things are still in the cache after the update so you do not even need to download them.

See this guide for example

What to do if you do not have your cached packages because you are on a newly installed system or just erased the caches earlier? For arch linux and even in case of the 32 bit supported arch32 community version there is an arch archive!

Arch-Linux-Archive

The reposity for arch32 is at a different place:

archive.archlinux32.org

After doing all the things on the official arch wiki about the archive, there was still some things to do specifically for arch32 and for my case.

In /etc/pacman.conf:

    XferCommand = /usr/bin/curl -L -C - -f -o %o %u
    ...
    SigLevel    = Never

Are better to be uncommented/added. The first one is needed because my default command for downloading packages (wget?) is generally failing for old archive package providing packes that respond much slower than the proper mirrors.

The latter is needed because sometime the package signing way was changed so there were a lot of packages that could not get updated properly.

Also I had to downgrade pacman once, but that was a bit scary and can brick your otherwise working system if you end up with a non-working pacman after this and there is no way anymore to make it work easily!

For these it is best to ask your distro forums directly if you have problems:

bbs.archlinux32.org

Using the package archives it is possible to isntall old mesa versions and despite the arch forums tell you that "partial upgrades/downgrades" are not supported with pacman, it is pretty much capable to do so (and shoot yourself in the leg if you are not paying attention).

So despite all things said it is a bad idea, I could try old stuff with the pacman -Sy and not adding that "u" as it should be proper (pacman -Syu).

This way I could just set the old mirrors in the files and then update the package manager to use that mirror when I ask it to install stuff and just install the stuff. As it is highly told on arch forums this is dangerous!

Also I had to change the architecture in pacman.conf to i686 instead of auto because the auto meant to search for "pentium4" packages which did not exist in the archives. The "pentium4" architecture is quite new in arch32 and it means that you will use SSE2 optimized packages (you should really do that!)

Installing older kernels

Installing a not-so ancient kernel is actually as straightforward as a kernel install or kernel compilation can be in itself so despite it is a huge topic and might belong to an other post completely seperated, there is nothing much to say about it specific for this problem.

If you are only going back a bit with the kernel, your work is really not hard. Things will not be broken and your package manager actually might have the LTS kernel with its last version easily installable.

For example I could easily try 4.x kernel while at the writing the main is 5.x:

    [prenex@prenex-laptop ~]$ pacman -Ss kernel
    ...
    core/linux-lts 4.19.37-1.0 [telepítve]
        The Linux-lts kernel and modules
    core/linux-lts-docs 4.19.37-1.0
        Kernel hackers manual - HTML documentation that comes with the Linux-lts kernel
    core/linux-lts-headers 4.19.37-1.0 [telepítve]
        Header files and scripts for building modules for Linux-lts kernel
    ...

This command is not installing it, just searching package infos that contains the "kernel" string in them. Similar search can be done on every major system.

In case you cannot get an old-enough kernel like this, you can also try using the package archives, or just easily build your own. If you go really far in the past, your job becomes irrealistically hard to keep all the other system components new, but for 1-2 years you seem to have really easy luck usually!

Bisecting using git

There is an other way to corner the problem in time and this is to build your own version of the bad software from its git versioning system.

If you can build an old version that still works, and a new one that does not, you can not only speculate and manually binary search for the commit that make the slowdown or the bug appear, but there is actually very good support for these kind of things in git already. This is called git bisect.

You might ask now: Why thy need for bisect support when I can do ya by hand?

Because it is safer, semi-automatic and generally much less error prone?
Because it works together very well with git stash for local changes?
Because it supports branches much better than you would do manually?

These count a lot!

For example when you can start your bisecting session you should already see what component, app or software is causing the problem (or have a good idea). To get to the point when you already have a good idea, you might already have a lot of information and maybe know it was last working on 17.x mesa versions. What if that became a branch and got some more commits and there is no direct inheritance path between your version and this? You do not need to test some more by hand (like I did and failed the guessing), you better git bisect!

You might say git stash is not a thing? Oh maybe you want to copy-paste all changes that you need to make an old mesa build on a new system always by hand after checking out a release? Even with bisecting it took weeks to find these problems I am talking about now so every automatism counts for you believe me!

In any ways. Bisecting using git seems to be much better than the archives or livecd trying methods so why do we work on those at all? Me mostly because I had no freakin idea about where to problem lies, just a rough assumption of when it worked last time and wanted a working, fast system deliberately so that it is easier to compare logs, command outputs, profiling outputs!

Corner the problem: what component was causing the issue?

I had various ideas about what the problem may be and what might cause it. Now that we have a toolbox to experiment in timing when the problem got to appear we should look at where it might be.

The non-gallium driver idea

As seen in the forum posts of mine, the first thing that came into my mind was that maybe I am running with the non-gallium, old r300 dri driver. Knowing a lot less serious things about the linux graphics stack back then I had hard time to figure out if this is the case or not.

To be honest now I can answer my own question that no one seemed to answer to me on the forums:

    Q: How to know if a driver is gallium or not?

    A: The easiest is to go to the source tree of mesa and see if your card has a dri driver at all or not!
       You can check this with `meson configure build/` and look what it says, if not sure look at the source!

This is what meson configure build/ says for me in current mesa source tree:

    dri-drivers [swrast]   [auto, i915, i965, r100, r200, nouveau, swrast]
        List of dri drivers to build. If this is set to auto all drivers 
        applicable to the target OS/architecture will be built
    ...
    gallium-drivers [r300] [auto, kmsro, radeonsi, r300, r600, nouveau, freedreno,
                swrast, v3d, vc4, etnaviv, tegra, i915, svga, virgl, swr,
                panfrost, iris, lima]
        List of gallium drivers to build. If this is set to auto all drivers 
        applicable to the target OS/architecture will be built

If you build the mesa yourself (the way you see in my earlier blog post) now you can be pretty sure that you only have the drivers listed here and if they are gallium or not, but even if you are not building them yourself, you can see from this listing if it is in the list of possibilities or not at all!

As you can see dri-drivers do not have r300 anymore anywhere, so my idea that this might be a problem that I run with an old dri and non-gallium one was completely false. This also means that you can no-longer rely on the string "Gallium on " being present in the glxinfo output anymore.

Kernel config and versioning idea

After the user "debianxfce" told me on phoronix forums which is a really good to visit place for similar troubleshooting that kernel configuration can play a lot in overall system performance maybe, I went in the direction to see if there are possible changes in the kernel config that might affect my case.

He told me:

        "Disable dri 3, remove unused systemd services,
         use a non debug 1000 Hz timer kernel."

This was a helpful message, despite my friend told me that many complain that this guy does a lot of off-topic so most of his messages got out of the forum if you look at my link above. I did not find him utterly off so I had nothing against him and I could learn some things so I was not the one who asked for his messages to disappear and likely someone just decided it was too much off.

In any ways this is true: The kernel is below everything and I got it upgraded a lot between the lastly working and current version - I thought. Cannot be it the problem? Sure it can! Also kernels can be configured differently in the different linux distributions so configuration can count too.

After decyphering what he said, there are various kernel config parameters and you can see them at less /proc/config.gz or in case of Ubuntu and other kind of linuxes that does not build a kernel with the support to see these here, usually you can see a similar config in /boot/some-file.config or similar.

After knowing this I could tell I am running on a 300hz kernel:

    # CONFIG_HZ_100 is not set
    # CONFIG_HZ_250 is not set
    CONFIG_HZ_300=y
    # CONFIG_HZ_1000 is not set
    CONFIG_HZ=300

This actually means that the scheduler can wake up 300 times per second and look if there is a way to do some work in the kernel or maybe context switch and all kind of similar stuff. For servers you better lower this value to 100, but for cutting edge interactivity you want to have the highest possible value which is 1000 currently. You can basically choose among the three now.

It is not as easy to just say "1000hz is better for desktop" always, because a bigger number means bigger overhead when there is "nothing to do", while it is surely helping for example when you get a load of 300 and your mouse can not even move anymore on the screen or you cannot type in the terminal. So in case some process is overeating the CPU, you might still get some interactivity.

Actually I have found that maybe this is more likely to affect me:

    #CONFIG_PREEMPT_VOLUNTARY
    CONFIG_PREEMPT

The former means that there are various points in the kernel where you can get interupted in that 300hz I had as my setup but the context switch will not happen until some predefined point or area in the source code is reached.

Much earlier than this actually there was never a way to switch out a process or a thread that currently is running in the kernel memory because a syscall is taking place, but this voluntary preemption was added so that at some of the predefined points it can happen. Later full preemption was added which became the new default.

The upside of this is that you can survive an infinite loop happening in the kernel itself (but I have yet to see anything like that), but the downside is that you can have a bit more overhead maybe in case there is only one app that has around 100% cpu usage needs and pretty much all it wants to finally finish that very ioctl or kernel syscall and just get back right after no one else really wants the cpu for anything.

Similarly the CONFIG_DEBUG_KERNEL is a kernel configuration to enable some extra debugging and is on for a lot of systems just to help finding problems.

See here: https://cateee.net/lkddb/web-lkddb/DEBUG-KERNEL.html

You can access a nice ncurses based menu to change these when you are building your very own kernel from the kernel sources.

All you need to do is to:

    make menuconfig

Also I choose to build the "zen-kernel" which is a special version especially geared towards "high end desktop usage" and comes with the MUQSS scheduler and some other nice stuff. So I have chosen this instead of the regular kernel to base my trickery changes upon.

Later I have also disabled RETPOLINE=y but added CONFIG_SCHED_MUQSS=y along CONFIG_MPENTIUMM=y because I of course surely know that I have a Pentium M processor so why not use code directly optimized for that?

It is also useful to issue make localmodconfig as that makes a kernel that only has modules exactly for things loaded right now when you issue this. In case you cannot start all things you need, plug in all devices or of course if you are building a kernel generic way for others this is the bad way to go.

Later I was randomly walking in the menu and turned off all kernel support for multiprocessors as the help message told me if I do so on a single core CPU then I can gain some small amount of improved performance.

You should probably never really turn things on-off yourself in the config and always use the make menuconfig command as there are several dependencies and this is the only proper way to see the help messages for the configurations.

A kernel config file looks like this

Also I have checked if my user belongs to the video group or not just in case it might be needed and checked ls -lat /dev/dri to see what card0 belongs too.

Also I went on and disabled every spectre, meltdown, and similar mitigations, because once I was myself testing my CPU using an attack code and I had no problems with the unchanged codes - and could never manage to make meltdown work even if I played around it a bit. So despite I am not immune to spectre I remember I had to play a lot for it to print anything barely useful while others seem to not affect me at all as the CPU was just too old.

To know what mitigations you can turn off, the best way is to look here:

Linux admin-guide/kernel-parameters

and if you want you can just search for the word "mitigation" on the page ;-)

Actually with a custom kernel and making everything as cutting edge as it can I got a very small amount of FPS rise in extreme tux racer, but it was still 15-18 FPS at most and 18 only happening while barely nothing is on screen.

Actually nothing have solved my issue, but it was good to customize the kernel because now I can have my own, cutting-edge and personalized version.

Using a high power profile

I have found (for the first time in my life) that I can actually change the so called "power profile" of my GPU on linux too! Once back in time when I had my Windows XP running on the machine there was a tool for it, but I never knew what is the proper way to change this in linux.

According to the radeon feature table maybe there is no memory reclocking implemented, but still I see some gains from higher powered GPU otherwise.

For the change I need to do these on my card:

    sudo echo low > /sys/class/drm/card0/device/power_profile
    sudo echo mid > /sys/class/drm/card0/device/power_profile
    sudo echo high > /sys/class/drm/card0/device/power_profile

There are other power management ways, but this profile based is the one that is supported by the drivers for my card so this is what I used. With low you can get lower performance and lower consumption and with high you can get a higher performance with eating more energy too.

Before blindly doing what I did, read up what works for your case:

wiki.archlinux.org/index.php/ATI

Actually it this page is a worthy read by itself too!

Profiling the problem

You might be thinking that maybe we should use profiling tools to find issues and actually this was an early idea of my friend 'rlblaster' who is a real top range coder. I was using perf myself earlier, but it was a good idea to use it here too as it have turned out.

I have used perf for system-wide profiling for 45 seconds:

    perf record -F 99 -a -g -- sleep 45

The outputs can be found here (both the data and its text form):

etr-perf.data etr-perf-report.txt

To understand what is written you can ask man perf-report:

    The overhead can be shown in two columns as Children and Self when perf 
    collects callchains. The self overhead is simply calculated by adding all 
    period values of the entry - usually a function (symbol). This is the value 
    that perf shows traditionally and sum of all the self overhead values should 
    be 100%. The children overhead is calculated by adding all period values of 
    the child functions so that it can show the total overhead of the higher level 
    functions even if they don’t directly execute much. Children here means functions 
    that are called from another (parent) function.

This is the call trace that happens mostly:

    __kernel_vsyscall
            |          entry_SYSENTER_32
            |          do_fast_syscall_32
            |          sys_ioctl
            |          do_vfs_ioctl
            |          radeon_drm_ioctl
            |          drm_ioctl
            |          drm_ioctl_kernel
            |          radeon_gem_create_ioctl
            |          radeon_gem_object_create
            |          radeon_bo_create
            |          ttm_bo_init
            |          ttm_bo_init_reserved
            |          ttm_bo_validate
            |          ttm_bo_handle_move_mem
            |          ttm_tt_bind
            |          radeon_ttm_tt_populate
            |          ttm_populate_and_map_pages
            |          ttm_pool_populate
            |          __alloc_pages_nodemask
            |          get_page_from_freelist

This happens around 30-65% of the CPU time and this tells me radeon_bo_create happens a whole lot of cases. Know that knowing that "bo" means buffer object one can speculate something is getting created here - like a texture or some vertex buffer being sent to the GPU.

From this output you do not see where you are in extreme tux racer (the app that was running in the background), because all you can see is the trace that happens after the change to the elavated kernel level - so from the kernel side of the syscall. This both helps and deters from us from the core of the problem as we have no idea where the userland part of the driver is but we see that an ioctl gets called a lot of times.

Not much after I was testing this, I already had the old, but working 16.04 so I could compare the results from the old live-usb to the new system.

On the old system I saw nothing so special like this calltrace so I became pretty sure this call trace contains the problem in it!

You can see a whole lot of other perf outputs here:

devblog/attachments/mesa/

Actually you see there most of the logs and outputs on various test systems I was having tested at the time so feel free to look around in case you want to compare them to your stuff.

Now that I have already built my own customized kernel I could however see what are these functions really are as the source code for kernel modules was right at hand! Why do not check what radeon_ttm_tt_populate is for example?

    static int radeon_ttm_tt_populate(struct ttm_tt *ttm,
            struct ttm_operation_ctx *ctx)
    {
        struct radeon_ttm_tt *gtt = radeon_ttm_tt_to_gtt(ttm);
        struct radeon_device *rdev;
        bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG);

        if (gtt && gtt->userptr) {
        ttm->sg = kzalloc(sizeof(struct sg_table), GFP_KERNEL);
        if (!ttm->sg)
            return -ENOMEM;

        ttm->page_flags |= TTM_PAGE_FLAG_SG;
        ttm->state = tt_unbound;
        return 0;
        }

        if (slave && ttm->sg) {
        drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages,
                 gtt->ttm.dma_address, ttm->num_pages);
        ttm->state = tt_unbound;
        return 0;
        }

        rdev = radeon_get_rdev(ttm->bdev);
    #if IS_ENABLED(CONFIG_AGP)
        if (rdev->flags & RADEON_IS_AGP) {
        return ttm_agp_tt_populate(ttm, ctx);
        }
    #endif

    #ifdef CONFIG_SWIOTLB
        if (rdev->need_swiotlb && swiotlb_nr_tbl()) {
        return ttm_dma_populate(&gtt->ttm, rdev->dev, ctx);
        }
    #endif

        return ttm_populate_and_map_pages(rdev->dev, &gtt->ttm, ctx);
    }

From the call trace you can see that the next level in the call stack is the ttm_populate_and_map_pages(..) call so we neither go to SQIOTLB dma neither in the ttm_agp_tt_populate direction.

The "AGP" acceleration idea

Wait! Did not I remember setting AGP acceleration stuff in xorg.conf earlier? Wow! Maybe I just do not have accelerated AGP for some reason - despite the card should have it by hardware maybe?

So I thought maybe it is now CPU intensive to do the ttm (a low level radeon memory manager) operation so it shows in the perf output while earlier it was going in some of the other code paths and was so fast it did not show up!

Then I have found here what should be dmesg-printed for AGP acceleration:

    ...
    Sep 19 11:29:54 Debian-G5 kernel: pmac_zilog: 0.6 (Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;)
    Sep 19 11:29:54 Debian-G5 kernel: Linux agpgart interface v0.103
    Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: Apple U3 chipset
    Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: configuring for size idx: 64
    Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: AGP aperture is 256M @ 0x0
    ...

While my dmesg just said these only:

    ...
    [   10.422564] battery: ACPI: Battery Slot [BAT0] (battery present)
    [   10.707533] Linux agpgart interface v0.103
    [   10.901497] asus_laptop: Asus Laptop Support version 0.42
    ...

and then... nothing else!

I went and looked where the information should have been printed out:

    // drivers/char/agp/ati-agp.c
    static int agp_ati_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
    {
        struct agp_device_ids *devs = ati_agp_device_ids;
        struct agp_bridge_data *bridge;
        u8 cap_ptr;
        int j;

        cap_ptr = pci_find_capability(pdev, PCI_CAP_ID_AGP);
        if (!cap_ptr)
        return -ENODEV;

        /* probe for known chipsets */
        for (j = 0; devs[j].chipset_name; j++) {
        if (pdev->device == devs[j].device_id)
            goto found;
        }

        dev_err(&pdev->dev, "unsupported Ati chipset [%04x/%04x])\n",
        pdev->vendor, pdev->device);
        return -ENODEV;

    found:
        bridge = agp_alloc_bridge();
        if (!bridge)
        return -ENOMEM;

        bridge->dev = pdev;
        bridge->capndx = cap_ptr;

        bridge->driver = &ati_generic_bridge;

        /* LOOK AT HERE *************************************/
        dev_info(&pdev->dev, "Ati %s chipset\n", devs[j].chipset_name);

        /* Fill in the mode register */
        pci_read_config_dword(pdev,
            bridge->capndx+PCI_AGP_STATUS,
            &bridge->mode);

        pci_set_drvdata(pdev, bridge);
        return agp_add_bridge(bridge);
    }

Also the later second log line about the AGP aperture size should be coming from vim drivers/char/agp/backend.c, from the agp_add_bridge(...) call. That is not printed neither. This means that the code is not even touching these areas.

I have added more logging in the kernel source code I have found that the pci_find_capability call is not finding capability for my card:

    [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-prenex-hackerman root=UUID=8e....0f6 rw radeon.agp_mode=8 agp=try_unsupported
    [ 0.000000] Trying unsupported AGP cards too......
    ...
    [    0.932692] Linux agpgart interface v0.103
    [    0.932694] agpgart: Registering agp-ati driver.
    [    0.932708] agpgart-ati 0000:00:00.0: probing Ati chipset [1002/5a31]) for agp capabilities
    ...
    [   15.399353] radeon: unknown parameter 'agp_mode' ignored

I have changed these parts in the kernel source tree to get this result:

    #ifndef MODULE
    static __init int agp_setup(char *s)
    {
        if (!strcmp(s,"off"))
        agp_off = 1;
        if (!strcmp(s,"try_unsupported")) {
    +       printk(KERN_INFO "Trying unsupported AGP cards too...");
        agp_try_unsupported_boot = 1;
        }
        return 1;
    }
    __setup("agp=", agp_setup);
    #endif
    ...
    static int __init agp_ati_init(void)
    {
        if (agp_off)
        return -EINVAL;
    +   printk(KERN_INFO PFX "Registering agp-ati driver.\n");
        return pci_register_driver(&agp_ati_pci_driver);
    }
    ...
    static int agp_ati_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
    {
        struct agp_device_ids *devs = ati_agp_device_ids;
        struct agp_bridge_data *bridge;
        u8 cap_ptr;
        int j;

    +   dev_info(&pdev->dev, "probing Ati chipset [%04x/%04x]) for agp capabilities\n",
    +       pdev->vendor, pdev->device);

        cap_ptr = pci_find_capability(pdev, PCI_CAP_ID_AGP);
        if (!cap_ptr)
        return -ENODEV;

    +   dev_info(&pdev->dev, "found agp capability on Ati chipset [%04x/%04x])\n",
    +       pdev->vendor, pdev->device);

        /* probe for known chipsets */
        for (j = 0; devs[j].chipset_name; j++) {
        if (pdev->device == devs[j].device_id)
            goto found;
        }

        dev_err(&pdev->dev, "unsupported Ati chipset [%04x/%04x])\n",
        pdev->vendor, pdev->device);
        return -ENODEV;

    found:
        bridge = agp_alloc_bridge();
        if (!bridge)
        return -ENOMEM;

        bridge->dev = pdev;
        bridge->capndx = cap_ptr;

        bridge->driver = &ati_generic_bridge;

        dev_info(&pdev->dev, "Ati %s chipset\n", devs[j].chipset_name);

        /* Fill in the mode register */
        pci_read_config_dword(pdev,
            bridge->capndx+PCI_AGP_STATUS,
            &bridge->mode);

        pci_set_drvdata(pdev, bridge);
        return agp_add_bridge(bridge);
    }

The thing is that when I was getting to this point I was still short of having an earlier working system actually. I have found the proper "early releases" for ubuntu and arch much later as I am not necessarily blog this out as it was happening with me, but try to put together a more logical order, while still keeping some kind of timeline kept together that someone can follow.

Actually AGP seemed to be not the reason for the slowdown, but there might be still some minor issues after the update because a lot of minor things were spotted.

One that I have spotted was this in the dmesg output:

    - original variable MTRRs
    ...
    - reg 2, base: 1408MB, range: 128MB, type UC
    ...
    - new variable MTRRs
    ...
    - reg 2, base: 1280MB, range: 128MB, type WB
    ...

UC means "uncached" and WB means write-back caching here.

Interestingly the video card is the one owning this 128 megs and it seems that the older kernels can change the variable MTTRs to a different location, while the new ones cannot do this or do not print this anymore.

Logs from the working system

All logs can be found here from the old and working system here:

glxinfo-16-04-old.txt

lspci-16-04-old.txt

dmesg-16-04-old.txt

config-16-04-old.txt

perf-report-16-04-old.txt

perf-16-04-old.data

uname-16-04-old.txt

These seem to show that AGP prints the same thing as before and 3D rendering is already much faster so I went in a different direction after this.

Actually I am still not completely convinced that I should not have the AGP capability maybe (despite I was living with a system not having it) or if the WB vs. UC change should not be there for some reason, but the main problem were not these as it have turned out!

These might still worth a further analysis though!

Downgrading mesa from the arch archives

The reason why it was hard to downgrade mesa from the arch32 and old arch archives is because lately there is some kind of "libglvnd" that serves the libGL.so nowadays instead of the drivers serving their own like earlier.

This is actually good because it lets us choose the driver at runtime now, even between propriately and open drivers or for multiple cards, but because of this it was a bit hard to go back in time and install an old mesa while this libglvnd got removed. After removing every dependency on it recursively with pacman I have finally managed to run an old-enough mesa on my new arch linux system and see the fast 3D performance.

This was a painful thing to do, so after learning that it is quite easy to just build mesa and copy over the relevant files like its own libGL.so to the places where apps are searching them, I went to test in that way and it was a much-much better way to do so.

Doing an strace profiling - finding the real issue

So the problem was in mesa and I have filed a bug report saying that the issue is somewhere between 13.x and 19.x. Quite a big range one can say!

But then I was profiling things once again, but not with perf, but strace. The strace tool traces syscalls towards the kernel so it tells us properly what is happening and exactly what is in the call trace seen in perf.

This is what I got (bad case):

      ...
      ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000068>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe2404c) = 0 <0.000102>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000030>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000043>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_RADEON_GEM_CREATE, 0xbfafd880) = 0 <0.000070>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_RADEON_CS, 0xafe380e0) = 0 <0.000088>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfafd9c4) = 0 <0.000029>
       > [vdso]() [0x891]
      ioctl(6, DRM_IOCTL_GEM_CLOSE, 0xbfafd99c) = 0 <0.000047>
       > [vdso]() [0x891]
      ...

While this is what I get in fast working older versions:

      ...
      ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000055>
      ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000022>
      ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000089>
      ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000053>
      ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023>
      ioctl(6, DRM_IOCTL_RADEON_CS, 0xb30910d0) = 0 <0.000095>
      ioctl(6, DRM_IOCTL_RADEON_GEM_WAIT_IDLE, 0xbfcf9f04) = 0 <0.000054>
      ioctl(6, DRM_IOCTL_RADEON_GEM_BUSY, 0xbfcf9d44) = 0 <0.000023>
      ioctl(6, DRM_IOCTL_RADEON_CS, 0xb307d03c) = 0 <0.000090>
      ...

What this is? I saw and guessed immediately right! There came a change in how some buffers or resources get reused! Earlirt (the bottom one) we only created the buffer object (bo) 6 times and always reused them, while in the later bad versions no reuse was happening.

Some relevant stats showing how bad the issue was:

    10 seconds of glxgears on my default 19.0.4 (bad):

    -> 12466 (any) ioctl calls
    -> 3111 DRM_IOCTL_RADEON_GEM_CREATE
    -> 3112 DRM_IOCTL_GEM_CLOSE

    10 seconds of glxgears on 17.2.0-devel (git-33236a306d) (good):

    -> 1783 (any) ioctl calls
    -> 7 DRM_IOCTL_RADEON_GEM_CREATE
    -> 8 DRM_IOCTL_GEM_CLOSE

After surely knowing that the main problem is in the mesa source tree and the issue can be seen easily both using strace and visibly when running tux-racer I went on git bisect-ing to find the commit that slows the things down.

The first slowdown commit

Slowdown cause is found when bisecting:

    8b3a257851905ff444d981e52938cbf2b36ba830 is the first bad commit
    commit 8b3a257851905ff444d981e52938cbf2b36ba830
    Author: Marek Olšák <marek.olsak@amd.com>
    Date:   Tue Jul 18 16:08:44 2017 -0400

        radeonsi: set a per-buffer flag that disables inter-process sharing (v4)

        For lower overhead in the CS ioctl.
        Winsys allocators are not used with interprocess-sharable resources.

        v2: It should not crash anymore, but the kernel will reject the new flag.
        v3 (christian): Rename the flag, avoid sending those buffers in the BO list.
        v4 (christian): Remove setting the kernel flag for now

        Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Looking at the commit I have found a way to first do a dirty quickfix to get the performance back and later Marek himself came to the bug report and gave a possilble fix for us to try out (some others have gravitated to the report at this point or wrote at various places).

It was basically a change that needed RADEON_FLAG_NO_INTERPROCESS_SHARING to be added for calls as a flag - something that r600 codes did, but r300 codes did not do. The gallium driver has a lot of shared code paths in it as it help developers against code duplication, but it seems barely anyone really tested r300 code paths beyond basic working conditions so the slowdown went unnoticed and it came from the fact that the shared code was changed and r300-only paths were not updated properly for it.

First I tried the proposal of Marek and my hotfix only by applying them over the first bad (slow) commit of the 17.x mesa versions. This have worked and it became fast again (great again), but when I tried the same patch over the latest 19.x version from git I got a runtime error and only big failure.

    glxinfo: ../src/gallium/winsys/radeon/drm/radeon_drm_bo.c:1023: 
      radeon_winsys_bo_create: Assertion 'heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS'
      failed.

After adding a lot of log messages around places I have found that:

    radeon_get_heap_index(6, 1): -1

while this meant to have a shared GTT (gart) and VRAM domain:

    domain(6) == RADEON_DOMAIN_GTT(2) | RADEON_DOMAIN_VRAM(4)

The domain here basically tells where the buffer object can end up. Is it a gart or video memory or maybe someplace else - in other words. But as it turns out the new get_heap_index always returns -1 as the shared heap support got removed at some point in time and it is not valid anymore.

I added a quickfix once again that basically choose VRAM if that bit was set and GTT only if VRAM was not set, but it was. Then do what the user code wants if neither bits are set. This prefers VRAM, but is never able to put things in any kind of other storage until user code directly asks to do so. At this very point it is hard to tell if there is or is no space in VRAM and maybe it was always just like this actualy before.

Marek went on and made a next possible "proper" patch by adding only VRAM at a much more proper place than putting a hacky if in a shared code like my quick and dirty second fix did.

After looking at the strace output it still seemed that there is around 700 GEM_CREATE_BO ioctl calls which is still higher than necessary.

To fix that I have put an intentional segfault here:

        /* Shared resources do not use cached heaps. */
        if (use_reusable_pool) {
        int heap = radeon_get_heap_index(domain, flags);
        assert(heap >= 0 && heap < RADEON_MAX_CACHED_HEAPS);
        usage = 1 << heap; /* Only set one usage bit for each heap. */

        pb_cache_bucket = radeon_get_pb_cache_bucket_index(heap);
        assert(pb_cache_bucket < ARRAY_SIZE(ws->bo_cache.buckets));

        bo = radeon_bo(pb_cache_reclaim_buffer(&ws->bo_cache, size, alignment,
                               usage, pb_cache_bucket));
        if (bo) 
            return &bo->base;
        } else {
        // first only logs where here
        fprintf(stderr, "1 KULAKVA!!!\n");
        fprintf(stderr, "2 KULAKVA!!!\n");
        fprintf(stderr, "3 KULAKVA!!!\n");
        // this segfault got added later
        uint kula = 0; 
        int *kulakva = (int*)kula;
        *kulakva=42;
        }

This way I could just run gdb glxgears and issue bt to see how I got to be here at the segfault and what code sets the flag badly.

This is what I got (had to compile mesa with -g for that):

    #0  0xb75a8892 in radeon_winsys_bo_create (rws=0x45e100, size=4096, alignment=4096, domain=RADEON_DOMAIN_GTT, flags=(unknown: 0))
        at radeon_drm_bo.c:993
    #1  0xb75a972b in radeon_cs_create_fence (rcs=rcs@entry=0xb06a4010) at radeon_drm_cs.c:753
    #2  0xb75aa327 in radeon_drm_cs_flush (rcs=0xb06a4010, flags=2, pfence=0xbffff7d8) at radeon_drm_cs.c:593
    #3  0xb757845a in r300_flush_and_cleanup (r300=r300@entry=0x446090, flags=flags@entry=2, fence=fence@entry=0xbffff7d8) at r300_flush.c:56
    #4  0xb757860f in r300_flush (pipe=0x446090, flags=2, fence=0xbffff7d8) at r300_flush.c:82
    #5  0xb731508f in st_context_flush (stctxi=0x55ed20, flags=3, fence=0xbffff7d8) at state_tracker/st_manager.c:635
    #6  0xb74511f8 in dri_flush (cPriv=0x445fd0, dPriv=0x5760d0, flags=3, reason=__DRI2_THROTTLE_SWAPBUFFER) at dri_drawable.c:568
    #7  0xb7f9dbe8 in dri2Flush (psc=psc@entry=0x4165d0, ctx=<optimized out>, draw=draw@entry=0x5590e0, flags=3, 
        throttle_reason=__DRI2_THROTTLE_SWAPBUFFER) at dri2_glx.c:553
    #8  0xb7f9e09d in dri2SwapBuffers (pdraw=0x5590e0, target_msc=0, divisor=0, remainder=0, flush=1) at dri2_glx.c:845
    #9  0xb7f73049 in glXSwapBuffers (dpy=0x407160, drawable=18874370) at glxcmds.c:843
    #10 0x00401742 in ?? ()
    #11 0xb7b5c669 in __libc_start_main () from /usr/lib/libc.so.6
    #12 0x00401cd5 in ?? ()

So the fence create call was still missing the proper flags so we did not end up having use_reusable_pool above. The easiest way to getting a nice stack trace is sometimes through a nice deliberate error at a specific point.

I have changed this relevant part with the added flag:

    /* Create a fence, which is a dummy BO. */
    fence = cs->ws->base.buffer_create(&cs->ws->base, 1, 1,
                       RADEON_DOMAIN_GTT,
                       RADEON_FLAG_NO_SUBALLOC
                       | RADEON_FLAG_NO_INTERPROCESS_SHARING);

and the currently final v3 patch was born.

Also it is good to look around the bug report because later Rui also seem to have implemented shader disk caching for r300 cards too and to see any more progress and proceeding along with maybe possible test results.

After the final change speed mostly came back and ioctl calls are now as they were before which is really nice! I still get bigger slowdown in Mount&Blade than I was used to, but maybe it is because of a wildly more updated wine or something else - if not the write-back vs. uncached change above or something.

In any ways likely the most heavy problems are gone and maybe there is nothing else now, except minor configuration issues maybe. Urban Terror for example go really-really smooth now just like every native game and most wine games.

Also I have found that I can enable HyperZ with measurable speed gains, but bad glitches on my r300 card so there is always a way to bring more speed out of this old hardware in the open source world.

Mount&Blade: Warband still slow?

After all the changes, pretty much everything got to be fast, but some of the wine games - especially Mount&Blade: Warband - was still borderline unplayable.

In the worst case it had 3 FPS and in the best case had around 19 FPS. The bad part of this is that there were game arenas where pretty much the minimum FPS is what you are constantly getting so playability was hurt.

I went on with perf against this game - this was the command advise you:

    i=0; while [ $i -lt 240 ]; do 
        sudo echo $i; 
        sleep 1;
        let i=$i+1;
    done; 
    sudo perf record --call-graph dwarf -F 99 -a -g -- sleep 300

This way I can go through the menu and all the loading so I end up directly where the arena is as it loaded up in 240 seconds. The trick here is to do something with sudo (the echo-ing) so that the sudo for the perf record is still getting to be done using the password I gave before started the game.

I must also tell that --call-graph dwarf does help a lot: both for getting gigantic file sized (200+ megs) but also for getting more complete calltrace!

This is what I saw (okay this was still without the dwarf stuff):

    -   48,67%     0,00%  mb_warband.exe   [unknown]                   [k] 000000
       -
          - 29,70% call_thread_exit_func
           0x7bcb0b59
         + call_thread_func_wrapper
          + 3,87% 0x40caee
          + 3,76% 0x40cf88
          + 2,25% 0x44bf74
          + 0,60% 0x44baa6
        0,59% 0
    +   29,70%     0,00%  mb_warband.exe   ntdll.dll.so                [.] call_thread_exit_func
    +   29,70%     0,00%  mb_warband.exe   ntdll.dll.so                [.] 0x7bcb0b59
    +   29,70%     0,00%  mb_warband.exe   ntdll.dll.so                [.] call_thread_func_wrapper
    +   22,29%     6,07%  mb_warband.exe   wined3d.dll.so              [.] wined3d_resource_map
    +   16,09%     0,00%  mb_warband.exe   wined3d.dll.so              [.] 0x7d587c16
    +   15,29%     0,00%  mb_warband.exe   wined3d.dll.so              [.] 0x7d585ef2
    +   15,02%    15,02%  mb_warband.exe   wined3d.dll.so              [.] 0x00057ef2
    +   13,31%     0,00%  mb_warband.exe   wined3d.dll.so              [.] wined3d_resource_unmap
    +   13,31%     0,00%  mb_warband.exe   wined3d.dll.so              [.] 0x7d587c9d
    +   10,86%     0,00%  mb_warband.exe   d3d9.dll.so                 [.] 0x7d69cf32
    +    8,77%     0,00%  mb_warband.exe   [unknown]                   [.] 0xe98b5502
    +    8,45%     0,00%  mb_warband.exe   d3d9.dll.so                 [.] 0x7d69d032
    +    8,18%     0,00%  mb_warband.exe   [JIT] tid 363               [.] 0x004228b0
    +    7,36%     0,00%  mb_warband.exe   d3d9.dll.so                 [.] 0x7d69cdc3
    +    6,57%     0,00%  mb_warband.exe   wined3d.dll.so              [.] 0x7d585ef8
    +    6,47%     6,47%  mb_warband.exe   wined3d.dll.so              [.] 0x00057ef8
    +    6,37%     0,00%  mb_warband.exe   [JIT] tid 363               [.] 0x0047c76c
    +    6,37%     0,00%  mb_warband.exe   d3dx9_42.dll.so             [.] D3DXCreateTextureFromFileExA
    +    6,37%     0,00%  mb_warband.exe   d3dx9_42.dll.so             [.] D3DXCreateTextureFromFileExW
    +    6,36%     0,00%  mb_warband.exe   d3dx9_42.dll.so             [.] D3DXCreateTextureFromFileInMemoryEx
    +    5,45%     0,00%  mb_warband.exe   d3dx9_42.dll.so             [.] D3DXLoadSurfaceFromMemo

Basically most of the time is spent in wined3d related things and - pretty interestingly - az call_thread_exit_func.

Because of this I went and looked around winecfg and winehacks.

Then I have found that this was enabled in the staging tab:

csmt-img

"Enable CSMT for better graphics performance (deprecated)"

Read here what CSMT is about

After ticking it off, even Mount&Blade:WB works as before and frame rate is from the worst case ever 12 to hights like 120 FPS and usually over 30!

Basically all problems are solved, but I think fixing up HyperZ might be still a very good hobby project to do together with anyone who helps. It seems to give you 10-20% increase in FPS in glxgears - alongside the glitches.

Problems are completely solved for 99% of my apps now and we helped everyone else who has an r300 card and had a slowdown! Hurray!

Closing words

There are still new directions to go forward to, but things got solved. Also a lot of new and friendly professionals got contacted and helped me which was a really blissful of joy for getting for level-upping programming skills. The community seemed so cool and good that it would be good to do this only for them, but of course it was a technical question and learning mostly.

Interestingly both of the problems have one thing in common: multithreading!

The first issue came when someone added this new flag to the mesa driver so things buffer objects can be reused if RADEON_FLAG_NO_INTERPROCESS_SHARING is set, but not otherwise. Of course if you have an interprocess shared resource, then it is pretty hard to reuse existing things. But my machine is a single core one, so likely I am not getting anything out from multiple processes there most of the time anyways so it is quite a change that was made lately for multi-processor systems that need these kind of things and not having enough test resources it have slipped through the codebase that the old r300 code path is never settings this so always slow - but renders well.

Then the r300 code paths having spent years without running with this flag, it not only was slow for that 1-2 year while me and probably others stuck with an older system instead of testing, but r300 got a "bitrot" too because changes like the shared domain change did not fail as it only affected the fast path which was never in use. This way the slowdown error have hidden the later one and there was no way to test that as affected code paths were never used!

For wine and CSMT: it is also a threading related change! Not that similar, but still it is one because "command stream multi-threading" is about having a seperate thread that accepts rendering commands to keep proper ordering as it is being done on windows and make the application and the command stream renderer pipeline a different thread not only helps to keep proper order, but they say it helps performance. It seems for some games it hurts it a lot and this latter is true for single-processor systems! I can see from the perf output that the overhead of this threading was huge, compared to how the game could keep up with sending the commands so it resulted in more waiting than properly used cycles.

Actually I saw myself a lot while programming at work that multi-threading can literally just slow things down a lot when only 1 CPU can work at a time. This is of course common sense, but adding the many-times minor gain by threads I always wondered that maybe we overdo this thing a bit.

Sadly if we throw 8 cores at a problem and gain 1.5x more speed, then we waste a lot of battery and energy-grid power in useless ways. This is also the main reason why I prefer browsers like pale moon and other gecko-based things as their overhead coming from multithreading is less than any webkit browser as far as I see them. On a single core machine, you should always choose to have things that do lightweight threading and not rely on having a ton of cores that can work overtime just to get some percent gains. We need to think more when developing multithreaded algorithms and solutions I feel. But it is just hard to do it right - I know... I know...

Special thanks

Special thanks go for the following people any anyone I accidentally have missed out:

Marek Olšák and also Rui Salvaterra for the work on the bugfix and other r300 speedup directly!
Everyone helping on phoronix forums (duby229, debianxfce, bridgman)
Bertrand Sylvain for keeping faith in me and giving some information through gmail exchanges.
My friend rlblaster for his ideas through gmail and reading my very long status reports despite it did not effect him.
Anyone on bbs.archlinux32.org whom I cannot name because at the time of the writing I cannot access the page.
My friend Zsolti for whom I wrote a lot of bullshit details despite he was also not affected at all.
All family and friends who is okay with me going into the deep waters for weeks and talk less :-)

I cannot help, but put out the "hackerman" picture just because it took me long weeks!

Hackerman

Tags: mesa, arch, 32bit, linux, debug, slowdown, contribute, open-source, graphics, stack, analysis, 3D, optimization, tutorial, system, internals

Ballmerpeak devblog