Addressing Ubuntu NVIDIA Issues

From edegan.com
Jump to navigation Jump to search

This page provides information on how to address NVIDIA driver issues under Ubuntu 18.04. However, the objective is not to provide a step-by-step set of instructions to address one particular issue. Instead, it lays out the general process and commonly-used command line tools to debug video (and other) driver issues.

The Boot Process

The following things happen at boot:

  • BIOS enumerates/checks hardware -- make sure Secure Boot is disabled!
  • The MBR (if you have one) contains GRUB, which is called. We shouldn't be using UEFI if we want unsigned NVIDIA drivers to work.
    • Edit /etc/default/grub and run update-grub to alter /boot/grub/grub.cfg (note that there is also /etc/grub.d)
  • Then it's the Kernel's turn! Kernel images are in /boot.
    • Find which one is being used by running uname -r or doing cat /proc/version or dmesg | grep ubuntu.
  • The Kernel loads initramfs first (a mini OS), which is also stored in /boot. It can be updated with update-initramfs.
    • initramfs decides which kernel modules are going to be loaded. Use modprobe to alter the list then update-initramfs.
    • /etc/modules-load.d/modules.conf can provide a list of modules or can be blank
    • lsmod lists loaded modules, depmod -n lists module dependencies, modinfo provides info. Loaded modules are in cat /proc/modules.
  • Kernel messages are in /var/log/kern.log (as well as in /var/log/syslog).
    • View them with dmesg, cat /var/log/kern.log, and journalctl -b (messages since last boot). Note that cat /proc/kmsg shows nothing and var/log/syslog is a log for rsyslog.
  • Init, or rather systemd, then takes over and the machine goes from runlevel to runlevel, eventually bringing up your Xwindowing system, if you have one.
    • /etc/rc.local is called last. For later versions of NVIDIA drivers you'll need /usr/bin/nvidia-persistenced --verbose here.

The boot process matters because video drivers are loaded in the kernel phase. If you change which modules are loaded manually (dpkg, apt-get and most scripts will do it for you), you'll need to update-initramfs.

Useful Commands

Here's a list of useful commands to diagnose your issues:

vi /etc/default/grub
update-grub
uname -r
cat /proc/version 
update-initramfs
cat /var/log/kern.log
dmesg
cat /proc/kmsg
less var/log/syslog
journalctl -b 
lsmod | grep video
modinfo asus_wmi
find /lib/modules/$(uname -r) -type f -name '*.ko'shui
modprobe nvidiafb

Finding hardware

If your videa card is installed in PCI slot and visible to OS it will show up in:

lspci -vk

and in

lshw -c video

Example Hardware Check

Checking that the hardware is being seen on the first build of the DIGITS DevBox:

lspci -vk

05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)              (prog-if 00 [VGA controller])
       Subsystem: NVIDIA Corporation GP102 [TITAN Xp]
       Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0
       Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
       Memory at c0000000 (64-bit, prefetchable) [size=256M]
       Memory at d0000000 (64-bit, prefetchable) [size=32M]
       I/O ports at d000 [size=128]
       Expansion ROM at 000c0000 [disabled] [size=128K]
       Capabilities: [60] Power Management version 3
       Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
       Capabilities: [78] Express Legacy Endpoint, MSI 00
       Capabilities: [100] Virtual Channel
       Capabilities: [250] Latency Tolerance Reporting
       Capabilities: [128] Power Budgeting <?>
       Capabilities: [420] Advanced Error Reporting
       Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
       Capabilities: [900] #19
       Kernel driver in use: nouveau
       Kernel modules: nvidiafb, nouveau
 
06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog             -if 00 [VGA controller])
       Subsystem: NVIDIA Corporation Device 12a3
       Flags: fast devsel, IRQ 24, NUMA node 0
       Memory at f8000000 (32-bit, non-prefetchable) [size=16M]
       Memory at a0000000 (64-bit, prefetchable) [size=256M]
       Memory at b0000000 (64-bit, prefetchable) [size=32M]
       I/O ports at c000 [size=128]
       Expansion ROM at f9000000 [disabled] [size=512K]
       Capabilities: [60] Power Management version 3
       Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
       Capabilities: [78] Express Legacy Endpoint, MSI 00
       Capabilities: [100] Virtual Channel
       Capabilities: [250] Latency Tolerance Reporting
       Capabilities: [258] L1 PM Substates
       Capabilities: [128] Power Budgeting <?>
       Capabilities: [420] Advanced Error Reporting
       Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
       Capabilities: [900] #19
       Capabilities: [bb0] #15
       Kernel modules: nvidiafb, nouveau

This looks good. The second card is the Titan RTX (see https://devicehunt.com/view/type/pci/vendor/10DE/device/1E02).

Altering GRUB

If you didn't use the HWE version of Ubuntu, you might get the dreaded screen freeze or black screen on boot when the video drivers are loaded. Fix that by booting to a terminal and doing:

vi /etc/default/grub
 GRUB_DEFAULT=0
 GRUB_TIMEOUT_STYLE=hidden
 GRUB_TIMEOUT=2
 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
 GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1"
 GRUB_CMDLINE_LINUX=""
update-grub
 Sourcing file `/etc/default/grub'
 Generating grub configuration file ...
 Found linux image: /boot/vmlinuz-4.18.0-25-generic
 Found initrd image: /boot/initrd.img-4.18.0-25-generic
 Found linux image: /boot/vmlinuz-4.18.0-20-generic
 Found initrd image: /boot/initrd.img-4.18.0-20-generic
 Found linux image: /boot/vmlinuz-4.18.0-18-generic
 Found initrd image: /boot/initrd.img-4.18.0-18-generic
 Found linux image: /boot/vmlinuz-4.15.0-54-generic
 Found initrd image: /boot/initrd.img-4.15.0-54-generic
 Found memtest86+ image: /boot/memtest86+.elf
 Found memtest86+ image: /boot/memtest86+.bin
 device-mapper: reload ioctl on osprober-linux-nvme0n1p1  failed: Device or resource busy
  Command failed
 done

See https://askubuntu.com/questions/1048274/ubuntu-18-04-stopped-working-with-nvidia-drivers

Secure Boot

Don't use UEFI and don't use Secure Boot, unless you can use only signed production level drivers (which isn't going to happen). To check that aren't using secure boot, in /boot:

grep CONFIG_MODULE_SIG_ALL config-4.18.0-25-generic
 CONFIG_MODULE_SIG_ALL=y
grep CONFIG_MODULE_SIG_FORCE config-4.18.0-25-generic
 # CONFIG_MODULE_SIG_FORCE is not set

Check the Kernel log to see if an unsigned module is tainting the kernel and if that's an issue vi /var/log/kern.log




Drivers

Remember that Tensorflow uses CUDA, which in turn uses the video driver, so you should fix things in order!

If you want to install the latest driver from the ppm, you might need to update your repo list. Current repos are in

/etc/apt/sources.list.d/

If Launchpad ppm was already added as a repo then you can:

cat /etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-bionic.list

See drivers from all of your repos with:

ubuntu-drivers devices

Module Diagnostics

This section contains some notes from the depth of my problems:

lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in use.

It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devices returns exactly what it did before we installed CUDA 10.1 too...

There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following:

/usr/bin/nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.
tail /var/log/syslog
...Jul  9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0:    [12] Replay Timer Timeout
...Jul  9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
ls /dev/ 
...reveals no nvidia devices

nvidia-smi
...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
.../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
.../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

modprobe --resolve-alias nvidiafb modinfo $(modprobe --resolve-alias nvidiafb)

lsof +D /usr/lib/xorg/modules/drivers/
COMMAND  PID USER  FD   TYPE DEVICE SIZE/OFF     NODE NAME
Xorg    2488 root mem    REG   8,49    23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so
Xorg    2488 root mem    REG   8,49    90360 26347089 
/usr/lib/xorg/modules/drivers/modesetting_drv.so
Xorg    2488 root mem    REG   8,49   217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv.so
Xorg    2488 root mem    REG   8,49  7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.so


Determine and Set Driver

You can use the application NVIDIA prime:

apt-get install nvidia-prime
prime-select nvidia
 Info: the nvidia profile is already set

Paths

You can also use ldconfig to add the LD Library Path:

vi /etc/ld.so.conf.d/cuda.conf
 /usr/local/cuda-10.0/lib64 
ldconfig 

Export paths by setting the global environment

vi /etc/environment
 PATH="/home/researcher/anaconda3/bin:/home/researcher/anaconda3/condabin:/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
 LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"

Now there is no need to activate a custom environment! Alternatively:

export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export PYTHONPATH=/your/tensorflow/path:$PYTHONPATH

Persistenced

For CUDA 10.1, we need nvidia-persistenced to be run at boot, so:

vi /etc/rc.local
 #!/bin/sh -e
 /usr/bin/nvidia-persistenced --verbose
 exit 0
chmod +x /etc/rc.local

For CUDA 10.0, it runs as a service and is set up and launched by the script.

/usr/bin/nvidia-persistenced --verbose
 nvidia-persistenced failed to initialize. Check syslog for more details.

Jul 11 21:08:20 bastard sshd[3708]: Did not receive identification string from 94.190.53.14 port 3135 Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Verbose syslog connection opened Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Directory /var/run/nvidia-persistenced will not be removed on exit Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Failed to lock PID file: Resource temporarily unavailable Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Shutdown (3714)

To check it is already running:

ps aux | grep persistenced
 nvidia-+  2183  0.0  0.0  17324  1552 ?        Ss   10:09   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
 root      3539  0.0  0.0  14428  1000 pts/0    S+   10:10   0:00 grep --color=auto persistenced

And to see it is a service:

systemctl list-units --type service --all | grep nvidia
 nvidia-persistenced.service           loaded    active   running NVIDIA Persistence Daemon

XWindows

To diagnose XWindow issues:

cat /var/log/Xorg.0.log
 ] (II) LoadModule: "nvidia"
 [    29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
 [    29.047] (II) Module nvidia: vendor="NVIDIA Corporation"
 [    29.047]    compiled for 4.0.2, module version = 1.0.0
 [    29.047]    Module class: X.Org Video Driver
 [    29.047] (II) NVIDIA dlloader X Driver  410.48  Thu Sep  6 06:27:34 CDT 2018
 [    29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
 [    29.047] (II) Loading sub module "fb"
 [    29.047] (II) LoadModule: "fb"
 [    29.047] (II) Loading /usr/lib/xorg/modules/libfb.so
 [    29.047] (II) Module fb: vendor="X.Org Foundation"
 [    29.047]    compiled for 1.19.6, module version = 1.0.0
 [    29.047]    ABI class: X.Org ANSI C Emulation, version 0.4
 [    29.047] (II) Loading sub module "wfb"
 [    29.047] (II) LoadModule: "wfb"
 [    29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so
 [    29.048] (II) Module wfb: vendor="X.Org Foundation"
 [    29.048]    compiled for 1.19.6, module version = 1.0.0
 [    29.048]    ABI class: X.Org ANSI C Emulation, version 0.4
 [    29.048] (II) Loading sub module "ramdac"
 [    29.048] (II) LoadModule: "ramdac"
 [    29.048] (II) Module "ramdac" already built-in
 [    29.095] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
 [    29.095] (EE) NVIDIA:     system's kernel log for additional error messages and
 [    29.095] (EE) NVIDIA:     consult the NVIDIA README for details.
 [    29.095] (EE) No devices detected.

To use nvidia settings, you have to be on the box. Doing it over SSH will get you:

nvidia-settings --query FlatpanelNativeResolution
 Unable to init server: Could not connect: Connection refused

Tensorflow in Conda

It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.

Still as your user account (and in the venv):

conda install cudatoolkit
conda install cudnn
conda install tensorflow-gpu
export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

Remove Everything

Sometimes the best options is to completely remove everything and start again.

To see what is currently installed (your output may look different):

dpkg -l|grep nvidia
 ii  libnvidia-container-tools                  1.0.2-1                                      amd64        NVIDIA container runtime library 
 (command-line tools)
 ii  libnvidia-container1:amd64                 1.0.2-1                                      amd64        NVIDIA container runtime library
 ii  nvidia-container-runtime                   2.0.0+docker18.09.7-3                        amd64        NVIDIA container runtime
 ii  nvidia-container-runtime-hook              1.4.0-1                                      amd64        NVIDIA container runtime hook
 ii  nvidia-docker2                             2.0.3+docker18.09.7-3                        all          nvidia-docker CLI wrapper
 ii  nvidia-prime                               0.8.8.2                                      all          Tools to enable NVIDIA's Prime
 ii  nvidia-settings                            418.56-0ubuntu0~gpu18.04.1                   amd64        Tool for configuring the NVIDIA 
 graphics driver

To completely remove all things nvidia (sub the correct version number and don't run the bits after the hash marks):

cd /usr/local/cuda-10.0/bin
./uninstall_cuda_10.0.pl
cd /usr/local
rm -r cuda-10.1/
nvidia-uninstall
apt-get purge nvidia* #Removed nvidia-prime, nvidia-settings, nvidia-container-runtime, nvidia-container-runtime-hook, nvidia-docker2
apt-get purge *nvidia* #Removed libnvidia-container-tools (1.0.2-1), Removing libnvidia-container1:amd64
apt autoremove #Removed libllvm7 libvdpau1 libxnvctrl0 linux-headers-4.18.0-18 linux-headers-4.18.0-18-generic linux-image-4.18.0-18-generic linux-modules-4.18.0-18-generic linux-modules-extra-4.18.0-18-generic mesa-vdpau-drivers pkg-config screen-resolution-extra vdpau-driver-all
cd /home/ed
rm -r NVIDIA_CUDA-10.1_Samples/
rm -r NVIDIA_CUDA-10.0_Samples/
dpkg --remove libcudnn7
dpkg --remove libcudnn7-dev
dpkg --remove libcudnn7-doc

COMMENT OUT LINES IN rc.local

vi /etc/rc.local
shutdown -r now