Changes

Addressing Ubuntu NVIDIA Issues (view source)

Revision as of 11:35, 17 July 2019

16,121 bytes added , 11:35, 17 July 2019

Created page with "This page provides information on how to address NVIDIA driver issues under Ubuntu 18.04. However, the objective is not to provide a step-by-step set of instructions to addres..."

This page provides information on how to address NVIDIA driver issues under Ubuntu 18.04. However, the objective is not to provide a step-by-step set of instructions to address one particular issue. Instead, it lays out the general process and commonly-used command line tools to debug video (and other) driver issues.

==The Boot Process==

The following things happen at boot:
*BIOS enumerates/checks hardware -- make sure Secure Boot is disabled!
*The MBR (if you have one) contains GRUB, which is called. We shouldn't be using UEFI if we want unsigned NVIDIA drivers to work.
**Edit /etc/default/grub and run update-grub to alter /boot/grub/grub.cfg (note that there is also /etc/grub.d)
*Then it's the Kernel's turn! Kernel images are in /boot.
**Find which one is being used by running uname -r or doing cat /proc/version or dmesg | grep ubuntu.
*The Kernel loads initramfs first (a mini OS), which is also stored in /boot. It can be updated with update-initramfs.
**initramfs decides which kernel modules are going to be loaded. Use modprobe to alter the list then update-initramfs.
**/etc/modules-load.d/modules.conf can provide a list of modules or can be blank
**lsmod lists loaded modules, depmod -n lists module dependencies, modinfo provides info. Loaded modules are in cat /proc/modules.
*Kernel messages are in /var/log/kern.log (as well as in /var/log/syslog).
**View them with dmesg, cat /var/log/kern.log, and journalctl -b (messages since last boot). Note that cat /proc/kmsg shows nothing and var/log/syslog is a log for rsyslog.
*Init, or rather systemd, then takes over and the machine goes from runlevel to runlevel, eventually bringing up your Xwindowing system, if you have one.
**/etc/rc.local is called last. For later versions of NVIDIA drivers you'll need /usr/bin/nvidia-persistenced --verbose here.

The boot process matters because video drivers are loaded in the kernel phase. If you change which modules are loaded manually (dpkg, apt-get and most scripts will do it for you), you'll need to update-initramfs.

==Useful Commands==

Here's a list of useful commands to diagnose your issues:
vi /etc/default/grub
update-grub
uname -r
cat /proc/version
update-initramfs
cat /var/log/kern.log
dmesg
cat /proc/kmsg
less var/log/syslog
journalctl -b
lsmod | grep video
modinfo asus_wmi
find /lib/modules/$(uname -r) -type f -name '*.ko'shui
modprobe nvidiafb

==Finding hardware==

If your videa card is installed in PCI slot and visible to OS it will show up in:

lspci -vk

and in
lshw -c video

===Example Hardware Check===

Checking that the hardware is being seen on the first build of the [[DIGITS DevBox]]:
lspci -vk

05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation GP102 [TITAN Xp]
Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at d000 [size=128]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
Capabilities: [900] #19
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau

06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog -if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 12a3
Flags: fast devsel, IRQ 24, NUMA node 0
Memory at f8000000 (32-bit, non-prefetchable) [size=16M]
Memory at a0000000 (64-bit, prefetchable) [size=256M]
Memory at b0000000 (64-bit, prefetchable) [size=32M]
I/O ports at c000 [size=128]
Expansion ROM at f9000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
Capabilities: [900] #19
Capabilities: [bb0] #15
Kernel modules: nvidiafb, nouveau

This looks good. The second card is the Titan RTX (see https://devicehunt.com/view/type/pci/vendor/10DE/device/1E02).

==Altering GRUB==

If you didn't use the HWE version of Ubuntu, you might get the dreaded screen freeze or black screen on boot when the video drivers are loaded. Fix that by booting to a terminal and doing:
vi /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=2
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1"
GRUB_CMDLINE_LINUX=""

update-grub
Sourcing file `/etc/default/grub'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.18.0-25-generic
Found initrd image: /boot/initrd.img-4.18.0-25-generic
Found linux image: /boot/vmlinuz-4.18.0-20-generic
Found initrd image: /boot/initrd.img-4.18.0-20-generic
Found linux image: /boot/vmlinuz-4.18.0-18-generic
Found initrd image: /boot/initrd.img-4.18.0-18-generic
Found linux image: /boot/vmlinuz-4.15.0-54-generic
Found initrd image: /boot/initrd.img-4.15.0-54-generic
Found memtest86+ image: /boot/memtest86+.elf
Found memtest86+ image: /boot/memtest86+.bin
device-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busy
Command failed
done

See https://askubuntu.com/questions/1048274/ubuntu-18-04-stopped-working-with-nvidia-drivers

==Secure Boot==

Don't use UEFI and don't use Secure Boot, unless you can use only signed production level drivers (which isn't going to happen). To check that aren't using secure boot, in /boot:

grep CONFIG_MODULE_SIG_ALL config-4.18.0-25-generic
CONFIG_MODULE_SIG_ALL=y
grep CONFIG_MODULE_SIG_FORCE config-4.18.0-25-generic
# CONFIG_MODULE_SIG_FORCE is not set

Check the Kernel log to see if an unsigned module is tainting the kernel and if that's an issue
vi /var/log/kern.log

==Drivers==

Remember that Tensorflow uses CUDA, which in turn uses the video driver, so you should fix things in order!

If you want to install the latest driver from the ppm, you might need to update your repo list. Current repos are in
/etc/apt/sources.list.d/

If Launchpad ppm was already added as a repo then you can:
cat /etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-bionic.list

See drivers from all of your repos with:
ubuntu-drivers devices

===Module Diagnostics===

This section contains some notes from the depth of my problems:

lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in use.

It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devices returns exactly what it did before we installed CUDA 10.1 too...

There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following:
/usr/bin/nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.
tail /var/log/syslog
...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout
...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
ls /dev/
...reveals no nvidia devices

nvidia-smi
...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
.../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
.../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

modprobe --resolve-alias nvidiafb
modinfo $(modprobe --resolve-alias nvidiafb)

lsof +D /usr/lib/xorg/modules/drivers/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so
Xorg 2488 root mem REG 8,49 90360 26347089
/usr/lib/xorg/modules/drivers/modesetting_drv.so
Xorg 2488 root mem REG 8,49 217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv.so
Xorg 2488 root mem REG 8,49 7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.so

===Determine and Set Driver===

You can use the application NVIDIA prime:
apt-get install nvidia-prime
prime-select nvidia
Info: the nvidia profile is already set

==Paths==

You can also use ldconfig to add the LD Library Path:
vi /etc/ld.so.conf.d/cuda.conf
/usr/local/cuda-10.0/lib64
ldconfig

Export paths by setting the global environment
vi /etc/environment
PATH="/home/researcher/anaconda3/bin:/home/researcher/anaconda3/condabin:/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"

Now there is no need to activate a custom environment! Alternatively:
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export PYTHONPATH=/your/tensorflow/path:$PYTHONPATH

==Persistenced==

For CUDA 10.1, we need nvidia-persistenced to be run at boot, so:
vi /etc/rc.local
#!/bin/sh -e
/usr/bin/nvidia-persistenced --verbose
exit 0
chmod +x /etc/rc.local

For CUDA 10.0, it runs as a service and is set up and launched by the script.
/usr/bin/nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.

Jul 11 21:08:20 bastard sshd[3708]: Did not receive identification string from 94.190.53.14 port 3135
Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Verbose syslog connection opened
Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Directory /var/run/nvidia-persistenced will not be removed on exit
Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Failed to lock PID file: Resource temporarily unavailable
Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Shutdown (3714)

To check it is already running:
ps aux | grep persistenced
nvidia-+ 2183 0.0 0.0 17324 1552 ? Ss 10:09 0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
root 3539 0.0 0.0 14428 1000 pts/0 S+ 10:10 0:00 grep --color=auto persistenced

And to see it is a service:
systemctl list-units --type service --all | grep nvidia
nvidia-persistenced.service loaded active running NVIDIA Persistence Daemon

==XWindows==

To diagnose XWindow issues:
cat /var/log/Xorg.0.log
] (II) LoadModule: "nvidia"
[ 29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so
[ 29.047] (II) Module nvidia: vendor="NVIDIA Corporation"
[ 29.047] compiled for 4.0.2, module version = 1.0.0
[ 29.047] Module class: X.Org Video Driver
[ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep 6 06:27:34 CDT 2018
[ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 29.047] (II) Loading sub module "fb"
[ 29.047] (II) LoadModule: "fb"
[ 29.047] (II) Loading /usr/lib/xorg/modules/libfb.so
[ 29.047] (II) Module fb: vendor="X.Org Foundation"
[ 29.047] compiled for 1.19.6, module version = 1.0.0
[ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4
[ 29.047] (II) Loading sub module "wfb"
[ 29.047] (II) LoadModule: "wfb"
[ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so
[ 29.048] (II) Module wfb: vendor="X.Org Foundation"
[ 29.048] compiled for 1.19.6, module version = 1.0.0
[ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4
[ 29.048] (II) Loading sub module "ramdac"
[ 29.048] (II) LoadModule: "ramdac"
[ 29.048] (II) Module "ramdac" already built-in
[ 29.095] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 29.095] (EE) NVIDIA: system's kernel log for additional error messages and
[ 29.095] (EE) NVIDIA: consult the NVIDIA README for details.
[ 29.095] (EE) No devices detected.

To use nvidia settings, you have to be on the box. Doing it over SSH will get you:
nvidia-settings --query FlatpanelNativeResolution
Unable to init server: Could not connect: Connection refused

==Tensorflow in Conda==

It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.

Still as your user account (and in the venv):
conda install cudatoolkit
conda install cudnn
conda install tensorflow-gpu
export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

==Remove Everything==

Sometimes the best options is to completely remove everything and start again.

To see what is currently installed (your output may look different):
dpkg -l|grep nvidia
ii libnvidia-container-tools 1.0.2-1 amd64 NVIDIA container runtime library
(command-line tools)
ii libnvidia-container1:amd64 1.0.2-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 2.0.0+docker18.09.7-3 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.0.3+docker18.09.7-3 all nvidia-docker CLI wrapper
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 418.56-0ubuntu0~gpu18.04.1 amd64 Tool for configuring the NVIDIA
graphics driver

To completely remove all things nvidia (sub the correct version number and don't run the bits after the hash marks):
cd /usr/local/cuda-10.0/bin
./uninstall_cuda_10.0.pl
cd /usr/local
rm -r cuda-10.1/
nvidia-uninstall
apt-get purge nvidia* #Removed nvidia-prime, nvidia-settings, nvidia-container-runtime, nvidia-container-runtime-hook, nvidia-docker2
apt-get purge *nvidia* #Removed libnvidia-container-tools (1.0.2-1), Removing libnvidia-container1:amd64
apt autoremove #Removed libllvm7 libvdpau1 libxnvctrl0 linux-headers-4.18.0-18 linux-headers-4.18.0-18-generic linux-image-4.18.0-18-generic linux-modules-4.18.0-18-generic linux-modules-extra-4.18.0-18-generic mesa-vdpau-drivers pkg-config screen-resolution-extra vdpau-driver-all
cd /home/ed
rm -r NVIDIA_CUDA-10.1_Samples/
rm -r NVIDIA_CUDA-10.0_Samples/
dpkg --remove libcudnn7
dpkg --remove libcudnn7-dev
dpkg --remove libcudnn7-doc

COMMENT OUT LINES IN rc.local
vi /etc/rc.local

shutdown -r now

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,658

edits

Changes

Addressing Ubuntu NVIDIA Issues (view source)

Revision as of 11:35, 17 July 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools