Changes

Jump to navigation Jump to search
4,541 bytes added ,  15:14, 9 July 2019
no edit summary
Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:
*http://deeplearning.net/software/theano/install_ubuntu.html
 
==Video Driver Issue==
 
After logging into the box sometime later, it seems that the video drivers are no longer loading, presumably as a consequence of some update or something.
 
nvidia-settings --query FlatpanelNativeResolution
ERROR: NVIDIA driver is not loaded
 
cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release
./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL
 
./mnistCUDNN
cudnnGetVersion() : 7501 , CUDNN_VERSION from cudnn.h : 7501 (7.5.1)
Cuda failurer version : GCC 7.4.0
Error: no CUDA-capable device is detected
error_util.h:93
Aborting...
 
And as researcher:
cd /home/researcher/
source ./venv/bin/activate
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
... failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
...kernel driver does not appear to be running on this host (bastard): /proc/driver/nvidia/version does not exist
 
lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in use.
 
It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devices
returns exactly what it did before we installed CUDA 10.1 too...
 
There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following:
/usr/bin/nvidia-persistenced --verbose
nvidia-persistenced failed to initialize. Check syslog for more details.
tail /var/log/syslog
...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout
...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
ls /dev/
...reveals no nvidia devices
nvidia-smi
...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
 
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
.../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
.../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
 
Am going to try uninstalling CUDA 10.1 and the current Nvidia driver, and then reinstalling CUDA 10.0
/usr/local/cuda-10.1/bin/cuda-uninstaller
nvidia-uninstall
 
WARNING: Your driver installation has been altered since it was initially installed; this may happen, for example, if
you have since installed the NVIDIA driver through a mechanism other than nvidia-installer (such as your
distribution's native package management system). nvidia-installer will attempt to uninstall as best it
can. Please see the file '/var/log/nvidia-uninstall.log' for details.
WARNING: Failed to delete some directories. See /var/log/nvidia-uninstall.log for details.
Uninstallation of existing driver: NVIDIA Accelerated Graphics Driver for Linux-x86_64 (418.67) is complete.
 
Then download cuda_10.0.130_410.48_linux.run from https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal, as well as cuda_10.0.130.1_linux.run.
 
sudo su
cd /bulk/install
./cuda_10.0.130_410.48_linux.run
accept all defaults and install everything (including 410.something NVIDIA driver)
 
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-10.0
Samples: Installed in /home/ed, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /tmp/cuda_install_8524.log
 
Fix the paths:
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
 
Also
vi /etc/ld.so.conf.d/cuda.conf
/usr/local/cuda-10.0/lib64
ldconfig
 
Finally:
./cuda_10.0.130.1.run
accept all defaults

Navigation menu