Changes

DIGITS DevBox (view source)

Revision as of 12:35, 17 July 2019

8,361 bytes removed , 12:35, 17 July 2019

no edit summary

Give the box a reboot!

===X Windows===

If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:

tasksel

And choose your favorite. We used Ubuntu Desktop.

And reboot again to make sure that everything is working nicely.

===Video Drivers===

driver : xserver-xorg-video-nouveau - distro free builtin

Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal~~, if you have deviated from these instructions and already installed X Windows)~~ so that it isn't loaded.

apt-get install build-essential

gcc --version

~~wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run~~

vi /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau

update-initramfs -u

shutdown -r now

Reboot to a text terminal

lspci -vk

Shows no kernel driver in use!

~~You can install~~ Install the driver ~~directly either now or after installing Xwindows. If you do it before installing Xwindows, you will need to manually edit the Xwindows config files.~~ !

apt install nvidia-driver-430

Everything should be good at this point!

~~===X Windows===~~

~~Now install the X window system. The easiest way is:~~

~~tasksel~~

~~And choose your favorite. We used Ubuntu Desktop.~~

~~And reboot again to make sure that everything is working nicely.~~

===Bcache===

This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/

Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.10

...

sudo apt-get install -y nvidia-docker2

sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image

docker run --runtime=nvidia --rm nvidia/cuda:10.10-base nvidia-smi

Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):

*https://developer.nvidia.com/digits

Note: you can kill docker containers with

docker system prune

====cuDNN====

First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:

cd /bulk/install/

dpkg -i libcudnn7_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-dev_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-doc_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb

And test it:

pip install --upgrade tensorflow-gpu

python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.

~~Still as researcher (and in the venv):~~

~~conda install cudatoolkit~~

~~conda install cudnn~~

~~conda install tensorflow-gpu~~

~~export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~

~~python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~

~~AND IT WORKS!~~

Note: to deactivate the virtual environment:

deactivate

Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.

=====PyTorch and SciKit=====

*http://deeplearning.net/software/theano/install_ubuntu.html

==~~Video Driver Issue~~=VNC= ~~After logging into the box sometime later, it seems that the video drivers are no longer loading, presumably as a consequence of some update or something.~~ ==~~=Testing===~~ ~~nvidia-settings --query FlatpanelNativeResolution~~ ~~ERROR: NVIDIA driver is not loaded~~ ~~cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release~~ ~~./deviceQuery~~ ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ ~~cudaGetDeviceCount returned 100~~ ~~-> no CUDA-capable device is detected~~ ~~Result = FAIL~~ ~~./mnistCUDNN~~ ~~cudnnGetVersion() : 7501 , CUDNN_VERSION from cudnn.h : 7501 (7.5.1)~~ ~~Cuda failurer version : GCC 7.4.0~~ ~~Error: no CUDA-capable device is detected~~ ~~error_util.h:93~~ ~~Aborting...~~ ~~And as researcher:~~ ~~cd /home/researcher/~~ ~~source ./venv/bin/activate~~ ~~python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~... failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected~~ ~~...kernel driver does not appear to be running on this host (bastard): /proc/driver/nvidia/version does not exist~~

~~lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in~~ In order to use. ~~It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devicesreturns exactly what it did before we installed CUDA 10.1 too...~~ ~~There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following:~~ ~~/usr/bin/nvidia-persistenced --verbose~~ ~~nvidia-persistenced failed to initialize. Check syslog for more details.~~ ~~tail /var/log/syslog~~ ~~...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout~~ ~~...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that~~ the ~~NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions~~ graphical interface for ~~those files.~~ ~~ls /dev/~~ ~~...reveals no nvidia devices~~ ~~nvidia-smi~~ ~~...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.~~ grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* ~~.../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb~~ ~~.../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer~~ ~~===Uninstall/Reinstall===~~ ~~Am going to try uninstalling CUDA 10.1 and the current Nvidia driver,~~ Matlab and ~~then reinstalling CUDA 10.0~~ ~~/usr/local/cuda-10.1/bin/cuda-uninstaller~~ ~~nvidia-uninstall~~ ~~WARNING: Your driver installation has been altered since it was initially installed; this may happen, for example, if~~ ~~you have since installed the NVIDIA driver through a mechanism~~ other ~~than nvidia-installer (such as your~~ ~~distribution's native package management system). nvidia-installer will attempt to uninstall as best it~~ ~~can. Please see the file '/var/log/nvidia-uninstall.log' for details.~~ ~~WARNING: Failed to delete some directories. See /var/log/nvidia-uninstall.log for details.~~ ~~Uninstallation of existing driver: NVIDIA Accelerated Graphics Driver for Linux-x86_64 (418.67) is complete.~~ Then download cuda_10.0.130_410.48_linux.run from https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocalapplications, ~~as well as cuda_10.0.130.1_linux.run.~~ ~~sudo su~~ ~~cd /bulk/install~~ ~~./cuda_10.0.130_410.48_linux.run~~ ~~accept all defaults and install everything (including 410.something NVIDIA driver)~~ ~~===========~~ ~~Driver: Installed~~ ~~Toolkit: Installed in /usr/local/cuda-10.0~~ ~~Samples: Installed in /home/ed, but missing recommended libraries~~ ~~Please make sure that~~ ~~- PATH includes /usr/local/cuda-10.0/bin~~ ~~- LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root~~ ~~To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin~~ ~~To uninstall the NVIDIA Driver, run nvidia-uninstall~~ ~~Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.~~ ~~Logfile is /tmp/cuda_install_8524.log~~ ~~Fix the paths:~~ ~~export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}~~ ~~export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~ ~~Also~~ ~~vi /etc/ld.so.conf.d/cuda.conf~~ ~~/usr/local/cuda-10.0/lib64~~ ~~ldconfig~~ ~~Finally:~~ ~~./cuda_10.0.130.1.run~~ ~~accept all defaults~~ ~~Unfortunately this didn't work. After~~ we need a ~~reboot:~~ ~~nvidia-settings --query FlatpanelNativeResolution~~ ~~Unable to init~~ VNC server~~: Could not connect: Connection refused (message was same as before on the box, this was over ssh)~~ .~~/deviceQuery Starting...~~ ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ ~~cudaGetDeviceCount returned 35~~ ~~-> CUDA driver version is insufficient for CUDA runtime version~~ ~~Result = FAIL~~ ~~python -c "import tensorflow as tf; tf.enabl e_eager_execution();~~ ~~print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~2019-07-09 15:20:40.085877: E tensorflow/stream_executor/cuda/cuda_driver.cc:300~~ ~~] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detecte~~ d ~~2019-07-09 15:20:40.085978: I tensorflow/stream_executor/cuda/cuda_diagnostics.c~~ ~~c:148] kernel driver does not appear to be running on this host (bastard): /proc~~ ~~/driver/nvidia/version does not exist~~ ~~/usr/bin/nvidia-persistenced --verbose~~ ~~nvidia-persistenced failed to initialize. Check syslog for more details.~~ ~~lspci -vk also returned the same as before. This is really frustrating!~~ ~~Did the following:~~ ~~apt-get install nvidia-prime~~ ~~prime-select nvidia~~ ~~Info: the nvidia profile is already set~~ ~~update-initramfs -u~~

~~For next time (as root):~~ ~~lshw -c video~~ First, install the VNC client remotely.We use the standalone exe from TigerVNC.~~. shows configuration without driver~~

~~modprobe~~ Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-~~resolve~~install-~~alias nvidiafbmodinfo $(modprobe~~ and-configure-~~resolve~~vnc-~~alias nvidiafb)~~on-ubuntu-18-04

~~lsof +D~~ cd /~~usr/lib/xorg/modules/drivers/~~ ~~COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME~~ ~~Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so~~ ~~Xorg 2488 root mem REG 8,49 90360 26347089~~ ~~/usr/lib/xorg/modules/drivers/modesetting_drv.so~~ ~~Xorg 2488~~ root ~~mem REG 8,49 217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv.so~~ ~~Xorg 2488 root mem REG 8,49 7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.so~~apt-get install xfce4 xfce4-goodies

As user ~~cat~~ sudo apt-get install tightvncserver vncserver set password for user vncserver -kill :1 mv ~/.vnc/xstartup ~/.vnc/xstartup.bak vi ~/.vnc/xstartup #!/~~var~~bin/~~log~~bash xrdb $HOME/~~Xorg~~.0Xresources startxfce4 & vncserver sudo vi /etc/systemd/system/vncserver@.service [Unit] Description=Start TightVNC server at startup After=syslog.target network.target [Service] Type=forking User=uname Group=uname WorkingDirectory=/home/uname PIDFile=/home/ed/.vnc/%H:%i.pid ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1 ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i ExecStop=/usr/bin/vncserver -kill :%i [Install] WantedBy=multi-user.~~log~~target

~~] (II) LoadModule: "nvidia"[ 29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so[ 29.047] (II) Module nvidia: vendor="NVIDIA Corporation"[ 29.047] compiled for 4.0.2, module version = 1.0.0[ 29.047] Module class: X.Org Video Driver[ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep 6 06:27:34 CDT 2018[ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs[ 29.047] (II) Loading sub module "fb"[ 29.047] (II) LoadModule: "fb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libfb.so[ 29.047] (II) Module fb: vendor="X.Org Foundation"[ 29.047] compiled for 1.19.6, module version = 1.0.0[ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.047] (II) Loading sub module "wfb"[ 29.047] (II) LoadModule: "wfb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so[ 29.048] (II) Module wfb: vendor="X.Org Foundation"[ 29.048] compiled for 1.19.6, module version = 1.0.0[ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.048] (II) Loading sub module "ramdac"[ 29.048] (II) LoadModule: "ramdac"[ 29.048] (II) Module "ramdac" already built-in[ 29.095] (EE) NVIDIA: Failed to initialize~~ Note that changing the ~~NVIDIA kernel module. Please see the[ 29.095] (EE) NVIDIA: system's kernel log for additional error messages and[ 29.095] (EE) NVIDIA: consult the NVIDIA README for details.[ 29.095] (EE) No devices detected.~~color depth breaks it!

vi To make changes (or after the edit) sudo systemctl daemon-reload ~~/var/log/kern~~sudo systemctl enable vncserver@2.~~log~~service ~~... it looks like we are back to an unsigned module tainting the kernel.~~ vncserver -kill :2 sudo systemctl start vncserver@2 sudo systemctl status vncserver@2

~~vi /etc/default/grub~~Stop the server with~~GRUB_DEFAULT=0GRUB_TIMEOUT_STYLE=hiddenGRUB_TIMEOUT=2GRUB_DISTRIBUTOR=`lsb_release -i -s~~ sudo systemctl stop vncserver@2~~> /dev/null || echo Debian`GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1"GRUB_CMDLINE_LINUX=""~~

~~update-grubSourcing file `/etc/default/grub'Generating grub configuration file ...Found linux image: /boot/vmlinuz-4.18.0-25-genericFound initrd image: /boot/initrd.img-4.18.0-25-genericFound linux image: /boot/vmlinuz-4.18.0-20-genericFound initrd image: /boot/initrd.img-4.18.0-20-genericFound linux image~~Note that we are using : ~~/boot/vmlinuz-4.18.0-18-genericFound initrd image~~2 because : ~~/boot/initrd~~1 is running our regular Xwindows GUI.~~img-4.18.0-18-genericFound linux image: /boot/vmlinuz-4.15.0-54-genericFound initrd image: /boot/initrd.img-4.15.0-54-genericFound memtest86+ image: /boot/memtest86+.elfFound memtest86+ image: /boot/memtest86+.bindevice-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busyCommand faileddone~~

Instrucions on how to set up an IP tunnel using PuTTY: https://~~askubuntu~~helpdeskgeek.com/~~questions/1048274~~how-to/~~ubuntu~~tunnel-18vnc-04over-~~stopped-working-with-nvidia-drivers~~ssh/

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

DIGITS DevBox (view source)

Revision as of 12:35, 17 July 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools