Changes

4,085 bytes removed , 19:40, 13 November 2020

no edit summary

This page details the build of our [[DIGITS DevBox]]. There's also a page giving information on [[Using the DevBox]]. nVIDIA, famous for their incredibly poor supply-chain and inventory management, have been saying [https://developer.nvidia.com/devbo "Please note that we are sold out of our inventory of the DIGITS DevBox, and no new systems are being built"] since shortly after the [https://en.wikipedia.org/wiki/GeForce_10_series Titax X] was the latest and greatest thing (i.e., somewhere around 2016). But it's pretty straight forward to update [https://www.azken.com/download/DIGITS_DEVBOX_DESIGN_GUIDE.pdf their spec].

==Introduction==

===Specification===

<onlyinclude>[[File:Top1000.jpg|right|300px]] Our [[DIGITS DevBox]], affectionately named ~~"Bastard"~~after Lois McMaster Bujold's fifth God, has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan Xp - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.0, cuDNN 7.6.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.</onlyinclude>

===Documentation===

Give the box a reboot!

===X Windows===

If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:

tasksel

And choose your favorite. We used Ubuntu Desktop.

And reboot again to make sure that everything is working nicely.

===Video Drivers===

driver : xserver-xorg-video-nouveau - distro free builtin

Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal~~, if you have deviated from these instructions and already installed X Windows)~~ so that it isn't loaded.

apt-get install build-essential

gcc --version

~~wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run~~

vi /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau

update-initramfs -u

shutdown -r now

Reboot to a text terminal

lspci -vk

Shows no kernel driver in use!

~~You can install~~ Install the driver ~~directly either now or after installing Xwindows. If you do it before installing Xwindows, you will need to manually edit the Xwindows config files.~~ !

apt install nvidia-driver-430

And change into the sample directory and run the tests:

cd /usr/local/cuda-10.10/samples/bin/x86_64/linux/release ./deviceQuery ./bandwidthTest

Everything should be good at this point!

~~===X Windows===~~

~~Now install the X window system. The easiest way is:~~

~~tasksel~~

~~And choose your favorite. We used Ubuntu Desktop.~~

~~And reboot again to make sure that everything is working nicely.~~

===Bcache===

This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/

Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.10

...

sudo apt-get install -y nvidia-docker2

sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image

docker run --runtime=nvidia --rm nvidia/cuda:10.10-base nvidia-smi

Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):

*https://developer.nvidia.com/digits

Note: you can kill docker containers with

docker system prune

====cuDNN====

First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:

cd /bulk/install/

dpkg -i libcudnn7_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-dev_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-doc_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb

And test it:

pip install --upgrade tensorflow-gpu

python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.

~~Still as researcher (and in the venv):~~

~~conda install cudatoolkit~~

~~conda install cudnn~~

~~conda install tensorflow-gpu~~

~~export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~

~~python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~

~~AND IT WORKS!~~

Note: to deactivate the virtual environment:

deactivate

Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.

=====PyTorch and SciKit=====

*http://deeplearning.net/software/theano/install_ubuntu.html

==~~Video Driver Issue~~=VNC= ~~After logging into the box sometime later, it seems that the video drivers are no longer loading, presumably as a consequence of some update or something.~~ ==~~=Testing===~~ ~~nvidia-settings --query FlatpanelNativeResolution~~ ~~ERROR: NVIDIA driver is not loaded~~ ~~cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release~~ ~~./deviceQuery~~ ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ ~~cudaGetDeviceCount returned 100~~ ~~-> no CUDA-capable device is detected~~ ~~Result = FAIL~~ ~~./mnistCUDNN~~ ~~cudnnGetVersion() : 7501 , CUDNN_VERSION from cudnn.h : 7501 (7.5.1)~~ ~~Cuda failurer version : GCC 7.4.0~~ ~~Error: no CUDA-capable device is detected~~ ~~error_util.h:93~~ ~~Aborting...~~ ~~And as researcher:~~ ~~cd /home/researcher/~~ ~~source ./venv/bin/activate~~ ~~python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~... failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected~~ ~~...kernel driver does not appear to be running on this host (bastard): /proc/driver/nvidia/version does not exist~~

~~lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in~~ In order to use. ~~It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devicesreturns exactly what it did before we installed CUDA 10.1 too...~~ ~~There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get~~ the ~~following:~~ ~~/usr/bin/nvidia-persistenced --verbose~~ ~~nvidia-persistenced failed to initialize. Check syslog~~ graphical interface for ~~more details.~~ ~~tail /var/log/syslog~~ ~~...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout~~ ~~...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist~~Matlab and other applications, ~~and that user 0 has read and write permissions for those files.~~ ~~ls /dev/~~ ~~...reveals no nvidia devices~~ ~~nvidia-smi~~ ~~...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running~~we need a VNC server.

~~grep nvidia /etc/modprobe~~First, install the VNC client remotely.~~d/* /lib/modprobe~~We use the standalone exe from TigerVNC.d/* ~~.../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb~~ ~~.../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer~~

~~===Uninstall~~Now install TightVNC, following the instructions: https:/~~Reinstall===~~/www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04

~~Am going to try uninstalling CUDA 10.1 and the current Nvidia driver, and then reinstalling CUDA 10.0~~ cd /~~usr/local/cuda-10.1/bin/cuda-uninstaller~~root ~~nvidia~~apt-get install xfce4 xfce4-~~uninstall~~goodies

~~WARNING: Your driver installation has been altered since it was initially installed; this may happen,~~ As user sudo apt-get install tightvncserver vncserver set password for ~~example, if~~user ~~you have since installed the NVIDIA driver through a mechanism other than nvidia~~ vncserver -~~installer (such as your~~kill :1 ~~distribution's native package management system)~~ mv ~/.vnc/xstartup ~/.vnc/xstartup. bak ~~nvidia-installer will attempt to uninstall as best it~~vi ~/.vnc/xstartup #!/bin/bash ~~can~~ xrdb $HOME/. Xresources startxfce4 & vncserver ~~Please see the file '~~sudo vi /etc/~~var~~systemd/~~log~~system/~~nvidia-uninstall~~vncserver@.service [Unit] Description=Start TightVNC server at startup After=syslog.~~log' for details~~target network.target ~~WARNING~~ [Service] Type=forking User=uname Group=uname WorkingDirectory=/home/uname PIDFile=/home/ed/.vnc/%H: ~~Failed to delete some directories~~%i. ~~See~~ pid ExecStartPre=-/~~var~~usr/~~log~~bin/~~nvidia~~vncserver -~~uninstall.log for details.~~kill :%i > /dev/null 2>&1 ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i ~~Uninstallation of existing driver~~ ExecStop=/usr/bin/vncserver -kill : ~~NVIDIA Accelerated Graphics Driver for Linux~~%i [Install] WantedBy=multi-~~x86_64 (418.67) is complete~~user.target

Then download cuda_10.0.130_410.48_linux.run from https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal, as well as cuda_10.0.130.1_linux.run.Note that changing the color depth breaks it!

To make changes (or after the edit) sudo susystemctl daemon-reload ~~cd /bulk/install~~sudo systemctl enable vncserver@2.service ~~./cuda_10.0.130_410.48_linux.run~~vncserver -kill :2 sudo systemctl start vncserver@2 ~~accept all defaults and install everything (including 410.something NVIDIA driver)~~ sudo systemctl status vncserver@2

~~===========~~ ~~Driver: Installed~~ ~~Toolkit: Installed in /usr/local/cuda-10.0~~ ~~Samples: Installed in /home/ed, but missing recommended libraries~~ ~~Please make sure that~~ ~~- PATH includes /usr/local/cuda-10.0/bin~~ ~~- LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root~~ ~~To uninstall the CUDA Toolkit, run~~ Stop the ~~uninstall script in /usr/local/cuda-10.0/bin~~ ~~To uninstall the NVIDIA Driver, run nvidia-uninstall~~ ~~Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.~~ server with ~~Logfile is /tmp/cuda_install_8524.log~~sudo systemctl stop vncserver@2

~~Fix the paths~~Note that we are using : ~~export PATH=/usr/local/cuda-10.0/bin${PATH:+~~2 because :~~${PATH}}~~ ~~export LD_LIBRARY_PATH=/usr/local/cuda-10~~1 is running our regular Xwindows GUI.~~0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~

~~Also~~Instrucions on how to set up an IP tunnel using PuTTY: vi https:/~~etc~~/ldhelpdeskgeek.~~so.conf.d~~com/~~cuda.conf~~ how-to/~~usr/local/cuda~~tunnel-vnc-over-~~10.0~~ssh/~~lib64~~ ~~ldconfig~~

~~Finally:~~ ~~./cuda_10.0.130.1.run~~ ~~accept all defaults~~====Connection Issues====

~~Unfortunately~~ Coming back to this ~~didn't work~~, I had issues connecting. I set up the tunnel using the saved profile in puTTY.exe and checked to see which local port was listening (it was 5901) and not firewalled using the listening ports tab under network on resmon. ~~After a reboot:~~ ~~nvidia-settings --query FlatpanelNativeResolution~~ ~~Unable~~ exe (it said allowed, not restricted under firewall status). VNC seemed to be running fine on Bastard, and I tried connecting to ~~init server~~localhost: ~~Could not connect~~: ~~Connection refused~~ 1 (~~message was same as before~~ that is 5901 on the ~~box~~localhost, ~~this~~ through the tunnel to 5902 on Bastard) using VNC Connect by RealVNC. The connection was ~~over ssh)~~refused.

I checked it was listening and there was no firewall: netstat -tlpn tcp 0 0 0.~~/deviceQuery Starting~~0.0.0:5902 0.0.0.0:* LISTEN 2025/Xtightvnc ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ ~~cudaGetDeviceCount returned 35~~ ~~-> CUDA driver version is insufficient for CUDA runtime version~~ufw status ~~Result = FAIL~~Status: inactive

~~python -c "import tensorflow as tf; tf.enabl e_eager_execution();~~ ~~print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~2019-07-09 15:20:40.085877: E tensorflow/stream_executor/cuda/cuda_driver.cc:300~~ ~~] failed call~~ The localhost port seems to ~~cuInit~~be open and listening just fine: ~~CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detecte~~ d ~~2019~~Test-~~07-09 15:20:40~~NetConnection 127.0.~~085978: I tensorflow/stream_executor/cuda/cuda_diagnostics~~0.c ~~c:148] kernel driver does not appear to be running on this host (bastard): /proc~~ ~~/driver/nvidia/version does not exist~~1 -p 5901

~~/usr/bin/nvidia-persistenced --verbose~~ ~~nvidia-persistenced failed to initialize. Check syslog for more details~~So, presumably, there must be something wrong with the tunnel itself.

~~lspci -vk also returned~~ '''Ignoring the ~~same as before~~SSH tunnel worked fine: Connect to 192. ~~This is really frustrating!~~168.2.202::5902 using the TightVNC (or RealVNC, etc.) client.'''

~~Did the following:~~ ~~apt-get install nvidia-prime~~ ~~prime-select nvidia~~ ~~Info: the nvidia profile is already set~~===RDP===

I also installed xrdp: ~~update~~apt install xrdp adduser xrdp ssl-~~initramfs~~ cert #Check the status and that it is listening on 3389 systemctl status xrd netstat -utln #It is listening... vi /etc/xrdp/xrdp.ini #See https://linux.die.net/man/5/xrdp.ini systemctl restart xrdp

~~For next time~~ This gave a dead session (~~as root~~a flat light blue screen with nothing on it):, which finally yielded a connection log which said "login successful for display 10, start connecting, connection problems, giving up, some problem." ~~lshw~~ cat /var/log/xrdp-~~c video~~ ..sesman. ~~shows configuration without driver~~log

~~modprobe --resolve-alias nvidiafbmodinfo $(modprobe --resolve-alias nvidiafb)~~There could be some conflict between VNC and RDP. systemctl status xrdp shows "xrdp_wm_log_msg: connection problem, giving up".

I tried without success: ~~lsof +D~~ gsettings set org.gnome.Vino require-encryption false https:/~~usr~~/~~lib/xorg/modules~~askubuntu.com/~~drivers~~questions/ ~~COMMAND PID USER FD TYPE DEVICE SIZE~~797973/~~OFF NODE NAME~~error-problem-connecting-windows-10-rdp-into-xrdp ~~Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg~~vi /~~modules~~etc/~~drivers~~X11/~~fbdev_drv~~Xwrapper.soconfig allowed_users = anybody ~~Xorg 2488 root mem REG~~ ~~8,49 90360 26347089~~ This was promising as it was previously set to consol. https:/~~usr~~/~~lib~~www.linuxquestions.org/~~xorg~~questions/~~modules~~linux-software-2/~~drivers~~xrdp-under-debian-9-connection-problem-4175623357/~~modesetting_drv~~#post5817508 apt-get install xorgxrdp-hwe-18.so04 ~~Xorg 2488 root mem REG~~ 8Couldn't find the package... This lead was promising as it applies to 18.04.02 HWE,49 which is what I'm running ~~217104 26346424~~ https:/~~usr~~/~~lib~~www.nakivo.com/~~xorg~~blog/~~modules~~how-to-use-remote-desktop-connection-ubuntu-linux-walkthrough/~~drivers/nouveau_drv.so~~ dpkg -l |grep xserver-xorg-core ii xserver-xorg-core 2:1.19.6-1ubuntu4.3 amd64 Xorg ~~2488 root mem REG~~ X server - core server 8Which seems ok,~~49 7813904 26346043~~ despite having a problem with XRDP and Ubuntu 18.04 HWE documented very clearly here: http:/~~usr~~/~~lib~~c-nergy.be/~~xorg~~blog/~~modules/drivers/nvidia_drv.so~~?p=13972

~~cat~~ There is clearly an issue with Ubuntu 18.04 and XRDP. The solution seems to be to downgrade xserver-xorg-core and some related packages, which can be done with an install script (https:/~~var~~/~~log~~c-nergy.be/~~Xorg~~blog/?p=13933) or manually.~~0.log~~But I don't want to do that, so I removed xrdp and went back to VNC! apt remove xrdp

~~] (II) LoadModule: "nvidia"[ 29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so[ 29.047] (II) Module nvidia: vendor~~=~~"NVIDIA Corporation"[ 29.047] compiled for 4.0.2, module version~~ = ~~1.0.0[ 29.047] Module class: X.Org Video Driver[ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep 6 06:27:34 CDT 2018[ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs[ 29.047] (II) Loading sub module "fb"[ 29.047] (II) LoadModule: "fb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libfb.so[ 29.047] (II) Module fb: vendor~~=~~"X.Org Foundation"[ 29.047] compiled for 1.19.6, module version~~ Other Software= ~~1.0.0[ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.047] (II) Loading sub module "wfb"[ 29.047] (II) LoadModule: "wfb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so[ 29.048] (II) Module wfb: vendor~~=~~"X.Org Foundation"[ 29.048] compiled for 1.19.6, module version~~ = ~~1.0.0[ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.048] (II) Loading sub module "ramdac"[ 29.048] (II) LoadModule: "ramdac"[ 29.048] (II) Module "ramdac" already built-in[ 29.095] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the[ 29.095] (EE) NVIDIA: system's kernel log for additional error messages and[ 29.095] (EE) NVIDIA: consult the NVIDIA README for details.[ 29.095] (EE) No devices detected.~~

vi I installed the community edition of PyCharm: snap install pycharm-community --classic #Restart the local terminal so that it has updated paths (after a snap install, etc.) /snap/pycharm-community/~~var~~214/~~log~~bin/~~kern.log~~ ~~... it looks like we are back to an unsigned module tainting the kernel~~pycharm. sh

~~vi /etc/default/grub~~On launch, you get some config options. I chose to install and enable:~~GRUB_DEFAULT=0~~*IdeaVim (a VI editor emulator)~~GRUB_TIMEOUT_STYLE=hiddenGRUB_TIMEOUT=2GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1"~~*R~~GRUB_CMDLINE_LINUX=""~~*AWS Toolkit

~~update-grubSourcing file `~~Make a launcher: In /~~etc~~usr/~~default~~share/~~grub'~~applications: ~~Generating grub configuration file ..~~ vi pycharm.desktop [Desktop Entry]~~Found linux image: /boot/vmlinuz-4~~ Version=2020.182.~~0-25-generic~~3~~Found initrd image: /boot/initrd.img-4.18.0-25-generic~~ Type=Application~~Found linux image: /boot/vmlinuz-4.18.0-20-generic~~ Name=PyCharm~~Found initrd image:~~ Icon=/~~boot~~snap/~~initrd.img-4.18.0-20~~pycharm-~~genericFound linux image:~~ community/~~boot~~214/~~vmlinuz-4.18.0-18-genericFound initrd image:~~ bin/~~boot/initrd~~pycharm.~~img-4.18.0-18-generic~~png~~Found linux image:~~ Exec="/~~boot~~snap/~~vmlinuz-4.15.0-54~~pycharm-~~genericFound initrd image:~~ community/~~boot~~214/~~initrd.img-4.15.0-54-genericFound memtest86+ image:~~ bin/~~boot/memtest86+~~pycharm.~~elf~~sh" %f~~Found memtest86+ image: /boot/memtest86+.bin~~ Comment=The Drive to Develop~~device-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busy~~ Categories=Development;IDE;~~Command failed~~ Terminal=false~~done~~ StartupWMClass=jetbrains-pycharm

~~https://askubuntu~~Also, create a launcher on the desktop with the same info.~~com/questions/1048274/ubuntu-18-04-stopped-working-with-nvidia-drivers~~

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

DIGITS DevBox (view source)

Revision as of 19:40, 13 November 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools