Changes

Jump to navigation Jump to search
5,685 bytes added ,  19:40, 13 November 2020
no edit summary
This page details the build of our [[DIGITS DevBox]]. There's also a page giving information on [[Using the DevBox]]. nVIDIA, famous for their incredibly poor supply-chain and inventory management, have been saying [https://developer.nvidia.com/devbo "Please note that we are sold out of our inventory of the DIGITS DevBox, and no new systems are being built"] since shortly after the [https://en.wikipedia.org/wiki/GeForce_10_series Titax X] was the latest and greatest thing (i.e., somewhere around 2016). But it's pretty straight forward to update [https://www.azken.com/download/DIGITS_DEVBOX_DESIGN_GUIDE.pdf their spec].
 
==Introduction==
===Specification===
<onlyinclude>[[File:Top1000.jpg|right|300px]] Our [[DIGITS DevBox]], affectionately named "Bastard"after Lois McMaster Bujold's fifth God, has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan XP Xp - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.1 (and CUDA 10 under conda)0, cuDNN 7.56.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.</onlyinclude>
===Documentation===
 
This page details the build of our [[DIGITS DevBox]]. [[Using the DevBox]] provides other information.
The documentation from NVIDIA is here:
We mostly followed the original hardware spec from NVIDIA, updating the capacity of the drives and other minor things, as we had many of these parts available as salvage from other boxes. We had to buy the ASUS X99-E WS motherboard (we got the ASUS X99-E WS/USB variant as the original wasn't available and this one has USB3.1), as well as some new drives, just for this project.
[[File:Front1000.jpg|right|300px]] We opted to use a Xeon e5-2620v3 processor, rather than the Core i7-5930K. We had both available and both support 40 channels, mount in the LGA 2011-v3 socket, have 6 cores, 15mb caches, etc. Although the i7 has a faster clock speed, the Xeon takes registered (buffered), ECC DDR4 RDIMMs, which means we can put 256Gb on the board, rather than just 64Gb. For the GPUs, we have a TITAN RTX and an older TITAN Xp available to start, and we can add a 1080Ti later, or buy some additional GPUs if needed. We also put the whole thing in a Rosewill RSV-L4000 case.
===Parts List===
Old notes on a prior look at a [[GPU Build]] are on the wiki too.
[[File:Back1000.jpg|right|300px]] There weren't any particularly noteworthy things about the hardware build. The GPUs need to go in slots 1 and 3, which means they sit tight on each other. We put the Titan Xp in slot 1 (and plugged the monitor into its HDMI port), because then the fans for the Titan RTX (which we expect will get heavier use) are in the clear for now. The case fans were set up in a push-and-pull arrangement, and the hot-swap bay was put in the center position to allow as much airflow past the GPUs as possible.
===BIOS===
Notes:
*We will do RAID 5 array in software, rather using X99 through the BIOS
 
What's really crucial is that all the hardware is visible and that we are NOT using UEFI. With UEFI, there is an issue with the drivers not being properly signed under secure boot.
==Software==
===Main OS Install===
Install [http://cdimage.ubuntu.com/releases/18.04.2/release/?_ga=2.30548799.1041204444.1558044875-2114387110.1558044875 Ubuntu 18.04 ] (note that the original DiGIT DevBox ran 14.04), '''not the live version''', from a freshly burnt DVD. If you install the HWE version, you don't need to run apt-get install --install-recommends linux-generic-hwe-18.04 at the end.
====In the installer====
Choose the first network hardware option and make sure that the second (right most) network port is connected to a DHCP broadcasting router.
Under partitions:[[File:Partitions1000.jpg|right|300px]]
# Put one large partition, formatted as ext4, mounted as /, bootable on the 850
# Partition each SATA drive as RAID
Give the box a reboot!
 
===X Windows===
 
If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:
tasksel
And choose your favorite. We used Ubuntu Desktop.
 
And reboot again to make sure that everything is working nicely.
===Video Drivers===
====Hardware check====The first build of this box was done with an installation of CUDA 10.1, which automatically installed version 418.67 of the NVIDIA driver. We then installed CUDA 10.0 under conda to support Tensorflow 1.13. All went mostly well, and the history of this page contains the instructions. However, at some point, likely because of an OS update, the video driver(s) stopped working. This page now describes the second build (as if it were a build from scratch). [[Addressing Ubuntu NVIDIA Issues]] provides additional information.
Check that the hardware is being seen: lspci -vk 05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation GP102 [TITAN Xp] Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at c0000000 (64-bit, prefetchable) [size=256M] Memory at d0000000 (64-bit, prefetchable) [size=32M] I/O ports at d000 [sizeHardware and Drivers=128] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Kernel driver in use: nouveau Kernel modules: nvidiafb, nouveau 06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog -if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 12a3 Flags: fast devsel, IRQ 24, NUMA node 0 Memory at f8000000 (32-bit, non-prefetchable) [size=16M] Memory at a0000000 (64-bit, prefetchable) [size=256M] Memory at b0000000 (64-bit, prefetchable) [size=32M] I/O ports at c000 [size=128] Expansion ROM at f9000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Capabilities: [bb0] #15 Kernel modules: nvidiafb, nouveau
This looks good. The second card Check the hardware is the Titan RTX (see httpsbeing seen and what driver is being used with://devicehunt.com/view/type/pci/vendor/10DE/device/1E02). lspci -vk
Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.
driver : xserver-xorg-video-nouveau - distro free builtin
You could install the driver directly now using, say, apt install nvidia-430. But don't! ====CUDA==== Get CUDA 10.1 and have it install its preferred driver (418.67): *The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html*You can down load CUDA from here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal Essentially, first install build-essential, which gets you gcc. Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal, if you have deviated from these instructions and already installed X Windows) so that it isn't loaded.
apt-get install build-essential
gcc --version
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run
vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
update-initramfs -u
shutdown -r now
Reboot to a text terminal
lspci -vk
Shows no kernel driver in use!
Install the driver!  apt install nvidia-driver-430 ====CUDA==== Get CUDA 10.0, rather than 10.1. Although 10.1 is the latest version at the time of writing, it won't work with Tensorflow 1.13, so you'll just end up installing 10.0 under conda anyway. *The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html*You can down load CUDA 10.0 from here: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocalEssentially, first install build-essential, which gets you gcc.  Then run the installer script.and DO NOT install the driver (don't worry about the warning, it will work fine!): sh cuda_10.10.168_418130_410.67_linux48_linux.run  Do you accept the previously read EULA? accept/decline/quit: accept
=========== Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? = Summary = =========== (y)es/(n)o/(q)uit: n
Driver: Installed Toolkit: Installed in /usr/local/cuda- Install the CUDA 10.1/0 Toolkit? Samples: Installed in (y)es/home(n)o/ed/, but missing recommended libraries(q)uit: y
Please make sure that Enter Toolkit Location - PATH includes /usr/local/cuda-10.1/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add [ default is /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root0 ]:
To uninstall the CUDA Toolkit, run cuda-uninstaller in Do you want to install a symbolic link at /usr/local/cuda-10.1? (y)es/(n)o/bin To uninstall the NVIDIA Driver, run nvidia-uninstall(q)uit: y
Install the CUDA 10.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /home/ed ]: Installing the CUDA Toolkit in /usr/local/cuda-10.0 ... Missing recommended library: libGLU.so Missing recommended library: libX11.so Missing recommended library: libXi.so Missing recommended library: libXmu.so Missing recommended library: libGL.so Installing the CUDA Samples in /home/ed ... Copying samples to /home/ed/NVIDIA_CUDA-10.0_Samples now... Finished copying samples. =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /home/ed, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.10/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 10.0 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver Logfile is /vartmp/cuda_install_2807.log Now fix the paths. To do this for a single user do: export PATH=/usr/local/cuda-installer10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.log0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Fix the pathsBut it is better to fix it for everyone by editing your environment file: export vi /etc/environment PATH="/usr/local/cuda-10.10/bin:/usr/local/cuda-10.1sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/NsightCompute-2019.1${PATHbin:+/usr/games:${PATH}}/usr/local/games" export LD_LIBRARY_PATH="/usr/local/cuda-10.10/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
Start With version cuda 10.0, you don't need to edit rc.local to start the persistence daemon:
/usr/bin/nvidia-persistenced --verbose
This should be run at bootInstead, so: vi /etc/rc.local #!/bin/sh -e /usr/bin/nvidia-persistenced --verbose exit 0 chmod +x /etc/rcruns as a service.local Verify the driver: cat /proc/driver/nvidia/version
====Test the installation====
Make the samples in:... cd /usr/local/cuda-10.10/samples
make
And change into the sample directory and run the tests:
Change into the sample directory and run the tests: cd /usr/local/cuda-10.10/samples/bin/x86_64/linux/release
./deviceQuery
./bandwidthTest
And yes, it's a thing of beautyEverything should be good at this point===X Windows=== Now install the X window system. The easiest way is: tasksel And choose your favorite. We used Ubuntu Desktop. And reboot again to make sure that everything is working nicely.
===Bcache===
This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/
Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.10
...
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:10.10-base nvidia-smi
Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):
*https://developer.nvidia.com/digits
Note: you can kill docker containers with
docker system prune
====cuDNN====
First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:
cd /bulk/install/
dpkg -i libcudnn7_7.56.1.1034-1+cuda10.1_amd640_amd64.deb dpkg -i libcudnn7-dev_7.56.1.1034-1+cuda10.1_amd640_amd64.deb dpkg -i libcudnn7-doc_7.56.1.1034-1+cuda10.1_amd640_amd64.deb
And test it:
====Python Based====
Now install Anaconda, so that we have python 3, and can pip and conda install everything elsethings. Instructions for installing Anaconda on Ubuntu 18.04LTS (e.g., https://docs.anaconda.com/anaconda/install/linux/) all recommend using the shell script.
From https://www.anaconda.com/distribution/ the latest version is 3.7, so:
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is Note: to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). Still as researcher (and in deactivate the venv)virtual environment: conda install cudatoolkit conda install cudnn conda install tensorflow-gpu export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" AND IT WORKS!deactivate
Note: that adding the anaconda path to deactivate /etc/environment makes the virtual environment: deactivate redundant.
=====PyTorch and SciKit=====
Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:
*http://deeplearning.net/software/theano/install_ubuntu.html
 
===VNC===
 
In order to use the graphical interface for Matlab and other applications, we need a VNC server.
 
First, install the VNC client remotely. We use the standalone exe from TigerVNC.
 
Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04
 
cd /root
apt-get install xfce4 xfce4-goodies
 
As user
sudo apt-get install tightvncserver
vncserver
set password for user
vncserver -kill :1
mv ~/.vnc/xstartup ~/.vnc/xstartup.bak
vi ~/.vnc/xstartup
#!/bin/bash
xrdb $HOME/.Xresources
startxfce4 &
vncserver
sudo vi /etc/systemd/system/vncserver@.service
[Unit]
Description=Start TightVNC server at startup
After=syslog.target network.target
[Service]
Type=forking
User=uname
Group=uname
WorkingDirectory=/home/uname
PIDFile=/home/ed/.vnc/%H:%i.pid
ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1
ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i
ExecStop=/usr/bin/vncserver -kill :%i
[Install]
WantedBy=multi-user.target
 
Note that changing the color depth breaks it!
 
To make changes (or after the edit)
sudo systemctl daemon-reload
sudo systemctl enable vncserver@2.service
vncserver -kill :2
sudo systemctl start vncserver@2
sudo systemctl status vncserver@2
 
Stop the server with
sudo systemctl stop vncserver@2
 
Note that we are using :2 because :1 is running our regular Xwindows GUI.
 
Instrucions on how to set up an IP tunnel using PuTTY:
https://helpdeskgeek.com/how-to/tunnel-vnc-over-ssh/
 
====Connection Issues====
 
Coming back to this, I had issues connecting. I set up the tunnel using the saved profile in puTTY.exe and checked to see which local port was listening (it was 5901) and not firewalled using the listening ports tab under network on resmon.exe (it said allowed, not restricted under firewall status). VNC seemed to be running fine on Bastard, and I tried connecting to localhost::1 (that is 5901 on the localhost, through the tunnel to 5902 on Bastard) using VNC Connect by RealVNC. The connection was refused.
 
I checked it was listening and there was no firewall:
netstat -tlpn
tcp 0 0 0.0.0.0:5902 0.0.0.0:* LISTEN 2025/Xtightvnc
ufw status
Status: inactive
 
The localhost port seems to be open and listening just fine:
Test-NetConnection 127.0.0.1 -p 5901
 
So, presumably, there must be something wrong with the tunnel itself.
 
'''Ignoring the SSH tunnel worked fine: Connect to 192.168.2.202::5902 using the TightVNC (or RealVNC, etc.) client.'''
 
===RDP===
 
I also installed xrdp:
apt install xrdp
adduser xrdp ssl-cert
#Check the status and that it is listening on 3389
systemctl status xrd
netstat -tln
#It is listening...
vi /etc/xrdp/xrdp.ini
#See https://linux.die.net/man/5/xrdp.ini
systemctl restart xrdp
 
This gave a dead session (a flat light blue screen with nothing on it), which finally yielded a connection log which said "login successful for display 10, start connecting, connection problems, giving up, some problem."
cat /var/log/xrdp-sesman.log
 
There could be some conflict between VNC and RDP. systemctl status xrdp shows "xrdp_wm_log_msg: connection problem, giving up".
 
I tried without success:
gsettings set org.gnome.Vino require-encryption false
https://askubuntu.com/questions/797973/error-problem-connecting-windows-10-rdp-into-xrdp
vi /etc/X11/Xwrapper.config
allowed_users = anybody
This was promising as it was previously set to consol.
https://www.linuxquestions.org/questions/linux-software-2/xrdp-under-debian-9-connection-problem-4175623357/#post5817508
apt-get install xorgxrdp-hwe-18.04
Couldn't find the package... This lead was promising as it applies to 18.04.02 HWE, which is what I'm running
https://www.nakivo.com/blog/how-to-use-remote-desktop-connection-ubuntu-linux-walkthrough/
dpkg -l |grep xserver-xorg-core
ii xserver-xorg-core 2:1.19.6-1ubuntu4.3 amd64 Xorg X server - core server
Which seems ok, despite having a problem with XRDP and Ubuntu 18.04 HWE documented very clearly here: http://c-nergy.be/blog/?p=13972
 
There is clearly an issue with Ubuntu 18.04 and XRDP. The solution seems to be to downgrade xserver-xorg-core and some related packages, which can be done with an install script (https://c-nergy.be/blog/?p=13933) or manually. But I don't want to do that, so I removed xrdp and went back to VNC!
apt remove xrdp
 
===Other Software===
 
I installed the community edition of PyCharm:
snap install pycharm-community --classic
#Restart the local terminal so that it has updated paths (after a snap install, etc.)
/snap/pycharm-community/214/bin/pycharm.sh
 
On launch, you get some config options. I chose to install and enable:
*IdeaVim (a VI editor emulator)
*R
*AWS Toolkit
 
Make a launcher: In /usr/share/applications:
vi pycharm.desktop
[Desktop Entry]
Version=2020.2.3
Type=Application
Name=PyCharm
Icon=/snap/pycharm-community/214/bin/pycharm.png
Exec="/snap/pycharm-community/214/bin/pycharm.sh" %f
Comment=The Drive to Develop
Categories=Development;IDE;
Terminal=false
StartupWMClass=jetbrains-pycharm
 
Also, create a launcher on the desktop with the same info.

Navigation menu