Changes

553 bytes added , 17:07, 9 August 2024

This page details the build of our [[DIGITS DevBox]]. There's also a page giving information on [[Using the DevBox]]. nVIDIA, famous for their incredibly poor supply-chain and inventory management, have been saying [https://developer.nvidia.com/devbo "Please note that we are sold out of our inventory of the DIGITS DevBox, and no new systems are being built"] since shortly after the [https://en.wikipedia.org/wiki/GeForce_10_series Titax X] was the latest and greatest thing (i.e., somewhere around 2016). But it's pretty straight forward to update [https://www.azken.com/download/DIGITS_DEVBOX_DESIGN_GUIDE.pdf their spec].

==Introduction==

===Specification===

<onlyinclude>[[File:Top1000.jpg|right|300px]] Our [[DIGITS DevBox]], affectionately named ~~"Bastard"~~after Lois McMaster Bujold's fifth God, has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan Xp - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.0, cuDNN 7.6.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.</onlyinclude>

===Documentation===

The DevBox is currently unavailable from Amazon [https://www.amazon.com/Lambda-Deep-Learning-DevBox-Preinstalled/dp/B01BCDK1KC], and at around $15k buying one is prohibitive for most people. Some firms, including Lamdba Labs [https://lambdalabs.com/deep-learning/workstations/4-gpu], Bizon-tech [https://bizon-tech.com/us/bizon-g3000], are selling variants on them, but their prices are high too and the details on their specs are limited (the MoBo and config details are missing entirely).

But the parts ' cost is perhaps $4-5k now for a massive update to the original spec! So this page goes through everything required to put one together and get it up and running.

==Hardware==

Give the box a reboot!

===X Windows===

If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:

tasksel

And choose your favorite. We used Ubuntu Desktop.

And reboot again to make sure that everything is working nicely.

===Video Drivers===

The first build of this box was done with an installation of CUDA 10.1, which automatically installed version 418.67 of the NVIDIA driver. We then installed CUDA 10.0 under conda to support Tensorflow 1.13. All went mostly well, and the history of this page contains the instructions. However, at some point, likely because of an OS update, the video driver(s) stopped working. This page now describes the second build (as if it were a build from scratch). [[Addressing Ubuntu NVIDIA Issues]] provides additional information.

====Hardware ~~check====~~ ~~Check that the hardware is being seen:~~ ~~lspci -vk~~ ~~05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller])~~ ~~Subsystem: NVIDIA Corporation GP102 [TITAN Xp]~~ ~~Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0~~ ~~Memory at fa000000 (32-bit, non-prefetchable) [size=16M]~~ ~~Memory at c0000000 (64-bit, prefetchable) [size=256M]~~ ~~Memory at d0000000 (64-bit, prefetchable) [size=32M]~~ ~~I/O ports at d000 [size=128]~~ ~~Expansion ROM at 000c0000 [disabled] [size=128K]~~ ~~Capabilities: [60] Power Management version 3~~ ~~Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+~~ ~~Capabilities: [78] Express Legacy Endpoint, MSI 00~~ ~~Capabilities: [100] Virtual Channel~~ ~~Capabilities: [250] Latency Tolerance Reporting~~ ~~Capabilities: [128] Power Budgeting <?>~~ ~~Capabilities: [420] Advanced Error Reporting~~ ~~Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024~~ ~~Capabilities: [900] #19~~ ~~Kernel driver in use: nouveau~~ ~~Kernel modules: nvidiafb, nouveau~~ ~~06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog -if 00 [VGA controller])~~ ~~Subsystem: NVIDIA Corporation Device 12a3~~ ~~Flags: fast devsel, IRQ 24, NUMA node 0~~ ~~Memory at f8000000 (32-bit, non-prefetchable) [size=16M]~~ ~~Memory at a0000000 (64-bit, prefetchable) [size=256M]~~ ~~Memory at b0000000 (64-bit, prefetchable) [size=32M]~~ ~~I/O ports at c000 [size=128]~~ ~~Expansion ROM at f9000000 [disabled] [size=512K]~~ ~~Capabilities: [60] Power Management version 3~~ ~~Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+~~ ~~Capabilities: [78] Express Legacy Endpoint, MSI 00~~ ~~Capabilities: [100] Virtual Channel~~ ~~Capabilities: [250] Latency Tolerance Reporting~~ ~~Capabilities: [258] L1 PM Substates~~ ~~Capabilities: [128] Power Budgeting <?>~~ ~~Capabilities: [420] Advanced Error Reporting~~ ~~Capabilities: [600] Vendor Specific Information: ID~~and Drivers=~~0001 Rev~~=~~1 Len~~=~~024~~ ~~Capabilities: [900] #19~~ ~~Capabilities: [bb0] #15~~ ~~Kernel modules: nvidiafb, nouveau~~

~~This looks good. The second card~~ Check the hardware is ~~the Titan RTX (see https~~being seen and what driver is being used with:~~//devicehunt.com/view/type/pci/vendor/10DE/device/1E02).~~ lspci -vk

Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.

driver : xserver-xorg-video-nouveau - distro free builtin

~~You could install the driver directly now using, say, apt install nvidia-430. But don't!~~ ~~====CUDA====~~ ~~Get CUDA 10.1 and have it install its preferred driver (418.67):~~ *The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html*You can down load CUDA from here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal ~~Essentially, first install build-essential, which gets you gcc.~~ Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal~~, if you have deviated from these instructions and already installed X Windows)~~ so that it isn't loaded.

apt-get install build-essential

gcc --version

~~wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run~~

vi /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau

update-initramfs -u

shutdown -r now

Reboot to a text terminal

lspci -vk

Shows no kernel driver in use!

Install the driver! apt install nvidia-driver-430 ====CUDA==== Get CUDA 10.0, rather than 10.1. Although 10.1 is the latest version at the time of writing, it won't work with Tensorflow 1.13, so you'll just end up installing 10.0 under conda anyway. *The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html*You can down load CUDA 10.0 from here: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocalEssentially, first install build-essential, which gets you gcc. Then run the installer script.and DO NOT install the driver (don't worry about the warning, it will work fine!): sh cuda_10.10.~~168_418~~130_410.~~67_linux~~48_linux.run Do you accept the previously read EULA? accept/decline/quit: accept Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? (y)es/(n)o/(q)uit: n Install the CUDA 10.0 Toolkit? (y)es/(n)o/(q)uit: y Enter Toolkit Location [ default is /usr/local/cuda-10.0 ]: Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y Install the CUDA 10.0 Samples? (y)es/(n)o/(q)uit: y Enter CUDA Samples Location [ default is /home/ed ]: Installing the CUDA Toolkit in /usr/local/cuda-10.0 ... Missing recommended library: libGLU.so Missing recommended library: libX11.so Missing recommended library: libXi.so Missing recommended library: libXmu.so Missing recommended library: libGL.so Installing the CUDA Samples in /home/ed ... Copying samples to /home/ed/NVIDIA_CUDA-10.0_Samples now... Finished copying samples. =========== = Summary = ===========

~~===========~~ Driver: Not Selected Toolkit: ~~= Summary =~~Installed in /usr/local/cuda-10.0 Samples: ~~===========~~Installed in /home/ed, but missing recommended libraries

~~Driver:~~ Please make sure that - ~~Installed~~ ~~Toolkit: Installed in~~ PATH includes /usr/local/cuda-10.10/bin ~~Samples: Installed in~~ - LD_LIBRARY_PATH includes /usr/~~home~~local/edcuda-10.0/lib64, ~~but missing recommended libraries~~or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

~~Please make sure that~~ ~~- PATH includes~~ To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.10/bin ~~- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root~~

~~To uninstall the CUDA Toolkit, run cuda-uninstaller~~ Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.10/doc/~~bin~~ ~~To uninstall the NVIDIA Driver, run nvidia-uninstall~~pdf for detailed information on setting up CUDA.

~~Please see CUDA_Installation_Guide_Linux~~ ***WARNING: Incomplete installation! This installation did not install the CUDA Driver.~~pdf in /usr/local/cuda-10~~A driver of version at least 384.~~1/doc/pdf~~ 00 is required for ~~detailed information on setting up~~ CUDA10.0 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver Logfile is /~~var/log~~tmp/~~cuda-installer~~cuda_install_2807.log

~~Fix~~ Now fix the paths. To do this for a single user do: export PATH=/usr/local/cuda-10.10/bin:/usr/local/cuda-10.~~1/NsightCompute-2019.1~~0${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.10/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

~~Start~~ But it is better to fix it for everyone by editing your environment file: vi /etc/environment PATH="/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games" LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64" With version cuda 10.0, you don't need to edit rc.local to start the persistence daemon:

/usr/bin/nvidia-persistenced --verbose

~~This should be run at boot~~Instead, ~~so:~~ ~~vi /etc/rc.local~~ ~~#!/bin/sh -e~~ ~~/usr/bin/~~nvidia-persistenced ~~--verbose~~ ~~exit 0~~ ~~chmod +x /etc/rc~~runs as a service.~~local~~ ~~Verify the driver:~~ ~~cat /proc/driver/nvidia/version~~

====Test the installation====

Make the samples ~~in:~~... cd /usr/local/cuda-10.10/samples

make

And change into the sample directory and run the tests:

~~Change into the sample directory and run the tests:~~ cd /usr/local/cuda-10.10/samples/bin/x86_64/linux/release

./deviceQuery

./bandwidthTest

~~And yes, it's a thing of beauty~~Everything should be good at this point! ~~===X Windows===~~ ~~Now install the X window system. The easiest way is:~~ ~~tasksel~~ ~~And choose your favorite. We used Ubuntu Desktop.~~ ~~And reboot again to make sure that everything is working nicely.~~

===Bcache===

This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/

Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.10

...

sudo apt-get install -y nvidia-docker2

sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image

docker run --runtime=nvidia --rm nvidia/cuda:10.10-base nvidia-smi

Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):

*https://developer.nvidia.com/digits

Note: you can kill docker containers with

docker system prune

====cuDNN====

First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:

cd /bulk/install/

dpkg -i libcudnn7_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-dev_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb dpkg -i libcudnn7-doc_7.56.1.1034-1+cuda10.~~1_amd64~~0_amd64.deb

And test it:

pip install --upgrade tensorflow-gpu

python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.

~~Still as researcher (and in the venv):~~

~~conda install cudatoolkit~~

~~conda install cudnn~~

~~conda install tensorflow-gpu~~

~~export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~

~~python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~

~~AND IT WORKS!~~

Note: to deactivate the virtual environment:

deactivate

Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.

=====PyTorch and SciKit=====

*http://deeplearning.net/software/theano/install_ubuntu.html

==~~Video Driver Issue~~=VNC=== In order to use the graphical interface for Matlab and other applications, we need a VNC server. First, install the VNC client remotely. We use the standalone exe from TigerVNC. Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04 cd /root apt-get install xfce4 xfce4-goodies As user sudo apt-get install tightvncserver vncserver set password for user (ailia) vncserver -kill :1 mv ~/.vnc/xstartup ~/.vnc/xstartup.bak vi ~/.vnc/xstartup #!/bin/bash xrdb $HOME/.Xresources startxfce4 & vncserver sudo vi /etc/systemd/system/vncserver@.service [Unit] Description=Start TightVNC server at startup After=syslog.target network.target [Service] Type=forking User=uname Group=uname WorkingDirectory=/home/uname PIDFile=/home/ed/.vnc/%H:%i.pid ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1 ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i ExecStop=/usr/bin/vncserver -kill :%i [Install] WantedBy=multi-user.target Note that changing the color depth breaks it! To make changes (or after the edit) sudo systemctl daemon-reload sudo systemctl enable vncserver@2.service vncserver -kill :2 sudo systemctl start vncserver@2 sudo systemctl status vncserver@2 Stop the server with sudo systemctl stop vncserver@2 Note that we are using :2 because :1 is running our regular Xwindows GUI. Instrucions on how to set up an IP tunnel using PuTTY: https://helpdeskgeek.com/how-to/tunnel-vnc-over-ssh/ ====Connection Issues==== Coming back to this, I had issues connecting. I set up the tunnel using the saved profile in puTTY.exe and checked to see which local port was listening (it was 5901) and not firewalled using the listening ports tab under network on resmon.exe (it said allowed, not restricted under firewall status). VNC seemed to be running fine on Bastard, and I tried connecting to localhost::1 (that is 5901 on the localhost, through the tunnel to 5902 on Bastard) using VNC Connect by RealVNC. The connection was refused. I checked it was listening and there was no firewall: netstat -tlpn tcp 0 0 0.0.0.0:5902 0.0.0.0:* LISTEN 2025/Xtightvnc ufw status Status: inactive The localhost port seems to be open and listening just fine: Test-NetConnection 127.0.0.1 -p 5901 So, presumably, there must be something wrong with the tunnel itself. '''Ignoring the SSH tunnel worked fine: Connect to 192.168.2.202::5902 using the TightVNC (or RealVNC, etc.) client.''' ====Later Notes==== =====Change the resolution===== I came back and changed the resolution to make it work on one of my portrait desktop monitors.See https://www.tightvnc.com/vncserver.1.php As root: vi /etc/systemd/system/vncserver@.service Change line: ExecStart=/usr/bin/vncserver -depth 24 -geometry 1440x2560 :%i (Note that the size is 2160x3840 divide by 150%). Leave the color depth as it says elsewhere that changes are bad. systemctl daemon-reload systemctl enable vncserver@2.service As Ed: vncserver -kill :2 sudo systemctl start vncserver@2 sudo systemctl status vncserver@2

~~After logging into the box sometime later, it seems that the video drivers are no longer loading, presumably as a consequence of some update or something~~Exit full screen with ctrl-alt-shift-f.

===~~Testing~~==Cut And Paste=====

~~nvidia~~Also, try to fix the cut-~~settings~~ and-paste issue. See, for example, https://unix.stackexchange.com/questions/35030/how-can-i-copy-paste-data-to-and-from-the-windows-clipboard-to-an-opensuse-~~query FlatpanelNativeResolution~~ ~~ERROR: NVIDIA driver is not loaded~~clipb

As root: ~~cd /usr/local~~apt-get install autocutsel vi ~/~~cuda-10~~.1vnc/~~samples~~xstartup #!/bin/~~x86_64~~bash xrdb $HOME/~~linux/release~~.Xresources autocutsel -fork ~~./deviceQuery~~ ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ startxfce4 & ~~cudaGetDeviceCount returned 100~~ ~~-> no CUDA-capable device is detected~~ ~~Result = FAIL~~Though this might have been working fine anyway. Just change the terminal and all will be well.

~~./mnistCUDNN~~ ~~cudnnGetVersion() : 7501 , CUDNN_VERSION from cudnn.h : 7501 (7.5.1)~~ ~~Cuda failurer version : GCC 7.4.0~~ ~~Error: no CUDA-capable device is detected~~ ~~error_util.h:93~~ ~~Aborting...~~=====Use XFCE terminal=====

~~And as researcher~~Change Settings: ~~cd /home/researcher/~~ ~~source ./venv/bin/activate~~ ~~python~~ Preferred Applications -~~c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~... failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA~~> Utilities -~~capable device is detected~~ ~~...kernel driver does not appear~~ > Terminal to ~~be running on this host (bastard): /proc/driver/nvidia/version does not exist~~XFCE

~~lspci~~ Note that this seems to fix everything but the instructions for customizing the menu are here: https://wiki.xfce.org/howto/customize-~~vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in use~~menu cat /etc/xdg/menus/xfce-applications.menu

~~It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devicesreturns exactly what it did before we installed CUDA 10.1 too...~~===RDP===

~~There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following~~I also installed xrdp: ~~/usr/bin/nvidia~~apt install xrdp adduser xrdp ssl-~~persistenced --verbose~~cert ~~nvidia-persistenced failed to initialize.~~ #Check ~~syslog for more details.~~the status and that it is listening on 3389 ~~tail /var/log/syslog~~systemctl status xrd netstat -tln #It is listening...~~Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout~~ ~~...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (~~vi /etc/~~dev~~xrdp/~~nvidia*) exist, and that user 0 has read and write permissions for those files~~xrdp.ini ls #See https:/~~dev~~/ linux.die.net/man/5/xrdp.~~reveals no nvidia devices~~ ~~nvidia-smi~~ini ~~...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.~~systemctl restart xrdp

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* ~~.../etc/modprobe.d/blacklist-framebuffer~~This gave a dead session (a flat light blue screen with nothing on it), which finally yielded a connection log which said "login successful for display 10, start connecting, connection problems, giving up, some problem.~~conf:blacklist nvidiafb~~" ~~...~~ cat /~~etc~~var/~~modprobe.d~~log/~~nvidia-installer~~xrdp-~~disable-nouveau~~sesman.~~conf:# generated by nvidia-installer~~log

~~===Uninstall/Reinstall===~~There could be some conflict between VNC and RDP. systemctl status xrdp shows "xrdp_wm_log_msg: connection problem, giving up".

~~Am going to try uninstalling CUDA~~ I tried without success: gsettings set org.gnome.Vino require-encryption false https://askubuntu.com/questions/797973/error-problem-connecting-windows-10-rdp-into-xrdp vi /etc/X11/Xwrapper.~~1 and the current Nvidia driver, and then reinstalling CUDA 10~~config allowed_users = anybody This was promising as it was previously set to consol.0 https://www.linuxquestions.org/questions/~~usr~~linux-software-2/~~local~~xrdp-under-debian-9-connection-problem-4175623357/~~cuda~~#post5817508 apt-10get install xorgxrdp-hwe-18.04 Couldn't find the package... This lead was promising as it applies to 18.04.02 HWE, which is what I'm running https://www.nakivo.1com/~~bin~~blog/~~cuda~~how-to-use-remote-desktop-~~uninstaller~~connection-ubuntu-linux-walkthrough/ dpkg -l |grep xserver-xorg-core ii ~~nvidia~~xserver-xorg-core 2:1.19.6-1ubuntu4.3 amd64 Xorg X server - core server Which seems ok, despite having a problem with XRDP and Ubuntu 18.04 HWE documented very clearly here: http://c-~~uninstall~~nergy.be/blog/?p=13972

~~WARNING: Your driver installation has been altered since it was initially installed; this may happen, for example, if~~ ~~you have since installed the NVIDIA driver through a mechanism other than nvidia~~There is clearly an issue with Ubuntu 18.04 and XRDP. The solution seems to be to downgrade xserver-~~installer (such as your~~ ~~distribution's native package management system). nvidia~~xorg-~~installer will attempt to uninstall as best it~~ core and some related packages, which can~~. Please see the file '/var~~be done with an install script (https:/~~log~~/~~nvidia~~c-~~uninstall.log' for details.~~ ~~WARNING: Failed to delete some directories~~nergy. ~~See~~ be/~~var~~blog/~~log/nvidia-uninstall.log for details~~?p=13933) or manually.But I don't want to do that, so I removed xrdp and went back to VNC! ~~Uninstallation of existing driver: NVIDIA Accelerated Graphics Driver for Linux-x86_64 (418.67) is complete.~~apt remove xrdp

~~Then download cuda_10.0.130_410.48_linux.run from https://developer.nvidia.com/cuda-10.0-download-archive?target_os~~=~~Linux&target_arch~~=~~x86_64&target_distro~~=~~Ubuntu&target_version~~Other Software==~~1804&target_type~~=~~runfilelocal, as well as cuda_10.0.130.1_linux.run.~~

I installed the community edition of PyCharm: ~~sudo su~~snap install pycharm-community --classic #Restart the local terminal so that it has updated paths (after a snap install, etc.) cd /~~bulk~~snap/~~install~~ .pycharm-community/214/bin/~~cuda_10.0.130_410.48_linux~~pycharm.~~run~~ ~~accept all defaults and install everything (including 410.something NVIDIA driver)~~sh

~~===========~~ ~~Driver: Installed~~ ~~Toolkit: Installed in /usr/local/cuda-10.0~~ ~~Samples: Installed in /home/ed~~On launch, ~~but missing recommended libraries~~ ~~Please make sure that~~ ~~- PATH includes /usr/local/cuda-10.0/bin~~ ~~- LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10~~you get some config options.~~0/lib64~~ I chose to ~~/etc/ld.so.conf~~ install and ~~run ldconfig as root~~enable:*IdeaVim (a VI editor emulator) *R ~~To uninstall the CUDA~~ *AWS Toolkit~~, run the uninstall script in /usr/local/cuda-10.0/bin~~ ~~To uninstall the NVIDIA Driver, run nvidia-uninstall~~ ~~Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.~~ ~~Logfile is /tmp/cuda_install_8524.log~~

~~Fix the paths~~Make a launcher: In /usr/share/applications: ~~export PATH~~vi pycharm.desktop [Desktop Entry] Version=2020.2.3 Type=Application Name=PyCharm Icon=/~~usr~~snap/~~local~~pycharm-community/~~cuda-10.0~~214/bin~~${PATH:+:${PATH}}~~/pycharm.png ~~export LD_LIBRARY_PATH~~ Exec="/snap/~~usr~~pycharm-community/214/~~local~~bin/~~cuda~~pycharm.sh" %f Comment=The Drive to Develop Categories=Development;IDE; Terminal=false StartupWMClass=jetbrains-~~10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}~~pycharm

Also ~~vi /etc/ld~~, create a launcher on the desktop with the same info.~~so.conf.d/cuda.conf~~ ~~/usr/local/cuda-10.0/lib64~~ ~~ldconfig~~

~~Finally:~~ Note that when I came back to the box the launcher didn't work.~~/cuda_10~~.0.~~130.1.run~~ ~~accept all defaults~~

~~Unfortunately this didn't work. After a reboot:~~ ~~nvidia-settings --query FlatpanelNativeResolution~~ ~~Unable to init server: Could not connect: Connection refused (message was same as before on the box, this was over ssh)~~==== MATLAB ====

I installed MATLAB R2024a by downloading the zip, running sudo ./~~deviceQuery Starting...~~ ~~CUDA Device Query (Runtime API) version (CUDART static linking)~~ ~~cudaGetDeviceCount returned 35~~ ~~-> CUDA driver version is insufficient for CUDA runtime version~~ ~~Result = FAIL~~install

~~python -c "import tensorflow as tf; tf.enabl e_eager_execution();~~ ~~print(tf.reduce_sum(tf.random_normal([1000, 1000])))"~~ ~~2019-07-09 15:20:40.085877: E tensorflow~~and using the defaults of /usr/~~stream_executor~~local/~~cuda~~MATLAB/~~cuda_driver~~R2024 etc.~~cc:300~~ ~~] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device~~ The license number is ~~detecte~~ d ~~2019-07-09 15:20:40~~41201644.~~085978: I tensorflow/stream_executor/cuda/cuda_diagnostics.c~~ ~~c:148] kernel driver does not appear to be running on this host (bastard): /proc~~ ~~/driver/nvidia/version does not exist~~

~~/usr/bin/nvidia-persistenced --verbose~~ ~~nvidia-persistenced failed to initialize. Check syslog for more details.~~===Upgrading the nVIDIA Drivers===

~~lspci -vk also returned~~ In MATLAB, I ran: gpuDevice Error using gpuDevice (line 26) Graphics driver is out of date. Download and install the ~~same as before~~latest graphics driver for your GPU from NVIDIA. ~~This is really frustrating!~~

~~Did the following:~~Some quick checks showed that I was using driver version 430.26 on ubuntu 18.04.02. ~~apt-get install~~ nvidia-~~prime~~smi ~~prime~~lsb_release -~~select nvidia~~ ~~Info: the nvidia profile is already set~~a

~~update~~I couldn't quite get MATLAB to tell me what I needed:* https://www.mathworks.com/help/parallel-~~initramfs~~ computing/gpu-ucomputing-requirements.html* https://www.mathworks.com/help/parallel-computing/run-mex-functions-containing-cuda-code.html#mw_20acaa78-994d-4695-ab4b-bca1cfc3dbac

For ~~next time (as root)~~MEX, I have 10.2 and need 12.2 of the CUDA toolkit: ~~lshw -c video~~MATLAB Release CUDA Toolkit Version R2024a 12.2 ... ~~shows configuration without driver~~ R2020b 10.2

~~modprobe~~ However:* nVidia said the latest version was https://www.nvidia.com/Download/driverResults.aspx/230357/en-~~-resolve-alias nvidiafb~~us/~~modinfo $(modprobe --resolve~~* The repo said the highest version for 18.04 is 545: https://launchpad.net/~graphics-~~alias nvidiafb)~~drivers/+archive/ubuntu/ppa

As root: ~~lsof +D /usr/lib/xorg/modules/drivers/~~runlevel ~~COMMAND PID USER FD~~ ~~TYPE DEVICE SIZE/OFF NODE NAME~~#5 ~~Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so~~systemctl get-default ~~Xorg 2488 root mem REG~~ ~~8,49 90360 26347089~~ ~~/usr/lib/xorg/modules/drivers/modesetting_drv~~#graphical.sotarget ~~Xorg 2488 root mem REG 8,49 217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv~~systemctl set-default multi-user.sotarget ~~Xorg 2488 root mem REG 8,49 7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.so~~systemctl reboot

As ed: ~~cat /var/log/Xorg.0.log~~vncserver -kill :2 Killing Xtightvnc process ID 1844

~~] (II) LoadModule~~As root: ~~"nvidia"[ 29~~ #sh .~~047] (II) Loading /usr/lib/xorg/modules/drivers~~/~~nvidia_drv.so[ 29.047] (II) Module nvidia: vendor="~~NVIDIA ~~Corporation"[ 29.047] compiled for 4.0~~-Linux-x86_64-550.~~2, module version = 1~~107.002.0run~~[ 29.047] Module class: X.Org Video Driver~~ # The distribution-provided pre-install script failed!~~[ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep~~ ~~6 06:27:34 CDT 2018[ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs[ 29.047] (II) Loading sub module "fb"[ 29.047] (II) LoadModule: "fb"[ 29.047] (II) Loading~~ #cat /~~usr~~var/~~lib~~log/~~xorg/modules/libfb.so[ 29.047] (II) Module fb: vendor="X.Org Foundation"[ 29.047] compiled for 1.19.6, module version = 1.0.0[ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.047] (II) Loading sub module "wfb"[ 29.047] (II) LoadModule: "wfb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so[ 29.048] (II) Module wfb: vendor="X.Org Foundation"[ 29.048] compiled for 1.19.6, module version = 1.0.0[ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.048] (II) Loading sub module "ramdac"[ 29.048] (II) LoadModule: "ramdac"[ 29.048] (II) Module "ramdac" already built~~nvidia-in~~[ 29.095] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the[ 29~~installer.~~095] (EE) NVIDIA: system's kernel~~ log ~~for additional error messages and[ 29.095] (EE) NVIDIA: consult the NVIDIA README for details.[ 29.095] (EE) No devices detected.~~

vi ~~/var/log/kern~~apt-get update apt install nvidia-driver-545 systemctl set-default graphical.~~log~~target ~~... it looks like we are back to an unsigned module tainting the kernel.~~ systemctl reboot

~~vi /etc/default/grub~~Run MATLAB~~GRUB_DEFAULT=0~~ gpuDevice~~GRUB_TIMEOUT_STYLE=hidden~~ Name: 'NVIDIA TITAN RTX'~~GRUB_TIMEOUT=2~~ Index: 1~~GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`~~ ComputeCapability: '7.5'~~GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm~~ GraphicsDriverVersion: '545.~~modeset=1"~~29.06'~~GRUB_CMDLINE_LINUX=""~~ ToolkitVersion: 12.2000

~~update-grub~~ gpuDevice(2)~~Sourcing file `/etc/default/grub~~ Name: 'NVIDIA TITAN Xp'~~Generating grub configuration file ...Found linux image~~ Index: ~~/boot/vmlinuz-4.18.0-25-generic~~2~~Found initrd image~~ ComputeCapability: ~~/boot/initrd~~'6.~~img-4.18.0-25-generic~~1'~~Found linux image~~ SupportsDouble: ~~/boot/vmlinuz-4.18.0-20-generic~~1~~Found initrd image~~ GraphicsDriverVersion: ~~/boot/initrd~~'545.~~img-4~~29.~~18.0-20-generic~~06'~~Found linux image~~ ToolkitVersion: ~~/boot/vmlinuz-4~~12.~~18.0-18-genericFound initrd image: /boot/initrd.img-4.18.0-18-genericFound linux image: /boot/vmlinuz-4.15.0-54-genericFound initrd image: /boot/initrd.img-4.15.0-54-genericFound memtest86+ image: /boot/memtest86+.elfFound memtest86+ image: /boot/memtest86+.bindevice-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busyCommand faileddone~~2000

~~https~~The messages were:~~//askubuntu~~ apt install nvidia-driver-545 The following additional packages will be installed: libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545 libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545 libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545 nvidia-firmware-545-545.~~com/questions/1048274/ubuntu~~29.06 nvidia-kernel-common-545 nvidia-kernel-18source-545 nvidia-04utils-~~stopped~~545 xserver-~~working~~xorg-~~with~~video-nvidia-~~drivers~~545 The following packages will be REMOVED: libnvidia-cfg1-430 libnvidia-common-430 libnvidia-compute-430 libnvidia-compute-430:i386 libnvidia-decode-430 libnvidia-decode-430:i386 libnvidia-encode-430 libnvidia-encode-430:i386 libnvidia-fbc1-430 libnvidia-fbc1-430:i386 libnvidia-gl-430 libnvidia-gl-430:i386 libnvidia-ifr1-430 libnvidia-ifr1-430:i386 nvidia-compute-utils-430 nvidia-dkms-430 nvidia-driver-430 nvidia-kernel-common-430 nvidia-kernel-source-430 nvidia-utils-430 xserver-xorg-video-nvidia-430 The following NEW packages will be installed: libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545 libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545 libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545 nvidia-driver-545 nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-utils-545 xserver-xorg-video-nvidia-545 0 upgraded, 21 newly installed, 21 to remove and 2 not upgraded.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,658

edits

Changes

DIGITS DevBox (view source)

Revision as of 17:07, 9 August 2024

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools