Difference between revisions of "Tensorflow with gpu"
From ElphelWiki
(→Docker images) |
(→Notes) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 177: | Line 177: | ||
− | == | + | ==Setup walkthrough for CUDA 10.2 (Dec 2019)== |
===Install CUDA=== | ===Install CUDA=== | ||
Line 222: | Line 222: | ||
* In the docs it's clear that Docker version 19.03+ should use nvidia-docker2. For Docker of older versions - nvidia-docker v1 should be used. | * In the docs it's clear that Docker version 19.03+ should use nvidia-docker2. For Docker of older versions - nvidia-docker v1 should be used. | ||
− | * It's not immediately clear about the '''nvidia-container-runtime'''. nvidia-docker v1 & v2 already | + | * It's not immediately clear about the '''nvidia-container-runtime'''. nvidia-docker v1 & v2 should have already registered it. |
====Notes==== | ====Notes==== | ||
Line 241: | Line 241: | ||
* How to run tensorboard from the container: | * How to run tensorboard from the container: | ||
<font size='2'># from [https://briancaffey.github.io/2017/11/20/using-tensorflow-and-tensor-board-with-docker.html here] | <font size='2'># from [https://briancaffey.github.io/2017/11/20/using-tensorflow-and-tensor-board-with-docker.html here] | ||
− | # From the running container's command line (since it was run with 'bash' in the step above) | + | # From the running container's command line (since it was run with 'bash' in the step above). |
+ | # set a correct --logdir | ||
root@e9efee9e3fd3:/# '''tensorboard --bind_all --logdir=/app/log.txt''' # remove --bind_all for TF 1.15 | root@e9efee9e3fd3:/# '''tensorboard --bind_all --logdir=/app/log.txt''' # remove --bind_all for TF 1.15 | ||
# Then open a browser: | # Then open a browser: | ||
'''http://localhost:6006'''</font> | '''http://localhost:6006'''</font> |
Revision as of 16:39, 23 December 2019
Contents
Requirements
- Kubuntu 16.04 LTS
Setup (guide)
Just follow:
- The walkthrough in the bottom is for CUDA 10.1, cuDNN 7.6.1, python3
- This guide (Ubuntu 16.04 64-bit, CUDA 9.2, cuDNN 7.1.4, python3)
- This guide (Ubuntu 16.04 64-bit, CUDA 9.1, cuDNN 7.1.2, python3)
Setup (some details)
- Check device
~$ lspci | grep NVIDIA 81:00.0 VGA compatible controller: NVIDIA Corporation GF119 [GeForce GT 610] (rev a1) 81:00.1 Audio device: NVIDIA Corporation GF119 HDMI Audio Controller (rev a1)
- Check driver version:
~$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
- Install cuda 9.2 with patch(es):
# https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=deblocal: ~$ sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb ~$ sudo apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub ~$ sudo apt-get update ~$ sudo apt-get install cuda # INSTALL THE PATCH(ES)
- Might need to reboot PC. If cuda 9.2 got installed over other version, nvidia tools will be throwing errors about driver versions mismatching, try
~$ nvidia-smi
Good looking output:
Wed Jun 13 15:55:44 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A | | 33% 36C P8 1W / 46W | 229MiB / 2000MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1305 G /usr/lib/xorg/Xorg 136MiB | | 0 3587 G /usr/bin/krunner 1MiB | | 0 3590 G /usr/bin/plasmashell 67MiB | | 0 3693 G /usr/bin/plasma-discover 20MiB | +-----------------------------------------------------------------------------+
- Check out post installation docs:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions: # Export paths ~$ export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}} ~$ export LD_LIBRARY_PATH=/usr/local/cuda-9.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} ~$ export LD_LIBRARY_PATH=/usr/local/cuda-9.2/extras/CUPTI/lib64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
- Install TensorFlow (build from sources for cuda 9.2):
link 1: (preferrable guide): http://www.python36.com/install-tensorflow141-gpu/ link 2: https://www.tensorflow.org/install/install_sources
- [Optional] Install TensorFlow (prebuilt for cuda 9.0?):
# docs: # - https://www.tensorflow.org/install/install_linux # some instructions: # - install cuDNN ~$ sudo apt-get install python3-pip # if it is not already installed ~$ sudo pip3 install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.7.0-cp35-cp35m-linux_x86_64.whl
Testing setup
- Supported card GeForce GTX 750 Ti (list of supported graphic cards):
~$ python3 Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> hello = tf.constant('Hello, World!') >>> sess = tf.Session() 2018-04-26 18:14:05.427668: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-04-26 18:14:05.428033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.1105 pciBusID: 0000:01:00.0 totalMemory: 1.95GiB freeMemory: 1.53GiB 2018-04-26 18:14:05.428061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-04-26 18:14:05.927106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-26 18:14:05.927149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-04-26 18:14:05.927163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-04-26 18:14:05.927313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1289 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0, compute capability: 5.0) >>> print(sess.run(hello)) b'Hello, World!'
- Unsupported card GeForce GT 610
~$ python3 Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> hello = tf.constant('Hello, World!') >>> sess = tf.Session() 2018-04-26 13:00:19.050625: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-04-26 13:00:19.181581: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: GeForce GT 610 major: 2 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:81:00.0 totalMemory: 956.50MiB freeMemory: 631.69MiB 2018-04-26 13:00:19.181648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1394] Ignoring visible gpu device (device: 0, name: GeForce GT 610, pci bus id: 0000:81:00.0, compute capability: 2.1) with Cuda compute capability 2.1. The minimum required Cuda capability is 3.5. 2018-04-26 13:00:19.181669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-26 13:00:19.181683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-04-26 13:00:19.181695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N >>> print(sess.run(hello)) b'Hello, World!'
- As a quickfix had to install CuDNN 7.0.5 instead of latest:
https://stackoverflow.com/questions/49960132/cudnn-library-compatibility-error-after-loading-model-weights
- Print tensorflow version
>>> print(tf.__version__)
Problems
- [SOLVED] AttributeError: '_NamespacePath' object has no attribute 'sort'
# Notes: After updating some packages probably. python3? # How to reproduce: 1: ~$ python3 >>> import tensorflow 2: ~$ virtualenv --system-site-packages -p python3 # Solution: ~$ sudo pip3 install setuptools --upgrade
Walkthrough for CUDA 10.1 (20190602)
Install CUDA
- In this guide there's a link to CUDA toolkit.
- That toolkit (CUDA Toolkit 10.1 update1 (May 2019)) also updated the system driver to 418.67
- Reboot
Install cuDNN
- Have to have an account with NVIDIA - downloaded cuDNN v7.6.1 (June 24, 2019), for CUDA 10.1
Option 1: installing tensorflow from source
Basically, this guide, some key notes:
- Install bazel - version 0.25.2 (newer will not work)
- To build, read this link:
git clone https://github.com/tensorflow/tensorflow.git cd tensorflow git checkout r1.14 ./configure bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package # 4-5 hours later ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg sudo pip3 install /tmp/tensorflow_pkg/tensorflow-[Tab]
- Testing:
~$ python3 >>> import tensorflow as tf >>> hello = tf.constant('Hello, World!') >>> sess = tf.Session()
Option 2: using docker
Follow this guide. Key notes:
- Tensorflow docker image requires nvidia docker image, nvidia docker image requires apt install nvidia-docker2, nvidia-docker2 requires apt install docker-ce:
- https://github.com/NVIDIA/nvidia-docker - https://docs.docker.com/install/linux/docker-ce/ubuntu/
- Test run:
# Test 1: GPU support inside container: sudo docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi # Test 2: Test all together sudo docker pull tensorflow/tensorflow:latest-gpu-py3-jupyter sudo docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu-py3-jupyter python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" # Test 3: Run a local script (and include a local dir) in contatiner: https://www.tensorflow.org/install/docker
Setup walkthrough for CUDA 10.2 (Dec 2019)
Install CUDA
- In this guide there's a link to CUDA toolkit.
- The toolkit (CUDA Toolkit 10.2) also updated the system driver to 440.33.01
- Will have to reboot
Docker
Instructions
https://www.tensorflow.org/install/docker
Quote:
Docker is the easiest way to enable TensorFlow GPU support on Linux since only the NVIDIA® GPU driver is required on the host machine (the NVIDIA® CUDA® Toolkit does not need to be installed).
Docker images
Where to browse: https://hub.docker.com/r/tensorflow/tensorflow/:
TF version | Python major version | GPU support | NAME:TAG for Docker command |
---|---|---|---|
1.15 | 3 | yes | tensorflow/tensorflow:1.15.0-gpu-py3 |
2.0.0+ | 3 | yes | tensorflow/tensorflow:latest-gpu-py3 |
2.0.0+ | 2 | yes | tensorflow/tensorflow:latest-gpu |
nvidia-docker
Somehow it was already installed.
- Check NVIDIA docker version
~$ nvidia-docker version
- In the docs it's clear that Docker version 19.03+ should use nvidia-docker2. For Docker of older versions - nvidia-docker v1 should be used.
- It's not immediately clear about the nvidia-container-runtime. nvidia-docker v1 & v2 should have already registered it.
Notes
- Can mount a local directory in a 'binding' mode - i.e., update files locally so they are updated in the docker container as well:
# this will bind-mount directory target located in $(pwd), which is a dir the command is run from # to /app in the docker container ~$ docker run \ -it \ --rm \ --name devtest \ -p 0.0.0.0:6006:6006 \ --mount type=bind,source="$(pwd)"/target,target=/app \ --gpus all \ tensorflow/tensorflow:latest-gpu-py3 \ bash
- How to run tensorboard from the container:
# from here # From the running container's command line (since it was run with 'bash' in the step above). # set a correct --logdir root@e9efee9e3fd3:/# tensorboard --bind_all --logdir=/app/log.txt # remove --bind_all for TF 1.15 # Then open a browser: http://localhost:6006