tensorflow 分布式在conda构造的虚拟环境中运行MNIST测试用例

在上次尝试失败之后，领悟到如果要运行这个测试用例，就需要构造出跟当初作者类似的环境，这个任务就交给conda了

部署conda

只需要安装miniconda即可，先获取脚本

root@mypc:~/conda# wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh

添加执行权限，运行,一路yes加回车

root@mypc:~/conda# chmod +x ./Miniconda3-py38_4.9.2-Linux-x86_64.sh

root@mypc:~/conda# ./Miniconda3-py38_4.9.2-Linux-x86_64.sh

更换阿里的镜像源

conda config --add channels https://mirrors.aliyun.com/anaconda/pkgs/free

conda config --add channels https://mirrors.aliyun.com/anaconda/pkgs/main

conda config --add channels https://mirrors.aliyun.com/anaconda/pkgs/msys2

conda config --add channels https://mirrors.aliyun.com/anaconda/pkgs/r

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/Paddle

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/auto

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/biobakery

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/bioconda

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/c4aarch64

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/caffe2

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/conda-forge

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/deepmodeling

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/dglteam

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/fastai

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/fermi

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/idaholab

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/intel

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/matsci

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/menpo

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/mordred-descriptor

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/msys2

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/numba

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/ohmeta

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/omnia

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/plotly

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/psi4

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/pytorch

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/pytorch-test

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/pytorch3d

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/pyviz

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/qiime2

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/rapidsai

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/rdkit

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/simpleitk

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/stackless

conda config --add channels https://mirrors.aliyun.com/anaconda/cloud/ursky

conda config --set show_channel_urls yes

# 确认

root@mypc:~# conda config --show-sources

构造虚拟环境

生成python2.7的虚拟环境

root@mypc:~# conda create -n my_tensorflow python=2.7

确认

root@mypc:~# conda env list

# conda environments:

base /root/miniconda3

my_tensorflow /root/miniconda3/envs/my_tensorflow

root@mypc:~#

进入,注意在前面多了一个(my_tensorflow)

root@mypc:~# conda activate my_tensorflow

(my_tensorflow) root@mypc:~#

接下来安装tensorflow 1.12，注意一定要先手动安装下面两个包

(my_tensorflow) root@mypc:~# pip install grpcio==1.24

(my_tensorflow) root@mypc:~# pip install protobuf==3.17.3

然后再安装tensorflow

(my_tensorflow) root@mypc:~# pip install tensorflow==1.12

确认版本号

(my_tensorflow) root@mypc:~# pip freeze | grep tensor

DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.

tensorboard==1.12.2

tensorflow==1.12.0

(my_tensorflow) root@mypc:~#

导出导入环境

既然是分布式，肯定还需要把环境导出到其它主机上没有使用export yaml的方式，而是直接用conda pack打包

root@mypc:~# conda install -c conda-forge conda-pack

root@mypc:~# conda pack -n my_tensorflow -o my_tensorflow.tar.gz

Collecting packages...

Packing environment at '/root/miniconda3/envs/my_tensorflow' to 'my_tensorflow.tar.gz'

[########################################] | 100% Completed | 19.7s

root@mypc:~#

然后通过scp将my_tensorflow.tar.gz拷贝到其它主机上接着就是解压缩到新主机的miniconda3/envs目录下

root@hpc-1:~# mkdir /root/miniconda3/envs/my_tensorflow

root@hpc-1:~# tar -zxvf my_tensorflow.tar.gz -C /root/miniconda3/envs/my_tensorflow

root@hpc-1:~# conda env list

# conda environments:

base * /root/miniconda3

my_tensorflow /root/miniconda3/envs/my_tensorflow

root@hpc-1:~#

通过这个操作，可以反向推导出conda的工作原理，就是在虚拟环境中，将对python的库调用“劫持”到env下的python库，从而实现独立的python库版本控制

运行脚本

本用例中通信的IP

mypc:5.5.5.254----作为ps

hpc-1:5.5.5.1-----作为worker0

hpc-2:5.5.5.2-----作为worker1

修改distributed.py,主要是修改data_dir和主机IP，放到三台主机上

root@hpc-2:~# git diff distributed.py_bak distributed.py

diff --git a/distributed.py_bak b/distributed.py

index d3a3285..b5087ea 100644

--- a/distributed.py_bak

+++ b/distributed.py

@@ -8,16 +8,16 @@ from tensorflow.examples.tutorials.mnist import input_data

flags = tf.app.flags

IMAGE_PIXELS = 28

# 定义默认训练参数和数据路径

-flags.DEFINE_string('data_dir', '/tmp/mnist-data', 'Directory for storing mnist data')

+flags.DEFINE_string('data_dir', '/root/tensorflow_share', 'Directory for storing mnist data')

flags.DEFINE_integer('hidden_units', 100, 'Number of units in the hidden layer of the NN')

flags.DEFINE_integer('train_steps', 10000, 'Number of training steps to perform')

flags.DEFINE_integer('batch_size', 100, 'Training batch size ')

flags.DEFINE_float('learning_rate', 0.01, 'Learning rate')

# 定义分布式参数

# 参数服务器parameter server节点

-flags.DEFINE_string('ps_hosts', '192.168.32.145:22221', 'Comma-separated list of hostname:port pairs')

+flags.DEFINE_string('ps_hosts', '5.5.5.254:22221', 'Comma-separated list of hostname:port pairs')

# 两个worker节点

-flags.DEFINE_string('worker_hosts', '192.168.32.146:22221,192.168.32.160:22221',

+flags.DEFINE_string('worker_hosts', '5.5.5.1:22221,5.5.5.2:22221',

'Comma-separated list of hostname:port pairs')

# 设置job name参数

flags.DEFINE_string('job_name', None, 'job name: worker or ps')

root@hpc-2:~#

三台主机都进入到my_tensorflow虚拟环境中，运行脚本先启动ps的脚本，会等待worker连上

(my_tensorflow) root@mypc:~# python distributed.py --job_name=ps --task_index=0

...

2022-11-05 08:19:49.930693: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

2022-11-05 08:19:49.936649: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> localhost:22221}

2022-11-05 08:19:49.936682: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 5.5.5.1:22221, 1 -> 5.5.5.2:22221}

2022-11-05 08:19:49.937980: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:22221

启动worker0,会等待worker1

(my_tensorflow) root@hpc-1:~# python distributed.py --job_name=worker --task_index=0

...

Worker 0: Initailizing session...

2022-11-05 08:21:07.034039: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

2022-11-05 08:21:17.034288: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

启动worker1，两个worker开始跑起来了

(my_tensorflow) root@hpc-2:~# python distributed.py --job_name=worker --task_index=1

...

Instructions for updating:

Please switch to tf.train.MonitoredTrainingSession

Worker 1: Waiting for session to be initaialized...

2022-11-05 08:22:52.177682: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 1204b5ccec20d3ad with config:

Worker 1: Session initialization complete.

Traing begins @ 1667607772.224416

1667607772.598671: Worker 1: traing step 1 dome (global step:0)

1667607772.643606: Worker 1: traing step 2 dome (global step:1)

1667607772.680817: Worker 1: traing step 3 dome (global step:2)

1667607772.722659: Worker 1: traing step 4 dome (global step:3)

1667607772.789997: Worker 1: traing step 5 dome (global step:4)

直到运行结束

#hpc-1上

...

1667608163.500756: Worker 0: traing step 5016 dome (global step:10001)

Training ends @ 1667608163.500929

Training elapsed time:391.221547 s

After 10000 training step(s), validation cross entropy = 1044.76

(my_tensorflow) root@hpc-1:~#

#hpc-2上

...

1667608163.478347: Worker 1: traing step 4987 dome (global step:10000)

Training ends @ 1667608163.478575

Training elapsed time:391.254159 s

After 10000 training step(s), validation cross entropy = 1044.76

(my_tensorflow) root@hpc-2:~#

MNIST的用例库已经自动下载好了

(my_tensorflow) root@hpc-2:~# ls -l /root/tensorflow_share

total 11336

-rw-r--r-- 1 root root 1648877 11-р сар 5 08:19 t10k-images-idx3-ubyte.gz

-rw-r--r-- 1 root root 4542 11-р сар 5 08:19 t10k-labels-idx1-ubyte.gz

-rw-r--r-- 1 root root 9912422 11-р сар 5 08:19 train-images-idx3-ubyte.gz

-rw-r--r-- 1 root root 28881 11-р сар 5 08:19 train-labels-idx1-ubyte.gz

(my_tensorflow) root@hpc-2:~#

好文推荐

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

tensorflow 分布式在conda构造的虚拟环境中运行MNIST测试用例

服务器 linux python 【报错】An unexpected error has occurred. Conda has prepared the above report.If you suspect this error

c++ 计算机视觉记录下linux系统conda虚拟环境下查看cuda和cudnn版本的方法

发表评论取消回复

金钥匙

tensorflow 分布式 在conda构造的虚拟环境中运行MNIST测试用例

服务器 linux python 【报错】An unexpected error has occurred. Conda has prepared the above report.If you suspect this error

c++ 计算机视觉 记录下linux系统conda虚拟环境下查看cuda和cudnn版本的方法

相关文章

发表评论取消回复

tensorflow 分布式在conda构造的虚拟环境中运行MNIST测试用例

c++ 计算机视觉记录下linux系统conda虚拟环境下查看cuda和cudnn版本的方法