H20服务器多卡运行有错误gpu_partition ,tmux错误

张开发
2026/4/18 20:17:34 15 分钟阅读

分享文章

H20服务器多卡运行有错误gpu_partition ,tmux错误
怎么修复改 vcuda 配置 在一个tmux中启动本地 shim 服务cp -f /usr/local/lib/inais/.vcuda_config /usr/local/lib/inais/.vcuda_config.bak_$(date %Y%m%d_%H%M%S) printf 22586\n127.0.0.1\n /usr/local/lib/inais/.vcuda_config cat /usr/local/lib/inais/.vcuda_config tmux set-environment -gu LOCAL_HOST_IP tmux set-environment -gu VCUDA_PORT tmux set-environment -gu INAIS_GPU_MEM_CONTAINER tmux set-environment -gu INAIS_GPU_MEM_DEV tmux set-environment -gu INAIS_GPU_MEM_POD unset LOCAL_HOST_IP VCUDA_PORT INAIS_GPU_MEM_CONTAINER INAIS_GPU_MEM_DEV INAIS_GPU_MEM_POD source /opt/conda/etc/profile.d/conda.sh conda activate janusdna python /chenhaowen/hnu/mps/lora_deepseek_ocr_vision_DNA/script/rice_phenotype_benchmark/vcuda_pidmap_shim.py \ --host 127.0.0.1 \ --port 22586然后就可以运行自己的命令了验证是否是对的看 shim 进程是否存在pgrep -af vcuda_pidmap_shim.py做最小 CUDA 验证source /opt/conda/etc/profile.d/conda.sh conda activate janusdna python -u - PYimport torchprint(cuda_available, torch.cuda.is_available())torch.cuda.set_device(0)x torch.zeros(1, devicecuda:0)print(alloc_ok, x.device)PYtmux错误cat /root/.tmux.conf EOF# Start each new tmux pane/window with a clean runtime state.# This avoids inheriting stale CONDA_*/CUDA_*/NCCL_*/NVIDIA_* variables# from an older tmux server or a different container image.set -g default-shell /bin/bashset -g default-command exec env -u CONDA_DEFAULT_ENV -u CONDA_EXE -u CONDA_PREFIX -u CONDA_PREFIX_1 -u CONDA_PROMPT_MODIFIER -u CONDA_PYTHON_EXE -u CONDA_SHLVL -u _CE_CONDA -u _CE_M -u LD_PRELOAD -u LD_LIBRARY_PATH -u CUDA_HOME -u CUDA_PATH -u CUDA_VERSION -u CUDA_DRIVER_VERSION -u CUDA_CACHE_DISABLE -u CUDA_VISIBLE_DEVICES -u CUDA_DEVICE_ORDER -u NCCL_VERSION -u NCCL_IB_DISABLE -u NCCL_SHARP_DISABLE -u NCCL_NET -u NCCL_P2P_DISABLE -u NCCL_CUMEM_ENABLE -u NCCL_DEBUG -u NVIDIA_VISIBLE_DEVICES -u NVIDIA_DISABLE_REQUIRE -u NVIDIA_DRIVER_CAPABILITIES -u NVIDIA_PRODUCT_NAME -u NVIDIA_PYTORCH_VERSION -u NVIDIA_BUILD_ID -u NVIDIA_REQUIRE_CUDA -u OMPI_MCA_coll_hcoll_enable /bin/bash -l# Keep locale variables in sync so tmux treats attached clients as UTF-8.# Sync PATH so a freshly attached client can bring in the expected conda env.set -g update-environment DISPLAY KRB5CCNAME SSH_ASKPASS SSH_AUTH_SOCK SSH_AGENT_PID SSH_CONNECTION WINDOWID XAUTHORITY LANG LANGUAGE LC_ALL LC_CTYPE PATHset-environment -g LANG C.UTF-8set-environment -g LC_ALL C.UTF-8EOF

更多文章