微调命令

CUDA_VISIBLE_DEVICES=0 python /aaa/LLaMA-Factory/src/train_bash.py \

--stage sft \

--model_name_or_path /aaa/LLaMA-Factory/models/chatglm2-6b \

--do_train \

--dataset bbbccc \

--template chatglm2 \

--finetuning_type lora \

--lora_target query_key_value \

--output_dir output/dddeee/ \

--overwrite_cache \

--per_device_train_batch_size 4 \

--gradient_accumulation_steps 4 \

--lr_scheduler_type cosine \

--logging_steps 10 \

--save_steps 10 \

--learning_rate 5e-5 \

--num_train_epochs 3.0 \

--plot_loss

已经从huggingface下载完整的模型并配置正确路径,也对自定义数据集仿照alpaca_gpt4_data_zh.json在dataset_info.json中写入相关配置。但运行如上命令还是有报错如下:

[INFO|training_args.py:1798] 2023-11-02 16:00:19,165 >> PyTorch: setting up devices

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

[INFO|trainer.py:1760] 2023-11-02 16:00:19,402 >> ***** Running training *****

[INFO|trainer.py:1761] 2023-11-02 16:00:19,402 >> Num examples = 1,372

[INFO|trainer.py:1762] 2023-11-02 16:00:19,402 >> Num Epochs = 3

[INFO|trainer.py:1763] 2023-11-02 16:00:19,402 >> Instantaneous batch size per device = 4

[INFO|trainer.py:1766] 2023-11-02 16:00:19,402 >> Total train batch size (w. parallel, distributed & accumulation) = 16

[INFO|trainer.py:1767] 2023-11-02 16:00:19,403 >> Gradient Accumulation steps = 4

[INFO|trainer.py:1768] 2023-11-02 16:00:19,403 >> Total optimization steps = 255

[INFO|trainer.py:1769] 2023-11-02 16:00:19,404 >> Number of trainable parameters = 1,949,696

0%| | 0/255 [00:00

warnings.warn(

Traceback (most recent call last):

File "/aaa/LLaMA-Factory/src/train_bash.py", line 14, in

main()

File "/aaa/LLaMA-Factory/src/train_bash.py", line 5, in main

run_exp()

File "/aaa/LLaMA-Factory/src/llmtuner/tuner/tune.py", line 26, in run_exp

run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)

File "/aaa/LLaMA-Factory/src/llmtuner/tuner/sft/workflow.py", line 67, in run_sft

train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train

return inner_training_loop(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 1892, in _inner_training_loop

tr_loss_step = self.training_step(model, inputs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 2776, in training_step

loss = self.compute_loss(model, inputs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/transformers/trainer.py", line 2801, in compute_loss

outputs = model(**inputs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/peft_model.py", line 918, in forward

return self.base_model(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 94, in forward

return self.model.forward(*args, **kwargs)

File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 937, in forward

transformer_outputs = self.transformer(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 830, in forward

hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 631, in forward

layer_ret = torch.utils.checkpoint.checkpoint(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner

return torch._dynamo.disable(fn, recursive)(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn

return fn(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner

return fn(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint

return CheckpointFunction.apply(function, preserve, *args)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply

return super().apply(*args, **kwargs) # type: ignore[misc]

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 230, in forward

outputs = run_function(*args)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 544, in forward

attention_output, kv_cache = self.self_attention(

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/xxxcache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 376, in forward

mixed_x_layer = self.query_key_value(hidden_states)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/envs/llama_factory_py310/lib/python3.10/site-packages/peft/tuners/lora.py", line 902, in forward

result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)

RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

0%| | 0/255 [00:00

命令运行过程中,看上去已经成功加载模型了,应该是训练第1个epoch时的报错。我--fp16加到上面的命令中运行,也有报错。

这是与开源社区交流的记录: https://github.com/hiyouga/LLaMA-Factory/issues/1359

原因:cuda 环境问题解决方案:pip install torch==2.0.1排查:打log看torch.cuda.is_available()输出为False说明CUDA环境有问题

精彩链接

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: