•2 min read•from Machine Learning
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO [P]
![Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fmf7rux5lhyug1.png%3Fwidth%3D140%26height%3D90%26auto%3Dwebp%26s%3Dca85312bbdb852edbfcd38808c9111e272c70d16&w=3840&q=75)
| So, a few days back I shared a post where I trained a tiny Qwen2.5-0.5B-Instruct model on smoltldr (reddit post summarization dataset of 2k rows), to output summaries of about 64 max length using RLVR with GRPO . However, there was a catch!
Hence the charts showed a sharp decline and convergence towards a response length of on and off 15 tokens. The rewards I used were 2:
Trained to one full epoch with a batch size of 2 max (before getting a OOM), the results were identical to the previous run, however, with one crucial difference -
Anyways, next up:
[link] [comments] |
Want to read more?
Check out the full article on the original site
Tagged with
#financial modeling with spreadsheets
#natural language processing for spreadsheets
#generative AI for data analysis
#rows.com
#Excel alternatives for data analysis
#large dataset processing
#interactive charts
#real-time data collaboration
#real-time collaboration
#Qwen2.5-0.5B-Instruct
#GRPO
#reddit post summarization
#summarization dataset
#RLVR
#ROUGE-L
#length_penalty
#quality_reward
#tokens
#LCS
#max length