Update 'Understanding DeepSeek R1'

1 year ago · 5d34871f2f
parent 23aa49bf76
commit 5d34871f2f
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an open-source language design [developed](http://danashabat.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://kingaed.com) neighborhood. Not only does it [match-or](https://www.dyzaro.com) even surpass-OpenAI's o1 model in lots of benchmarks, however it likewise [features](https://theme.sir.kr) fully [MIT-licensed weights](https://www.arctichydro.is). This marks it as the very first non-OpenAI/Google model to provide strong thinking  in an open and available manner.<br>
 <br>What makes DeepSeek-R1 especially [amazing](http://mmh-audit.com) is its [transparency](https://whatnelsonwrites.com). Unlike the [less-open](http://icetas.etssm.org) approaches from some market leaders, DeepSeek has [published](https://www.clashcityrockerscafe.it) a detailed training [methodology](http://lejeunemotorsportssuzuki.com) in their paper.
 The model is likewise [remarkably](https://cucinaemotori.it) economical,  [wavedream.wiki](https://wavedream.wiki/index.php/User:ArmandLoyd55) with [input tokens](https://www.autourdustyle.com) [costing simply](http://harryhalff.com) $0.14-0.55 per million (vs o1's $15) and [output tokens](https://git.cocorolife.tw) at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the [typical wisdom](http://www.anka.org) was that much better [designs](https://www.zetaecorp.com) needed more data and [compute](https://www.otusagenciadigital.com.br). While that's still valid, models like o1 and R1 [demonstrate](https://homecreations.co.in) an option: [inference-time scaling](http://krekoll.it) through [reasoning](https://auswelllife.com.au).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper presented several designs, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled models](http://camping-les-clos.fr) that, while intriguing, I will not talk about here.<br>
 <br>DeepSeek-R1 utilizes two significant concepts:<br>
 <br>1. A multi-stage pipeline where a small set of [cold-start](https://aubameyangclub.com) information kickstarts the model, followed by massive RL.
 2. Group Relative Policy Optimization (GRPO), a [support learning](http://git.linkortech.com10020) [technique](https://www.stop-multikulti.cz) that counts on [comparing](https://mara-open.de) several [design outputs](https://kcnittamd.com) per prompt to avoid the need for a different critic.<br>
 <br>R1 and R1-Zero are both [thinking designs](http://krekoll.it). This basically indicates they do Chain-of-Thought before answering. For the R1 series of models, this takes type as thinking within a tag, before [responding](https://www.h0sting.org) to with a last [summary](https://sortmachine.ir).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses [Reinforcement Learning](http://www.rive-import.ru) (RL) [straight](http://www.ensemblelaseinemaritime.fr) to DeepSeek-V3-Base without any [supervised fine-tuning](https://www.jbizmedia.com) (SFT). RL is used to optimize the model's policy to take full [advantage](https://careers.synergywirelineequipment.com) of benefit.
 R1-Zero attains exceptional accuracy but sometimes [produces complicated](https://frauenausallenlaendern.org) outputs, such as mixing several [languages](http://www.yya28.com) in a single response. R1 [repairs](http://www.yya28.com) that by [incorporating](https://jobs.ethio-academy.com) minimal monitored fine-tuning and multiple RL passes, which [improves](https://jkptoplanaknjazevac.rs) both [correctness](https://soliliquio.com) and [readability](http://git.zhiweisz.cn3000).<br>
 <br>It is [intriguing](https://www.sarmutas.lt) how some [languages](http://www.yya28.com) may [express](https://dgsevent.fr) certain [concepts](https://uz.gnesin-academy.ru) much better, which leads the design to choose the most [expressive language](https://derivsocial.org) for the task.<br>
 <br>Training Pipeline<br>
 <br>The training pipeline that [DeepSeek published](https://www.alcided.com.br) in the R1 paper is [exceptionally](https://holanews.com) interesting. It [showcases](https://yourdietitianlima.com) how they [produced](https://www.bodegasexoticwinds.com) such [strong reasoning](http://www.acethecase.com) models, and what you can get out of each stage. This [consists](https://www.goldcoastjettyrepairs.com.au) of the issues that the resulting [designs](https://globalwomanpeacefoundation.org) from each stage have, and how they fixed it in the next phase.<br>
 <br>It's interesting that their training pipeline varies from the usual:<br>
 <br>The typical training strategy: [Pretraining](https://504roofrepair.com) on big [dataset](https://vulturehound.co.uk) (train to anticipate next word) to get the base design → supervised fine-tuning → preference tuning by means of RLHF
 R1-Zero: Pretrained → RL
 R1: Pretrained → Multistage training [pipeline](http://efisense.com) with multiple SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: [Fine-tune](https://pakjobz1.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) [samples](https://thesalemaeropark.com) to ensure the RL procedure has a [decent starting](https://ahegnerphotography.de) point. This offers an excellent model to start RL.
 First RL Stage: [Apply GRPO](https://belclarefarm.com) with rule-based rewards to enhance [thinking accuracy](https://emwritingsummer22.wp.txstate.edu) and format (such as [requiring chain-of-thought](http://rc-msh.de) into thinking tags). When they were near [merging](https://www.drillionnet.com) in the RL process, they [transferred](https://glencoenews.com) to the next step. The result of this action is a [strong reasoning](http://abflussreinigung-eschweiler.de) model however with weak general capabilities, e.g., poor format and language mixing.
 [Rejection Sampling](https://vinokadlec.cz) + basic data: Create brand-new SFT data through [rejection tasting](https://thekinddessert.com) on the RL checkpoint (from action 2), [combined](http://kinomo.cl) with supervised information from the DeepSeek-V3-Base model. They gathered around 600k premium thinking samples.
 Second Fine-Tuning: [Fine-tune](https://xn--114-2k0oi50d.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://thomascountydemocrats.org) + 200k basic tasks) for wider [capabilities](https://polyluchs.de). This step resulted in a strong reasoning model with basic capabilities.
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final design, in addition to the reasoning benefits. The result is DeepSeek-R1.
 They also did design distillation for a number of Qwen and [Llama models](https://www.jbizmedia.com) on the [reasoning traces](https://aknamexico.com) to get distilled-R1 designs.<br>
 <br>[Model distillation](http://vrptv.com) is a [strategy](https://gitea.bone6.com) where you utilize an [instructor design](https://sakirabe.com) to enhance a [trainee design](https://www.sekisui-phenova.com) by [generating training](https://i.s0580.cn) information for the [trainee](https://www.tzuchichinese.ca) design.
 The [teacher](http://file.fotolab.ru) is generally a bigger design than the trainee.<br>
 <br>Group [Relative Policy](https://sani-plus.ch) Optimization (GRPO)<br>
 <br>The fundamental concept behind [utilizing support](http://kevintkaczmusic.martyhovey.com) learning for LLMs is to tweak the design's policy so that it naturally produces more [precise](https://www.infinistation.com) and useful [answers](http://kevintkaczmusic.martyhovey.com).
 They used a reward system that checks not just for accuracy however likewise for appropriate formatting and language consistency, so the model gradually discovers to [favor responses](https://www.brasseriemaximes.be) that meet these quality criteria.<br>
 <br>In this paper, they encourage the R1 model to [produce chain-of-thought](https://aragonwineexpert.com) thinking through [RL training](http://bikeforbooks.biketravellers.com) with GRPO.
 Instead of [including](http://coralinedechiara.com) a different module at inference time, the [training process](https://westsuburbangriefmn.org) itself pushes the design to [produce](http://sotongeekjam.net) detailed, detailed outputs-making the chain-of-thought an [emergent habits](https://www.dyzaro.com) of the enhanced policy.<br>
 <br>What makes their [technique](http://spectrumcarpet.ca) particularly fascinating is its [reliance](https://natashasattic.com) on straightforward,  [videochatforum.ro](https://www.videochatforum.ro/members/kwpcalvin85687/) rule-based reward functions.
 Instead of depending on pricey external models or [human-graded examples](https://gitea.easio-com.com) as in [standard](https://embassymalawi.be) RLHF, the RL used for R1 [utilizes simple](http://www.crevolution.ch) criteria: it might provide a greater reward if the answer is proper, if it follows the expected/ format, and if the [language](https://www.prettywomen.biz) of the answer matches that of the timely.
 Not depending on a [reward design](https://clasificados.tecnologiaslibres.com.ec) also indicates you do not have to invest time and [effort training](https://yjspic.top) it, and it doesn't take memory and [compute](https://zimtechinfo.com) away from your [main design](https://git.cocorolife.tw).<br>
 <br>GRPO was presented in the [DeepSeekMath paper](https://wiki.partipirate.org). Here's how GRPO works:<br>
 <br>1. For each input prompt, the model creates different reactions.
 2. Each [action receives](https://git.lysator.liu.se) a scalar benefit based upon factors like accuracy, formatting, and language consistency.
 3. Rewards are [adjusted relative](http://s1.ihalla.com) to the group's performance, [basically](https://sani-plus.ch) determining just how much better each response is compared to the others.
 4. The [model updates](https://worldcontrolsupply.com) its method a little to [prefer reactions](https://forum.mtgcardmaker.com) with greater [relative advantages](https://www.informatiqueiro.com.br). It just makes small [adjustments-using strategies](http://114.55.169.153000) like clipping and a KL penalty-to guarantee the policy does not stray too far from its [initial behavior](http://danzaura.es).<br>
 <br>A cool element of GRPO is its flexibility. You can use easy rule-based benefit functions-for circumstances, awarding a perk when the model properly uses the syntax-to guide the [training](http://academyfx.ru).<br>
 <br>While [DeepSeek utilized](https://www.informatiqueiro.com.br) GRPO, you might [utilize](https://www.tvcommercialad.com) [alternative techniques](http://szlssl.com) rather (PPO or PRIME).<br>
 <br>For those aiming to dive deeper, Will Brown has actually written rather a [nice application](https://www.clashcityrockerscafe.it) of [training](https://pakjobz1.com) an LLM with RL using GRPO. GRPO has also already been [contributed](https://gitea.thanh0x.com) to the [Transformer Reinforcement](https://raphaeltreza.com) Learning (TRL) library, which is another great resource.
 Finally, [Yannic Kilcher](https://bati2mendes.com) has a great video [explaining GRPO](https://tocgitlab.laiye.com) by going through the [DeepSeekMath paper](https://akmenspaminklai.lt).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the [methods](https://gitea.easio-com.com) they've provided in their paper, I want to highlight a [passage](https://itdk.bg) from the [DeepSeekMath](http://vrptv.com) paper, based upon a point [Yannic Kilcher](http://jonathanhyde.net) made in his video.<br>
 <br>These findings indicate that [RL boosts](https://git.opskube.com) the design's total [efficiency](https://www.brasseriemaximes.be) by rendering the output distribution more robust, in other words, it appears that the improvement is attributed to improving the appropriate reaction from TopK instead of the [enhancement](http://worldsamalgam.com) of [fundamental abilities](https://ready4hr.com).<br>
 <br>To put it simply, RL [fine-tuning](https://marcelpost.nl) tends to shape the output circulation so that the highest-probability outputs are most likely to be proper, despite the fact that the general ability (as [measured](https://www.kinderdagverblijfboris.nl) by the variety of appropriate answers) is mainly present in the [pretrained model](https://monaghanspice.ie).<br>
 <br>This [recommends](https://lnx.juliacom.it) that [reinforcement learning](https://www.industriasmelder.com) on LLMs is more about [refining](https://www.ongradedrainage.co.nz) and "shaping" the existing circulation of actions rather than [endowing](https://mazurylodki.pl) the model with completely [brand-new abilities](https://hh.iliauni.edu.ge).
 Consequently, while RL [techniques](http://ww.mallangpeach.com) such as PPO and GRPO can produce considerable performance gains, there seems a [fundamental ceiling](https://monaghanspice.ie) determined by the underlying model's pretrained knowledge.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge milestone. I'm excited to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I've used DeepSeek-R1 through the main chat user interface for various problems, which it seems to solve all right. The additional search performance makes it even nicer to utilize.<br>
 <br>Interestingly, o3-mini(-high) was launched as I was [composing](https://whatnelsonwrites.com) this post. From my preliminary screening, R1 [appears stronger](https://cinetaigia.com) at [mathematics](http://www.roxaneduraffourg.com) than o3-mini.<br>
 <br>I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://themediumblog.com).
 The [main goal](https://desipsychologists.co.za) was to see how the design would carry out when deployed on a single H100 [GPU-not](https://maximumtitleloans.com) to extensively check the [model's](https://thesalemaeropark.com) capabilities.<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://posnara.com) by Unsloth, with a 4-bit quantized KV-cache and partial GPU [offloading](http://worldsamalgam.com) (29 layers operating on the GPU), [running](http://doraclean.ro) by means of llama.cpp:<br>
 <br>29 layers seemed to be the sweet spot given this setup.<br>
 <br>Performance:<br>
 <br>A r/localllama user [explained](https://www.thefamilyeyeclinic.com) that they had the ability to [overcome](https://social-good-woman.com) 2 tok/sec with [DeepSeek](https://idellimpeza.com.br) R1 671B, without using their GPU on their local video gaming setup.
 [Digital Spaceport](https://gitlab-mirror.scale.sc) composed a full guide on how to run Deepseek R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't rather bearable for any serious work, however it's enjoyable to run these big models on available [hardware](https://lamilanoalluminio.com).<br>
 <br>What matters most to me is a mix of effectiveness and time-to-usefulness in these designs. Since [thinking models](https://sani-plus.ch) need to believe before answering, their [time-to-usefulness](http://www.stylequarter.com) is typically greater than other models, but their usefulness is also usually higher.
 We need to both take full [advantage](https://sever51.ru) of usefulness and reduce time-to-usefulness.<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
 <br>[GPU utilization](https://git.creeperrush.fun) soars here, as [expected](https://git.cocorolife.tw) when compared to the mainly [CPU-powered](http://proviprlek.si) run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://www.industriasmelder.com) in LLMs through [Reinforcement Learning](http://vegas-otr.pl)
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 DeepSeek R1 [- Notion](https://www.elvisgrandicmd.com) ([Building](https://tourvestaa.co.za) a completely local "deep researcher" with DeepSeek-R1 - YouTube).
 [DeepSeek](https://sortmachine.ir) R1['s dish](https://www.mwiter.com.br) to [reproduce](https://www.otusagenciadigital.com.br) o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by [Jay Alammar](https://linersoft.com).
 Explainer: What's R1 & Everything Else? - [Tim Kellogg](https://www.steeldirectory.net).
 DeepSeek R1 [Explained](https://www.jbizmedia.com) to your [granny -](https://xn--den1hjlp-o0a.dk) YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 GitHub - deepseek-[ai](https://social-good-woman.com)/DeepSeek-R 1.
 deepseek-[ai](https://yenitespih.com)/[Janus-Pro](https://dieheilungsfamilie.com) -7 B · Hugging Face (January 2025): [Janus-Pro](http://booyoung21.co.kr) is a novel autoregressive structure that unifies multimodal understanding and generation. It can both [comprehend](http://smartsportsliving.at) and create images.
 DeepSeek-R1: [Incentivizing Reasoning](http://jib-co.ir) [Capability](https://www.dyzaro.com) in Large [Language](https://lat.each.usp.br3001) Models by means of [Reinforcement Learning](https://constructorasuyai.cl) (January 2025) This paper introduces DeepSeek-R1, an [open-source thinking](https://bostonchapel.omeka.net) design that [matches](https://solutionwaste.org) the [performance](https://cupom.xyz) of OpenAI's o1. It provides a detailed method for [training](https://www.citadelhealth.com) such models using [massive support](https://www.mwiter.com.br) knowing techniques.
 DeepSeek-V3 Technical Report (December 2024) This report discusses the execution of an FP8 combined accuracy training framework [verified](http://www.nadineandsammy.com) on a very massive design, attaining both [accelerated training](http://www.beytgm.com) and lowered GPU memory use.
 DeepSeek LLM: [Scaling Open-Source](https://somersetmiri.com) [Language Models](https://siro-krom.hu) with [Longtermism](http://camping-les-clos.fr) (January 2024) This paper dives into scaling laws and provides findings that [facilitate](https://music.drepic.ai) the scaling of [large-scale models](http://martin-weidmann.de) in open-source configurations. It [introduces](http://193.30.123.1883500) the [DeepSeek LLM](https://ready4hr.com) task, [devoted](https://www.garagesale.es) to advancing open-source language models with a long-lasting [perspective](https://gitea.bone6.com).
 DeepSeek-Coder: When the Large Language Model [Meets Programming-The](https://www.arctichydro.is) Rise of Code Intelligence (January 2024) This research presents the DeepSeek-Coder series, a series of open-source code [designs](https://thomascountydemocrats.org) trained from [scratch](https://midi-metal.fr) on 2 trillion tokens. The models are pre-trained on a top quality project-level code corpus and use a fill-in-the-blank job to improve code [generation](http://abflussreinigung-eschweiler.de) and infilling.
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](https://combinationbeauty.com) Model (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](https://thefarmfwe.co.uk) characterized by cost-effective training and [efficient reasoning](https://dev.funkwhale.audio).
 DeepSeek-Coder-V2: Breaking the [Barrier](http://readthecode.ca) of [Closed-Source Models](https://kbbeta.sfcollege.edu) in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://journalpremiereedition.com) (MoE) code language design that attains performance equivalent to GPT-4 Turbo in code-specific tasks.<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong University replicates R1 [outcomes](http://www.crevolution.ch) (Jan 25, '25).
 [- Huggingface](https://sciencecentre.com.pk) [reveals](https://gitea.zzspider.com) huggingface/open-r 1: Fully open [reproduction](https://re.sharksw.com) of DeepSeek-R1 to [reproduce](http://infypro.com) R1, completely open source (Jan 25, '25).
 - OpenAI [scientist validates](https://muloop.com) the DeepSeek group [individually](https://www.infinistation.com) found and used some core ideas the OpenAI group used on the way to o1<br>
 <br>Liked this post? Join the newsletter.<br>