HI3065{"id":3047,"date":"2023-12-29T08:38:46","date_gmt":"2023-12-29T08:38:46","guid":{"rendered":"https:\/\/www.trinka.ai\/blog\/?p=3047"},"modified":"2026-04-29T11:26:00","modified_gmt":"2026-04-29T11:26:00","slug":"rlhf-for-grammar-error-correction","status":"publish","type":"post","link":"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/","title":{"rendered":"RLHF for Grammar Error Correction"},"content":{"rendered":"<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_50 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\">Table of Contents<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"Toggle Table of Content\" role=\"button\"><label for=\"item-6a139ee660716\" aria-hidden=\"true\"><span style=\"display: flex;align-items: center;width: 35px;height: 30px;justify-content: center;direction:ltr;\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/label><input  type=\"checkbox\" id=\"item-6a139ee660716\"><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Introduction\" title=\"Introduction\">Introduction<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Technical_Background_Reinforcement_Learning_from_Human_Feedback\" title=\"Technical Background: Reinforcement Learning from Human Feedback\">Technical Background: Reinforcement Learning from Human Feedback<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Step_1_Supervised_model_training\" title=\"Step 1: Supervised model training\">Step 1: Supervised model training<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Step_2_Reward_model_training\" title=\"Step 2: Reward model training\">Step 2: Reward model training<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Step_3_Supervised_model_optimization_with_the_reward_model_using_PPO_objective\" title=\"Step 3: Supervised model optimization with the reward model using PPO objective\">Step 3: Supervised model optimization with the reward model using PPO objective<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Our_Use_Case\" title=\"Our Use Case\">Our Use Case<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Evaluation_and_Results\" title=\"Evaluation and Results\">Evaluation and Results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#Conclusion_and_Next_Steps\" title=\"Conclusion and Next Steps\">Conclusion and Next Steps<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.trinka.ai\/blog\/rlhf-for-grammar-error-correction\/#References\" title=\"References\">References<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Introduction\"><\/span>Introduction<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>At Trinka AI, we have built a Grammar Error Correction (GEC) model specifically designed to assist non-native speakers of English. Our model revises written language to make it accurate and fluent, particularly in academic contexts. In this blog, we highlight a key experiment: incorporation of Reinforcement Learning from Human Feedback into our workflow.<\/p>\n<p>Our GEC model covers a broad range of writing errors and provides accurate text revisions. However, we understand that some clients require tailored solutions. For instance, a legal professional might need corrections and style suggestions that differ from what a medical researcher would require. While a phrase like &#8220;<em>the patient exhibited no notable symptoms&#8221;<\/em> is standard in medical research, the phrase<em> &#8220;the transformed E. coli were reluctant to express the protein\u201d<\/em> might not be suitable.<\/p>\n<div style=\"border-radius: 16px; border: 1px solid #EB9FF1; background: linear-gradient(93deg, #570081 1.11%, #500073 32.32%, #890093 71.76%, #890093 100.27%); padding: 22px; color: white; font-weight: 600;\">Our experiment on RLHF is to systematically train a GEC model to replicate the style and language corrections of a particular journal. This would ensure that the GEC model, with time, would move closer to what a human could do by understanding what\u2019s acceptable and what\u2019s not.<\/div>\n<h2><span class=\"ez-toc-section\" id=\"Technical_Background_Reinforcement_Learning_from_Human_Feedback\"><\/span>Technical Background: Reinforcement Learning from Human Feedback<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In the field of Natural Language Processing (NLP), Reinforcement Learning from Human Feedback (RLHF) has garnered significant attention following the emergence of ChatGPT [1]. RLHF was initially introduced around 2017, playing a crucial role in the development of InstructGPT [2]. However, with the surge in ChatGPT&#8217;s popularity, the spotlight turned definitively on RLHF.<\/p>\n<p>RLHF is a fascinating methodology that incorporates human feedback directly into the training process of machine learning models. This approach enhances the model&#8217;s ability to process and respond in more human-like ways.<\/p>\n<p>For a detailed understanding of RLHF, we will delve into an insightful paper on the subject [2]. Additionally, there are numerous blogs that extensively discuss this innovative NLP technique [3], [4], [5].<\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_3052\" aria-describedby=\"caption-attachment-3052\" style=\"width: 738px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" class=\"wp-image-3052\" src=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-300x150.png\" alt=\"illustrates the three steps of our method:\" width=\"738\" height=\"369\" srcset=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-300x150.png 300w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-1024x511.png 1024w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-768x383.png 768w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-1536x767.png 1536w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-2048x1022.png 2048w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Card-150x75.png 150w\" sizes=\"(max-width: 738px) 100vw, 738px\" \/><figcaption id=\"caption-attachment-3052\" class=\"wp-caption-text\">Fig 1. illustrates the three steps of our method: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) PPO Objective. This diagram is from paper [2].<\/figcaption><\/figure>\n<p>Figure 1 is derived from the InstructGPT paper [2]. It depicts the three-step process involved in the Reinforcement Learning from Human Feedback (RLHF) method.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Step_1_Supervised_model_training\"><\/span>Step 1: Supervised model training<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>Fine-tuning a model on a single task or multiple tasks using a training dataset which comprises a prompt as an input and its desired output.<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Step_2_Reward_model_training\"><\/span>Step 2: Reward model training<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>This is an important step of RLHF. The input prompt is shown to the user along with a variety of outputs from different configurations. The user then reorders these outputs, ranking them from the most to the least valid.<\/li>\n<li>This data is then provided to a reward model to train it. During training, the reward model is shown two suggestions at a time, and it is designed to assign a higher score to the best ranked suggestion.<\/li>\n<li>This generated data can also be called preferential data.<\/li>\n<li>Once trained, the reward model will give you a score that will tell how good that suggestion is.<\/li>\n<li>The loss function to train the reward model is given as follows [2]:<\/li>\n<\/ul>\n<p><img loading=\"lazy\" class=\"wp-image-3053 aligncenter\" src=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-300x56.png\" alt=\"RLHF for Grammar Error Correction\" width=\"468\" height=\"87\" srcset=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-300x56.png 300w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-1024x190.png 1024w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-768x143.png 768w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-1536x285.png 1536w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-2048x381.png 2048w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-1-150x28.png 150w\" sizes=\"(max-width: 468px) 100vw, 468px\" \/><\/p>\n<ul>\n<li>In this context, &#8220;<em>D<sub>w<\/sub><\/em>&#8221; represents the correct suggestion, i.e., the one with a higher ranking, while &#8220;<em>y<sub>l<\/sub><\/em>&#8221; denotes the suggestion with a lower ranking. The reward function, &#8220;<em>r<sub>\u03b8<\/sub><\/em>&#8220;, provides a scalar score based on the prompt and its corresponding suggestion.<\/li>\n<li>Subsequently, a log sigmoid function is applied to the loss function. This results in a lower score or assigns a low penalty if the reward for the higher-ranked suggestion is greater than the reward for the lower-ranked suggestion, and vice versa.<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Step_3_Supervised_model_optimization_with_the_reward_model_using_PPO_objective\"><\/span>Step 3: Supervised model optimization with the reward model using PPO objective<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li>During this phase, a prompt is selected from the dataset and fed into the model, which then generates an output. The effectiveness of the output is assessed using the reward model, which calculates its reward score. This calculated reward is then used to update the model\/policy using Proximal Policy Optimization (PPO).<\/li>\n<li>The objective of training the PPO policy is described in the same paper as follows [2]:<\/li>\n<\/ul>\n<p><img loading=\"lazy\" class=\"wp-image-3054 aligncenter\" src=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-300x34.png\" alt=\"\" width=\"768\" height=\"87\" srcset=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-300x34.png 300w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-1024x116.png 1024w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-768x87.png 768w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-1536x174.png 1536w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-2048x233.png 2048w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Equation-2-150x17.png 150w\" sizes=\"(max-width: 768px) 100vw, 768px\" \/><\/p>\n<ul>\n<li>When training the PPO objective, the &#8220;<em>\u03b3<\/em>&#8221; parameter is set to zero. This approach involves the policy model computing a reward score for each prompt and its suggested response. KL divergence [6] is then applied on the policy model and the original model to ensure model stability, and it prevents excessive divergence.<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Our_Use_Case\"><\/span>Our Use Case<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We developed a GEC model trained on millions of proprietary data points and capable of competitive accuracy and coverage. However, certain clients, particularly academic journals, require a solution tailored to their unique requirements and aligned with their specific style guidelines.<\/p>\n<p>To this end, we applied the RLHF technique to a prominent scientific journal and observed that the model effectively move towards the content typically published in the journal.<\/p>\n<p>We had already performed <strong>step 1<\/strong>, i.e., training a supervised model, as our existing GEC model was already in place. This left us to focus primarily on <strong>steps 2 and 3<\/strong>, i.e., training a reward model and fine-tuning the PPO objective.<\/p>\n<p>To train the reward model, we had to prepare the training data. In our scenario, the &#8220;<em>prompt<\/em>&#8221; is essentially a sentence that requires editing. We have sentences edited by human experts that comply with the specific guidelines of the journal. We also have the GEC model outputs of the same sentences.<\/p>\n<p>In this context, &#8220;<em>x<\/em>&#8221; represents the sentence that requires edits, &#8220;<em>D<sub>w<\/sub><\/em>&#8221; is the version edited by human experts, and &#8220;<em>y<sub>l<\/sub><\/em>&#8221; is the output generated by our current GEC model. Our goal was to construct a reward model that assigned higher scores to sentences edited by humans and lower scores to those generated by the model. We did not include data points where the edits made by humans and the model were identical. Furthermore, we generated multiple suggestions, i.e. &#8220;<em>y<sub>l<\/sub><\/em>&#8221; for the given &#8220;<em>x<\/em>&#8221; based upon different checkpoints and generation parameters.<\/p>\n<p>We used Hugging Face\u2019s Transformer Reinforcement Learning (TRL) library [7] to train the PPO objective. TRL is a full stack library that offers a complete suite of tools to train transformer language models using reinforcement learning. TRL applies the SFT and RM steps to the final optimization stage of PPO. We used the <em>PPOTrainer class<\/em> from the TRL library to train the PPO objective which considers the original policy, i.e., the GEC model and reward model that we trained using the above step.<\/p>\n<p><strong>NOTE:<\/strong> We tried optimizing the PPO objective with adaptive KL but the model started generating sub-optimal responses. Hence, we switched to static KL coefficient for optimization. It is crucial to note that setting the KL coefficient too low can cause the model to produce awkward responses.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Evaluation_and_Results\"><\/span>Evaluation and Results<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We evaluated the reward model by using our held-out dataset of 10,000 entries and observed noticeable enhancements across various metrics. The models were assessed using metrics like M2 Score [8] and BLEU [9]. With regard to the M2 score, we observed a significant increase of 2.9% in the F0.5 score. A similar trend was observed in BLEU scores too, the details of which are depicted in Figure 2.<\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_3051\" aria-describedby=\"caption-attachment-3051\" style=\"width: 753px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" class=\"wp-image-3051\" src=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-300x90.png\" alt=\"GEC model\" width=\"753\" height=\"226\" srcset=\"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-300x90.png 300w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-1024x306.png 1024w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-768x229.png 768w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-1536x459.png 1536w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-2048x612.png 2048w, https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Box-150x45.png 150w\" sizes=\"(max-width: 753px) 100vw, 753px\" \/><figcaption id=\"caption-attachment-3051\" class=\"wp-caption-text\">Fig 2. illustrates the results on the test data for our GEC model and RLHF GEC model. The BLEU score shifts by 2.34% towards human editing; with more training data, we assume it will move even closer to human editing.<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>It is evident that the journal-specific RLHF GEC model aligns more closely with the human experts&#8217; revisions. We assume that enriching the training of the reward model with additional data will further narrow the gap between the model&#8217;s output and human-expert-level corrections.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion_and_Next_Steps\"><\/span>Conclusion and Next Steps<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our team trained an RLHF model tailored to the specific needs of a prominent scientific journal publisher. This model, fine-tuned using our in-house GEC model, which is currently in use in our flagship writing app Trinka AI (<a href=\"https:\/\/www.trinka.ai\">www.trinka.ai<\/a>) , demonstrated a solid 2.9% improvement in the F0.5 score. This outcome aligns closely with the quality of revisions made by human experts. We recognize that further significant improvements are possible by incorporating more specialized datasets, thus enhancing the model&#8217;s alignment with human-expert-level precision.<\/p>\n<p>Training an effective RLHF model is not without its challenges. A key difficulty lies in simultaneously training two independent models: the reward model and the PPO. To streamline this complex process, we are now delving into Direct Preferential Optimization (DPO) [10] as a potential solution.<\/p>\n<p>Looking ahead, we aim to expand the application of the RLHF technique to other publishers and journals. Our goal is to further analyze and understand the behavior of these models in different publishing contexts with an eye towards continuous refinement and adaptation.<\/p>\n<p>If you would like to see the examples of the trained RLHF model or if you are a publisher or a company looking to enhance business communications\/branding and would like to know how to create RLHF models that reflect your style and communication objectives, write to us at <a href=\"mailto:sales@trinka.ai\">sales@trinka.ai<\/a>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"References\"><\/span>References<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<div style=\"line-height: 15px;\">\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 0px; padding: 0px;\">[1] OpenAI, \u201cIntroducing ChatGPT,\u201d OpenAI, Nov. 30, 2022. <a href=\"https:\/\/openai.com\/blog\/chatgpt\">https:\/\/openai.com\/blog\/chatgpt<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 0px; padding: 0px;\">[2] L. Ouyang et al., \u201cTraining language models to follow instructions with human feedback,\u201d Mar. 2022. Available: <a href=\"https:\/\/arxiv.org\/pdf\/2203.02155.pdf\">https:\/\/arxiv.org\/pdf\/2203.02155.pdf<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 0px;\">[3] A. Thakur, \u201cUnderstanding Reinforcement Learning from Human Feedback (RLHF): Part 1,\u201d W&amp;B, Nov. 02, 2022. <a href=\"https:\/\/wandb.ai\/ayush-thakur\/RLHF\/reports\/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx#learning-to-summarize-with-human-feedback\">https:\/\/wandb.ai\/ayush-thakur\/RLHF\/reports\/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1&#8211;VmlldzoyODk5MTIx#learning-to-summarize-with-human-feedback<\/a> (accessed Dec. 21, 2023).<\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 0px;\">[4] N. Lambert, \u201cIllustrating Reinforcement Learning from Human Feedback (RLHF),\u201d huggingface.co, Dec. 09, 2022. <a href=\"https:\/\/huggingface.co\/blog\/rlhf\">https:\/\/huggingface.co\/blog\/rlhf<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 0px;\">[5] &#8220;RLHF (Reinforcement Learning From Human Feedback): Overview + Tutorial,\u201d www.v7labs.com. <a href=\"https:\/\/www.v7labs.com\/blog\/rlhf-reinforcement-learning-from-human-feedback\">https:\/\/www.v7labs.com\/blog\/rlhf-reinforcement-learning-from-human-feedback<\/a> (accessed Dec. 21, 2023).<\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 3px;\">[6] Wikipedia Contributors, \u201cKullback\u2013Leibler divergence,\u201d Wikipedia, Apr. 16, 2019. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Kullback%E2%80%93Leibler_divergence\">https:\/\/en.wikipedia.org\/wiki\/Kullback%E2%80%93Leibler_divergence<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 3px;\">[7] \u201cTRL &#8211; Transformer Reinforcement Learning,\u201d huggingface.co. <a href=\"https:\/\/huggingface.co\/docs\/trl\/index\">https:\/\/huggingface.co\/docs\/trl\/index<\/a> (accessed Dec. 21, 2023).<\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 3px;\">[8] D. Dahlmeier and H. Ng, \u201cBetter Evaluation for Grammatical Error Correction,\u201d 2012. Available: <a href=\"https:\/\/aclanthology.org\/N12-1067.pdf\">https:\/\/aclanthology.org\/N12-1067.pdf<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 3px;\">[9] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, \u201cBLEU: a Method for Automatic Evaluation of Machine Translation,\u201d Proceedings of the 40th Annual Meeting on Association for Computational Linguistics &#8211; ACL \u201902, 2001, doi: <a href=\"https:\/\/doi.org\/10.3115\/1073083.1073135\">https:\/\/doi.org\/10.3115\/1073083.1073135<\/a><\/p>\n<p style=\"font-size: 14px; margin-bottom: 10px; margin-top: 3px;\">[10] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, \u201cDirect Preference Optimization: Your Language Model is Secretly a Reward Model,\u201d arXiv.org, May 29, 2023. <a href=\"https:\/\/arxiv.org\/abs\/2305.18290\">https:\/\/arxiv.org\/abs\/2305.18290<\/a><\/p>\n<\/div>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>Introduction At Trinka AI, we have built a Grammar Error Correction (GEC) model specifically designed to assist non-native speakers of English. Our model revises written language to make it accurate and fluent, particularly in academic contexts. In this blog, we highlight a key experiment: incorporation of Reinforcement Learning from Human Feedback into our workflow. Our [&hellip;]<!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":4,"featured_media":3065,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[239],"tags":[],"acf":[],"featured_image_url":"https:\/\/www.trinka.ai\/blog\/wp-content\/uploads\/2023\/12\/Version-X-Blog-Banner-1.png","_links":{"self":[{"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/posts\/3047"}],"collection":[{"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/comments?post=3047"}],"version-history":[{"count":63,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/posts\/3047\/revisions"}],"predecessor-version":[{"id":4696,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/posts\/3047\/revisions\/4696"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/media\/3065"}],"wp:attachment":[{"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/media?parent=3047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/categories?post=3047"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.trinka.ai\/blog\/wp-json\/wp\/v2\/tags?post=3047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}