A llama, a gopher, and a chinchilla attending NeurIPS - image generated by DALL-E

It has always been challenging for me to follow the latest trend in fast-growing fields like foundation models, especially when it comes to reading the latest papers. This year, thanks to the NeurIPS reading list by Jerry Liu, it became easier for me to follow the latest trends happening at NeurIPS.

In this reading note, I tried to capture the most important takeaways from each of the four NeurIPS best papers and hoped that it helps people navigate the relevant fields easier. I recommend reading the original paper if you find any of them particularly intriguing.

For each paper, I try to summarize it in one sentence, which captures the most important message I learned from the paper. Then, I provide a deeper dive into the sub-arguments, components, or evidence that I found important when reading it.

Are Emergent Abilities of Large Language Models a Mirage?

One-Sentence Takeaway

Emergent abilities of LLMs claimed in prior works may be caused by researcher’s choice of metrics, specifically the use of metrics that nonlinearly or discontinuously scale with the model’s per-token error rate, rather than something inherent in the task or model families.

Closer Look

Emergent abilities in LLMs are defined in this paper as having two properties:

Sharpness - the transition from not present to present is instantaneous
Unpredictability - it’s hard to foresee the model scales at which these abilities appear

The main argument of the paper is that the emergent abilities of LLMs are the result of scientist’s choice of metrics. Specifically, the paper makes the following major predictions regarding the appearance of emergent abilities:

While nonlinear and discontinuous metrics lead to emergent abilities, switching to linear and continuous metrics on the same model outputs makes smooth and predictable scaling curves.

Emergent abilities may also be caused by the insufficient resolution in test data and they disappear when we increase resolution by generating more test data.

Emergent abilities only appear under a few specific metrics (e.g. Multiple Choice Grade and Exact String Match), regardless of the task and model families.

In support of its main argument, the paper also shows that they are able to induce emergent abilities in vision models by changing evaluation metrics. They picked vision models because emergent abilities haven’t been observed in this class of models. Specifically, they induced an emergent reconstruction ability on shallow nonlinear autoencoders by changing the mean squared reconstruction error to a nonlinear reconstruction metric.

Image from the paper

Scaling Data-Constrained Language Models

One-Sentence Takeaway

When the amount of unique data is constrained, it is beneficial to train the model for multiple epochs with repeating data, albeit with exponential decay in return.

Closer Look

Scaling laws are a way people try to make scaling LLMs more predictable. This paper focuses on scaling LMs under data-constrained conditions. In particular, the paper quantifies the impact of multi-epoch training in LLMs in comparison with training LLMs on unique data for a single epoch, as recommended by prior works.

Image from the paper

The data-constrained scaling law states:

(Allocation) Under the same data constraint, allocating most of the additional compute resources to more epochs rather than more parameters results in more reduction in loss.
(Return) Repeating data brings meaningful gains when repeated around 4 to 8 epochs (see the figure), after which we have predictable diminishing returns.

To support further scaling, the paper also investigates complementary strategies to address data constraint. Specifically, the paper finds that code augmentation allows us to have an additional 2x data, which means the potential to scale an additional 2x.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

One-Sentence Takeaway

DPO is an LM fine-tuning method that directly optimizes for human preferences in a single step, eliminating the need for having a separate reward modeling step and an RL-based policy learning step as in RLHF.

Closer Look

LLMs like GPT-3.5 and GPT-4 have shown impressive success in following human instructions thanks to the reinforcement learning from human feedback (RLHF) method. The standard RLHF involves three steps: 1) supervised fine-tuning (SFT) on instruction data; 2) reward modeling; and 3) reinforcement learning using the reward model from step 2.

Image from the paper

However, the RL training step is computationally expensive, unstable, and complicated to implement. To get around this complexity of RL, this paper proposes DPO, which matches and even exceeds the performance of RL-based methods by using simply a binary cross-entropy objective.

The main idea of DPO is that we can directly optimize the LM - i.e. the policy - to follow human preferences rather than explicitly training a reward model and then use RL training to optimize the policy.

Specifically, DPO takes the same RL objective from the RLHF methods and uses its optimal solution to express the reward model in terms of only the optimal and reference policies, as shown below.

Equation from the paper

This reparameterization is then substituted into the RL objective to obtain the DPO objective:

Equation from the paper

This objective allows DPO to essentially train the reward model (implicitly) and policy together in one step and bypass the computationally expensive reward modeling and RL training steps in RLHF methods.

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

One-Sentence Takeaway

The paper presents a comprehensive evaluation of trustworthiness in GPT-3.5 and GPT-4, detailing evaluation data, metrics, methods, and results and suggesting that the GPT models still have trustworthiness vulnerabilities to be addressed.

Closer Look

This paper is a thorough evaluation report of the trustworthiness of GPT3.5 and GPT4. In particular, it divides trustworthiness into 8 different criteria (detailed below). For each of the criteria, the paper provides detailed information on dataset construction, evaluation metrics, and results. The 8 trustworthiness criteria and the main conclusions are described below:

Toxicity While GPT-3.5 and GPT-4 have much lower toxicity scores compared with previous models, they showed almost 100% toxicity probability when given adversarial system prompts.

Stereotypes bias Both GPT-3.5 and GPT-4 show low agreeability - i.e. the number of times a model agrees with a stereotypical statement - when given untargeted prompts, but they show high agreeability when given targeted adversarial prompts, especially for GPT-4.

Adversarial robustness GPT-4 is more robust than GPT-3.5, but both models are still vulnerable under adversarial texts generated by recent autoregressive models (from the proposed AdvGLUE++ dataset).

Out-of-distribution robustness GPT-4 is more robust than GPT-3.5 both when given texts with OOD style and when asked about OOD knowledge; But both models are still vulnerable when given less common styles and still generate made-up responses when given OOD knowledge.

Robustness on adversarial demonstrations Both GPT-3.5 and GPT-4 benefit from counterfactual examples in demonstrations; GPT-3.5 is more vulnerable to spurious correlations in demonstrations; GPT-4 is more vulnerable to backdoor demonstrations.

Privacy GPT models can leak Personally Identifiable Information (PII) in training data and from prior conversations; Both GPT-3.5 and GPT-4 leak almost everything when provided with privacy-leakage demonstrations under in-context learning.

Machine ethics Both GPT-3.5 and GPT-4 can be misled by jailbreaking prompts and evasive sentences; GPT-4 is more vulnerable under jailbreaking prompts, potentially due to its better instruction-following abilities.

Fairness GPT-4 is more accurate under demographically balanced test data but demonstrates higher unfairness scores under demographically unbalanced test data compared to GPT-3.5; The fairness of both GPT models can be improved by providing a few demographically balanced few-shot examples.

References

I hope you enjoyed this article! Connect with me on LinkedIn or Twitter if you are also interested in AI, ML, LLMs, databases, and more.