Using GPT-4 to bootstrap few-shot CoT demonstations for GPT-3.5

The Scoped Negation (ScoNe) benchmark of She et al. (2023) seeks to stress-test models on their ability to reason about negation. In the original paper, the text-davinci-002 and text-davinci-003 models were more or less at chance on the hardest ScoNe categories.

This notebook starts with a very simple Chain-of-Thought-based module for ScoNe. gpt-3.5-turbo is at chance on the “one scoping negation” category (one of the two hardest in ScoNe) using this simple program.

We figured that bootstrapping demonstrations would help, but turbo struggled to create good demonstrations that included CoT steps. When we switched to using gpt4-turbo just to create these demonstrations (which involves under 50 calls to that model), turbo regularly achieved 85–90% accuracy. This is a single compilation step using dspy.BootstrapFewShotWithRandomSearch.

Set-up

import glob
import os
import pandas as pd
import random

import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join('.', 'cache')

# We'll rely on turbo for everything except bootstrapping CoT demos:

turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', max_tokens=250, model_type='chat')

dspy.settings.configure(lm=turbo)

# GPT-4 will be used only to bootstrap CoT demos:

gpt4T = dspy.OpenAI(model='gpt-4-1106-preview', max_tokens=350, model_type='chat')

# Toggling this to true will redo the bootstrapping process. When
# it is set to False, the existing demonstrations will be used but
# turbo will still be used to evaluate the zero-shot and full programs.
RUN_FROM_SCRATCH = False

ScoNe

!git clone https://github.com/selenashe/ScoNe.git

Cloning into 'ScoNe'...
remote: Enumerating objects: 77, done.
remote: Counting objects: 100% (77/77), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 77 (delta 42), reused 42 (delta 20), pack-reused 0
Receiving objects: 100% (77/77), 116.25 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (42/42), done.

Data loader

def load_scone(dirname):
    dfs = []
    for filename in glob.glob(dirname + "/*.csv"):
        df = pd.read_csv(filename, index_col=0)
        df['category'] = os.path.basename(filename).replace(".csv", "")
        dfs.append(df)
    data_df = pd.concat(dfs)

    def as_example(row):
        # The 'one_scoped' file is from an earlier dataset, MoNLI, and
        # so is formatted a bit differently:
        suffix = '' if row['category'] == 'one_scoped' else '_edited'
        # Reformat the hypothesis to be an embedded clause in a question:
        hkey = 'sentence2' + suffix
        question = row[hkey][0].lower() + row[hkey][1: ].strip(".")
        question = f"Can we logically conclude for sure that {question}?"
        # Binary task formulation:
        label = "Yes" if row['gold_label' + suffix] == 'entailment' else "No"
        return dspy.Example({
            "context": row['sentence1' + suffix],
            "question": question,
            "answer": label,
            "category": row['category']
        }).with_inputs("context", "question")

    return list(data_df.apply(as_example, axis=1).values)

Train and dev samples

all_train = load_scone("ScoNe/scone_nli/train")

random.seed(1)
random.shuffle(all_train)

# 200 random train, 50 random dev:
train, dev = all_train[: 200], all_train[200: 250]

len(train), len(dev)

(200, 50)

Test

random.seed(1)

test = load_scone(dirname=f"ScoNe/scone_nli/test")

# We're developing a system for the full ScoNe benchmark, but we'll
# evaluate only on one of the hardest and most informative ScoNe
# categories for now -- examples with a single negation that plays
# a crucial role in the reasoning:
test = [ex for ex in test if ex.category == "one_scoped"]

pd.Series([ex.answer for ex in test]).value_counts()

No     100
Yes    100
dtype: int64

Evaluation tools

scone_accuracy = dspy.evaluate.metrics.answer_exact_match

evaluator = Evaluate(devset=test, num_threads=1, display_progress=True, display_table=0)

Zero-shot CoT

class ScoNeSignature(dspy.Signature):
    ("""You are given some context (a premise) and a question (a hypothesis). """
    """You must indicate with Yes/No answer whether we can logically """
    """conclude the hypothesis from the premise.""")

    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Yes or No")

class ScoNeCoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_answer = dspy.ChainOfThought(ScoNeSignature)

    def forward(self, context, question):
        return self.generate_answer(context=context, question=question)

cot_zeroshot = ScoNeCoT()

evaluator(cot_zeroshot, metric=scone_accuracy)

Average Metric: 100 / 200  (50.0): 100%|█████████████████████████| 200/200 [00:00<00:00, 733.75it/s]

Average Metric: 100 / 200  (50.0%)

50.0

Optimized few-shot with bootstrapped demonstrations

bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    num_candidate_programs=10,
    num_threads=8,
    metric=scone_accuracy,
    teacher_settings=dict(lm=gpt4T))

Going to sample between 1 and 8 traces per predictor.
Will attempt to train 10 candidate sets.

if RUN_FROM_SCRATCH:
    cot_fewshot = bootstrap_optimizer.compile(cot_zeroshot, trainset=train, valset=dev)
else:
    cot_fewshot = ScoNeCoT()
    cot_fewshot.load("scone-cot_fewshot-turbo-gpt4-demos.json")

Average Metric: 24 / 50  (48.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1096.32it/s]
Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1034.71it/s]
  6%|███▎                                                         | 11/200 [00:00<00:00, 899.26it/s]
Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1225.04it/s]
  4%|██▊                                                           | 9/200 [00:00<00:00, 815.06it/s]
Average Metric: 37 / 50  (74.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 884.47it/s]
  2%|█▏                                                            | 4/200 [00:00<00:00, 309.09it/s]
Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1111.93it/s]
  0%|▎                                                             | 1/200 [00:00<00:00, 712.23it/s]
Average Metric: 31 / 50  (62.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1043.32it/s]
  2%|█▏                                                            | 4/200 [00:00<00:00, 837.65it/s]
Average Metric: 23 / 50  (46.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1104.00it/s]
  2%|█▏                                                            | 4/200 [00:00<00:00, 802.55it/s]
Average Metric: 34 / 50  (68.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1116.66it/s]
  2%|█▌                                                            | 5/200 [00:00<00:00, 855.28it/s]
Average Metric: 30 / 50  (60.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1148.03it/s]
  1%|▌                                                             | 2/200 [00:00<00:00, 723.34it/s]
Average Metric: 27 / 50  (54.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1109.09it/s]
  3%|█▊                                                            | 6/200 [00:00<00:00, 828.15it/s]
Average Metric: 28 / 50  (56.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1036.51it/s]
  2%|█▌                                                            | 5/200 [00:00<00:00, 790.78it/s]
Average Metric: 25 / 50  (50.0): 100%|████████████████████████████| 50/50 [00:00<00:00, 1128.36it/s]
  4%|██▍                                                           | 8/200 [00:00<00:00, 845.75it/s]
Average Metric: 31 / 50  (62.0): 100%|█████████████████████████████| 50/50 [00:00<00:00, 921.83it/s]

Average Metric: 24 / 50  (48.0%)
Score: 48.0 for set: [0]
New best score: 48.0 for seed -3
Scores so far: [48.0]
Best score: 48.0
Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
New best score: 50.0 for seed -2
Scores so far: [48.0, 50.0]
Best score: 50.0
Bootstrapped 8 full traces after 12 examples in round 0.
Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
New best score: 54.0 for seed -1
Scores so far: [48.0, 50.0, 54.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.7
Average of max per entry across top 3 scores: 0.76
Average of max per entry across top 5 scores: 0.76
Average of max per entry across top 8 scores: 0.76
Average of max per entry across top 9999 scores: 0.76
Bootstrapped 7 full traces after 10 examples in round 0.
Average Metric: 37 / 50  (74.0%)
Score: 74.0 for set: [8]
New best score: 74.0 for seed 0
Scores so far: [48.0, 50.0, 54.0, 74.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.78
Average of max per entry across top 3 scores: 0.86
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92
Bootstrapped 3 full traces after 5 examples in round 0.
Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.8
Average of max per entry across top 3 scores: 0.82
Average of max per entry across top 5 scores: 0.92
Average of max per entry across top 8 scores: 0.92
Average of max per entry across top 9999 scores: 0.92
Bootstrapped 1 full traces after 2 examples in round 0.
Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.94
Average of max per entry across top 9999 scores: 0.94
Bootstrapped 4 full traces after 5 examples in round 0.
Average Metric: 23 / 50  (46.0%)
Score: 46.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.86
Average of max per entry across top 3 scores: 0.9
Average of max per entry across top 5 scores: 0.94
Average of max per entry across top 8 scores: 0.96
Average of max per entry across top 9999 scores: 0.96
Bootstrapped 4 full traces after 5 examples in round 0.
Average Metric: 34 / 50  (68.0%)
Score: 68.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
Bootstrapped 5 full traces after 6 examples in round 0.
Average Metric: 30 / 50  (60.0%)
Score: 60.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
Bootstrapped 2 full traces after 3 examples in round 0.
Average Metric: 27 / 50  (54.0%)
Score: 54.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
Bootstrapped 6 full traces after 7 examples in round 0.
Average Metric: 28 / 50  (56.0%)
Score: 56.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
Bootstrapped 4 full traces after 6 examples in round 0.
Average Metric: 25 / 50  (50.0%)
Score: 50.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
Bootstrapped 8 full traces after 9 examples in round 0.
Average Metric: 31 / 50  (62.0%)
Score: 62.0 for set: [8]
Scores so far: [48.0, 50.0, 54.0, 74.0, 56.0, 62.0, 46.0, 68.0, 60.0, 54.0, 56.0, 50.0, 62.0]
Best score: 74.0
Average of max per entry across top 1 scores: 0.74
Average of max per entry across top 2 scores: 0.92
Average of max per entry across top 3 scores: 0.98
Average of max per entry across top 5 scores: 0.98
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
13 candidate programs found.

evaluator(cot_fewshot, metric=scone_accuracy)

Average Metric: 171 / 200  (85.5): 100%|█████████████████████████| 200/200 [00:00<00:00, 557.50it/s]

Average Metric: 171 / 200  (85.5%)

85.5

cot_fewshot.save("scone-cot_fewshot-turbo-gpt4-demos.json")

Example prompt with prediction

turbo.inspect_history(n=1)





You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: Yes or No

---

Context: It is not true that there is not a single person walking in the city.

Question: Can we logically conclude for sure that it is not true that there is not a single celebrity walking in the city?

Reasoning: Let's think step by step in order to produce the answer. We know that the double negative in the context implies that there is at least one person walking in the city. However, the context does not provide any information about the status or occupation of the person walking in the city. Therefore, we cannot logically conclude that the person walking in the city is a celebrity.

Answer: No

---

Context: the boy, not girl, will play an trombone, but not for another week

Question: Can we logically conclude for sure that the boy, not girl, will play an instrument, but not for another week?

Reasoning: Let's think step by step in order to produce the answer. We know that the boy will play a trombone, which is a type of instrument. The context specifies that this will happen not for another week, which means it will happen in the future, but not immediately. The gender of the person is also specified as a boy, not a girl.

Answer: Yes

---

Context: A man is not holding anything in his hands.

Question: Can we logically conclude for sure that a man is not holding beverages in his hands?

Reasoning: Let's think step by step in order to produce the answer. We know that the man is not holding anything in his hands. Beverages are a subset of "anything." Therefore, if he is not holding anything, he is also not holding beverages.

Answer: Yes

---

Context: There is not a boat nearby.

Question: Can we logically conclude for sure that there is not a speedboat nearby?

Reasoning: Let's think step by step in order to produce the answer. We know that there is not a boat nearby. A speedboat is a type of boat. If there are no boats nearby, then there cannot be a speedboat nearby either, because a speedboat is included in the category of boats.

Answer: Yes

---

Context: The man is not listening to music.

Question: Can we logically conclude for sure that the man is not listening to rockabilly?

Reasoning: Let's think step by step in order to produce the answer. We know that the man is not listening to music. Rockabilly is a genre of music. If he is not listening to music at all, then he cannot be listening to rockabilly, which is a specific type of music.

Answer: Yes

---

Context: There is not a single person walking in the city.

Question: Can we logically conclude for sure that there is not a single mover walking in the city?

Reasoning: Let's think step by step in order to produce the answer. We know that there is not a single person walking in the city. A mover is a type of person. If there are no people walking in the city, then there cannot be a mover walking in the city either, because a mover is included in the broader category of people.

Answer: Yes

---

Context: a man walks not so confidently, but he does not drop eggplants

Question: Can we logically conclude for sure that a man walks not so confidently, but he does not drop produce?

Reasoning: Let's think step by step in order to produce the answer. We know that the man does not drop eggplants, which are a type of produce. However, the statement does not provide information about other types of produce. Therefore, we cannot conclude that he does not drop any type of produce, only that he does not drop eggplants specifically.

Answer: No

---

Context: The girl who is not here is not wearing any jewelry at all.
Question: Can we logically conclude for sure that the girl who is not here is not wearing any ringlets at all?
Answer: Yes

---

Context: The man is not steering a sedan

Question: Can we logically conclude for sure that the man is not steering a car?

Reasoning: Let's think step by step in order to produce the answer. We know that the man is not steering a sedan, which is a specific type of car. However, the context does not provide any information about whether he is steering any other type of car. Therefore, we cannot logically conclude that he is not steering a car at all.

Answer: No