Creating OK Google With GPT-3

Vinter, Wed Dec 07 2022 • gpt3

Yesterday I was showing ChatGPT to a friend of mine, and he asked me a simple question: "How long do you think everyone will have something similar to Ok Google but powered by an AI on their phones?"

I thought about it for a second and my first answer was "I'd say 2-3 years", but then I realized that I had the tools to make it happen right now, and it shouldn't even take too much time. So I booted up my PC and started working on it. Small spoiler: it took me 30 minutes.

The Process

I analyzed the problem and divided it into 3 smaller components:

Listen to some audio from the computer and convert it to text
Use that text to query GPT-3 and get the answer back
Convert the answer to audio and play it to the user

Let's start with the first one. Luckily for us this is a very simple thing to solve as there exists a Python library that does exactly what we need: SpeechRecognition

import speech_recognition as sr

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source, timeout=5)

text = r.recognize_google(audio)
print(text)

I defaulted to the google speech recognition service for the sake of the demo, but there are other recognition functions in the library that are potentially better. The default one, however, works perfectly fine, so I kept it. (For increased accuracy once could use OpenAI's Whisper)

Once we did that it's time for the second part of the problem. Sadly ChatGPT doesn't provide an API, so using it would've been a bit tricky. I decided to use the official OpenAI API with the davinci-003 model, which is the same that ChatGPT uses, but theoretically with a bit of hacking and Selenium one could use chat.openai.com directly.

Here's the code to query OpenAI (the first part is excluded for clarity reasons, but I will show the final code at the end)

import openai

openai.api_key = "<KEY GOES HERE>"

completion = openai.Completion.create(
    engine="text-davinci-003",
    prompt=text,
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

answer = completion['choices'][0]['text']

print(answer)

With 2 out of 3 done it was time to develop the TTS part. Once again, we have a library for that

from time import sleep
from gtts import gtts
import os
import pyglet

tts = gTTS(text=answer)

tts.save("out.mp3")

music = pyglet.media.load("out.mp3", streaming=False)
music.play()

sleep(music.duration)
os.remove("out.mp3")

There are a couple of things going on here:

gTTS doesn't natively support a way of streaming the audio, so we need to save it
Then, we use pyglet (which is an audio and video playback library) to reproduce it
Finally, we sleep for the duration of the audio and delete the temporary file in the end

And now that everything was tested and confirmed to work individually, putting the pieces together was very simple:

import speech_recognition as sr
from time import sleep
from gtts import gtts
import os
import pyglet
import openai

openai.api_key = "<KEY GOES HERE>"

r = sr.Recognizer()

with sr.Microphone() as source:
    audio = r.listen(source, timeout=5)

text = r.recognize_google(audio)

completion = openai.Completion.create(
    engine="text-davinci-003",
    prompt=text,
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

answer = completion['choices'][0]['text']

tts = gTTS(text=answer)
tts.save("test.mp3")

music = pyglet.media.load("out.mp3", streaming=False)
music.play()
sleep(music.duration) #prevent from killing
os.remove("out.mp3")

When launching the program it will parse 5 seconds of audio from the microphone, query GPT-3 and return the answer using TTS. And I will admit, it works surprisingly well. One thing it lacks compared to Ok Google is, of course, the continuous listening and pattern activation, but that seems to be easily solvable (on an MVP level) with the libraries we already used.

Remove the timeout from the first listen()
Keep checking the stream for a certain keyword
When found, start a new thread that performs the code above

The full code can be found at the following link: https://github.com/thevinter/okgpt

Entering Dangerous Territory

IMPORTANT NOTICE: In this section I allow GPT-3 to run unchecked code on my machine with only a semblance of safeguards. Please be aware that this is 1) potentially very dangerous 2) a terrible idea if you are convinced about the importance of AI Safety

It is also given as a proof of concept of what it would mean for an AI to suddenly become a very powerful actor, and how that could happen. Please replicate with caution

Here's a bonus section that came up when I was writing the article. What if, I convinced GPT to give me linux commands and then executed them? What is the potential of such an implementation and how hard is it?

Well, as for the last question, not very much (at least to achieve it on a very basic level)

[...]
text = r.recognize_google(audio)

completion = openai.Completion.create(
    engine="text-davinci-003",
    prompt=f"Return a linux command that solves the following problem: {text}. You will return the command and nothing else",
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

answer = completion['choices'][0]['text']
x = input(f"GPT is trying to run {answer}, want to proceed? [y/n]")
if(x=='y'): os.system(answer)
[...]

As you can see, we can add a directive in the prompt to have the API return us Linux commands, and then use os.system() to execute them. This worked as a simple prototype and I could easily ask it to create files on my desktop, list the files in a folder and even check the weather!

How to improve this? Well, the current limitation of this demo is that our model is not fine-tuned, but let's assume we can easily do this. We then could instruct GPT to add specific patterns when it tries to run a command: "From now onwards, everytime you will suggest running a Linux command, start the message with 'cmd:'"

By doing this we could theoretically use that pattern to decide whether to run the answer or not:

import subprocess

if(answer[:4] == "cmd:"):
    answer = subprocess.check_output(answer[4:].strip(), shell=True)
    if(answer == ''): answer="Done!"
text_to_speech(answer)

In the example above I switched to subprocess instead of os to be able to run the eventual stdout of a command and play it trough TTS

There's another thing we could achieve by having it run linux commands. Connect it to the internet (N.B. This is just a proof of concept and I haven't properly implement it but it looks like that with some more work it could behave as intended)

GPT-3 is notoriously run in offline mode, but what is preventing us from running curl and feeding in the output back? Especially now that ChatGPT has clearly a "memory", all the info that gets past in remains saved:

[...]
text = r.recognize_google(audio)

completion = openai.Completion.create(
    engine="text-davinci-003",
    prompt=f"Write a linux curl command to solve the following problem: {text}. Write just a curl command and nothing else",
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

answer = completion['choices'][0]['text']

data = subprocess.check_output(answer, shell=True)

completion = openai.Completion.create(
    engine="text-davinci-003",
    prompt=f"You now know the following information: {data}. Given what you know tell me the answer to the following question: {text}",
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)

answer = completion['choices'][0]['text']
[...]

It's of course very rough and very hardcoded, but it seems crazy what we could achieve. It looks like right now the tools are "barely not enough" to make a big breakthrough, but what will happen when GPT-4 comes out?