The LLM Journey Begins?
I’ve started playing with LLMs. Finally feeling it’s time to see what I can see for myself. I’ve been inspired by reading Simon Willison’s blog – also the inspiration for getting these thoughts out of my skull and into a blog I can reference later.
Why now?
The post which convinced me to start experimenting is Things we learned about LLMs in 2024. In particular, reading “Some of those GPT-4 models run on my laptop” made me reframe how I feel.1 That models of this caliber have become efficient enough to run locally (and privately) on a laptop2 sparked my curiosity. I’m hopeful the trend of increasing efficiency continues. I know it’s far from solving the current energy consumption story, but at least it’s a step in the right direction.
How I’ve been thinking about their use is changing. Getting away from the snake oil, “this solves everything” hype view as well as the ostrich head, “utterly useless” reactive view, leads me to: How can I use this new technology? Where are its edges? What can it do and not do? Is it possible to use it in a responsible way?
A problem to solve
The first thing I wanted to try was using it as a search engine replacement. The web is so incredibly broken I often cannot use a search engine to help solve technical problems.
Once upon a time, I’d expect to end up on some nerd’s personal blog where they solved something similar or explained the technical details of the thing I was working with, but that rarely happens now. Instead, I get 3 types of results:
- 🙄 Years-old Reddit posts3 with people running into the same problem (usually without a solution).
- 🥴 Pages that fool me into hoping they are relevant, but don’t even have the majority of my query terms on them. “I see you said ‘Windows’: here’s a popular selection of pages from Microsoft’s Knowledge Base that may or may not be relevant.”
- 💀 SEO-optimized trash inspiring me to replicate Walden.
It’s reached the point where I put much of the search query inside double quotes (to ensure results with those terms) and then get zero results. An empty page is at least honest and wastes less of my time. Trading away maddening for saddening.
The tooling
This is where I landed after various trials and errors.
I’m using llama3.2 (the 3B parameter version). The easiest way I found to get this going is:
brew install ollama
ollama serve
# open another terminal (first run will download the model)
ollama run llama3.2
This is what drove the majority of my experiments.
As an additional experiment I installed Simon’s llm
. After more trial and error, this worked best:
# install llm into it's own environment
uvx --python 3.11 llm
# install the ollama plugin
uvx --python 3.11 llm install llm-ollama
# with `ollama serve` running
uvx --python 3.11 llm models default llama3.2
uvx --python 3.11 llm chat
The experiment
Recently, I added a new host to my SSH config (snips.sh — it’s a fun idea) and made a new key to use with it. However, every time I connected I was prompted to use the wrong key before falling back to the correct one. Web searches for how to fix this or understand what was going on led me absolutely nowhere: it was a waste of time reading worthless Stack Overflow and Reddit results.
Time to fire up that local LLM and see what it can do! Although I got contradictory and confusing advice at times, it gave me actionable things to try and a direction to head. Pretty soon, I had solved my problem. Success!
While working with the tool, it felt like talking to a dimwitted but incredibly well-read coworker. They “know” a lot, but they’re bad at connecting the dots, mix things up, and sometimes unknowingly fabricate things. Although decoding these responses requires effort, I can work with it. I’m getting value. I got nothing but frustration and lost time from from using a search engine.
It’s important to note this was successful because I have a lot of SSH config experience. Only with that skill was I able to figure out a path forward. If this was an area I knew little about, there would have been frustrating periods of going in circles. For example, in the “conversations” I had with the LLM, it would tell me to put the IdentitiesOnly
config option in some places and then later tell me to remove it from the exact same places. LLMs don’t reason! You’re merely tickling different matching patterns. But, it worked.
Compared to using an online LLM
I tried the same queries I was making about SSH config via DuckDuckGo’s chat tool. That let me try a few different models (GPT-4o mini, Claude 3 Haiku, Llama-3.1 70B). I found those weren’t any better at helping me with my SSH config issue than the local one I was running. So, in this case, the local model was as good as the free online ones!
Maybe I’ll eventually try a top-end model (like Claude 3.5) to see what that’s like, but I’m currently much more interested in what I can accomplish locally.
The hardware
This was all done on an M1 MacBook Air with 16GB of memory. It’s slower than the online tools, but not enough to matter. It’s a few extra seconds of waiting, which I found didn’t make a difference. Especially since you start reading the result before it’s done outputting.
I tried running the model on an iPhone 16 Pro via MLC Chat and it’s way faster4, though gave me much worse results for the same prompt. I suspect even though it’s llama3.2:3B, there might be some other differences since the full name in MLC Chat is “Llama-3.2-3B-Instruct-q4f16_1-ML”. I don’t know what that means. Maybe the MLC compiled version is different? Something to investigate another time.
Takeaway: a kind of cartomancy
Cartomancy is bullshit if you think of it as providing answers, truth, prediction, etc., but I think it has value: it’s a random number generator for reflection. In tarot, you shuffle the cards, lay them down in a particular way, and then the card face, orientation, and position become an interpretation. This can get you to think of your life or self in a way you hadn’t before. It’s been decades since I’ve done this, but I still recognize there is value in the process. It feels similar to the way Oblique Strategies work (which I do use from time to time). On demand alternative perspectives can help get you past a stuck place and that’s what happened in my experiment.
Appendix
Tooling tribulations
Here’s some tooling I tried to setup that failed. Free yourself of the pain.
-
Installing
llm
via Homebrew.This went poorly because Homebrew is its own Python environment and that means you’re using whatever version of Python and its packages happen to be in the index. In particular, I couldn’t install the
llm-ollama
plugin due to the version ofhttpx
it required being older than the one in my Homebrew environment.Using
uvx
to installllm
means you can choose a specific Python version and install additional dependencies (thellm
plugins) into a self-contained environment. Honestly, I should have done that in the first place. I ignored my instinct and put my finger in the plug socket. Sometimes you gotta remind yourself why you have the instincts you do. -
Using gpt4all with
llm
.I don’t know if this is user error or if gpt4all happens to be in a weird state right now, but the model list it comes with uses the same name for 2 different models! In particular, “Llama3” is used for both the 3B and 1B parameter models and which one got picked was inconsistent.
The first time I used “Llama3” the 3B model was downloaded and used, but later it downloaded the 1B model and used that instead. 🤦🏼 I tried to debug this, but quickly gave up and went with the
ollama
server instead. I didn’t want my first foray to end in a yak shaving exercise. -
Trying to install the gpt4all macOS app.
LOL. I got prompted to install Rosetta. In 2024. Noped right out of that. It sounds like it may be the installer needing Rosetta as an unfortunate side effect of the Qt Installer Framework. If you already have Rosetta installed or don’t care, go for it.
Hardware limit: llama3.2-vision
I gave the llama3.2-vision model a shot and that’s when things fell over.
# with `ollama serve` running
ollama run llama3.2-vision
This is an 8GB download and it uses 8GB of memory when loaded without doing anything. And boy does my laptop not do well with this. While trying to process an image I took with my phone, the laptop was using all the memory and starting to swap, though at least not thrashing.
Each query took almost 3 minutes to complete (GPU pinned). This made it unusable since I’m doing multiple queries to get what I want. Perhaps if I learn how to prompt better (reminds me of getting good at doing Google searches back in the day) I can one shot these and the 3’ wait would be fine, but as is, it’s not tenable.
-
I’m deeply aware of so many issues with the AI hype cycle that’s happening. It’s why I’ve been reluctant to dip my toe too much. The climate repercussions, the bullshit generation, how the training data was acquired, the impact to so many people’s livelihoods, and the power differential it is taking place within and looks optimized to serve, can be overwhelming. ↩︎
-
Yes, Simon’s laptop is beefy: an M2 Pro with 64GB of memory, but going from something only able to run in a datacenter to now being able to run on a laptop in 21 months is remarkable!
That said, it’s not even possible to configure an M4 Pro laptop with that much RAM. This upsell bullshit makes me wish Tim Cook would GTFO. You can get an M4 Pro with 64GB in a Mac mini! They are making these chips! You just aren’t allowed to have it in your laptop. 🖕 Maybe the next CEO will prioritize products for the customer’s instead of for Apple’s short-term stock price. But, I digress … ↩︎ -
Since I’m not using Google, the only search engine paying Reddit to index it. The open web this is not. ↩︎
-
Yes, I get why an A18 Pro is faster than an M1, but it’s still hilarious that my phone is faster than my laptop in this regard. ↩︎