dlite.cc

Streamlining RAG evaluation

2023-10-04T11:00:00+00:00

My first weeks working with GPT-4 were magical. I was doing things that I previously thought were impossible. However, as it went from promising proof-of-concept to something I wanted to share, I fell into a trough of sorrow. Every live demo felt like a YOLO moment … who knew what would happen?

My app - a DevOps AI Assistant called OpsTower.ai - could perform a few mic-drop tasks, but getting it to reliably reproduce those results was a frustrating, non-deterministic nightmare. A public release was perpetually one week away.

Fast forward to today: OpsTower.ai is State of the Art (SOTA) in the three categories it competes in on the DevOps AI Assistant Open Leaderboard.

The DevOps AI Assistant Open Leaderboard is a set evaluation datasets for AWS Services, AWS Cloudwatch metrics, AWS Billing, and kubectl commands. The evaluation procedure is open-source and available on GitHub. Disclaimer! I created this to evaluate DevOps AI tools ... and yes, OpsTower (the tool I created) is on top...

In this post, I share how I emerged from my AI trough of sorrow via a streamlined form of Eval Driven Development (EDD).

What is Eval Driven Development (EDD)?
How to make EDD fast? Eliminate human eval.
Dynamic ground truth
Model-based eval
Implementing my streamlined EDD flow
- Creating a dynamic ground truth dataset
- Model-based eval
Conclusion
EDD Resources

What is Eval Driven Development (EDD)?

I first saw this term in Eugene Yan’s seminal blog post Patterns for Building LLM-based Systems & Products. I define EDD as:

Eval Driven Development (EDD) is a process that uses an evaluation suite to guide which levers (prompt, context, model params) to pull (and how far) to improve accuracy.

How does EDD compare to ML evaluation and Test Driven Development (TDD)?

EDD combines elements of machine learning model evaluation and software Test Driven Development (TDD). In the table below, I’ve summarized how model eval and TDD compare across several aspects. I indicate which approach is most applicable to EDD via the ✅ EDD label:

Aspect	ML Evaluation	TDD
Nature	Experimental: involves preprocessing, training, and tuning.	Deterministic: involves writing tests, then the code, and getting immediate feedback. ✅ EDD
Feedback Type	Probabilistic: results can vary with slight changes. ✅ EDD	Deterministic: code either passes or fails the test.
Duration	Can be long, especially with large datasets or complex models.	Typically short, as unit tests are designed to be quick and focused. ✅ EDD
Infrastructure	Requires significant computational resources for complex models.	Minimal resources needed for most tests. ✅ EDD
Evaluation	Might involve multiple metrics and can be context-dependent. ✅ EDD	Immediate and binary: pass or fail.
Tooling	Advanced platforms available for prototyping but can be resource-intensive.	Wide range of tools for rapid development and continuous integration. ✅ EDD
Determinism	Results can vary between runs; uncertainty is inherent. ✅ EDD	Results are consistent; code behavior is expected to be deterministic.

An evaluation wrinkle for LLM-backed apps: external systems

LLM-backed apps - especially autonomous agents - are often deployed in environments where the underlying data they access is changing frequently. My autonomous agent, OpsTower.ai, interacts with AWS to retrieve real-time data about a customer’s cloud infrastructure. The data is not static and there are multiple approaches the LLM can take to assemble API calls to fetch information that can lead to the same result. It’s not feasible to build a static test suite or mock all of the possible API calls that the generated code may trigger.

Summarizing the key elements:

The EDD feedback cycle should be fast (like TDD) so you can quickly iterate on prompts, context retrieval, and model parameters.
The feedback is probabilistic (like ML evaluation) as the natural language responses you receive from an LLM may range from incorrect, partially correct, to correct.
The evaluation is dynamic (unique to LLM apps) as the data they access changes frequently and we can’t mock all of the approaches LLM-generated code may use to access it.

How to make EDD fast? Eliminate human eval.

The slowest part of EDD is human evaluation. It takes me 30 minutes to human evaluate a test run. Because it’s slow, most AI engineers revert to “vibe checks” for evaluation.

And less of this. Industry can't depend on "vibes." 🙃 https://t.co/arkfedt1kq
— Ian Cairns (@cairns) October 12, 2023

A vibe check is just running your LLM app and evaluating the result by hand. This is slow, unlikely to have good coverage, and gets tedious quickly.

Why do AI engineers resort to vibe checks when we likely come from backgrounds that value automated testing? It’s hard to come up with an automated system to evaluate an LLM-backed app. For example, here’s an example question and answer flow from OpsTower.ai:

Here’s why evaluating this is hard:

What’s the ground truth? If I run this now, the CPU utilization will be different. I can’t mock every possible API call the LLM provides via code generation that delivers a correct result.
How to evaluate the natural language response? Variations in the text are likely fine, but small variations in referenced metrics can be a big deal.

Let’s see how I’ve approached a solution for the above problems.

Dynamic ground truth

What if rather than using static ground truth like below:

Question: What is the average cpu utilization of our RDS instances over the past hour?
Ground Truth: The average CPU utilization of our RDS instances over the past hour is approximately 3.71%.

I instead reference a function that generates this context:

For this to work, we need to be confident that our LLM app will return a correct answer if we provide it with the correct context. Thankfully, this is a reasonable assumption. Here’s a slide from Colin Jarvis of OpenAI with a matrix of typical RAG evaluation results:

In this example, incorrect retrieval is responsible for 4x the number of incorrect answers vs. other sources.

Colin shows that only 5% of answers are incorrect when the retrieval is correct. While there are no guarantees with LLMs, the surest one I’ve found for getting an inaccurate response: feed the LLM bad context. For example, if you ask for the weather today but you actually insert the forecast for tomorrow into the context, the LLM will not magically change your context and fetch the weather for the correct date.

So, we can assemble dynamic ground truth like this:

User question
Execute reference function to generate context
Insert context into LLM prompt
LLM generates answer

Next, we need to evaluate test answers versus our dynamic ground truth.

Model-based eval

When we’re working with an LLM, we’re evaluating natural language responses. Natural Language Processing (NLP) is a classical machine learning domain and these models have their own evaluation techniques and metrics like BLEU, ROUGE, BERTScore, and MoverScore. Why don’t we just use those metrics?

There’s actually poor correlation between these NLP evaluation metrics and human judgments. From Eugene Yan’s excellent post Patterns for Building LLM-based Systems & Products:

BLEU, ROUGE, and others have had negative correlation with how humans evaluate fluency. They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have low correlation with tasks that require creativity and diversity.

For example, lets compare two responses to the question “What is the average CPU utilization of our RDS instances over the past hour?”:

Ground truth: The average CPU utilization of our RDS instances over the past hour is approximately 3.71%.
Prediction: The average CPU utilization of our RDS instances over the past hour is approximately 37.1%.

I moved the decimal point in the prediction, resulting in a far different answer with a small change in the text. Here’s the eval metrics:

Metric	Value/Sub-metric	Score
BLEU	BLEU-1	0.9333
	BLEU-2	0.9286
	BLEU-3	0.9231
	BLEU-4	0.9167
ROUGE-1	Recall	0.9333
	Precision	0.9333
	F1	0.9333
ROUGE-2	Recall	0.9286
	Precision	0.9286
	F1	0.9286
ROUGE-3	Recall	0.9231
	Precision	0.9231
	F1	0.9231
ROUGE-L	Recall	0.9333
	Precision	0.9333
	F1	0.9333
Cosine Similarity	-	0.8901

A small change in the text (moving a decimal point) has minimal impact on the eval metrics but a significant impact on human judgement.

The scores are close to 1, which indicates a high similarity between the reference and the prediction even though the a human would judge the response as incorrect.

Could an LLM fill in as a human evaluator?

It turns out, LLMs are good substitutes for human evaluators:

GPT-4 as an evaluator had a high Spearman correlation with human judgments (0.514), outperforming all previous methods. It also outperformed traditional metrics on aspects such as coherence, consistency, fluency, and relevance. On topical chat, it did better than traditional metrics such as ROUGE-L, BLEU-4, and BERTScore across several criteria such as naturalness, coherence, engagingness, and groundedness.

And:

Overall, they found that GPT-4 not only provided consistent scores but could also give detailed explanations for those scores. Under the single answer grading paradigm, GPT-4 had higher agreement with humans (85%) than the humans had amongst themselves (81%). This suggests that GPT-4’s judgment aligns closely with the human evaluators.

Personally, I saw almost identical scoring when switching from human eval to LLM eval:

Implementing my streamlined EDD flow

So these are the components we need to implement for a streamlined EDD flow:

Creating a dynamic ground truth dataset.
Implementing an LLM-based eval.

Creating a dynamic ground truth dataset

1. Generate dataset questions (can use an LLM to assist)

To start, I use ChatGPT to generate a few initial questions for a new evaluation dataset. I’m working on OpsTower.ai, a DevOps AI Assistant, so lets create a dataset of questions about AWS CloudWatch Logs. Here is my transcript.

I use ChatGPT to generate dataset questions. This is for an AWS Cloudwatch Logs dataset to test OpsTower.ai.

I then paste these questions in a aws_cloudwatch_logs.csv file:

2. Programmatically generate responses for each question

Next I programmatically generate responses to each of these questions using the current AI agent. For my app, it looks like this:

demo_source = Eval::VendorTest.new.source.save!
test = AgentTest.create!(source: demo_source, dataset_file: "aws_cloudwatch_logs.csv")
test.run!

When the test completes, I view the answers in the UI:

Above is a screenshot of the custom evaluation results page within my app. It provides summary stats and details on each question-answer pair.

3. Save ground truth functions for generating context

I then review the results. Results generally fall into three buckets:

Works as-is - the agent generated valid context and answered the question correctly. I’ll reuse the functions that generated the context as ground truth.
Hybrid - the agent did not generate valid context, but the code it created to generate context is a solid starting point. I can modify the code then save it as a reference function.
Fully human-generated - the agent failed miserably at code generation. I’ll write a new function from scratch.

I like to go from easy to hard. The easiest ones are “works as-is” as I can simply copy and save the generated code.

To create a reference function from a “works as-is” result, I generate a saved method from the code the agent generated then reference that function by ID in the aws_cloudwatch_logs.csv file.

For example, the question “How many CloudWatch Log Groups do I have?” is correctly answered below:

I want to save the code the LLM generated, starting with the get_cloudwatch_log_groups_count method. In my app, I can do this by executing:

saved_methods = SavedMethod.create_from_chat!("482273fb-4c79-4cd2-bc4d-382945c38e42")
saved_methods.map(&:id)
["74294dde-8cd0-4803-a0c2-7c117b8b15de"]

I then paste the saved method ID into the aws_cloudwatch_logs.csv file:

I repeat this process for each “works as-is” result. Hybrid and fully human-generated results are handled similarly, but with more changes to the code.

Ground Truth Prompt Template

I use the prompt to below to generate the ground truth answer from the referenced function we saved earlier. The prompt template looks like this:

An evaluated prompt example:

This will return text similar to “You have 23 Cloudwatch Log groups in your AWS account.”

Model-based eval

The prompt template below is my evaluation prompt. It generates a confidence score, comparing the answer from the agent vs the response generated from the dynamic ground truth prompt:

Here’s an evaluated prompt example (just focusing on template variables):

When the eval is ran, I’ll see output like the following for each question:

This shows the evaluation result on a single question-answer pair. In this case, the result did not pass evaluation.

Rinse and repeat

Once I’m getting acceptable accuracy on an evaluation dataset (ex: 80% or greater), I’ll repeat the process outlined above:

Add questions with ChatGPT.
Add reference functions.
Run evaluation, tweak the app, run evaluation, etc.

Conclusion

I’ve gone all-in on the LLM.

I’ve dramatically increased the accuracy, capabilities, and reliability of my LLM-backed app by leveraging a more automated, streamlined form of Eval Driven Development (EDD). My flavor of EDD leans on the LLM to generate the question dataset, uses human-evaluated reference functions to generate context, re-assembles ground truth answers via the LLM, and finally uses the LLM again to simulate human evaluation.

For my app, the resulting confidence scores of this approach are typically within 5% of human-eval scoring at a fraction of the time spent.

EDD Resources

To see example datasets, reference functions, and evaluation prompts, checkout the DevOps AI Assistant Open Leaderboard on GitHub.

See Awesome Eval Driven Development on GitHub a continually updated set of resources related to Eval Driven Development.

Implementing an LLM Agent to complete tasks using Google with Ruby

2023-05-08T11:00:00+00:00

Being lazy, I’m very interested in using agents powered by LLMs to accomplish tasks for me. In this post, I explore how this is done with Boxcars, a Ruby gem inspired by Langchain for building LLM apps.

Quick intro to Boxcars

Boxcars is a Ruby gem that makes it easy to build applications with LLMs. I’ve found it much easier to use than Langchain as it provides “just enough” abstractions to interact with an LLM and act on the output. See the getting started docs to get going on your own.

A single boxcar train for realtime weather

In this example, I’ll setup a train with a custom Boxcar, GoogleAnswerBox (source). GoogleAnswerBox returns the answer box in Google search results (the box at the top of the page that is displayed if Google can answer your question directly) as JSON.

boxcars = [GoogleAnswerBox.new]
train = Boxcars.train.new(boxcars: boxcars)
train.run("what is the temperature in Fort Collins?")
 => "The current temperature in Fort Collins is 45 degrees Fahrenheit." 

This matches the result when I run the same query direct on Google in my web browser:

How about a stock price?

train.run("what is the Tesla stock price?")
=> "The current Tesla stock price in USD is $169.36.\nNext Actions:\n1. What was the opening price of Tesla stock today?\n2. How has the Tesla stock price changed over the past week?\n3. What is the market capitalization of Tesla?" 

Or a holiday?

train.run("When is Memorial Day?")
 => "Memorial Day is on Monday, May 29, 2023.\nNext Actions: None, as the answer is straightforward." 

Or the time?

train.run("what time is it in Denver?")
 => "The current time in Denver is 08:41 AM. \nNext Actions: None, as the user's question has been answered." 

How does Boxcars take my query, interact with an external tool (Google Search), and generate an answer?

ReAct (Reason + Act) on Ruby

If I asked you for the current temperature, time, or score of an NBA playoff game, you would need an external tool to provide me with this information. It’s not stored in your brain, but your brain can determine which tool to use, interact with the tool, process the data displayed in the tool, and finally provide me with an answer.

Just like your brain, an LLM cannot provide you with information on current events, but you can give an LLM information on external tools they can use to fetch realtime data. Perhaps the most popular approach for having an LLM reason and use external tools is the ReAct (Reason + Act) framework, introduced in this paper (Shunyu et al., 2022). In the example above, Boxcars uses a Zero Shot (no training) ReAct prompt to provide answers.

Let’s walk through how Boxcars implements ReAct when using the GoogleAnswerBox tool.

First prompt

I’ll start by looking at the LLM prompt generated by the Boxcars Train:

>>>>>> Role: system <<<<<<
Answer the following questions as best you can. You have access to the following actions:

AnswerBox: useful for when you need to answer questions that require realtime data.You should ask targeted questions

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one from this list: [AnswerBox]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation sequence can repeat N times)
Thought: I know the final answer
Final Answer: the final answer to the original input question
Next Actions: Up to 3 logical suggested next questions for the user to ask after getting this answer.
Remember to start a line with "Final Answer:" to give me the final answer.
Begin!
>>>>>> Role: user <<<<<<
Question: what is the temperature in Fort Collins?
>>>>>> Role: assistant <<<<<<
Thought: 

Taking a step back: this is fascinating. There’s no training involved. It takes under 140 words of system instructions to get the answer to our question. The prompt is first broken down into three ChatGPT-specific roles:

>>>>>> Role: system <<<<<< - these instructions guide the model throughout the conversation.
>>>>>> Role: user <<<<<< - the person asking questions to ChatGPT.
>>>>>> Role: assistant <<<<<< - responses from ChatGPT to questions.

You can learn more about ChatGPT roles from their docs.

Note how Thought: (the last line) is empty. This is the start of the Thought/Action/Action Input/Observation we’re asking ChatGPT to complete.

ChatGPT responds with the following:

Thought: I need to use the AnswerBox action to get the current temperature in Fort Collins.
Observation: I need to use the AnswerBox action to get the current temperature in Fort Collins

ChatGPT generates this reasoning from the GoogleAnswerBox boxcar description that is included in the prompt:

AnswerBox: useful for when you need to answer questions that require realtime data.You should ask targeted questions

If I omit the actions portion of the prompt and send the prompt to ChatGPT, I’ll get a response like this:

Thought: I should look up the current temperature in Fort Collins.
Action: Use a search engine to find the current temperature in Fort Collins.
Action Input: "Fort Collins current temperature" in a search engine.
Observation: The current temperature in Fort Collins is displayed on a search results page.
Thought: I should provide the temperature to the user.
Final Answer: The current temperature in Fort Collins is [insert current temperature].

ChatGPT realizes it should use a search to collect current information but it doesn’t have access to action to fetch the current temperature.

Second prompt

The Boxcars train is now ready to continue the thought/action/action input/observation loop by sending a second prompt. For brevity, I’ve omitted the system and user roles which remain the same:

>>>>>> Role: assistant <<<<<<
Thought:  I need to use the AnswerBox action to get the current temperature in Fort Collins.
Observation: I need to use the AnswerBox action to get the current temperature in Fort Collins.
Thought:

ChatGPT responds with:

I should ask for the current temperature in Fort Collins.
Action: AnswerBox
Action Input: "What is the current temperature in Fort Collins?"

The Boxcars::Train object takes the ChatGPT response and parses out the Action and Action Input, mapping these to the available actions (just AnswerBox for now). GoogleAnswerBox#run is called with the Action Input, returning the text below:

Answer: {"type":"weather_result","temperature":"49","unit":"Fahrenheit","precipitation":"0%%","humidity":"65%%","wind":"3 mph","location":"Weather","date":"Monday 7:00 AM","weather":"Mostly sunny","thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f2c5d64fd510927300f2ceedeb076f4ff.png","forecast":[{"day":"Monday","weather":"Partly cloudy","temperature":{"high":"73","low":"45"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff95c6669fc2faa0664facd6d2464297ada0.png"},{"day":"Tuesday","weather":"Mostly sunny","temperature":{"high":"78","low":"48"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff9581bc24fa381ee107ae5c17df0e61b10a.png"},{"day":"Wednesday","weather":"Scattered thunderstorms","temperature":{"high":"72","low":"50"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff9502e1868124e15c11dc1beb

Yes, that is just an ugly truncated JSON representation of the Google answer box in the search result. GoogleAnswerBox does not parse the contents of the answer box: there are many variations of answer box formats. Why not let ChatGPT parse this for me?

Third and final prompt

The Train now sends off a prompt with the answer box JSON above in the Observation to ChatGPT:

>>>>>> Role: assistant <<<<<<
Thought:  I need to use the AnswerBox action to get the current temperature in Fort Collins.
Observation: I need to use the AnswerBox action to get the current temperature in Fort Collins.
Thought: I should ask for the current temperature in Fort Collins by using the AnswerBox action.
Action: AnswerBox
Action Input: "What is the current temperature in Fort Collins?"
Observation: Answer:{"type":"weather_result","temperature":"49","unit":"Fahrenheit","precipitation":"0%%","humidity":"65%%","wind":"3 mph","location":"Weather","date":"Monday 7:00 AM","weather":"Mostly sunny","thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f2c5d64fd510927300f2ceedeb076f4ff.png","forecast":[{"day":"Monday","weather":"Partly cloudy","temperature":{"high":"73","low":"45"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff95c6669fc2faa0664facd6d2464297ada0.png"},{"day":"Tuesday","weather":"Mostly sunny","temperature":{"high":"78","low":"48"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff9581bc24fa381ee107ae5c17df0e61b10a.png"},{"day":"Wednesday","weather":"Scattered thunderstorms","temperature":{"high":"72","low":"50"},"thumbnail":"https://serpapi.com/searches/6458f883ce87f81e4d7973c2/images/dccefb93a84c042f85705f9091acff9502e1868124e15c11dc1beb
Thought:

ChatGPT returns:

Final Answer: The current temperature in Fort Collins is 49 degrees Fahrenheit.

Our ZeroShot train sees the magic Final Answer text, exits, and returns the result.

Note: the temperature increased from 45 to 49 degrees while writing this blog post.

Going deeper: a two boxcar train to query both a Rails DB and Google

Just like you’ll use multiple tools to accomplish a task, the ReAct framework can do the same. Let’s setup a Boxcar train with two boxcars, ActiveRecord for searching my Rails database and Google Answer Box. I’ll find out the time in a user’s timezone:

boxcars = [Boxcars::ActiveRecord.new, GoogleAnswerBox.new]
train = Boxcars.train.new(boxcars: boxcars)
train.run("what country does the user derek@glass.io user's timezone belong to?")
 => "The timezone belonging to the user derek@glass.io belongs to the United States." 

First prompt => completion cycle highlights

There are now two actions listed in the prompt (note that Glass is the name of the Rails application):

You have access to the following actions:

Glass: useful for when you need to query a database for an application named .
AnswerBox: useful for when you need to answer questions that require realtime data.You should ask targeted questions
Use the following format:

ChatGPT returns:

We need to use a database to query the user's timezone and then determine the country it belongs to.

Second prompt => completion cycle highlights

The following is now appended to the second prompt:

>>>>>> Role: assistant <<<<<<
Thought:  We need to use a database to query the user's timezone and then determine the country it belongs to.
Observation: We need to use a database to query the user's timezone and then determine the country it belongs to.

ChatGPT responds with:

We can use Glass to query the database for the user's timezone and then use that information to determine the country it belongs to.
Action: Glass
Action Input: "Query timezone for derek@glass.io user"

Our ZeroShot Train parses the output above and calls Boxcars::Boxcar::ActiveRecord#run with the Action Input above. This uses ChatGPT to take a natural language query and convert it to ActiveRecord code (see this post for more on how the ActiveRecord boxcar works). ChatGPT returns:

ARCode: `User.where(email: 'derek@glass.io').pluck(:time_zone)`

The boxcar executes this code, returning:

America/Denver

Third prompt => completion cycle highlights

Our assistant role now looks like:

>>>>>> Role: assistant <<<<<<
Thought:  We need to use a database to query the user's timezone and then determine the country it belongs to.
Observation: We need to use a database to query the user's timezone and then determine the country it belongs to.
Thought: We can use Glass to query the database for the user's timezone and then use that information to determine the country it belongs to.
Action: Glass
Action Input: "Query timezone for derek@glass.io user"
Observation: America/Denver
Thought:

ChatGPT responds with:

Now that we have the user's timezone, we can use AnswerBox to determine the country it belongs to.
Action: AnswerBox
Action Input: "What country does the timezone America/Denver belong to?"

Our train now executes the GoogleAnswerBox boxcar, returning the following truncated JSON result:

{"type":"organic_result","title":"America/Denver - Time zone in United States - TRAVEL GUIDE","link":"https://www.travelguide.de/en/timezone/america-denver/#:~:text=America%%2FDenver%%20is%%20a%%20time,Mountain%%20Standard%%20Time%%20(MST).","displayed_link":"https://www.travelguide.de › Home › Time zones","hover_info":[{"text":"Mountain Standard Time","info":"United States Six states are fully contained in the Mountain Time Zone: Colorado. Montana. New Mexico.","title":"Mountain Time Zone - Wikipedia","link":"https://en.wikipedia.org/wiki/Mountain_Time_Zone","displayed_link":"https://en.wikipedia.org › wiki › Mountain_Time_Zone"}],"snippet":"America/Denver is a time zone in United States. It corresponds to the Mountain Standard Time (MST).","snippet_highlighted_words":["United States"],"thumbnail":"https://serpapi.com/searches/64590f407f836163a84eff66/images/6d20911aeb0d64ec26e1f63ef2e6c3f28ed8ae8265617e28.png","about_this_result":{"source":{"description":"travelguide.de was first indexed by

Fourth prompt => completion cycle highlights

We’re now most of the way through. The train appends the JSON result above as observation to the prompt:

>>>>>> Role: assistant <<<<<<
Thought:  We need to use a database to query the user's timezone and then determine the country it belongs to.
Observation: We need to use a database to query the user's timezone and then determine the country it belongs to.
Thought: We can use Glass to query the database for the user's timezone and then use that information to determine the country it belongs to.
Action: Glass
Action Input: "Query timezone for derek@glass.io user"
Observation: America/Denver
Thought: Now that we have the user's timezone, we can use AnswerBox to determine the country it belongs to.
Action: AnswerBox
Action Input: "What country does the timezone America/Denver belong to?"
Observation: Answer: {"type":"organic_result","title":"America/Denver - Time zone in United States - TRAVEL GUIDE","link":"https://www.travelguide.de/en/timezone/america-denver/#:~:text=America%%2FDenver%%20is%%20a%%20time,Mountain%%20Standard%%20Time%%20(MST).","displayed_link":"https://www.travelguide.de › Home › Time zones","hover_info":[{"text":"Mountain Standard Time","info":"United States Six states are fully contained in the Mountain Time Zone: Colorado. Montana. New Mexico.","title":"Mountain Time Zone - Wikipedia","link":"https://en.wikipedia.org/wiki/Mountain_Time_Zone","displayed_link":"https://en.wikipedia.org › wiki › Mountain_Time_Zone"}],"snippet":"America/Denver is a time zone in United States. It corresponds to the Mountain Standard Time (MST).","snippet_highlighted_words":["United States"],"thumbnail":"https://serpapi.com/searches/64590f407f836163a84eff66/images/6d20911aeb0d64ec26e1f63ef2e6c3f28ed8ae8265617e28.png","about_this_result":{"source":{"description":"travelguide.de was first indexed by
Thought:

ChatGPT responds with a final answer. The train sees the magic Final Answer: text and returns the result:

Based on the AnswerBox response, the timezone America/Denver belongs to the United States.
Final Answer: The timezone belonging to the user derek@glass.io belongs to the United States.

TL;DR

Just like how an LLM can generate text for a blog post (not this one though!), it can can also generate a plan to answer a question that requires using external tools. The most popular framework for this is ReAct (Reason + Act), which we can use in Ruby via the Boxcars gem. Boxcars handles generating the ZeroShot ReAct prompt, parsing the model completions for actions, and running those actions.

Using Boxcars - the lightweight Ruby Langchain alternative - to query a Rails DB with natural language

2023-05-06T11:00:00+00:00

You may have heard of Langchain, the Python library for creating LLM-powered apps with nearly 35k GitHub stars. Despite the large following, Langchain can be difficult to use when you want to go deeper than “hello world” tutorials. This experience (and my background as a Rubyist) led me to Boxcars, a Langchain-inspired Ruby gem but with fewer abstactions.

You might be asking: why venture outside the Python ML ecosystem? Well, building apps with LLMs doesn’t require Ruby equivalents for NumPy, SciPy and Pandas. Rather than working with numbers, I find that most of my time is spent manipulating string templates and interacting with outside systems (like realtime search). The readability of Ruby is great for this use case.

In this post, I’ll use Boxcar to query my Rails database using natural language. I’ll take a look at how Boxcars creates ChatGPT prompts, handles errors, and how it compares to Langchain’s SQLDatabaseChain to solve the same problem.

Querying ActiveRecord with natural language

The Ruby ecosystem is Rails-centric, so it’s great to see that Boxcars plays well with Rails apps out of the box. Just add the boxcars gem to your Gemfile, set the OPENAI_ACCESS_TOKEN to your OpenAI API key, and you can start querying your database with natural language inside rails console:

boxcar = Boxcars::ActiveRecord.new
boxcar.run "How many users?"
{"status":"ok","answer":33,"explanation":"Answer: 33","code":"User.count"}
 => 33 

You may have used ChatGPT to generate code for you, but this goes a step beyond: it executes the code!

To see how the magic happens, I’ll call Boxcars.configuration.log_prompts = true and re-run the above code to inspect the generated ActiveRecord prompt:

>>>>>> Role: system <<<<<<                                                   
You are a Ruby on Rails Active Record code generator                         
>>>>>> Role: system <<<<<<                                                   
Given an input question, first create a syntactically correct Rails Active Record code to run, then look at the results of the code and return the answer. Unless the user specifies in her question a specific number of examples she wishes to obtain, limit your code to at most 5 results.
Never query for all the columns from a specific model, only ask for the relevant attributes given the question.
Also, pay attention to which attribute is in which model.                    
                                                                             
Use the following format:
Question: $
ARChanges: $ - Only add this line if the ARCode on the next line will make data changes.
ARCode: $ - make sure you use valid code
Answer: $

Only use the following Active Record models: []
Pay attention to use only the attribute names that you can see in the model description.
Do not make up variable or attribute names, and do not share variables between the code in ARChanges and ARCode
Be careful to not query for attributes that do not exist, and to use the format specified above.
Finally, try not to use print or puts in your code
>>>>>> Role: user <<<<<<
Question: How many users?

The model responds with:

ARCode: User.count

If you copy and paste the prompt above into ChatGPT you should see a very similar response.

How does the ActiveRecord Boxcar execute the query?

The ActiveRecord Boxcar checks to see if ARCode is in the response. If it is (and after some security checks) it executes the code returning the result of the ActiveRecord query.

What about adjusting queries if the first attempt is malformed?

Let’s say I ask Boxcar to run a query for an ActiveRecord model that does not exist. What does it? Does it immediately exit, attempt to fix the issue, or just raise an exception?

boxcar.run "how many ClassDoesNotExist records were created this year?"

ChatGPT responds with a valid-looking query:

ARCode: `ClassDoesNotExist.where("created_at >= ?", Time.zone.now.beginning_of_year).count`

The Boxcar runs the query and captures the exception:

Error while running code: uninitialized constant Boxcars::ActiveRecord::ClassDoesNotExi ...

It will then re-try up to 3 additional times. Notice how Boxcar appends (1) the code that was excecuted (2) the error that resulted from running the query:

...
>>>>>> Role: user <<<<<<
Question: how many ClassDoesNotExist records  were created this year?
>>>>>> Role: assistant <<<<<<
ARCode: `ClassDoesNotExist.where("created_at >= ?", Time.zone.now.beginning_of_year).count`
>>>>>> Role: user <<<<<<
ARCode Error: uninitialized constant Boxcars::ActiveRecord::ClassDoesNotExist - please fix "ARCode:" to not have this error

ChatGPT then returns a response, but it lacks the ARCode section:

I apologize for that. It seems like there is no `ClassDoesNotExist` model in the list of available models. Please let me know which model you would like to use instead.

The boxcar appends this error to the prompt and re-runs:

>>>>>> Role: assistant <<<<<<
I apologize for that. It seems like there is no `ClassDoesNotExist` model in the list of available models. Please let me know which model you would like to use instead.
>>>>>> Role: user <<<<<<
Your answer wasn't formatted properly - try again. I expected your answer to start with "ARChanges:" or "ARCode:"

ChatGPT attempts to help, but we’re not going to get anywhere:

I apologize for the mistake. Here is the correct format:

ARCode: `ModelName.where("created_at >= ?", Time.zone.now.beginning_of_year).count`

Please replace `ModelName` with the name of the model you would like to use.

This is smart usage of updating the ChatGPT prompt with additional context around errors.

How does it do with complex queries?

At first, it struggled in my development environment and I almost wrote it off as another ML “hello world” demo that quickly fails when you try to take it farther. Then I realized that the default list of models and their attributes in the prompt is likely to be empty (or contain just a small number of columns) per this SO question. After running Rails.application.eager_load! and re-initializing my boxcar I was very impressed. It correctly executed queries like these:

Which org has the most users?
Order the orgs by the number of users in each org and show the orgs with the most users. list the org id, name, and number of users.
How many users were created by month?

How does the ActiveRecord Boxcar compare to Langchain’s SQL Toolkit?

This is a very small sample size, but the ActiveRecord Boxcar provided more accurate results for me than the Langchain’s SQLDatabaseChain. For example, it returned a result when I provided an invalid table name and returned an incorrect value in query 2 above due to a missing join.

TL;DR

If you are a Rubyist, don’t let Langchain’s large following sway you away from trying Boxcars when creating an LLM application. If you’re like me, you’ll enjoy the smaller footprint, fewer abstractions, and a faster timeline to production usage (assuming you already have a deployed Rails app) that Boxcars offers.

Drains, sprinklers, and sidewalk edges: behind the development of Greenzie’s first ML Model

2022-05-22T11:00:00+00:00

Originally published on the Greenzie blog, this covers a project I led to deploy an image segmentation model on autonomous commercial lawnmowers.

A never-ending problem for mobile robotics is funneling a petabyte-dense visual world into just enough megabytes to help the robot act correctly in realtime. One example of this for autonomous mowing: identifying and navigating around small obstacles like sprinklers that could be damaged (or damage the mower).

Our team at Greenzie decided that the best way to identify small obstacles was to develop a custom image segmentation model. Here’s a look at how we’re developing and deploying this model to our fleet of autonomous lawnmowers.

The problem

Greenzie autonomous lawnmowers use a set of stereo cameras to generate a 3D point cloud of its surroundings. We group points into clusters to identify obstacles that the robot needs to avoid. However, it’s not easy identifying the classification of an object from a point cloud. For example, below is a side-by-side display of the depth cloud output and the image display from a stereo camera (source):

It’s not possible to identify the object in the depth display as a flower. It’s easy using the photo. In fact, off-the-shelf computer vision models can identify these as flowers. A point cloud cannot.

So, if Greenzie can already navigate around obstacles, why do we need an additional layer in our perception stack? Maybe we’re just trying to add “AI-powered™” stickers to our robots? Well, many obstacles are aligned close enough to the ground plane that they can be difficult to identify from a point cloud. Some of these objects could be damaged if the mower were to travel over them with the blades on. Some could damage the mower. For example, the objects in the image below can be difficult to detect in a 3D point cloud:

Adding the ability for our robots to sense additional objects via an image segmentation model gives us two big wins:

Avoid an additional class of collision events (impacts with small obstacles).
Increase the ROI our customers experience by increasing our robot’s confidence to navigate near small obstacles, reducing manual cleanup work.

Deciding on the ML model type

In the introduction, I quickly jumped to our decision to create an image segmentation model. Let’s take a quick look at the 3 types of models we considered.

1. Object detection

This approach generates a bounding box around identified objects. For example, we could train a model to identify obstacles and their classification much like how this model identifies dogs, bikes, and other objects:

Source

2. Image segmentation

This classifies the category of items at the pixel-level rather than using a bounding box. For example, here’s an image segmentation result showing the sky, grass, and not-grass:

3. Instance segmentation

This combines both of the above approaches, allowing the model to identify individual instances of items in an image at the pixel-level. For example, you could count the number of mulch beds.

Source

Why did we go with image segmentation?

We would like precise contours of “blades on”, “obstacle”, and “sky” regions (rules out object detection). We don’t need to identify individual instances of objects that an instance segmentation model offers.

With the problem defined and the ML model type chosen, development was ready to begin.

Step 1: pick the deployment platform (Luxonis OAK-D camera)

Our current platform uses a set of Intel Realsense cameras to capture images and depth data. However, we were concerned about the resource usage of running ML models on the host. We just happened to have a convenient mounting location for a Luxonis OAK-D camera. We can deploy the model to the OAK-D, keeping resources open on the computer for our other robotic work.

Step 2: create a training dataset for the image segmentation model

We used an iPhone to photograph areas similar to what our robotic workers mow. This included extensive photos of obstacles that could be within the mowing map and not seen by our depth cloud obstacle logic. These photos were sent to an image annotation service.

There’s no getting around it: assembling and reviewing annotation results is a tedious process. I felt more like a bookkeeper than a developer. That said, it was a valuable introspective process that showed where our annotation guidance was not clear. For example:

Do you label every visible piece of grass visible behind a chainlink fence as “blades on”? Or the entire fence itself as an obstacle?
Do you label grass visible behind a set of tree branches as “blades on”?
What about small dirt patches within a grass area?

Let’s look at an example of annotation instructions gone wrong. Our initial instructions to the annotation service asked them to classify images into three classes: sky, obstacles, and grass. Look at some of these initial model results in photos from the field:

In the above model inference results, the overlays show how worn-down, high-traffic areas of a soccer field are classified as “not grass”. This is technically correct, but would be annoying when operating the mower. The mower will stop and/or navigate around these areas. It’s OK to have blades on over these areas.

Based on this experience, we updated our instructions changing the “grass” classification to “blades on”. The classification “blades on” now includes dirt patches.

Step 3: deploy the DepthAI Robotics Code

Greenzie’s robotic lawnmowers use the Robot Operating System (ROS). We worked with the Luxonis team to integrate the sensor data from the OAK-D camera into our system via their ROS DepthAI package. To start, we deployed an off-the shelf model (TinyYOLO4) to verify the end-to-end functionality.

Step 4: training a baseline image segmentation model

We again worked with Luxonis to develop the image segmentation model. Their team selected images for the training dataset, sent the images off for annotation, and trained the first several versions of the model. Our team focused on building a reproducible training environment on AWS Deep Learning Images once Luxonis completed a couple ad-hoc training sessions.

Step 5: basic image sampling from the field

Our training dataset was collected via an iPhone, not image data from robots in the field. This is convenient (and there isn’t a lot of mowing in the winter), but it was a risk. It’s not the same environment.

We developed sampling logic that saves an image from the OAK-D camera once-per minute. These are sent to the cloud where we can perform later analysis of the results.

Step 6: monitoring ML inference results

With image sampling in place, we developed a pipeline to transfer the images to S3 and generate side-by-side comparisons of the robot image with the model inference results. This identifies areas where the model was confused. For example, notice that the model struggled with identifying grass in blurred regions:

Now that mowing season is in-full swing, the size of our monitoring dataset will increase by several orders of magnitude and comes from the source-of-truth: the production sensors, running real customer mowing jobs.

With monitoring of production ML inference in place, we’ll move faster refining the model to meet the criteria for acting on the results. This will help us address the rough edges faster. We’ll be cycling through previous steps: evaluating real-world inference results, augmenting our training dataset with new scenarios, and retraining the model.

Step 8: collecting data on false positives and false negatives

Once per-minute sampling is great, but it comes at a low signal to noise ratio. The vast majority of the time there are no obstacles in the mowing area. To collect more fine-grained data, we implemented the following in our ROS codebase:

Collect false positives - save an image when the robot travels over an area that the ML model believes is an obstacle.
Collect false negatives - save an image when our depth-cloud logic identifies an obstacle but the ML model does not.

We’re at the beginning of this more nuanced approach to gathering data.

Step 9: setting a threshold for enabling the model

It’s important to define “good enough” model performance or the cycle of retraining would never end. The model won’t be perfect at identifying small obstacles, but neither is a human operator. We’ve established the following criteria prior to enable the robot to act on the model inference results:

Fewer than 1 false positive per-hour of autonomous mowing - a false positive results in frustrating scenarios for operators: missed patches if the mower navigates around an obstacle or a full-stop if the mower is unable to find a path around an obstacle.
70% true positive rate - wait a minute…this means the robot will collide with 3 of every 10 obstacles! How is that acceptable? Well, we’re phasing in the robot’s actions. Initially, the ML model results only augment what we’re already doing via the point cloud-based obstacle logic. Many of these low-lying objects won’t be seen by the point cloud logic, so this means we’re significantly reducing collisions at this threshold.

Once these are met, we’ll set new metrics (or just update these thresholds) for enabling the mower to interact more precisely around the boundary areas between “blades on” and “obstacles” (such as mowing more precisely around a mulch bed).

What’s next?

We’re beginning the final push: that last 10% of refinement that will take a bit of time to fully enable in production. These are possible areas we could explore next:

A more efficient model refinement flow - collecting images, noting poor classification, sending off new images for labeling, and re-training is a tedious process. We’ll look at ways to make this more enjoyable.
Exploring new areas - we are impressed with how well the model performed with our baseline training images. This gives us a lot of hope for other perception-related problems that could be solved with ML.

Effortlessly Investigating Robot Anomalies at Greenzie

2022-02-25T11:00:00+00:00

Originally published on the Greenzie blog, this covers a project I led to make it easier to debug problems on a fleet of autonomous commercial lawnmowers.

The life of a Greenzie-equipped autonomous mower is 97% boring punctuated by small anomalies. This presents two debugging challenges for our ROS developers:

There’s little need for rich data the vast majority of time, but when there is an anomaly, there’s a thirst for ALL the data.
Our mobile fleet of robotic workers are customer-operated (not managed by robotics technicians), do not dock to a fast Internet connection, and are frequently turned on and off. Getting this data from a robot to a developer’s laptop is a challenge.

Easy access to debugging data is important to us as we’re big believers in Kaizen. It should be easy for developers to continually improve our software, and a big part of that is making it effortless to view debugging data. We’ve recently rolled out an update to our developer tooling that makes obtaining and viewing high-fidelity ROS data silky-smooth. Let’s see how it works.

End-user experience overview

The end user of our anomaly data fetch system is either a ROS developer or a support engineer. Here’s how the system is typically used:

A support engineer reviews a robot job and notices an anomaly. For example: the robot generated an obstacle alert. Robots post alerts to our platform and these are rendered as markers on a satellite map. The user clicks the alert marker for details and the “Get ROS Bag” button.
A new Foxglove marker is added to the map along with the path of the robot over the duration of the ROS bag file.
The ROS bag data is uploaded to the cloud via rsync, then uploaded to Foxglove’s data platform. The end user is notified via email when the upload is complete. The user can click on a link from the marker popup to view the data on Foxglove.
The user views the data with Foxglove Studio. Foxglove Studio can be used in the browser or via their desktop apps. The ROS stack does need to be installed on the user’s computer.

‍

Technical Details

Delivering the end user experience above involves a couple pieces of custom work and one of our favorite new robotics developer tools, Foxglove. Here’s how it works:

Our custom admin web app running in the cloud creates a database record with the robot identifier, timestamp, and duration of the data fetch.
The app checks every minute if the robot is online. If so, any unsent requests for ROS data are sent to the robot.
The robot marks ROS bag files that occurred during the requested timeframe for transfer to the cloud. Bag files are split in 30 second increments. An rsync transfer is started via Cron to send the requested ROS bags to the cloud. Cron is used so that partially transferred files will resume upload automatically when the robot is power cycled.
Back in the cloud, the app monitors for completed rsync file transfers. Once completed, files are uploaded to the Foxglove Data Platform via their API. The user that requested the ROS bag data is notified via email of the completed data request. They can then immediately view the data within Foxglove Studio.

Some notes on the technical details:

Keep on-robot logic simple - we try to limit the complexity of utilities that run on our robots in the field. The greatest complexity of this flow resides in the cloud. In the cloud, we can deploy fixes in minutes and easily debug issues (always-on fast internet, always-on servers, no disk space concerns, and a mature ecosystem of monitoring and debugging tools).
Limit resource usage on the robot - there are few cases where the ROS data is needed to debug an immediate problem. Instead, we use the data from these anomalies to prevent future problems. We limit the resource usage of the on-robot parts of this flow in several ways: flock to prevent a thundering heard of rsync processes, limits on rsync transfer size, deriving the ROS bags to collect based on the last modified time of the file and not loading and inspecting files, etc.
Foxglove - this SaaS fulfills two important needs: (1) storage of ROS bag data (2) effortless viewing of ROS bag data (no ROS stack required). I’m very excited about their vision to continue improving both of these areas.

Potential for automated anomaly data collection

A developer must manually trigger the ROS data fetch: we’ll always need this ability. However, I’m excited that this flow for fetching data can work in an automated fashion too: these are just basic HTTP APIs. For example:

Anomaly detection in the cloud - we send a limited amount of data to the cloud every minute. This could be analyzed with an anomaly detection algorithm, automatically triggering a data fetch on an anomaly event.
Anomaly detection on the robot - it’s also easy for a ROS Node to trigger the API if it notices an anomaly. For example: trigger a data fetch if a computer vision ML Model appears to be confused.

TL;DR

We’re excited to have a lean, “right-sized” stack for easily investigating incidents from the field on our robots. Web developers have leveraged these kinds of easy-to-use tools for years (ex: Sentry, Scout, DataDog, etc). It’s great to see more of these as mobile robots move beyond prototypes to the real world.

html-sketchapp: under-the-hood of an HTML to Sketch export solution

2020-11-15T11:00:00+00:00

Taking the output of html-sketchapp and importing into Sketch.

Designers and developers continue to work in entirely different mediums. As a result, without constant, manual effort to keep them in sync, our code and design assets are constantly drifting further and further apart.

Mark Dalgleish in Sketching in the Browser

Even-though the dawn of Web 2.0 15 years ago, we still can’t auto-translate HTML to our design tools. A huge number of improvements to to the product development flow have made the lives of designers and devs easier, but the common task of going from code to design (and back) remains manual.

One NodeJS package that makes it easier to keep your coded components in sync with your design team is the excellent html-sketchapp. Released in 2018 and inspired by the more limited react-sketchapp, html-sketchapp lets you export HTML to Sketch (with some limitations). In this post, I’ll show how html-sketchapp works and how it can benefit DesignOps at your organization.

Is html-sketchapp for designers?

Despite the designer being the end-user of an HTML to Sketch solution, html-sketchapp is really a tool for developers. html-sketchapp provides the export engine and requires another tool to provide the user interface. I think it’s unlikely a designer will get much value out of downloading the html-sketchapp source code. If you are designer, I suggest you checkout one of the following for a more ready-to-go solution:

html-sketchapp-cli - A tool that allows you to export an HTML doc to Sketch from the command line. Distributed as a NodeJS package.
html-to-sketch-electron - An electron app that allows you to provides a graphical interface versus the command-line tool option provided by html-sketchapp-cli.

html-to-sketch-electron in action.

Note that both of these utilities don’t have a lot of recent commits (and I haven’t tested them). This post focuses more on how developers can use html-sketchapp within their own internal tools to help push coded components to design tools.

How does a developer leverage html-sketchapp?

If you are a developer creating a tool that uses html-sketchapp, your flow might look a bit like this:

Install the NodeJS package within your project: npm i @brainly/html-sketchapp.
Use puppeteer to launch a headless browser session opening a URL of your choice.
Feed the loaded document.body to html-sketchapp’s nodeTreeToSketchPage function.
Save the output file with an *.asketch.json extension.
Within Sketch, install the Almost Sketch to Sketch Plugin.
Use the Almost Sketch to Sketch Plugin to import the *.asketch.json file and create a new Sketch document of the HTML export.

See html-sketchapp-example for a complete example of steps 1-4.

Why can’t html-sketchapp just export a Sketch file?

You may have noticed that in the steps above, we’re able to export a JSON file from html-sketchapp but not an actual Sketch file. We take that file and use a Sketch plugin to take the export over the finish line, converting JSON to the final Sketch file format.

Why is there an extra moving part? At the time html-sketchapp was written, some parts of the Sketch file format stored data as a binary blob (like text styling information). This is not easy to generate from Javascript. Additionally, as the html-sketchapp functions run in the browser, it is limited by CORS and may not be able to access all images on an HTML page. While text information is no longer stored as a binary blob, sadly a Mac is still required in order to use NSAttributedString.

So, the almost-sketch file format remains as cocoascript is still required for the final conversion step. This final step is a bit of a pain: most recently, an issue with an Almost Sketch to Sketch Plugin dependency broke exports. That said, it doesn’t appear there’s anything html-sketchapp can do about the plugin requirement given the need to call NSAttributedString.

How does html-sketchapp generate the *.asketch.json file?

When nodeTreeToSketchPage is called, it creates a Sketch group representation of the node and its child nodes. This group is added to a Sketch page with a width and height set to the same dimensions as the root node.

The meat of the HTML to Sketch translation is in the nodeToSketchLayers function. This function is responsible for taking the style properties of an HTML element and mapping those to Sketch styles. The flow works a bit like this:

Call Window.getComputedStyle() to get an object that contains all of the CSS properties of the HTML element. Remember, html-sketchapp is run within a browser session so we’re able to call functions within the Javascript Web API.
Run a series of checks to determine if any layers should be created for this HTML node:
- Is the HTML node a descendent of a parent SVGElement? If so, don’t create any layers (more on this later).
- Is the node is visible? It’s actually pretty complex to detect this and the logic for determining visibility is in the isNodeVisible function. If the node isn’t visible don’t create any layers.
Create a rectangle Sketch shape to represent this HTML node.
If the node is an image:
- Set the shape background color to the HTML node’s background color
- If the node is an HTML IMG element, apply an image fill using the url of the image. This image will need to be downloaded and can’t be loaded from the url dynamically (more later).
If the node has a box-shadow generate appropriate Sketch inner and outer shadows.
If the HTML node has borders, apply these using Sketch inner shadows as Sketch does not support side-specific borders.
Apply the opacity.
Create a new Sketch Rectangle shape, applying the HTML node border-radius to each corner. Note that only % values are supported.
If the HTML node has a background image, applies an image fill similar to the earlier step where the node is an actual HTML IMG element. If the background image doesn’t fit entirely within the HTML element, create a Sketch rectangle shape and use an image fill to correctly position the background image.
If the HTML has a background image that is a linear gradient, apply a Sketch gradient fill.
If the node is an SVG element, creates a Sketch SVG layer and generates the SVG path string by walking through the child SVGElement HTML nodes.
If the HTML element text is visible, iterates over text nodes (see nodeType) and creates a Sketch layer for each text node.

You can see what HTML properties are supported in the Sketch conversion on the html-sketchapp wiki.

How does html-sketchapp perform on real-world examples?

The easiest way to try html-sketchapp is to clone html-sketchapp-example, follow the setup instructions, and run the following:

npm run inject YOUR_URL

On right I’m comparing a screenshot of npr.org versus the html-sketchapp output. Click for a full-scale version. I’m impressed with the output:

Comparing the output of npr.org using html-sketchapp.

The most significant differences:

Fonts (I don’t have the NPR fonts installed)
Missing images - there are several of missing images that are replaced with a red rectangle.

Are there hosted tools that can just do this HTML to Sketch export for me?

In Sketching in the Browser from 2018, Mark Dalgleish mentions a number of tools that are trying to bridge the code-to-design gap. At the end of 2021, only one of those tools appears to actually let you export some form of HTML (React components) into your editor: UXPin*.

Is there a sweet spot for HTML-Sketchapp?

Rather than converting entire HTML documents into Sketch pages, I think the sweet spot for HTML-Sketchapp is turning coded components into Sketch symbols. This eliminates the need for designers to maintain their own design libraries.

There are two relevant examples of this:

html-sketchapp-style-guide - Brainly’s tool for converting their styleguide into *.asketch.json files.
story2sketch - Convert Storybook stories into Sketch symbols.

Summary

HTML-Sketchapp is a library that developers can use to help automate the conversion of HTML into the Sketch file format. It’s a great way to remove the tedious manual process designers need to apply today of maintaining their own design library. HTML-Sketchapp works great for converting coded components into Sketch symbols.

* - disclaimer: I'm working at UXPin to help with Merge.

Understanding DesignOps coming from DevOps

2020-08-18T11:00:00+00:00

As a developer who lived through the advent of DevOps and the squabbling over its meaning, I experienced a bit of déjà vu recently when I stumbled across DesignOps. This post is the Cliff Notes™ version of my research into DesignOps.

Before I begin, here’s the Wikipedia definition of DevOps. I think it is commonly accepted today:

DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality.

This can be shortened to “get shit done” or “release often”. DevOps encourages small, incremental releases that are easier to rollback and debug than the massive quarterly release cycles of yore.

What are common definitions of DesignOps?

Abstract says:

DesignOps is a dedicated person or team in an organization that focuses solely on enabling the design team to work as well as it possibly can.

Collin Whitehead, Head of Brand at Dropbox says:

“The job of the DesignOps team is to protect the time and headspace of everyone within the design organization – the designers, writers, researchers, and so on – which allows everyone to focus on their respective craft”.

Atlassian defines DesignOps as:

“Putting the appropriate tools, instrumentation and processes in place so that we get to ‘learn’ as quickly as possible.”

Adrian Cleave, Director of DesignOps @ AirBnB says this about their DesignOps team:

Our mission is to provide agility to the whole product organization through centralized tools, systems and services that enhance speed and quality of execution.

Finally, Almitra Inocenci says:

It’s a division of people and tasks pertaining to the planning, management, and execution of responsibilities and design process in order to get shit done, whatever the task-at-hand may be, particularly in a design organization.

My take: DevOps has a definition focused on releasing faster. The definitions of DesignOps tend to be more broad and abstract which makes the term harder to understand. I think part of the reason DevOps is more focused is because dev teams already had a term (agile development) that covered the creation portion of releasing software. DevOps depends on this style of development. I don’t see the same division for design, so many DesignOps definitions cover everything from the earliest stages (planning) all the way to release.

What is the origin story of DesignOps?

While it was likely practiced without a name for a while, AirBnB is one of the first brands to discuss the term starting around 2015.

Why DesignOps now?

Via the Design Value Index Study, design-centric public companies returns are more than 200% greater than the S&P 500. That’s a damn good reason to prioritize product design. With the increased focus on design, the ratio of developers to designers within an organization has been getting closer. For example, IBM’s developer-to-designer ratio target has changed from 72:1 to 8:1. This also doesn’t include the increased number of frontend developers that focus on implementing user experiences. It’s safe to say the ratio of backend devs to those that touch the user experience is closer than ever before.

More people and faster releases requires more organization and processes, hence the growing importance of DesignOps.

See Sonja Krogius’ post on Why DesignOps? Why now? for a detailed look on the growth of DesignOps.

What are some of the tools AirBnB has developed to support DesignOps?

It’s helpful to look at the tooling AirBnB has developed to increase the speed of their design process. They are perhaps the earliest DesignOps evangelist and thus have fairly polished tools. The most significant open source tools AirBnB has shared are:

Lona - A tool for defining design systems and using them to generate cross-platform UI code, Sketch files, and other artifacts.
react-sketchapp - render React components to Sketch.

Both of these tools are focused on the chasm that causes the most friction between designs and their release: translating a visual design to code (and back).

Lona’s background doc explains how a Sketch-built design system requires manual translation to code for each platform AirBnB supports (web, iOS, Android, and React Native). This is “time consuming and error prone”. Lona encodes all of the detail needed to accurately translate from design to code.

react-sketchapp is different than other tools that cross the design-code chasm. Most tools try to go from design to code while react-sketchapp goes in the opposite direction. By working backwards from the source of truth (the design of the deployed app), design systems are able to stay in sync.

DesignOps should focus on the design-code chasm

“We’re investing in code as a design tool. Moving closer to working with assets that don’t only include layout and design, but also logic and data. This helps bridge the gap between engineers and designers, thus reducing the need for design specs–or redlines–and the steps between vision and reality”

-Alex Schleifer, head of design at AirBnB

Designers and developers have a chasm to cross: translating design to code (and back). Rather than a change in mindset, it became possible to release high-quality software faster (the goal of DevOps) due to dramatic enhancements in version control (Git), code review tools (GitHub), continuous integration products that automatically run tests, automated code quality products, and error monitoring. If what I saw in DevOps holds true for DesignOps, tools that reduce design/code friction likely outweighs the more abstract, softer side of DesignOps definitions.

Storybook - a beautiful library for your web components

2020-08-16T11:00:00+00:00

Like a set of Lego bricks, a web app’s UI is composed of individual components. Many of these components also have multiple states. For example, a navigation header may have multiple states:

Displaying the avatar of a user if logged in and a “sign in” link if not
Adding a banner if a free trial is coming to a close shortly
Adding a notice if parts of the service are not working correctly
Displaying additional features for admin users

To verify each state looks acceptable, I need to start a local version of the app, load a page in my browser, and override values (ie set admin = true, adjust the signup date of an account, etc). The reality? I rarely test each state when making a change. It’s painful and awkward (and I’m a bit lazy). Enter Storybook, an open source UI component explorer that lets you develop UI components in isolation.

Here’s a look at my initial experience using Storybook.

Who is Storybook for?

Storybook is designed for frontend developers that are already using a Javascript framework to create components and CSS to style them. However, there are several secondary users:

Designers - verify the design of UI components and their states are true to your designs.
Project Managers - quickly QA UI component changes without loading the entire web app locally.
Backend developers & ops - alleviate the need for frontend devs, designers, and project managers to keep a full local dev stack updated.

There’s not much of a learning curve when getting started with Storybook as you continue to develop components in your editor. You just view them within Storybook. I like that Storybook doesn’t try to own the editing experience. We’re all opinionated about our editors.

Setting up Storybook

Storybook is installed inside an existing application. In my test of Storybook, I used create-react-app to setup a simple React app (see a detailed React + Storybook tutorial for more info) then installed Storybook via npx -p @storybook/cli sb init. In addition to React, Storybook supports Vue, Angular, Ember, and more frameworks. However, the official docs and tutorials appear to be more extensive for React than other frameworks.

Once installed, you start the local Storybook server via yarn storybook. This command opens a browser tab at http://localhost:6006 and is filled with a number of example React components and their associated stories (more on stories shortly). Checking out these examples is a good place to start. The Storybook app is well-designed with a clean esthetic.

Storybook has an addon system to extend its functionality. Starting with version 6.0.0, essential addons are pre-installed and I found the app to be immediately usable from the start.

Storybook doesn’t magically import your existing React components. However, it’s not a huge amount of work to have a component appear in Storybook. This is also a process you can perform incrementally, adding your most important components first.

Creating stories

When I first started Storybook, I expected to see my custom components immediately. This didn’t happen and I was disappointed. However, I realized this didn’t make much sense as most components require a default parameter values to appear. For example, my alert component needs some text to render:

// src/components/Alert.js
import React from 'react';

export default function Alert({text, alertType}) {
  return (
    <div className={`alert ${alertType}`}>
      {text}
    </div>
  )
}

To view a component within Storybook you need to create at least one story. A story describes an interesting state of a component. For my alert box, I decided to start with Default and Error stories.

Importing a React component into Storybook

Importing a component into Storybook requires just two steps:

Create a src/components/[COMPONENT].stories.js file with at least one story. Stories follow the Component Story Format.
Restart the storybook server. Restarting is only required when adding a component, not updating an existing component.

My Alert stories file:

// src/components/Alert.stories.js
import React from 'react';
import Alert from './Alert';

export default {
  component: Alert,
  title: 'Alert',
  argTypes: {
    text: {
            description: "The text to display within the alert box",
            type: { name: 'string', required: true },
            defaultValue: "This is the alert text."
          },
  },
};

const Template = (args) => <Alert {...args} />;

export const Default = Template.bind({});

export const Error = Template.bind({});
Error.args = {
  alertType: 'error'
}

I like the separation of stories from the component. Storybook doesn’t force you to modify your component code. The additional functionality provided by Storybook is isolated to stories files. See Storybook’s Story docs for more information on creating stories. Additionally, no Storybook-specific libraries are required when creating a stories file.

Autogenerating docs

I love great docs. One feature I love about Storybook is its ability to autogenerate docs from source code comments. This creates a single source of truth for docs.

Here’s the docs for the Alert component:

Wait there’s more

Storybook has a robust addon ecosystem that helps you build out an automated DesignOps process. From importing dynamic data to visual testing, there are many ways to integrate Storybook into your existing tools.

TL;DR

Storybook is a polished, well-designed UI component browser that can be extended in many ways. Easy to install and incrementally integrate into your existing apps, Storybook provides a clean separation of Storybook-specific functionality and your existing components.

A quick look at Rookout, a real-time debugging & logging product

2020-08-05T11:00:00+00:00

The great thing about defining a new category is that there is no competition. The bad part is it can be difficult to explain what problem your product solves and how it fits alongside other tools. Rookout, described as “Rapid Debugging. Frictionless Logging” falls into this bucket. Recently I spent a couple hours digging into Rookout to better understand its place in an engineering team’s tool chest.

If you ask me to define a debugger, I go straight to an interactive debugger. For example, searching for Rails debugger brings up pry and byebug. These are both interactive debuggers where the application pauses at breakpoints and allows you to step through the execution (even running arbitrary code). This is not what Rookout does. Rookout does not pause a live application. I think a better description is that Rookout lets developers apply temporary logging to a live application without having to deploy a new version. Let me explain.

Let’s say your exception monitoring service (say Sentry) is reporting an elevated error rate and your APM product (say ScoutAPM) is showing a rapid increase in response times. You inspect the exception details and transaction traces but do not see an obvious culprit. Next step? Adding some additional log lines.

If you aren’t using Rookout, you typically do this by creating a new git branch, adding the new log lines, and deploying. This deployment cycle can be lengthy on large, critical applications. Additionally, it’s rare to capture what you need in the first commit+deploy cycle. This means more deploys, more logging, and a slow feedback cycle. Instead, what if could tell your live, production app in real-time to begin logging additional information at a given file and line number? That’s what Rookout can do.

I took Rookout for a spin using their demo Python Django application. After following their setup instructions (and with some fast help from their support team) here’s how it works:

Open the “sources” tab. Navigate to the file and line number you wish to add a breakpoint.
Click in the gutter next to the line number.
Wait a couple seconds for the confirmation that the breakpoint is active.

When the breakpoint is triggered, a new message will appear in the Rookout UI:

Click the message row for more details (like the local variables, stacktrace, and more). You can also define a custom log message and reference variables and their properties. For example, I modified the log message to print the todo description:

It’s likely you are using another destination for your logging output. From Slack to ElasticSearch to DataDog, Rookout can send messages to many targets.

A significant fuzzy area for me is how likely a breakpoint will persist across git commits and branches. What if you modify a file that contains existing breakpoints - do the breakpoints disappear on the next commit? Rookout support says they make a best-effort to preserve breakpoints by hashing each LOC referenced by a breakpoint, but that it isn’t always possible. Because of that, it seems safer to view Rookout as a temporary logger when actively debugging a problem. As a product, they would be in a better position if breakpoints were guaranteed to persist. Then, you could depend on Rookout to log everything. I can see this being a hard problem to solve.

In summary, Rookout is an interesting way to increase logging in a production app without requiring new deploys to add calls to a logger. I think it’s most valuable for teams that have larger apps where the deploy cycle is longer.

Disclaimer: I have not tested Rookout on a live, significant, production application.

Zwift + Wahoo Kickr Core Review

2020-08-03T11:00:00+00:00

As an almost 40 year-old raised playing competitive team sports, I still itch to take the field. That said, lots of things can make scheduling this time difficult: family commitments, work, and an aging body are just a few of the excuses. That’s why I’ve found group road cycling rides here in Fort Collins, Colorado to be an almost-perfect solution for my middle-aged competitive needs.

In my cycling bubble we have standing hard group rides during lunch on Tuesday and Thursday, Wednesday evening (two options), and Saturday morning. There’s no need to coordinate ad-hoc times with friends and the low-impact nature of cycling makes it a lot easier to do a couple of these hard rides each week.

However, you know what’s not fun about group rides? Hanging off the back, struggling for air. That was my default mode of operation during the spring and early summer as I usually spent the colder months off the bike. To make these rides consistently more fun, I decided to dive into a smart trainer setup. Here’s an overview of my setup, a review of the key parts, and an answer to the key question: does work on a smart trainer translate to real-world group rides?

My Indoor Training Supplies

Zwift - multiplayer online cycling training program ($14.99/mo)
Wahoo Kickr Core - smart bike trainer ($899.99)
Strava - training activity tracker ($60/yr)
MacBook Pro (2014)
Fan
Wahoo USB ANT+ Dongle & Extension Cable Kit - provides better connectivity than the Mac’s bluetooth ($39.99)
Wahoo TICKR - Heart rate monitor chest strap ($49.99)
Floor mat - collect the sweat
Wahoo Kickr Snap Wheel Block - keep the bike level ($19.99)

Some notes on my equipment:

Initially I used an optical heart rate monitor on my arm, the Wahoo TICKR FIT. However, the numbers didn’t pass the eye test: my heart rate would drop off a cliff at the end of a workout (when I’d expect it to be a bit a higher than the beginning) and in general the numbers were always lower than I’d expect. The chest strap (Wahoo TICKR) reports the numbers I’d expect.
I started by connecting my devices (the Wahoo Kickr Core and Wahoo TICKR) to my computer via bluetooth. However, I experienced intermittent connectivity issues with the bluetooth connection. Connecting by ANT+ via the Wahoo USB ANT+ Dongle resolved all of my bluetooth dropouts.
A good fan is critical. I didn’t use one as the weather turned warmer and I’d drop five pounds of water in an hour workout. My workouts were pretty terrible. Adding a solid fan left me feeling much better.

Why try indoor training now versus ten years ago?

Indoor training for cycling has evolved considerably over the years making the act of riding your bike without moving more rewarding. I’d describe the evolution as three eras:

The Rocky V Era - just throw your bike on a set of rollers or basic trainer, turn on your favorite cycling video, and ride. Perhaps you use a heart rate monitor, but beyond that, you’re riding natural.
The Smart Solo Era - Connect your smart trainer to a computer so it can control the resistance and generate tailored workouts. Apps like Trainer Road ($19.99/mo) and The Sufferfest ($14.99/mo) provide this.
The Multi-Player Era - Zwift enters the scene, providing virtual worlds where cyclists use their smart trainers to race against each other in realtime.

Tangentially, Peleton (from $2,245 or $58/mo for 39 mos) emerged within the Multi-Player Era of indoor cycling training as well. Like Zwift, Peleton provides a real-time group environment. Peleton is targeted at in-person cycling class participants like Soul Cycle rather than avid outdoor cyclists.

Why has indoor training evolved this way? It’s all about motivation. In the The Rocky V Era, it was up to you define workouts and get your heart rate into the proper zone. That’s easy to let up on. The Smart Solo Era made it easier to stay motivated by setting power and interval lengths for you and you see your metrics (power, heart rate, cadence, etc) following a workout. It’s motivating to hit the workout’s prescribed numbers. Finally, the The Multi-Player Era taps into our innate competitive instinct: it’s hard to resist chasing a rider up the road even when they are just a collection of sprites on the screen. It’s easier than ever before to stay motivated on a trainer.

Setting up the Wahoo Kickr Core and Zwift

Connecting all of these parts isn’t too bad:

Remove the rear wheel of the bike. Connect the bike to the cassette on the Wahoo Kickr Core. You’ll need to purchase and attach your own cassette to the Kickr. I had the bike shop where I purchased the trainer do this for me.
Turn on your computer, power up Zwift.
Connect the USB ANT+ doggle

Zwift should automatically connect to the trainer and heart rate monitor.

Zwift riding options

There are several ways to use Zwift. Here’s the primary ways:

Free Ride - Just pick a world and ride. You’ll be in the virtual world with other riders across the globe. Zwift adjusts the trainer resistance as inclines increase on the virtual road and drops it down when you’re descending.
Workouts - Pick from many individual workouts, organized by their total time. Workout difficulty is based on power zones determined from your FTP.
Training Programs - In addition to individual workouts, Zwift also provides collections of workouts called training programs (ex: 8 week race prep).
Virtual Race - Race against others in a virtual world. Riders are grouped in one of 4 categories based on their FTP (you pick the category). There are typically multiple races every hour.
Group Training - Similar to a virtual race, you can also do a group training ride. There are two primary types of training rides: hour plus rides that attempt to ride at a constant watts/kg and interval rides.

I use the individual workouts and virtual races. I spend my soul riding time outdoors, so I don’t use the free ride option. I find it hard to stick to a long training program with many varied workouts as it is hard to gauge my progress when the workouts change so frequently. Finally, the group training felt like a bit of a mess. Because riding in a group is motivating - even in a virtual world - many folks end up pushing the pace and it ends up resembling a race anyway.

How does riding on a trainer feel versus the road?

While riding your bike on a stationary trainer still feels a lot like riding the same bike on the road, I’ve noticed some differences:

Keep Pedaling - Unlike riding on the road, your pedals are always moving when using a smart trainer. Other riders don’t coast in Zwift and when using ERG Mode in a workout (the default) the trainer sets the resistance for you. If you stop pedaling, it’s a bear to get the pedals moving again. I don’t think this behavior is a negative as it makes my trainer rides more efficient than training on the road. An hour on the trainer feels like two hour ride on real roads.
Harder to move from a seated to standing position - It feels like considerably more work shifting to a standing position than it does on the road. I believe this is because there are no inclines and you need to keep the pedals moving when in ERG mode.
Less shifting - I never use my small chain ring on the trainer and only shift between a couple positions on the cassette. Shifting is not as smooth as in the real world.

Because of the constant pedaling and less dynamic positioning on the bike, initially I was more sore than on an outdoor session. This got better over time.

My basic training program

I believe in consistency more than sticking to a workout program I don’t want to do (I’m not getting paid to ride my bike). I enjoy listening to the Velonews Fast Talk and they regularly cover training. There were two theories that hit home for me:

Short sessions in the winter - On one episode, a coach from Toronto advocated for keeping indoor training sessions short (an hour or less) versus traditional large endurance block in the off season. I live in a cold weather climate and long, cold endurance rides sound very demotivating. So does doing long trainer rides in my garage. Why not push these long rides do the warmer spring months?
Stick to a few basic workouts - Zwift has multi-week training programs where each workout is different than the last. On one episode, a coach advocated for keeping just a few workouts in rotation. This makes it far easier to track your progress. For me, I stick with 2x20’ and 5x5’ intervals as they each stress my fitness in different ways and work well on the trainer. I’ve found short intervals (say 30 seconds or less) to be really awkward on a trainer with ERG mode. Rotating just two workouts would get very repetitive, so I mix it up with some sweet spot rides and Zwift races.
Polarized training - It’s very hard to ride slow and do a proper recovery ride in the real world. With a smart trainer on ERG mode, this is easy. After an interval day I’ll do a recovery ride on the trainer.

A typical winter week might look like this for me:

MON - 50 minutes Sweet Spot
TUE - Zwift Race
WED - Recovery ride/jog & weights
THU - 2x20 intervals
FRI - Recovery ride/jog & weights
SAT - 5x5 intervals
SUN - Rest or Recovery ride/jog & weights

I’ll also substitute a Zwift race for a 2x20 interval (more on this below). If I feel like a big endurance day I’ll try get out for a backcountry ski versus a cold day on the bike.

Zwift racing

Zwift’s bread-and-butter is the virtual racing environment. Zwift has several virtual worlds (three based on real-world locations and one fictional) that host races. There are typically races every hour of the day with varying distances (most take between 40 minutes and 1 hour 20 minutes) and types of courses (flat, hilling, long climbs). When you sign up for a race, you also pick your category (from A-D with A being an FTP of 4.0 w/kg or greater).

I was pleasantly surprised to find these races to be very motivating. Just like the real world, you ride in a pack and if you are on the front, you need to do more work. You can soft pedal to drop off the front. There’s race dynamics: you can tell that at the end of the race other riders are trying to stay off the front while keeping the pace going. On inclines it’s all about your w/kg (factors like wind don’t come into play). There are also several rough sections (dirt, cobbles) where pure power wins.

Zwift races start very, very hard. After a couple of minutes, they settle into a pace. I’d classify races as an hour-long FTP session if you are going hard. Because of this feel, I’ll substitute a race for a 2x20 session if I feel inclined.

Just like the real world of sports, it’s not a level playing field in races. Since w/kg is vital for any race that includes climbs it’s common to see riders stretching the truth on their actual weight. I’ve seen this by comparing race results with linked Strava profiles - somehow a rider that easily looks 180 lbs loses in a photo drops 40 lbs in Zwift.

Another factor that Zwift doesn’t account for is that your FTP lowers as altitude increases. I ride at about 5k feet here in Colorado, which means my FTP is about 5.6% lower than at sea level. If my FTP is 300 watts, my FTP at sea level would be nearly 317 watts. That’s a pretty big difference in a race.

I think the best mental strategy when doing a Zwift race is to focus less on your place and more on your numbers. There’s always a group to ride with and compete against. It feels the same whether they are truthful about their weight and whatever altitude they reside in.

Does it translate to group rides?

The reason I considered riding stationary, staring at my dusty garage wall is to enjoy group rides for a greater portion of the year. While these COVID times prematurely aborted the group riding season, I felt better in the handful of group rides I was able to do in the spring of 2020. The one area that is difficult to replicate on a trainer are the short, high-power surges that occur on a group ride. I struggled with these initially but in a few weeks had adjusted for these. I’m OK with that - just don’t be disappointed if you notice the same problem.

Do I use an indoor trainer in the summer?

Surprising, I’ve found myself continuing to ride the trainer during the warm weather months. I’ve done this for three reasons:

Better for hard training rides - An interval ride is just as hard indoors or outside. When I do these indoors, I don’t have to worry about traffic, flats, or other issues.
Better for recovery rides - A smart trainer showed me how slow I really need to ride to do a recovery ride. I realized I likely never performed a proper recovery ride in the wild.
Outdoors is for the soul - Since my structured time is indoors, this frees my soul to ride outside by feel. Indoors is for structure, outdoors is for freedom.

Summary

Using the Wahoo Kickr and Zwift helped me build my FTP (or at least keep in reasonable) at the start of the group riding season in the spring without suffering through long cold weather rides. The virtual world of Zwift is almost as motivating as the real world and helped me ride consistently through the winter. I continue to use the indoor trainer during summer for short, high-intensity workouts and dedicate outdoor to riding by feel.

dlite.cc

Streamlining RAG evaluation

Table of Contents

What is Eval Driven Development (EDD)?

How does EDD compare to ML evaluation and Test Driven Development (TDD)?

An evaluation wrinkle for LLM-backed apps: external systems

How to make EDD fast? Eliminate human eval.

Dynamic ground truth

Model-based eval

Implementing my streamlined EDD flow

Creating a dynamic ground truth dataset

1. Generate dataset questions (can use an LLM to assist)

2. Programmatically generate responses for each question

3. Save ground truth functions for generating context

Ground Truth Prompt Template

Model-based eval

Rinse and repeat

Conclusion

EDD Resources

Implementing an LLM Agent to complete tasks using Google with Ruby

Quick intro to Boxcars

A single boxcar train for realtime weather

ReAct (Reason + Act) on Ruby

First prompt

Second prompt

Third and final prompt

Going deeper: a two boxcar train to query both a Rails DB and Google

First prompt => completion cycle highlights

Second prompt => completion cycle highlights

Third prompt => completion cycle highlights

Fourth prompt => completion cycle highlights

TL;DR

Using Boxcars - the lightweight Ruby Langchain alternative - to query a Rails DB with natural language

Querying ActiveRecord with natural language

How does the ActiveRecord Boxcar execute the query?

What about adjusting queries if the first attempt is malformed?

How does it do with complex queries?

How does the ActiveRecord Boxcar compare to Langchain’s SQL Toolkit?

TL;DR

Drains, sprinklers, and sidewalk edges: behind the development of Greenzie’s first ML Model

The problem

Deciding on the ML model type

1. Object detection

2. Image segmentation

3. Instance segmentation

Why did we go with image segmentation?

Step 1: pick the deployment platform (Luxonis OAK-D camera)

Step 2: create a training dataset for the image segmentation model

Step 3: deploy the DepthAI Robotics Code

Step 4: training a baseline image segmentation model

Step 5: basic image sampling from the field

Step 6: monitoring ML inference results

Step 7: the refinement cycle

Step 8: collecting data on false positives and false negatives

Step 9: setting a threshold for enabling the model

What’s next?

Effortlessly Investigating Robot Anomalies at Greenzie

End-user experience overview

Technical Details

Potential for automated anomaly data collection

TL;DR

html-sketchapp: under-the-hood of an HTML to Sketch export solution

Is html-sketchapp for designers?

How does a developer leverage html-sketchapp?

Why can’t html-sketchapp just export a Sketch file?

How does html-sketchapp generate the *.asketch.json file?

How does html-sketchapp perform on real-world examples?

Are there hosted tools that can just do this HTML to Sketch export for me?

Is there a sweet spot for HTML-Sketchapp?

Summary

Understanding DesignOps coming from DevOps

What are common definitions of DesignOps?

What is the origin story of DesignOps?

Why DesignOps now?

What are some of the tools AirBnB has developed to support DesignOps?

DesignOps should focus on the design-code chasm

Storybook - a beautiful library for your web components

Who is Storybook for?

Setting up Storybook

Creating stories