Chat with Anything | Issue #86

Creating LLM-powered chatbots on any and all kinds of data with embedchain

Sep 06, 2023

Hi there!

Remember last week when we ventured into the fascinating world of Private GPT models using langchain? Well, this week, let’s deep dive into something even more intriguing: Embedchain. For those looking to get a chatbot up and running in a jiffy, this is your magic potion!

🚀 Introducing Embedchain

I've stumbled upon this nifty framework called Embedchain which builds over langchain and several other toolchains. (chains are getting popular. ⛓️) In essence, it's a tool that simplifies creating LLM (Large Language Model) powered chatbots over any dataset. Think of it as a bridge that connects vast oceans of data to the precise answers you seek.

🧌 What makes Embedchain so cool?

Embedchain takes the tediousness out of the process. It handles data loading, chunking, embedding creation, and storage in a vector database. Add your dataset using the .add method, ask your questions with .query, and voila! You've got your answers. For instance, if I want a chatbot that knows everything about Naval Ravikant, from his YouTube videos to his books, I can simply add the respective links, and Embedchain crafts the bot for me. 🧙

🔍 Deep Dive into How It Works

Behind the scenes, when you make a query, Embedchain goes through a multi-step process to find you the best answer:

Detect and load the data.
Chunk it into meaningful pieces.
Create embeddings for each chunk.
Store these chunks in a vector database.
When queried, it creates an embedding for your query.
Finds similar documents for the query from the vector database.
Passes these documents as context to the LLM to get the final answer.

It uses the retrieval-based learning technique that I talked about previously. Most steps are automated, which is what makes it so cool.

📋 Supported Data Formats

One of the highlights of Embedchain is its versatility in data formats. Whether it's a YouTube video, a PDF, a web page, or even a Notion page, Embedchain's got you covered.

It even automatically detects the data type based on the source argument, saving you the guesswork.

Here's a quick glimpse of the formats it supports:

YouTube Video: Just pass the video URL.
PDF File: URLs or direct paths, but no password-protected PDFs.
Web Page: Pretty straightforward, give it a URL.
Sitemap: Use this to add all web pages from an XML sitemap.
Doc File: Both .doc and .docx formats are supported.
CSV: Incorporates headers, so data is contextual.
Code Documentation: Want to embed a whole documentation site? No problem.
Notion: Just ensure you've got the dependencies installed and you’re good to go.
Local Data Types: From plain text to QnA pairs, Embedchain handles it.

🎭 Different Flavours of Apps

Embedchain offers different app types, each catering to different needs.

App: The vanilla version using OpenAI's model.
Llama2App: Uses Replicate’s LLM model.
OpenSourceApp: Completely open-source, using Sentence Transformers for embeddings and gpt4all for answers.
CustomApp: For the tinkerers who love customization.
PersonApp: Craft bots for specific personalities.

🔧 Working Magic with Embedchain

The best way to understand something is to see it in action. Here are several examples from their gallery:

Harinbot: bot trained on personal messages to respond to texts
Databutton: create chatbot over any website (BYOK: Bring Your Own Key)
Jobo: Automated job applications
Chatdocs: Chat with all docs via UI
Chatbot trained on 1000+ videos of Ester hicks the co-author behind the famous book Secret

If you have an idea and want to try, Replit has a simple template to get you started.

Closing remarks…

Embedchain is a game-changer for anyone looking to swiftly and efficiently build chatbots over diverse datasets. The ease with which it lets you tap into vast information sources and retrieve precise answers is really neat. It biggest limitation I found is the lack of good support for conversations. It can only handle up to five exchanges, currently, but I’m sure it’ll get better over time.

Give it a whirl, and as always, if you hit any roadblocks, drop those comments below!

Until next time!

— Harsh

Next — Today I Learnt About Data Science

Discussion about this post