Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Khoj: An AI personal assistant for your digital brain (khoj.dev)
155 points by activatedgeek on July 8, 2023 | hide | past | favorite | 92 comments


- Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI? If yes, how will you verify the quality is "reasonable"?

- How is this better than Rewind, Needl, Mem, etc all the personal search engine that have been doing the rounds lately from various knowledge bases? Is the selling point that it's Open-source? Also if Apple improves spotlight, I wonder how useful this will be.


Hello! One of the developers of Khoj here.

The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant.

Note: while all LLM tools look fairly similar on the surface these days, our specific approaches are fairly different. Give us a try and see what you think :-)


And yet you didn't answer them at all.


I can expand on that (I'm the other developer working on the project).

> Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI? If yes, how will you verify the quality is "reasonable"?

We're working on building a helpful AI assistant, with or without OpenAI. We use offline SentenceTransformer models for search and OpenAI (currently) for chat.

To allow user to verify quality, with search you've to look at the quality of the results returned. For chat we pass references (from your docs) used to generate the response. A lot more should be done, open to suggestions.

We also have our own chat quality test suite that "benchmarks" chat capabilities (via pytest)

> How is this better than Rewind, Needl, Mem, etc all the personal search engine that have been doing the rounds lately from various knowledge bases? Is the selling point that it's Open-source? Also if Apple improves spotlight, I wonder how useful this will be.

- I've tried Rewind. It's a neat project with a slick UI, no doubt about it. But 1. It has a cold boot problem (you can only search stuff you've opened since you installed Rewind) and 2. It's limited to Mac (M1+) machines. Khoj will index all supported files across your data sources and it can run on other machines easily.

- Needl, based on their homepage, seems to provide fuzzy/keyword based search. Khoj search works offline and supports natural language queries (e.g search for "sold my car for" and it'll find notes about your Toyota Corolla or Ferrari)

- Mem.ai is pretty neat as well. We'd love to add all the features they have. With Khoj you can self-host if you prefer or use Khoj cloud if you want to sync across devices. And it integrates into your existing tools (Emacs, Obsidian and Web)

In summary, Khoj being open-source is a critical differentiator for an AI assistant to be trustable (you can see what the code is doing). But all the AI assistance approaches are also different.


>Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI?

From a brief look at the github repo there seems to be need to setup OpenAI API key so not sure if this currently has the ability to chat / search w/o sending or needing a OpenAI API access ?


Search does currently work 100% offline - none of your data would be sent to OpenAI if all you're doing is searching for your local documents. You could completely disable your internet connection and it would still work.

Chat currently is only integrated with OpenAI because it had the highest quality + lowest barrier to entry. We're experimenting with open source LLMs and hope to have an alternative available soon.


"The way we see it, building in the open is going to be critical for creating an aligned, trustworthy AI assistant."

Isn't this service just a very thin wrapper around chat-gpt? How on earth do you have any influence on alignment or trustworthiness. That's like saying your coffee cup makes your coffee fair trade.

This whole thread is very disingenuous, it's literally a simple interface for the OpenAI-API drenched in fake buzzwords boosted to the top of HN to scam investors.


You're being overly critical. You can definitely control the alignment of your assistant with prompt engineering and embeddings. They never say they control the underlying model.

It's an open source project and I don't see why you need to be so obnoxious about it.


Are they being obnoxious without cause though?

The Khoj website says, and I quote:

> Khoj's offline AI models allow you to find information using natural language queries. Search using terms that are similar to what you're looking for, rather than exact or fuzzy matches. Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

Emphasis mine.

It seems somewhat disingenuous.

I get it, parts of it run offline, parts of it use the openai api… but that’s not what it says on the box.

Why is the project making a song and dance about self hosting and being open source when it’s just another openai app.

If it’s not just another openai wrapper, cut the openai part of it out and pitch it that way, sure.

…but as it stands, I’m pretty sceptical.

Lots of people are doing the “ai magic” tech demo stuff at the moment, but when you cut them off from the openai api the magic goes away and what’s left isn't very good or interesting.

Maybe this is different? …but it doesn’t look like it; and since they’re tied up with the openai api and you can’t use it without that, how would I even tell?


>> Khoj search works offline. So if you self-host your data never leaves your machine and search works without internet.

> Emphasis mine.

> It seems somewhat disingenuous.

I've been trying it. Khoj search does work offline. Khoj chat (they are literally seperate functions in the app) requires an openAI key and if you give it one, uses openAI.


Yes! It's a bit more than, "somewhat disingenuous," to say a system built to use the OpenAI API works with you to make sure, "your data never leaves your machine".

That's like saying I invented a new form of transportation where you're feet never leave the ground but in actuality I'm just a travel agent sending you to the airport.


"your data never leaves your machine" is only mentioned in the Search section, where it is it true. No-one reading that would assume that meant everything considering the two last sentences above in the Chat section explicitly says it's using OpenAI.

Really feels like people are nitpicking and hating on this project for no good reason. I feel sorry for the authors.


It feels like you are reading too much into this. Really don't understand all the bashing here. It's an open source software for building things using OpenAI. Do you think LangChain is similarly disingenuous? Or the Vercel AI SDK?


Neither of those things claim:

> So if you self-host your data never leaves your machine


You're quoting the paragraph under "Search", describing their search engine. I feel you're misrepresenting it.

Anyway, I definitely don't think this deserves to be described as "a simple interface for the OpenAI-API drenched in fake buzzwords boosted to the top of HN to scam investors" or "twitter get-rich-quick-guru level lousy and fake, and is clearly boosted to the top of HN".

Horrible reactions in this thread to open source software you can fork to use whatever you want. Really disappointing.


Langchain works completely offline with appropriate LLM/API backend & vectore store if needed


No-one is preventing you from creating a PR or fork this project to add whatever backend you want. Did LangChain fully cover all backends on release? Are you not allowed to release a project that only supports OpenAI?

You really need to explain what you are hating on here.


STFU all I said is that what they claim is not what's the state of their project today. Don't even get me started on their alignment BS.


It says open source AI personal assistant.

The AI isn't open source and sending your data to a third party isn't really trustworthy personal.


I understand your concerns, but let me zoom out a little here and talk about the nature of open source.

Open source means that the source code which is developed for a piece of software is fully open (i.e, anyone can read, fork, modify the code) for what they are installing.

According to your definition, it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS. All of those are closed sourced software.

https://www.redhat.com/en/topics/open-source/what-is-open-so...

That being said, we are planning to integrate an open source LLM soon. When we added chat, Open AI just had the best one, but the space is changing so quickly. We're both super enthusiastic about seeing all the open source tooling for this stuff evolve.


The problem is not that there is "glue" to closed-source apps. It's that the essential core of your product, without which your product has no content or meaningful use — is _someone else's closed-source model._

If I market "a totally creative-commons blockbuster Hollywood movie", but my actual product is just a creative-commons-licensed set of driving directions to some nearby movie theater where you can buy tickets to see the same copyrighted movies anyone else is offering, then _the fundamental essence_ of what I'm offering is not, in fact, creative-commons. I sold people on a _movie_ with that license, and then failed to deliver.

That's what you've done here.

To be clear, the fatal flaw is that your marketing is dishonest about what your product currently is, not that your product is something nobody wants. I'd recommend either making your marketing honest, or else making your product live up to what your marketing promises.

_Then_ you do the PR push on HN.


> it would be really hard to do anything that is fully, end to end, open source. We've developed the code on Macs, hosted the code on GitHub, written plugins for Obsidian and GitHub, hosted the website on AWS.

Yeah, nobody did open source before those things existed...


I don't understand the presumption that the AI should be open source here. If I release an open source SDK for talking to an API, it's still open source even if the underlying API isn't.


> I don't understand the presumption that the AI should be open source here.

Because it literally says "open-source AI".


It says "open-source AI assistant".


Exactly, it doesn't say AI open source assistant. Everything mentioned after open source should be open source.


Be serious.


I'm dead serious. The assistant talking to the AI is open source. How else would you describe it? You guys are really doing everything in your power to miscredit this. I really don't understand the hostile attitude.


Assistant _to the_ Regional Manager!

Seriously, this is such a misleading redefinition of a common phrase that it makes me suspicious of the whole project's trustworthiness. If they're playing shell games with "well, technically I didn't say that I meant this phrase the way it's most commonly used in this same industry", then what else are they redefining and misleading about?

I'd encourage you to look into how Siri and "hey Google" and even Cortana and Bixby all describe themselves: as "AI assistants". Nobody thinks "the thing that takes audio and throws it into an AI model is the assistant _to the_ AI". They think of the whole package as their AI assistant — that is, an assistant that is an AI.

That's how the phrase is most commonly used, and even if it were a new turn of phrase, parsing "adjective noun" out into "noun to the noun" is wildly unnatural.


So what should they call this to avoid getting hated on? People are saying this is "just a thin wrapper" but I don't think that's the case (and even if it was, what's the problem?). This is their architecture: https://github.com/khoj-ai/khoj/blob/master/docs/khoj_archit...


I'd call it "an OpenAI client that also has something like grep built in."


Would you consider an "open source database connector" to be a new, open source database that also has a connector?


That's the problem of the English language, it doesn't have compound words. A database connector is a connector to a database. An AI personal assistant is a AI powered assistant that I can use to do tasks for me. If it's an assistant help using an AI that's something completely different.


You're just making up arbitrary rules that you apply inconsistently. My language has compound words and it wouldn't make any difference in this case. "An open source AI assistant" refers to the assistant in the same way "an inexperienced lab assistant" refers to the assistant being inexperienced and not the lab.


> You're just making up arbitrary rules that you apply inconsistently. My language has…

I’m not sure what your language has, but you definitely hit the nail on the head with your inadvertent description of the English language: (seemingly) arbitrary rules applied inconsistently.

You’re absolutely correct that an inexperienced lab assistant does not refer to the lab.

The GP isn’t wrong either.


Given how many people made the same reading (even to the point of the same joke), I'd argue they're arbitrary rules applied largely consistently, except when some marketing droid shows up.


No, because the common usage of "database connector" has consistently taught me enough context to expect "a connector to a database," much like the common usage of "AI assistant" has consistently taught me enough context to know that I should expect "an AI that acts as an assistant."


This threads reminds of a repeated scene in The Office, where one man repeatedly calls himself the “assistant manager”, and is constantly corrected as “assistant to the manager.”

Open source AI assistant.

Open source assistant to the AI.


Open source Assistant for AI


Exactly my thoughts. This person is just gaming the system on here


> Seen a few of these. Are you all working on providing an easy way to maybe use LLMs for chatting/search without sending my data to OpenAI?

Curious: What informs reservations about the use of OpenAI models? Their API terms state explicitly that they do not use customer data for training and that they delete it after 30 days, anyway.

> Also if Apple improves spotlight, I wonder how useful this will be.

There are 3x more Android phones and PCs than iPhones and Macs. Just sayin'


> What informs reservations about the use of OpenAI models?

Three things. For one, I have no reason to take them at their word that they aren’t saving data to train on. Two is that OpenAI will shut down one day, and thus I would like any services I run to outlive them. Third and finally, I have hardware and it’d be a waste not to use it. As a bonus, I find it hypocritical a company that benefits so heavily from open source would hide away their models as closed source in fear of copycats.


> For one, I have no reason to take them at their word that they aren’t saving data to train on.

How are you able to trust cloud providers(even VPS or managed bare metal ones)? I have seen the same sentiment among bigger companies who happily store all users data in the cloud.


I don’t. Any data I purposefully store in the cloud that has any significance I store encrypted. I also do my best to minimize my exposure to non-E2EE services for important purposes, and self-host when possible.


This industry has an atrocious track record of claiming to respect privacy, and then doing something entirely different. I have no reason to think OpenAI are lying, but it would still be wise to be extremely cautious of putting sensitive data in their hands.


Given the narrative and the place (HN) you’re saying, I’m betting you don’t use Google for storing your data either, but the vast majority of the world does. For someone who trusts Google I am almost there in how much I trust OpenAI to the same level as well. Doesn’t mean I think they’re the good guys, but that I am not worried about the risk that much.


So, basically, you don’t care about the privacy of your digital data. That is fine, but it represents an extreme position regardless of how many people follow this path.


it’s not an extreme position. it’sa position shared by significant numbers of people worldwide, as is evident from the number of customers of these platforms you feel threatened by. it’s only considered “extreme” in the echo chamber of HN.


Sure it is from a 10 point scale.

0 - Full privacy off the grid

9 - Brain implant with all data shared to the world

8.5 - Allowing Google to have and scan for ads and government perusal all of your personal emails, written thoughts, location info, friends and accomplices, calendar, photographs, etc


I think killing animals and eating them is an extreme position too but it’s considered obnoxious to say that, so how is this any different?


>they do not use customer data for training and that they delete it after 30 days, anyway.

I don't use X, just keep it around, 'just in case' for 30 days.


as someone who refers back to previous chats quite frequently i’m glad they do this and would use a feature to extend that period of time.


It’s API calls not their user chat portal. You can’t access the stored data, they say they keep it around for 30 days in case of abuse so they can refer back to it to verify and take action.


> Also if Apple improves spotlight, I wonder how useful this will be.

Do you really not see the usefulness of a solution that caters to the remaining 88% (desktop/notebooks) of the market?


Reasonable from openAI is again at their whims & changes to what they consider is appropriate for you.

Haven't seen a roadmap on Spotlight to include semantic search across my entire local drive. Maybe if they Integrate Journal/Freeform/Notes into one thing then it is deliberate & works with things I explicitly want it to understand & help me work with rather than the tools that you've listed which just help you find stuff


To me, this makes a significant difference.

While I would prefer that I could run the LLM locally, being able to see the code that calls the api is a clear second best. At this point in time, I am not going to trust any black box that can read my data and run "AI" on it because I find the risk too big. If I can self-host something, I might just be willing to try it out.


Perhaps I'm in the minority, but seeing open-source used in the description made me think you were using or providing an openly available LLM in addition to the chat/search features. Instead it seems this is "merely" (I don't mean to undermine the level of effort involved) using OpenAI's GPT-4 API for its LLM.

This sort of reek of a growth mindset where you are using "open-source" for the purposes of looking cool and gaining users, but you are in fact trying to grow as quickly as possible to prove to investors that they should fund you for your next round.

I have no reason to believe that's the case for you in particular; just letting you know that some people may perceive things that way. Maybe you could make it clearer that it is a GPT-4 frontend of sorts?


1. Khoj has been around since early 2021, and both of us have been contributing to open source for several years. Being open source for a project like this just makes sense. 2. Search actually is 100% offline. It's using sentence transformer models from huggingface. 3. Chat uses OpenAI's model only because they're currently best in class (and easy to set up). Our plan (more of a 6-month view) is to have our own open source LLM hosted for inference. See issue for reference: https://github.com/khoj-ai/khoj/issues/201


No, you're not the minority. This is twitter get-rich-quick-guru level lousy and fake, and is clearly boosted to the top of HN.

This stuff is so incredibly tiring, because it's already all over social media and HN should be a safe space with actual products.


Jeez, take it easy with the unwarranted hostility!. I get that you disagree with using OpenAI's APIs but clearly this is an "actual product" in every sense of the word and not some snake oil.


Hey activatedgeek! Thanks for sharing Khoj. @110 and I are the developers.

Lots of great discussion going on in this thread. Two things we want to clarify:

1. Search works offline. Chat uses OpenAI.

2. We're working on adding open source LLM support for chat. We're evaluating quality and ease of setup for this.

If you find the project interesting, hop on our Discord and share your thoughts: https://discord.gg/BDgyabRM6e.

We very much want to hear about your experiences and how we can make something more useful for the community.


Ah, this comment puts a lot into context for me. Y'all didn't _intend_ this to be your big PR push here yet, and now you're caught "mid-flight" explaining why your marketing is still aspirational instead of true-to-current-state.

Feeling for y'all.


At this point I'm surprised nobody connects these tools to Gmail, Gsuite, and/or a posix structure. If it has to be my self hosted AI assistant I should be able to provide my documents to it, right?


We already index org, markdown, and PDF files on your file system. We're adding a text connector soon, which will allow you to index any plain text files you care to.

With that, you should be able to index Gmail over Maildir/POP/IMAP?


What about images with ocr?


Ah, I always forget to mention. We do also index images, but haven't tested OCR.

Like you could search for "bike by the lake" to get relevant images, but not search for the text within an image.


Microsoft announced they were integrating GPT-4 into the office suite so I wouldn’t be surprised if google does something similar with Bard


Google Docs added an AI a few weeks ago.


The example shown doesn't really fit what I associate with "personal assistant". Assistants do tasks, not answer questions like "where do good ideas come from?". I can ask that ChatGPT without any third-party middlemen.


Just had a look at the code. It’s a cool project that’s clearly had a lot of thought put into it.

If the devs are still around, I’d love to hear about your experiences with embeddings.


1. One of the reasons we created Khoj was being able to do natural language search with embeddings generated offline using open-source models!

2. We don't use any vector datastores (yet). You can do a lot in memory, it's faster and it does exact matches (no KNN, approx matching)

Feel free to ask if you were looking for something more specific?


Thank you! I’d love to hear more about your experiences with:

1. content / question vector mismatch

2. what types of embedding you experimented with storing per-chunk (text only? Hypothetical question? Metadata?)

3. choice of embeddings model (eg OpenAI vs instructorEmbeddings or an alternative from the MTEB leaderboard)

It’s a great project, going to have a deeper dig today.


Here is a test to assess quality of these assistants.

(1) upload the bitcoin white-paper. (2) ask question “What is the contribution of R.C. Merkle to this reasearch?”

The proper answer should mention “Merkle Trees”.


Khoj means 'search' in some Indian languages


I wonder if there's a .do TLD. If so, they could've done khoj.do


It's made by indians


Nice work! I think one way I would definitely use it is if I can just ask questions about my downloads folder :) on my mac. If you are like me, you probably have papers, invoices, proof of addresses, passports and stuff like that inside. And would I be able to ask what's the passport number of ... so I can enter it into the web check in for a plane. Or if I need to know what my last electricity bill was ?


Those are cool use cases! PDFs with text should work. Maybe I should try and index my download folder too :-).


This uses ChatGPT, and the article makes no promise that our personal data will not be sent to ChatGPT.

No, thanks.


The GPT integration only works if you pass Khoj an OpenAI key in your settings, so it's a pretty explicit opt-in. Otherwise, there's no way for Khoj to send data to OpenAI. Does that make sense?


But what can Khoj do without that?


It still provides really good semantic search! The search will return snippets from your raw data itself, while the LLM gives you the experience of chatting with it.


That’s great to know, thank you!


What is the difference to e.g. KnowledgeGPT?

https://news.ycombinator.com/item?id=34652921

I think i will have to test both solutions myself...


Quite cool! It looks like this tool is oriented around ephemeral sessions, while Khoj is meant to be personal and local to you.


Hi there! To the developers:

Is there a way to use a personally owned and hosted LLM? If not, is there an interest in developing such a feature?


Hi Mithril,

For search we already use a offline/self-hosted model from HuggingFace. And you can easily configure it to use other SentenceTransformer models from HuggingFace

For chat, follow this feature: https://github.com/khoj-ai/khoj/issues/201 to see when Khoj gets the ability to use offline/self-hosted chat models


It is impossible for me To read this site on my iphone because the header size keeps changing with the typing animation so the text is moving up and down every second.


Would like to see this support Word documents also. Does not sound like those are as yet.


relies on ClosedAI, what's the point of being the 47373th app that does so?


Notion plug-in would be fantastic





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: