BQL

Alex Hudson

2008-08-04 17:04:55 UTC

Hi Michael,

Post by Michael Papet
I have a very difficult time understanding the need for a new layer on top of SQL. Can you please elaborate on why a new query language is required?

Sure thing. I did make the start of an attempt on
http://www.bongo-project.org/Documentation/Bongo_Query_Language , but
let me try again here and if it makes sense to you I can update that page :)

Let me say a little about the store first. It's effectively the heart of
Bongo in a lot of ways, being responsible for data storage and doing
useful things with the data, and it's simply a network daemon which
speaks a straightforward text protocol we inherited from Netmail called
"NMAP". Many of the commands it understands are designed to retrieve
data in some way; for example, finding the e-mails in the Inbox,
searching for contacts, that kind of thing.

As an example command, let's look at LIST:

http://www.bongo-project.org/Documentation/Store_Protocol#LIST

List just finds the documents in a collection, which could be (for
example) a list of e-mails in a given mail box. You'll see it takes a
number of parameters: we can ask for documents with certain flags set
(such as "unread documents"), and we can ask for extra output like
document properties (for example, "Who sent the e-mail") for each
document in the list.

The problem with this approach of adding parameters to commands is that
it's pretty inflexible: for example, I can't use list to find "all the
documents marked read and sent within the last four days".

Hula solved this problem by adding properties to documents in a CLucene
search engine index, and then store clients (such as the Dragonfly
webmail server) would pass CLucene search queries directly into the
store to refine results further. Hula would then do two things: it would
run an SQLite query to retrieve the rough list of relevant documents, it
would then run a CLucene query it had been given against the search
engine, and it would manually sort through the results of both, throwing
out documents only returned in one query: effectively, in SQL-speak,
doing a manual JOIN on the results. Unsurprisingly, this doesn't turn
out to be very efficient.

Another problem with that approach is that we're passing through the
query unthinkingly. If a client wanted to ask CLucene to do something
bad (intentionally or unintentionally), it has a direct line. The store
doesn't even understand the query syntax, and as a consequence of that
we didn't have it documented anywhere.

I wanted to solve all these problems with a single solution, and that
solution is BQL. It's badly named, but it's a simple, documented
language which we can reduce to SQL: there is no manual work to be done,
we only run the query once, we can make better guarantees about the
query being well-formed and sensible, and we can give clients an
interface where they can send very expressive queries and get back just
the data they want.

Now, I said it's badly named, and in retrospect I shouldn't have called
it a "query language": it's not. You can think of BQL as being mostly
concerned with the parts after WHERE clause in SQL; and in that sense
it's really better described as a Filtering language. You can ask for
some extra data in results, such as document properties, but 90% of the
time you're really using it to cut down a dataset into a more specific
set: going from a generic "Give me a list of mail in my Inbox", for
example, to a much more specific "Give me a list of mail sent in the
last week from Bob".

Could we have used SQL for this? Technically it would be possible,
although we would be back to sending opaque queries through the store
protocol and all the problems that brings (potential for SQL injection
attacks, abuse of privilege, etc.). It would also have meant that store
clients would have to become aware of the SQL schema for the store,
which with BQL they are completely abstracted from, because the store
does the work of turning the BQL query into SQL.

Hopefully this all a. makes sense, and b. convinces you of the need for
this :)

Post by Michael Papet
My other suggestion is to please consider handling all of the queries through ODBC so the database of choice remains open and flexible. Yes, there's a sqlite odbc for all platforms I use, so ship it as the default. As an Admin in an environment where we have elaborate database infrastructure already, using another DB outside of our backup/SAN/etc is a non-starter.

That's definitely on the table, although probably not for a little while
yet. There is already a database API in place, so we haven't scattered
and embedded the SQLite stuff all over the store - it would be
relatively straightforward to put an ODBC link in there instead, and I
know that a lot of people would be interested in that option.

That said, I don't intend our use of SQLite to be particularly visible
to the end-user/administrator. This isn't going to be another database
to manage: it's much closer to think of it as an index file in much the
same vein as any other file-backed binary storage, and indeed the
standard file-based backup systems would work absolutely fine: many
people, I suspect, won't know or care that there's an SQL engine in there.

Thanks for your comments,

Alex.