Libdejector User's Guide

Robert J Hansen

Meredith L Patterson

Legal Notice

Libdejector is (c) 2005, Robert J. Hansen and Meredith L. Patterson. The code is hereby released under terms of under terms of the GNU General Public License, version 2 (or, at your discretion, any later version).

Independent implementors are warned that the techniques used in Libdejector are currently being explored for patentability. We disavow any and all patent claims which may be brought to bear against users of the GPLed code or its derivatives; however, commercial software interests are referred to Michael F. Williams, Esq. ().

The supporting documentation for Libdejector is released under terms of the GNU Free Documentation License, version 1.2 (or, at your discretion, any later version).

While we certainly hope this code and documentation will be useful and accurate, we cannot accept any liability for inaccuracies. Full disclaimers can be found in the terms of the license agreements.

The use of any trademark in this document is for instructional purpose only, and is not a challenge to the use of those marks.

These legal notices are declared Invariant Sections for purposes of the documentation license.

Abstract

Libdejector is a research project of Robert J. Hansen and Meredith L. Patterson, both Ph.D. students at the University of Iowa. The aim of Libdejector is to provide a simple, extensible C library to raise the difficulty of SQL injection attacks. At present, bindings are provided for the Python scripting language, but other languages should present no great difficulties.


Table of Contents

1. Introduction
What's SQL injection?
The Old Way of Defending
The New Way of Defending
2. Behind the Scenes
More Details
3. Compiling libdejector
Before You Begin...
Compiling libdejector
Installing libdejector
The C library
Python bindings
4. API Specification
libdejector
Python API
Objects
Exceptions
Marking up Exemplars

Chapter 1. Introduction

For a lot of businesses, database security is mission-critical. If your database gets corrupted from bad records, your DBAs will spend a lot of their time and your money fixing the problem. Worse, if malicious hackers use weaknesses in your database security to get access to client information, you can be put in the unenviable position of having to tell your clients you weren't able to protect their sensitive information.

For a long time, a certain kind of attack called SQL injection has been so difficult to defend against that many reputable programmers have basically thrown up their hands in frustration. Because it is so difficult to defend against injection attacks with existing tools, malicious hackers consider injection to be one of their foremost weapons.

Libdejector solves the problem. And when we say "solves the problem", we don't mean that it just works well. We mean the underlying theory has been mathematically proven to solve the problem. And neither do we mean that it's something that requires a Ph.D. to use; while it took two Ph.D. students to come up with it, the library itself can be used by pretty much any Web programmer.

Libdejector is not a silver bullet. You can't sprinkle a little libdejector here and there and be magically safe. Libdejector has to be used properly and integrated into a larger security plan. Used properly, it will give you excellent protection from one specific kind of attack, but you need to be aware that other attacks exist and you need to defend against them.

But relax. We're working on those, too.

What's SQL injection?

SQL is the Structured Query Language, an ANSI/ISO[1] specification for how human beings interact with databases. Interactions are described in terms of queries, which can be thought of as questions you ask the database. If you wanted to get everyone in a phone book whose last name is Smith, you might say, "select * from phonebook where last_name = 'Smith';".

We hope you noticed the semicolon at the end. All SQL queries end with a semicolon, just like in English questions end with a question mark. From this, you can see that SQL has grammar, and by adjusting the grammar, the meaning of a query can be changed.

An example in English could be, "Woman! Without her, man is nothing!" Change the punctuation just a little bit and you get "Woman, without her man, is nothing." Grammar is power. SQL injection plays tricks with grammar in order to make databases do things you don't want them to do.

Let's say that you want to let anybody query the phone book for records. You have a Web page which takes in a user request, which gets stored in a variable called FOO. Then, you structure your SQL query such that it reads "select * from phonebook where last_name = 'FOO';". Your Web server will then substitute whatever the user input for FOO. If the user inputs Smith, then the query will look just like our previous example.

But what happens if a malicious hacker gives a string like "'; drop table phonebook; -- "? Now we've given three queries to the database instead of one. The first one will search for people without last names; the second one will delete the phone book; and the third one, which begins with two dashes, just says "ignore this line completely".

Suddenly, a malicious attacker has just deleted thousands (perhaps millions!) of records from your database. This is a nightmare! Whatever budget savings you gained from automating the phone book have just been wiped out from all the overtime the DBAs are racking up trying to fix the damage.

The Old Way of Defending

Previously, our defenses against injection attacks were haphazard and did not fit the problem very well. Databases were told not to trust the Web server when it said to insert or delete information, but that caused problems when Web servers had a legitimate need to do it. Web programmers used regex validation to try and spot bad inputs, but it was quickly discovered these techniques had a very high failure rate.

There were (and still are) two different ways regexes could fail. A regex could fail to spot a bad input, and thus let a malicious attacker in. A regex could also incorrectly flag a good input as bad, and thus keep a legitimate user out. The first one was bad, but the second one was far worse. If your Web site is getting thousands of hits each second, the majority of them will be legitimate requests. If a regex validator incorrectly rejects one percent of legitimate requests, that means your Web site is pushing away ten people every second. This leads to people calling up your tech support lines, irate over how your "broken" Web server is getting in their way when they're trying to do something safe and reasonable.

Many database vendors are aware of the limitations of regex validation. Some vendors have introduced bound variables as a way of addressing many of these problems. There are two inherent limitations of bound variables, though. Number one, they only work on part of the SQL query. Some portions of the query cannot be protected via bound variables. Number two, they're typically provided by large database companies that want quite a lot of money. Our technique can be applied to commonly-available open-source databases for a very low cost. It also protects you from vendor lock-in. Using libdejector you can migrate your database from one vendor to the next without needing to worry about whether the next vendor will support bound variables in the same way as your old vendor, and how much of your code will have to be rewritten as a result.

The New Way of Defending

The major breakthrough came in January 2005 during unrelated research. We realized the problem wasn't how regex validation was used; we realized the problem was regex validation's inability to solve the problem.

By "inability to solve the problem", we mean that we sat down with the mathematics and formally proved that for any regex validator, we could construct either a safe query which would be flagged as dangerous, or a dangerous query which would be flagged as correct.

Most people would be despondent over this discovery. In fact, it was liberating. Our results told us what to look for in a real solution, and we found it in the realm of pushdown automata theory.

The problem with regex validation is simple: it doesn't keep track of context. A double dash in the command parts of an SQL query is very dangerous, but a double dash enclosed within quotation marks is completely harmless. So instead of just looking at user input and deciding from that if it's safe or dangerous, we look at the user input in the context in which it is used.

Regex validators can't look at context. It's mathematically provable. So if we need to look at context and regex validators can't look at context, we need to start using a different tool and a different way of examining things. We found it in pushdown automata theory.

Libdejector is the end result of a year of research and development into the application of pushdown automata towards database security.



[1] American National Standards Institute / International Standards Organization

Chapter 2. Behind the Scenes

Table of Contents

More Details

All SQL injection attacks have one thing in common: they modify the grammar of a regular query. SQL is a language just like English, and for that reason we're going to talk about it in English. Our anti-injection technique (which we call dejection) works on any language for which the grammar can be completely defined.

Imagine the sentence "John gives the book to Mary". If I tell you that you're allowed to change the names but nothing else, how can you tell that "Mary gives the book to John" is an acceptable sentence, but "The book gives John to Mary" is a bad input?

Let's start by diagramming the example sentence. Simplifying it immensely, you'd find that a person gives a book to a person.

In the first input, we diagram it out and discover that a person gives a book to a person. The names attached to those people vary, but that's okay: we've said that we're allowed to change the names of the people.

In the second input, we diagram it out and discover that a book is giving a person to a person. This has two major problems. The first one is that our giver is no longer a person, but a thing. We're allowed to change the name of the giver, but not what kind of thing is doing the giving. The second one is that our gift is no longer a book. We're not allowed to change anything about the gift, and now we've gone from giving a book to giving a human being.

Humans do this kind of context sensitive analysis all the time. Children make it into a game called Mad Libs, which uses bizarre transformations to amuse kids while educating them about grammar.

The non-mathematical version of Dejector is this: we diagram SQL queries and mark each part of the query as being "changeable" or "fixed". If the programmer declares part of the query is changeable, then that part of the query can be altered by the user. Everything that's not marked as changeable is instead fixed, and any alteration to the fixed portions--no matter how mild--will cause the SQL query to be rejected.

Obviously, if we were to say "anything in the sentence 'John gives the book to Mary' can be changed", then we'd have no security against people deciding that the book should give John to Mary. Likewise, if people decide that "anything in this SQL query can be changed", then Dejector will provide no security against injection attacks. Dejector is a tool, first and foremost, and like all tools can be used incorrectly.

More Details

Each database speaks its own very slightly different dialect of SQL. MySQL[2] speaks a slightly different one than PostgreSQL[3] which speaks a slightly different one than SQL Server[4]. Since grammar is very important to the successful use of libdejector, it is absolutely paramount that you only use the version of libdejector compiled for your particular database!

Each database has what's called a "context-free grammar" (sometimes called a "BNF notation") specifying its dialect of SQL and how its own particular rules operate. Libdejector uses this context-free grammar plus an example query--what we call an exemplar-- to diagram out what queries should look like. The programmer uses a simple markup notation to specify what parts of the diagram are changeable and everything else is fixed.

Once we have our exemplar, we transform it into XML. We then take the user's input, diagram it according to the rules of the database's SQL dialect, and transform that into XML.

Once that's done, we just walk through the two XML representations, comparing each XML node one at a time. If we find a node which is marked as changeable, then we skip all of that node's children and move to the next sibling node.

If we get to the end of the two XML representations and everything that's not marked changeable is identical, then we declare the user input to match the exemplar. But if at any time we find that they're out of step, we say that an injection is occurring.



[2] MySQL is a trademark of MySQL A.B.

[3] PostgreSQL is a trademark of the PostgreSQL Global Development Group.

[4] SQL Server is a trademark of the Microsoft Corporation.

Chapter 3. Compiling libdejector

Before You Begin...

Warning

It is very important that you only use the version of libdejector meant for your particular database. Please do not use a libdejector meant for PostgreSQL 7 with PostgreSQL 8, much less use a libdejector meant for MySQL 3 with SQL Server!

Compiling dejector from source isn't scary, but there are some important prerequisites to consider. The most important of them are a full set of development tools. We swear by the GNU Compiler Collection and its toolchain, but you may have success with others. We've successfully built libdejector with GCC 3.2, 3.3 and 4.0; others may or may not work, depending on its level of ANSI conformance. Additionally, we use GNU Bison 2.1 and GNU Flex 2.5.31; later versions should also work. We use GDOME 0.8.0 for handling the XML transformations.

The preceding is what you need just to build the C libdejector library. To build bindings for scripting languages, you'll need SWIG 1.3.17 or later, along with the header files for your scripting language of choice.[5]

Compiling libdejector

Please see the file INSTALL in the libdejector source tree for detailed instructions. Generally speaking, it should be as simple as typing ./configure and then make.

Installing libdejector

The C library

By default, only libraries needed for scripting languages are built. This means that a C library for libdejector is not installed by default. However, it wouldn't be difficult to hack the Makefile to make it do this.

Python bindings

By default, the Python bindings are installed into your Python installation's site-packages directory. This makes a new module available to all Python invocations. The module is named Dejector_foo_bar, where "foo" is the database name and "bar" is the version of the database.[6]



[5] At present, only Python is officially supported. Bindings for .NET, Java and Perl are in development.

[6] E.g., libdejector for PostgreSQL 8.0.3 through 8.0.5 is Dejector_pg_80x.

Chapter 4. API Specification

libdejector

Warning

The C library is not built by default. This API reference will only be useful to you if you've hacked the Makefile to make a C library. We anticipate that this will be useful only to people porting the C back-end to other scripting languages.

The complete C API can be automagically generated from the source code using the tool Doxygen. We strongly recommend doing this over reading this guide. It's possible that you've been given a guide for a different version of libdejector than the one you're using on your site; but if you generate the documentation from the source code, you can be reasonably well-assured that the documentation is accurate.

int deject(const char* exemplar, const char* test)
      

This function is the heart of libdejector. Given an exemplar string exemplar and a test string test, determine whether the test string is a syntactic (grammatical) match for the exemplar. If so, then return a 1. If not, return a 0. Various negative numbers are returned on internal failure conditions: a -1 means we were unable to mark nodes as changeable/fixed, and a -2 means at least one of the inputs was not a valid SQL query. A return value of -1 should never occur; a return value of -2 will happen on malformed inputs.

Python API

It's almost easier to show Pyjector in use than it is to document it. That said, there are only two classes and one method.

Objects

The two objects defined by Pyjector are Dejector and BadSQL. All the important work is done by Dejector.

Dejector

__init__(exemplar)
validate(candidate)
          

The constructor for Dejector takes a single argument; namely, the string to use as an exemplar. Validation takes place by calling Dejector.validate(candidate), which will throw an exception if the string candidate fails to validate against the exemplar.

BadSQL

BadSQL is a completely empty class. It exists only as a raw type so that programmers can catch injection attempts and treat them differently than other errors.

Exceptions

The Dejector.validate method will throw an exception of type BadSQL if user input does not match against the candidate string.

Marking up Exemplars

In order to mark a section of an exemplar as mutable, enclose it in curly braces. For instance, SELECT number FROM phonebook WHERE last_name = 'Smith'; could be bracketed as SELECT {number} FROM phonebook WHERE last_name = 'Smith'; if you wanted users to be able to access other fields but only for people named Smith. Or if you wanted people to be able to find the phone number and only the phone number for any last name, you could bracket it as SELECT number FROM phonebook WHERE last_name = '{Smith}';.

Warning

Limit your markup!  So far we've skipped a lot of the underlying theory. One of these things is the concept of the lowest enclosing scope. What this basically means is that the more parts of a query you bracket, the more stuff will match it. Including, perhaps, things you don't want it to match.

The best advice to give, then, is to limit your markup. Enclose only the parts of a query which you want for someone to change. For instance, don't bracket {'Smith'} when '{Smith}' will do. The former says "you can substitute anything here", and the latter says "you can only substitute a string here".

Only one markup per query!  At present, libdejector is unreliable when attempting to mark up multiple distinct parts of a query as changeable. For instance, SELECT {number} FROM phone_book WHERE last_name = '{Smith}'; would be an unreliable query. We're working on getting past this limitation and expect the next release to address it.

It's not magic!  Libdejector is not magic. It needs to be used as part of an integrated Web security solution. If you're using libdejector but your root password is set to root, we cannot protect you.