Building a Search Web App with Dancer and Sphinx

In this article, we'll develop a basic search application using Dancer and Sphinx. Sphinx is an open source search engine that's fairly easy to use, but powerful enough to be deployed in high-traffic sites, such as Craigslist and Dailymotion.

In keeping with this year's Dancer Advent Calendar trend, the example app will be built on Dancer 2, but it should work just as well with Dancer 1.

Alright, let's get to work.

The Data

Our web application will be used to search through documents stored in a MySQL database. We'll use a simple table with the following structure:

CREATE TABLE documents (
    id int NOT NULL AUTO_INCREMENT,
    title varchar(200) NOT NULL,
    contents_text text NOT NULL,
    contents_html text NOT NULL,
    PRIMARY KEY (id)
);

Each document has an unique ID, a title, and contents, stored as both plain text and as HTML. We need the two formats for different purposes -- HTML will be used to display the document in the browser, while plain text will be fed to the indexing mechanism of the search engine (because we do not want to index the HTML tags, obviously).

We can populate the database with any kind of document data -- for my test version, I used a simple script to fill the database with POD documentation extracted from Dancer distribution. The script is included at the end of this article, in case you'd like to use it yourself.

Installation and Configuration of Sphinx

Sphinx can be installed pretty easily, using one of the pre-compiled .rpm or .deb packages, or the source tarball. These are available at the download page at SphinxSearch.com -- grab the one that suits you and follow the installation instructions.

When Sphinx is installed, it needs to be configured before we can play with it. Its main configuration file is usually located at /etc/sphinx/sphinx.conf. For our purposes, a very basic setup will do -- we'll put the following in the sphinx.conf file:

source documents
{
    type        = mysql
    sql_host    = localhost
    sql_user    = user
    sql_pass    = hunter1
    sql_db      = docs

    sql_query   = \
        SELECT id, title, contents_text FROM documents
}

index test
{
    source          = documents
    charset_type    = utf-8
    path            = /usr/local/sphinx/data/test
}

This defines one source, which is what Sphinx uses to gather data, and one index, which will be created by processing the collected data and will then be queried when we perform the searches. In our case, the source is the documents database that we just created. The sql_query directive defines the SELECT query that Sphinx will use to pull the data, and it includes all the fields from the documents table, except contents_html -- like we said, HTML is not supposed to be indexed.

That's all that we need to start using Sphinx. After we make sure the searchd daemon is running, we can proceed with indexing the data. We call indexer with the name of the index:

$ indexer test

It should spit out some information about the indexing operation, and when it's done, we can do our first search:

$ search "plugin"

index 'test': query 'plugin ': returned 8 matches of 8 total in 0.002 sec

displaying matches:
1. document=19, weight=2713
2. document=44, weight=2694
3. document=20, weight=1713
4. document=2, weight=1672
5. document=1, weight=1640
6. document=13, weight=1640
7. document=27, weight=1601
8. document=28, weight=1601

Apparently, there are 8 documents in the Dancer documentation with the word plugin, and the one with the ID of 19 is the highest ranking result. Let's see which document that is:

mysql> SELECT title FROM documents WHERE id = 19;
+----------------------------------------------------+
| title                                              |
+----------------------------------------------------+
| Dancer::Plugin - helper for writing Dancer plugins |
+----------------------------------------------------+

It's the documentation for Dancer::Plugin, and it makes total sense that this is the first result for the word plugin. Sphinx setup is thus ready and we can get to the web application part of our little project.

The Basic Application

We'll start with a simple web application (let's call it DancerSearch) that just shows a search form, and then we'll extend it with more features. It will be using Dancer 2.0, and the Dancer::Plugin::Database plugin (we'll use it to access the documents database). The code below is the initial lib/DancerSearch.pm file:

package DancerSearch;

use Dancer 2.0;
use Dancer::Plugin::Database;

get '/' => sub {
    template 'index';
};

1;

We're also going to need a little startup script, bin/app.pl:

#!/usr/bin/env perl

use Dancer 2.0;
use DancerSearch;

start;

And a simple layout -- views/layouts/main.tt:

<!doctype html>
<html>
    <head>
    <title>Dancer Search Engine</title>
    <link rel="stylesheet" href="css/style.css">
    </head>
    <body>
    <h1>Dancer Search Engine</h1>
    <div id="content">
        [% content %]
    </div>
    </body>
</html>

And, of course, a template for our index page. For now it'll just contain a search form -- views/index.tt:

<form action="/" method="get">
    Search query:
    <input type="text" name="query">
    <input type="submit" value="Search">
</form>

Last but not least, we need a configuration file to tell our app which layout we want to use, and how to connect to our documents database using the Dancer::Plugin::Database plugin. This goes into config.yml:

layout: main

plugins:
    Database:
        driver:     mysql
        host:       localhost
        database:   docs
        username:   user
        password:   hunter1

We can now launch the application, and it will greet us with a search form. Which, unsurprisingly, doesn't work yet. Let's wire it up to Sphinx.

The Sphinx::Search CPAN Module

There is a CPAN module called Sphinx::Search that provides a Perl interface to Sphinx, and we're going to use it in our app. We put use Sphinx::Search in DancerSearch.pm, and add the following piece of code before the get '/' route handler:

# Create a new Sphinx::Search instance
my $sph = Sphinx::Search->new;
# Match all words, sort by relevance, return the first 10 results
$sph->SetMatchMode(SPH_MATCH_ALL);
$sph->SetSortMode(SPH_SORT_RELEVANCE);
$sph->SetLimits(0, 10);

This creates a new instance of Sphinx::Search (which will be used to talk to the Sphinx daemon and do the searches), and sets up a few basic options, such as how many results should be returned and in what order. Now comes the most interesting part -- actually performing a search in our application. We insert this chunk of code at the beginning of the get '/' route handler:

if (my $phrase = params('query')->{'phrase'}) {
    # Send the search query to Sphinx
    my $results = $sph->Query($phrase);

    my $retrieved_count = 0;
    my $total_count;
    my $documents = [];

    if ($total_count = $results->{'total_found'}) {
        $retrieved_count = @{$results->{'matches'}};
        # Get the array of document IDs
        my @document_ids = map { $_->{'doc'} } @{$results->{'matches'}};
        # Join the IDs to use in SQL query (the IDs come from Sphinx, so we
        # can trust them to be safe)
        my $ids_joined = join ',', @document_ids;

        # Select documents, in the same order as returned by Sphinx
        # (the contents of $ids_joined comes from Sphinx)
        my $sth = database->prepare('SELECT id, title FROM documents ' .
            "WHERE id IN ($ids_joined) ORDER BY FIELD(id, $ids_joined)");
        $sth->execute;

        # Fetch all results as an arrayref of hashrefs
        $documents = $sth->fetchall_arrayref({});
    }

    # Show search results page
    return template 'index', {
        phrase => encode_entities($phrase),
        retrieved_count => $retrieved_count,
        total_count => $total_count,
        documents => $documents
    };
}

Let's go through what is happening here. First, we check if there was actually a search phrase in the query string (params('query')->{'phrase'}). If there was one, we pass it to the $sph->Query() method, which queries Sphinx and returns the search results (the returned data structure is briefly explained in the description of the Query method in Sphinx::Search documentation).

We then check the number of results ($results->{'total_found'}), and if it's greater than zero, it means we found something and we need to retrieve the documents data from the database. Sphinx only returns the IDs of the matching documents (as shown earlier in the test search that we did using the command line), so we need to send a query to the database to get the actual data, such as document titles that we want to display in the results (note that we're using the ORDER BY FIELD construct in the SELECT query to maintain the same order as the list returned by Sphinx).

When we have the documents data ready, we pass it along with other information (such as the total number of results) to be displayed in our index template. But, hold on a second -- the template is not yet ready to display the results, it only shows the search form. Let's fix that now -- below the search form, we add the following code:

[% IF phrase %]
    <p>Search results for <strong>&quot;[% phrase %]&quot;</strong></p>

    [% IF total_count %]
    <p>
        Found [% total_count %] hits.
        Showing results 1 - [% retrieved_count %].
    </p>

    <ol>
    [% FOREACH document IN documents %]
        <li>
            <a href="/document/[% document.id %]">[% document.title %]</a>
        </li>
    [% END %]
    </ol>
    [% ELSE %]    
    <p>
        No hits -- try again!
    </p>
    [% END %]
[% END %]

This displays the phrase that was submitted, the number of hits, and a list of results (or a "no hits" message if there weren't any).

And you know what? We're now ready to actually do a search in the browser:

Neat, we have a working search application! We're just missing one important thing, and that is being able to access a document that was found. The results link to /document/:document_id, but that route isn't recognized by our app. No worries, we can fix that easily:

# Get the document with the specified ID
get '/document/:id' => sub {
    my $sth = database->prepare('SELECT contents_html FROM documents ' .
        'WHERE id = ?');
    $sth->execute(params->{'id'});
        
    if (my $document = $sth->fetchrow_hashref) {
        return $document->{'contents_html'};
    }
    else {
        status 404;
        return "Document not found";
    }
};

This route handler is pretty straightforward, we grab the ID from the URL, use it in a SELECT query to the documents table, and return the HTML contents of the matching document (or a 404 page, if there's no document with that ID).

The Complete Application

This is the complete DancerSearch.pm file:

package DancerSearch;

use Dancer 2.0;
use Dancer::Plugin::Database;
use HTML::Entities qw( encode_entities );
use Sphinx::Search;

# Create a new Sphinx::Search instance
my $sph = Sphinx::Search->new;
# Match all words, sort by relevance, return the first 10 results
$sph->SetMatchMode(SPH_MATCH_ALL);
$sph->SetSortMode(SPH_SORT_RELEVANCE);
$sph->SetLimits(0, 10);

get '/' => sub {
    if (my $phrase = params('query')->{'phrase'}) {
        # Send the search query to Sphinx
        my $results = $sph->Query($phrase);

        my $retrieved_count = 0;
        my $total_count;
        my $documents = [];

        if ($total_count = $results->{'total_found'}) {
            $retrieved_count = @{$results->{'matches'}};
            # Get the array of document IDs
            my @document_ids = map { $_->{'doc'} } @{$results->{'matches'}};
            # Join the IDs to use in SQL query (the IDs come from Sphinx, so we
            # can trust them to be safe)
            my $ids_joined = join ',', @document_ids;

            # Select documents, in the same order as returned by Sphinx
            my $sth = database->prepare('SELECT id, title FROM documents ' .
                "WHERE id IN ($ids_joined) ORDER BY FIELD(id, $ids_joined)");
            $sth->execute;

            # Fetch all results as an arrayref of hashrefs
            $documents = $sth->fetchall_arrayref({});
        }

        # Show search results page
        return template 'index', {
            phrase => encode_entities($phrase),
            retrieved_count => $retrieved_count,
            total_count => $total_count,
            documents => $documents
        };
    }
    else {
        # No search phrase -- show just the search form
        template 'index';
    }
};

# Get the document with the specified ID
get '/document/:id' => sub {
    my $sth = database->prepare('SELECT contents_html FROM documents ' .
        'WHERE id = ?');
    $sth->execute(params->{'id'});
        
    if (my $document = $sth->fetchrow_hashref) {
        return $document->{'contents_html'};
    }
    else {
        status 404;
        return "Document not found";
    }
};

1;

Conclusion

What we've built is still a very basic application, lacking many features -- the most obvious one that's missing is pagination, and being able to access results further down the list, not just the first ten. However, the code can be easily extended, thanks to the flexibility and ease of use of both Dancer and Sphinx. With a bit of effort, it can be made into an useful search app for a knowledge base site, or a wiki.

I think this application is a good example of how Dancer benefits from being part of the Perl ecosystem, giving web developers the ability to make use of the thousands of modules in CPAN (like we just did with Sphinx::Search). This allows to build working prototypes of web applications and implement complex features in a very short time.

The POD Extraction Script

As promised, this is the script that I used to extract the POD from Dancer distribution and store it in the MySQL database:

#!/usr/bin/env perl

package MyParser;

use strict;
use vars qw(@ISA);

use Pod::Simple::PullParser ();
BEGIN { @ISA = ('Pod::Simple::PullParser') }

use DBI;
use File::Find;
use Pod::Simple::Text;
use Pod::Simple::HTML;

# Variables to hold the text and HTML produced by POD parsers
my ($text, $html);
# Create parser objects and tell them where their output will go
(my $parser_text = Pod::Simple::Text->new)->output_string(\$text);
(my $parser_html = Pod::Simple::HTML->new)->output_string(\$html);

# Initialize database connection
my $dbh = DBI->connect("dbi:mysql:dbname=docs;host=localhost", "user",
    "hunter1")
    or die $!;

sub run {
    my $self = shift;
    my (@tokens, $title);

    while (my $token = $self->get_token) {
        push @tokens, $token;

        # We're looking for a "=head1 NAME" section
        if (@tokens > 5) {
            if ($tokens[0]->is_start && $tokens[0]->tagname eq 'head1' &&
                $tokens[1]->is_text && $tokens[1]->text =~ /^name$/i &&
                $tokens[4]->is_text)
            {
                $title = $tokens[4]->text;
                # We have the title, so we can ignore the remaining tokens
                last;
            }

            shift @tokens;
        }
    }

    # No title means no POD -- we're done with this file
    return if !$title;

    print "Adding: $title\n";

    $parser_text->parse_file($self->source_filename);
    $parser_html->parse_file($self->source_filename);

    # Add the new document to the database
    $dbh->do("INSERT INTO documents (title, contents_text, " .
        "contents_html) VALUES(?, ?, ?)", undef, $title, $text, $html);

    # Clear the content variables and reinitialize parsers
    $text = $html = "";
    $parser_text->reinit;
    $parser_html->reinit;
}

my $parser = MyParser->new;

find({ wanted => sub {
    if (-f and /\.pm$|\.pod$/) {
        $parser->parse_file($File::Find::name);
        $parser->reinit;
    }
}, no_chdir => 1 }, shift || '.');

You can run it with one argument, which is the location of the directory that will be scanned (recursively) for .pm/.pod files, or with no arguments, in which case the script will work with the current directory.

(Note: The script makes use of Pod::Simple, which I'm not very familiar with, so it's possible that I'm doing something stupid with it -- if that's the case, please let me know.)

Author

Michal Wojciechowski, <odyniec@odyniec.net>

blog comments powered by Disqus