» Documentation

Freetext Documentation

Lucene integration overview

ScimoreDB encapsulates CLucene search engine, which is C++ port of Java Lucene text search engine API. For more information on CLucene go to: CLucene.

ScimoreDB has built-in SQL commands for querying and modifying Lucene full-text index. Current integration of full-text does not allow modifying Lucene index content directly, instead of Lucene index must be linked with DB table that will send all modifications as a single batch once transaction has been committed. Therefore, there can be a delay before full-text index reflects changes. The delay time depends of the amount of the changes made in DB table. For example: if SQL transaction updates 10 rows of single table, after commit, the Lucene index will receive a command batch of 10 deletes followed by 10 inserts, this will take less than a millisecond. If update batch is bigger, the Lucene indexing takes more time respectively. There are no limitations how many rows can be updated, since the batch for Lucene full-text index uses DB page pool for storage and cannot run out memory.

In order to perform full-text indexing you need:

  • Create full-text index
  • Link index with DB table

Create full-text index

Syntax:

CREATE FREETEXT INDEX index_name
(column_definition [, ...]);

column_definition syntax : 

column_name type_name
[STORE|UNSTORE]
[UNINDEX]
[KEYWORD]


type_name syntax : 
TINYINT | SMALLINT | INT | INTEGER | BIGINT | DOUBLE | FLOAT | DATETIME | CHAR | VARCHAR | TEXT | NCHAR | NVARCHAR | NTEXT 

Creates a new Lucene full-text index. index_name specify the name of the index in database. To specify the database

use "USE db" command before CREATE statement or set DB name in index name: database.index_name.

Column Definition

Each column must have at least a name and a type listed above.

  • STORE If specified and a column is not TEXT type, the value of the column will be indexed and stored in full-text

index. We not recommend to store all columns in index. Usually, to store unique fields, used uniquely identify table's row,

for example primary key field(s).

  • UNSTORE If specified, index the column values and allow search, sort. However, the full-text search results

contain NULL values for the field, since they was not stored.

  • UNINDEX The field will be stored in full-text index, but not indexed, therefore not searchable. Usually, we store

unique tables fields, for instance primary key fields. The fields have no interest in full-text search query, but they are

needed to join the free text results with the linked DB table. Read more in Querying full-text section.

  • KEYWORD When present, the value will not be tokenized / split. All not character fields is auto set to use

KEYWORD. The string http://www.server.com will only be searchable by entire string, www, http result in nothing. Sorting and range queries can only be applied on fields that has been specified with KEYWORD.

Type conversion

All data in Lucene internally stored as text. Therefore, the data types like: TINYINT | SMALLINT | INT | INTEGER | BIGINT | DOUBLE | FLOAT | DATETIME will be converted into a text representation that can be ordered alphabetically. Numbers will be padded with the leading zeros, so that the numbers has the same width and sort order is the same as the converted number's order. DATETIME are converted using: YYYYMMDD format.

Setting up stop words

Stop words allows skipping from indexing special words that is meaningless for the search output. Such words are: and, or, is, the, or just a single character. By default, full-text index uses English stop words. Developers can customize stop words. Custom stop words have to be enlisted in stopwords.txt file, where each word must be separated by newline. Copy the file to full-text index folder. The folder location can be obtained from system.systables tables column path. You must to re-index Lucene index after changing stop words. Although, it is not necessary to do so, if full-text index is empty. How to re-index the index, please read in [Freetext#Rebuilding Index] section.

An example:

use debate
create freetext index debate_search
(
postid int unindex,
threadid int unindex unstore,
dateline varchar unstore,
title varchar unstore,
pagetext text unstore
) 

Linking full-text index with a DB table

The full-text indexes data source is the table. All data modifications and inserts applied on the table will be transparently committed to the full-text index for indexing/deleting.

There are 2 ways to link the full-text index with the table.

First, during tables creation:

create table debate
(
postid int not null primary key,
threadid int,
dateline varchar,
title varchar,
pagetext text,
constraint debate_table_search freetext(postid,threadid,dateline,title,pagetext)
references debate_search(postid,threadid,dateline,title,pagetext)
) 

In the example above we create full-text constraint debate_table_search which maps the table's fields

(postid,threadid,dateline,title,pagetext) to Lucene index debate_search fields (postid,threadid,dateline,title,pagetext).

Note: Currently the table's number of free text searchable columns must match the number of columns in the full-text index; however, the names and types can be different. If data type does not match, SQL optimizer will automatically perform conversion.

Second, linking full-text index with the existing table:

alter table debate
(
add constraint debate_table_search freetext(postid,threadid,dateline,title,pagetext)
references debate_search(postid,threadid,dateline,title,pagetext)
) 

Once linking complete, the content of debate_search will be truncated and repopulated with the content from debate table.

Unlinking full-text index

To unlink full-text index, drop table's FREETEXT constraint. For instance, to unlink debate_search full-text index from

table debate:

alter table debate
(
drop constraint debate_table_search
) 

Unlink operation does not affect the content of the full-text index; however, new changes to the DB table will not be

populated to the index.

Removing full-text index

To remove full-text index, first, unlink the index from the table. Then, use DROP TABLE statement:

drop table debate_search 

Alter full-text index

Changing full-text index layout requires complete index re-creation. In order to do so, follow the sequence of the steps:

Truncate full-text index

Full-text index content truncate syntax is identical to the table's truncation command.

Syntax:

TRUNCATE TABLE fulltext_index_name 

The Lucene full-text index will be re-created and all indexed data lost. From this point, the current DB table's content

will not be found in a full-text search. However, new changes will appear in full-text index. The full-text truncation technique could be applied for performance gain to minimize the size of the index. For example, we truncate old data, which stays in the DB table as so-called archived objects, but is not indexed. Only new/fresh changes are indexed.

Full-text index rebuild

Lucene full-text index isn't recovery safe, which mean, that if during index modification, ScimoreDB server exists

unexpectedly, Lucene index might become corrupted? Luckily, the index can be re-indexed again from the source (linked DB

table). The rebuild index procedure combines unlinking and linking again the full-text index. The following example will

rebuild full-text index linked with debate table:

alter table debate
(
drop constraint debate_table_search,
add constraint debate_table_search freetext(postid,threadid,dateline,title,pagetext)
references debate_search(postid,threadid,dateline,title,pagetext)
) 

Syntax

EXECUTE FREETEXT (

SELECT {* | expression [[AS] alias] [, ...]]}
 FROM index_name
 [WHERE search_expression]
 [RESTRICT offset, limit]
 [ORDER BY expression [ASC | DESC] [, ...]]
)

search_expression syntax :
expression {=} {expression | (sub_select)}
expression BETWEEN lower_expression AND upper_expression
expression IN (sub_select)
search_expression {AND | OR} search_expression 

EXECUTE FREETEXT (¦) code block executes SQL SELECT statement where source is the full-text index referred by

index_name.

Search Clauses

The WHERE clause will narrow the results of the selected index to whever the search_expression is true. Only {=} comparison operator is allowed. The text value of the condition may have a string to match or Boolean expression using operators below we illustrate various search terms:

Condition

Action

title = "hello"

Match all documents containing string hello in indexed title field

title = "hello world"

Match all documents containing hello or world

title ="+hello +world"

Match all documents containing hello and world

title = "hello world"

Match all documents containing hello world phrase

title="+hello -world"

Match all documents containing hello but not word

It is possible to construct Boolean queries with OR, AND SQL operators. The SQL optimizer internally will convert SQL

Boolean expression to Lucene's syntax. For instance:

SQL expression

Converted Lucene expression

(title="scimore" or title="lucene") and title="DB"

(scimore lucene) and DB

The IN search term will return documents that matches at least single string in the IN set. For instance:

SQL expression

Equivalent Lucene expression

title IN("scimore","lucene")

scimore lucene

BETWEEN queries searches for the documents in the range between starting term through an ending term. The range query can be applied only on those fields that content has not been tokenized i.e. defined with KEYWORD prefix in the full-text index create command. If field type is number it will be padded with zeros to preserve correct sort order or if DATETIME will be converted to YYYYMMDD:

SQL expression

Equivalent Lucene expression

postID BETWEEN 0 AND 100

postID:[0000000000 TO 0000000100]

Restrict/Limit

The RESTRICT offset, limit clause is used to restrict the number of matched documents limit with an offset from the

beginning of offset rows.

Ordering

Ordering is allowed on fields that content has not been tokenized.

Total number of matches

Number of matched documents can be obtained using {#TotalCount} field in full-text index:

use debate
execute freetext
(
select postid,[#TotalCount] from debate_search where
pagetext='hello world'
restrict 0,10
order by dateline desc
) 

Get matched document's content

Full-text index does not store the content of TEXT (BLOB) fields, since it has been already stored in the linked DB table.

To access the data of those fields you must to join the matched documents with linked table's rows. Therefore, when designing the full-text index it is important to add (but not necessary index) primary key(s) of the linked table. In our debate example we use postId field, which is primary key of the debate table, to join full-text output with the linked table.

Create full-text index:

create database debate
go;
use debate
create freetext index debate_search
(
postid int unindex,               -- this is a primary key of debate table, do not index, but only store
threadid int unstore,
dateline varchar untore,
title varchar unstore,
pagetext text unstore
)
go;
create table debate
(
postid int not null primary key,
threadid int,
dateline varchar,
title varchar,
pagetext text,
constraint debate_search freetext(postid,threadid,dateline,title,pagetext)
references debate_search(postid,threadid,dateline,title,pagetext)
) 

Search and join matched documents with the linked table:

use debate
select t2.* from (
execute freetext
(
select postid from debate_search where
pagetext='hello world'
restrict 0,10
order by dateline desc
)) t1 inner join debate t2 on (t1.postid=t2.postid) 

Scale-out Lucene full-text

To increase both query and indexing performance, you may choose from 2 approaches of scaling out techniques:

Replication

Replicate the table's content among DB instances. The data is replicated and identical on each DB instance. The full-text query can be answered by any single DB instance reducing time and network resources. Using the replication approach you minimize the number of search requests to single server / db instance. However, the modifications have to be applied on all db instances, which mean that disk, memory and CPU resources are not used optimally. The approach is similar to master-slave replication. In order, to replicate the table do not set PARTITION keyword when creating DB table.

Data distribution/partitioning

If partitioning column set, the DB table's rows will be distributed uniformly. Then each full-text index on separate DB server maintains own set of documents. Disk, memory and CPU can benefit greatly from smaller dataset. Speed of data modifications will increase linear. The searches increases too.

Example

Indexing data from MySQL:

execute dql
begin
open_oledb 'Driver={MySQL ODBC 3.51 Driver}; option=1048576; Server=mysql; Database=debat; Uid=myuser; Pwd=Mypassword,

'select postid, threadid, dateline, title, pagetext from post'

to @Results
project by postid, threadid, title, pagetext

end 

option=1048576 declare forward only cursor, otherwise the MySQL ODBC object will run out of memory if dataset is too big.

Create schema and insert rows from MySQL:

begin tran
--create table and full-text index
create database debate
go;
use debate
create freetext index debate_search
(
postid int unindex,
threadid int,
dateline varchar unstore,
title varchar unstore,
pagetext text unstore
)
go;
create table debate
(
postid int not null primary key partition,
threadid int,
dateline varchar,
title varchar,
pagetext text,
constraint debate_search freetext(postid,threadid,dateline,title,pagetext)
references debate_search(postid,threadid,dateline,title,pagetext)
)
go;
-- insert and index rows from MySQL
execute dql
begin
open_oledb 'Driver={MySQL ODBC 3.51 Driver}; option=1048576; Server=mysql; Database=debat; Uid=user; Pwd=password,

'select postid,threadid, dateline,title,pagetext from post'

to @T1
project by postid,threadid, title ,pagetext

insert_bulk @T1,'debate.debate'
set postid = @T1.postid,
threadid = @T1.threadid,
dateline = @T1.dateline,
title = @T1.title,
pagetext = @T1.pagetext
end

commit 

Searching with full-text index:

use debate
select t2.postid,t2.threadid,t2.dateline,t2.title,t2.pagetext from (
execute freetext
(
select postid from debate_search where
pagetext='int'
restrict 0,10
order by dateline desc
)) t1 inner join debate t2 on (t1.postid=t2.postid) 

More examples

Indexing DVD movies