While doing some work with SQL Servers FULL-TEXT SEARCH recently I came across some interesting gotchas relating to "noise words" which I have not experienced before, which is probably because it is quite rare that I use these feature.
This query
select * from myTable
WHERE CONTAINS (myColumn, 'the NEAR good NEAR person')
would result in this error
Execution of a full-text operation failed. A clause of the query contained only ignored words.
The word "and" is considered a NOISE WORD. These are common words that MS Search feels should be ignored, and not be included in any Catalog. The full list of NOISE WORDS is located at \Program Files\Microsoft SQL Server\MSSQL\FTDATA\SQLServer\Config . The file representing your language can be opened and modified with a text editor. Any changes made will be reflected on the next Catalog population.
Now while the reason for these noise words existence is mainly good, to avoid overhead on database searches etc, there are situations where you might need to include them in your search. Unfortunately the only way to do this is to remove the words from the noise file.
In my situation I needed to do full-text search on all words being near to each other to return results like amazon, so "the good person" would return the same results as "person the good".
The problem is that in this scenario if ANY word is a NOISE WORD, the query fails.
Another scenario where this occurs is in a <simple_term> where all the words are NOISE WORDS and all get stripped.
E.G.
select * from myTable
WHERE CONTAINS(myTable, 'any can come')
As "any", "can" and "come" are all noise words, this results in an empty query, and you get the above ignored words error.
So how can one avoid these errors.
1) You can trap the errors and display a friendly message. This is not very elegant and wont really tell your users why their search is nor working.
2) Or you can strip out all the "NOISE WORDS" from your search string prior to using it in a query. This way you can avoid any errors and can tell users in advance if they have used ignored words.
Here is a little CFFUNCTION I wrote to do this.
<cfFunction name="stripNoiseWords" output="no" hint="removes SQL noise words from search string and changes the delimiter to |">
<cfargument name="searchString">
<cfset strNoiseWords = "1|2|3|4|5|6|7|8|9|0|$|!|@|##|$|%|^|&|*|(|)|-|_|+|=|[|]|{|}|about|after|all|also|an|and|another|any|are|as|at|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|does|each|else|for|from|get|got|has|had|he|have|her|here|him|himself|his|how|if|in|into|is|it|its|just|like|make|many|me|might|more|most|much|must|my|never|now|of|on|only|or|other|our|out|over|re|said|same|see|should|since|so|some|still|such|take|than|that|the|their|them|then|there|these|they|this|those|through|to|too|under|up|use|very|want|was|way|we|well|were|what|when|where|which|while|who|will|with|would|you|your|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z">
<cfset result = "">
<cfloop list="#arguments.searchString#" delimiters=" " index="word">
<cfif NOT Listfind(strNoiseWords, word, "|")>
<cfset result = Listappend(result,word,"|")>
</cfif>
</cfloop>
<cfreturn result>
</cffunction>
For those that may not yet have experienced the joys of FULL-TEXT search on SQL server, below are some addiitonal details
CONTAINS At its simplest level, the key word CONTAINS behaves similar to LIKE. To begin, we will use CONTAINS for precise string matching. The differences between CONTAINS and LIKE will not be very evident here. Using the pubs Catalog created in Part 1, issue the query:
SELECT title_id, title, notes
FROM titles
WHERE CONTAINS (notes, ' "computer" ')
The statement returns three rows, each containing the word "computer" in the notes field.
The Catalog created on the pubs database contained two fields: notes, and title. These are the only two fields allowed for CONTAINS searching. If a query is issued on a field not included in the Catalog, SQL will return an error. Evidenced by searching on the "type" field below:
SELECT title_id, title, notes
FROM titles
WHERE CONTAINS (type, ' "computer" ')
This statement returns:
The syntax of the CONTAINS is the Cataloged field name, followed by the word or phrase to search for. Although only fields Cataloged can be searched on, any field in the table can be returned. Our first statement returned the title_id field. The same is true of joining a CONTAINS statement to another table. It follows the usual TSQL JOIN statement rules. This next query joins the CONTAINS statement to the titleauthor table, useful if author names were needed.
SELECT titles.title_id, titles.title, titles.notes, titleauthor.au_id
FROM titles INNER JOIN titleauthor ON titles.title_id = titleauthor.title_ID
WHERE CONTAINS (titles.notes, ' "computer" ')
The statement functions like a standard join returning the following.
The first query seemed very similar to a regular statement that could have been rewritten using the LIKE key word. However, if we run a LIKE statement, some differences will appear:
SELECT title_id, title, notes
FROM titles
WHERE notes LIKE '%computer%'
This LIKE statement returns five records rather than the three returned by the CONTAINS:
The difference is the plural "computers," returned by LIKE. The CONTAINS statement was only looking for an exact text match of "computer." Plurals were ignored. As a side note, CONTAINS is case insensitive. It sees no difference between "COmputERS" and "computers." To include the plurals, the CONTAINS is changed from what is known as a simple term search, to a prefix term search. This is accomplished by using the * sign. Change the original search statement to:
SELECT title_id, title, notes
FROM titles
WHERE CONTAINS (notes, '"computer*" ')
The query will now return the same five records as the LIKE statement. This leads to a discussion about the use of single and double quotes. The double quotes indicate the asterisk should look for zero, one, or more characters after the root word. Without the double quotes, such as:
SELECT title_id, title, notes
FROM titles
WHERE CONTAINS (notes, 'computer*')
The asterisk will now be treated as literal character to be found, rather than a wild card. Only the three records with the singular use of "computer" will be returned with this statement.
FORMSOF
Another way to find variants of a word is change our prefix term search into a generation term search by the use of a new key word: FORMSOF. FORMSOF is used in conjunction with CONTAINS to look for variants of the search word. To use FORMSOF, the original statement is changed to:
SELECT title, notes
FROM titles
WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, computer)')
This will return all five records, those with the word "computer," and the plural "computers." Again, we could obtain the same results by using a LIKE statement, but the real power of FORMSOF becomes evident when the word "good" is searched for. This next LIKE statement will not return any matching records:
SELECT title, notes
FROM titles
WHERE notes LIKE '%good%'
The word "good" does not appear in the notes field. However, this next FORMSOF statement will return one record:
SELECT title, notes
FROM titles
WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, good)')
The difference is that in addition to returning singular and plural forms of a word, FORMSOF will also include gender and neutral forms of nouns, verbs, and adjectives. In the above example, "best" is returned as a variant of "good."
NOISE WORDS
If we change the above statement around, and look for variants of the word "and" by using the following statement:
SELECT title, notes
FROM titles
WHERE CONTAINS(notes, 'FORMSOF (INFLECTIONAL, and)')
An error will be returned rather than the previously returned record:
Server: Msg 7619, Level 16, State 1, Line 1 Execution of a full-text operation failed. A clause of the query contained only ignored words.
As discussed at the start of this article the word "and" is considered a NOISE WORD.
<simple_term>
Specifies a match for an exact word (one or more characters without spaces or punctuation in single-byte languages) or a phrase (one or more consecutive words separated by spaces and optional punctuation in single-byte languages). Examples of valid simple terms are "blue berry", blueberry, and "Microsoft SQL Server". Phrases should be enclosed in double quotation marks (""). Words in a phrase must appear in the same order as specified in <contains_search_condition> as they appear in the database column. The search for characters in the word or phrase is case insensitive. Noise words (such as a, and, or the) in full-text indexed columns are not stored in the full-text index. If a noise word is used in a single word search, SQL Server returns an error message indicating that only noise words are present in the query. SQL Server includes a standard list of noise words in the directory \Mssql\Ftdata\Sqlserver\Config.
Punctuation is ignored. Therefore, contains(testing, "computer failure") matches a row with the value, "Where is my computer? Failure to find it would be expensive."