Basic DART Search Syntax
IN THIS ARTICLE:
This chart serves as a guide to the basic search syntax for DART.
DART Syntax |
Description |
Example Search Term |
Exact Match |
||
|
By default, the DART search matches the exact text. Searching for a word or a phrase returns exact matches of that word or phrase. Different forms of the same word or phrase, spelling corrections/variations, and conceptually related text are not matched. |
advertise => Matches the exact text "advertise" within a document. Does not match text such as "advertising," "advertised," "advertisement," "marketing," "sales," etc. |
Boolean Operators |
||
|
Returns documents that satisfy either or both of the elements on either side of the operator. Alternative syntax: brackets (see below) |
budget OR cost => Returns documents that contain just budget, just cost, or both budget and cost anywhere in the document. |
|
Returns documents that satisfy both elements on either side of the operator. |
budget AND cost => Returns documents that contain both budget and cost anywhere in the document. |
|
Returns documents that satisfy the element on the left of NOT and that do not satisfy the element on the right of NOT. |
budget NOT proposal => Returns documents that contain budget and that do not contain proposal. Does not return documents that contain both budget and proposal (or that contain proposal and not budget. |
DART Syntax |
Description |
Example Search Terms |
Brackets (alternative syntax for OR) |
||
{ } |
Contains list of optional variants. Alternative syntax: OR |
bank{s} => Returns the same documents as the search term bank OR banks. Does not match banked or banking. |
[ ] |
Contains list of mandatory variants. Alternative syntax: OR |
sav[e, es, ed, ing] => Returns the same documents as the search term save OR saves OR saved OR saving. Does not match sav or savings. gr[a, e ] y => Returns the same documents as the search term gray OR grey. Does not match gry or graey.
[budget, cost, proposal] => Returns the same documents as the search term budget OR cost OR proposal. Does not match budgets or proposals. |
|
|
sav[e{s, d}, ing{s}] => Returns the same documents as the search term save OR saves OR saved OR saving OR savings. Does not match sav. { re } buil[ d, t ] => Returns the same documents as the search term build OR built OR rebuild OR rebuilt. Does not match buil or rebuilding. [budget { s } , cost { s } , proposal { s }] => Returns the same documents as the search term budget OR budgets OR cost OR costs OR proposal OR proposals. Does not match budgeting or costly. |
DART Syntax |
Description |
Example Search Terms |
Wildcard Characters and Stemming |
||
|
Unlimited wildcard: Matches zero or more characters. May be used only at the end of a word that consists of four or more letters. |
bank* => Matches bank, banks, banked, banking, bankers, bankrupt, bankruptcies, bankable, bankofamerica.com, bank2000, Bankwitz, etc. |
|
Single-character wildcard: Matches exactly one character (a letter or a number). May be used at the beginning or end of a word, or embedded within a word. |
s??n => Matches soon, seen, shun, etc. Does not match shining, sin, or sn. |
+ |
Digit wildcard: Matches exactly one digit. May be used at the beginning or end of a word, or embedded within a word. |
1++ => Matches 100, 122, 159, etc. Does not |
~ |
|
bank~ => Matches bank, banks, banked, banking, and bankings. |
DART Syntax |
Description |
Example Search Terms |
Proximity Operators |
||
|
|
[budget, cost*] w/10 proposal* => Matches text such as:
[budget*, cost*] w/10 proposal* => Same search term as above with alternative syntax. |
|
Uni-directional proximity: Matches text where element A is followed by element B, within N or fewer tokens. |
[blue, white] f/3 card{s} => Matches text such as The blue and green cards are offered to preferred customers. |
|
Window proximity: Matches text where the specified elements occur within a window of N or fewer tokens. If you do not specify a window number, it is 100. |
window(size = 10, account{s}, status, David f/1 Lee) => Matches text such as What is the status of David M. Lee's account? |
Tokens
The proximity operators search the token count not the word count. Words and tokens are similar, but they are not the same. The search index defines a token as a word containing letters and/or numbers. Tokens are separated from each other by white space, and the search index defines all non-alphanumeric characters (punctuation, symbols, special characters, etc.) as white space. The token count can be different from the word count within a string of text. For example, the following string of text contains 19 words:
"Hi! Please email this to John at jmartin@enron.com (he's out of the office now, but he'll be back soon)." The search index tokenizes this text into 23 tokens for search:
"Hi Please email this to John at jmartin enron com he s out of the office now but he ll be back soon"
DART Syntax |
Description |
Example Search Terms |
Metadata Operator |
||
m/<field> <function><value> ---------------- m/<field> : <value> m/<field> = <value> m/<field> > <value> m/<field> >= <value> m/<field> < <value> m/<field> <= <value> m/<field> >< <value> m/<field> = empty m/<field> != empty |
Returns documents where the specified value matches the value in the specified field. contains (:) equals (=) greater than (>) greater greater than or equal to (>=) less than (<) less than or equal to (<=) between and including (><) is empty is not empty |
m/title: budget => Returns documents that contain the text budget in the title field. Also returns documents that contain additional text in the title field, such as November budget or monthly budget meeting. m/title = budget => Returns documents where the full text in the title field is budget. Does not return documents that contain additional text in the title field, such as November budget or monthly budget meeting. m/text_size > 750 => Returns documents where the text_size is larger than 750 bytes. m/date_created >< 1/1/2010 and 12/31/2011 => Returns documents where the date_created is on or after Jan 1, 2010 and is on or before Dec 31, 2011. m/file_name = empty => Returns documents where the file_name field is empty. |
Quotes |
||
|
|
"forget me not my love" => Returns documents that contain the phrase forget me not my love. versus forget me not my love => Returns documents that contain the phrase forget me and that do not contain the phrase my love. versus m/title = empty => Returns documents where the title field is empty. |
|
References a previously executed search from the History tab in DSR. This can be run on its own, or it can be used as part of another search term. |
tax{es} /10 fil[e, ing*] NOT #14 => #14 represents the 14th search term in the DSR History tab. |
DART Syntax |
Description |
Example Search Terms |
Grouping |
||
|
|
(budget* AND cost*) OR proposal* => Runs as:
budget* AND (cost* OR proposal*) => Runs as:
|
Operator Precedence
The default precedence for DART operator syntax is:
- Grouping (parentheses)
- OR, brackets
- Uni-directional Proximity
- Bi-directional Proximity (smaller first, or if the same size, left to right)
- AND
- NOT
Everything that follows the metadata operator is included in the scope of the m/ operator, unless grouping is used. For example:
- m/author: Meyers OR Lee => Returns documents that contain "Meyers" or "Lee" in the author field.
versus
- (m/author: Meyers) OR Lee => Returns documents that contain "Meyers" in the author field, or that contain "Lee" anywhere else in the document.
As a best practice, we recommend using grouping every time a single search term contains more than one of the following operators: OR, AND, NOT, w/ <n>, f/<n>, m/<field>. The default precedence for DART operations can be easy to forget. If you make a mistake regarding operator precedence, your search term may not perform the way you intend it to perform. This kind of mistake can easily go unnoticed, and it can have negative consequences on engagement workflow and deliverables. Grouping will help ensure that your search term performs the way you intend it to perform. It will also help another reader quickly understand the intention of your search term.
Search Term Expansions
Expansions represent all the different ways a search term can match or "hit" a document. For example:
(m/author: Meyers) OR Lee generates two expansions:
- m/author: Meyers
- Lee
budget{s} OR cost{s} OR financ* generates five expansions:
1. budget
2. budgets
3. cost
4. costs
5. financ*
Grouping can change the expansions generated by a term. This can change the way a search term hits a document, which can change the search results. For example:
market* w/10 (segment* w/5 young*) generates four expansions:
1. market* f/10 segment* f/5 young*
2. market* f/10 young* f/5 segment*
3. segment* f/5 young* f/10 market*
4. young* f/5 segment* f/10 market*
versus
(market* w/10 segment*)* w/5 young* generates four slightly different expansions:
- market* f/10 segment* f/5 young*
- segment* f/10 market* f/5 young*
- young* f/5 segment* f/10 market*
- young* f/5 market* f/10 segment*
Viewing Expansions in DSR
To view the expansions of a search term before you run it, write or paste the term into the DSR search bar, then press the Expand button under the search bar. A window will open that displays all expansions generated by the search term. This helps ensure you understand the actual expansions (which indicate how the search term will run) of your search term before you run it.
Note: Expansions are not generated for the window proximity operator. During search, this operator performs differently than the others in a manner that does not lend itself to expansions.
Word Variation: Boolean OR, Brackets, Wildcards, and Stemming
DART syntax provides four different ways to add variation to a word: the Boolean OR operator, brackets, wildcards, or stemming. There are advantages and disadvantages to each option.
Boolean OR
The Boolean OR operator provides a straightforward way to specify the exact variation you want to search for:
save OR saves OR saved OR saving
The main advantage to using Boolean OR is the operator is easy to learn and use. However, using Boolean OR, search terms can quickly become long, time-consuming to write, difficult to read.
Brackets
Brackets are an alternative to OR that allow you to specify the exact word variation you want to search for in a way that is more compact and less redundant than using Boolean ORs. Using bracket syntax, you can embed the variation directly within the word, and you can separate lists of variants by commas rather than ORs.
[sav[e{s, d}, ing] versus save OR saves OR saved OR saving
[budget, cost, proposal] versus budget OR cost OR proposal
In most cases, brackets are recommended instead of Boolean OR because they are a faster, easier, and more reader- friendly way to add precise variation to a word or concept within a search term. Both types of brackets can be used anywhere in a word, and square brackets can be used for a list of words or phrases:
gr[e,a ]y versus grey OR gray
{re}bui[d, t] versus build OR built OR rebuild OR rebuilt
[ad{s, vertisement{s}}, commercial{s}, market segment{s}] versus ad OR ads Or advertisement OR advertisements OR commercial OR commercials OR market segment OR market segments
Wildcards
A wildcard operator is a special symbol that matches one or more other characters in the text. Wildcards allow you to add variation to a word in a way that is faster and easier than using brackets or Boolean OR. However, wildcards do not let you specify or control the exact variation that is or is not captured, and they may capture unanticipated and/or undesired variation. For example:
sav* => save, saves, saved, saving, savior, saviors, savage, savages, Savannah, savvy, savory, SAV, etc.
In addition to capturing too much and/or unintended variation, wildcards (especially the unlimited wildcard) can be very computationally expensive. They can result in slow searches that drain system resources (see the Computationally Expensive Searches section below for more details).
It is important to exercise caution and use your discretion with wildcards, especially with the unlimited wildcard. Use the unlimited wildcard at the end of words with five or more letters, but in other situations, consider using brackets, stemming, the single-character wildcard, or the digit wildcard.
Stemming
Stemming identifies the predictable grammatical variations associated with a root word. Like wildcards, stemming allows you to add variation to a word in a way that is faster and easier than using brackets or Boolean OR, but it does not let you specify or control the exact variation that is or is not captured.
save~ => save, saves, saved, saving, savings, savingly
Stemming uses an algorithm that predicts morphological variations of a root word. The algorithm is good, but it is not perfect. Stemming may not always include all desired morphological variations of a word. For example, irregular word forms may be missed, or noun forms may not be included for words deemed to be a verb and vice versa.
Use the Expand button under the DSR search bar to see which variations are included when adding stemming to a particular word. If the stemming variations do not meet the needs of the search at hand, consider using brackets to specify the exact word variation you want to search for.
Document Body vs. Metadata
A document can be thought of as having two different parts: body and metadata. The body contains the content of the document (it's often referred to or thought of as just "the document" or "the text"). The metadata contains information about the document, such as its author or file size ("data about data").
Both the body and metadata can be thought of as "fields" of a document. The body is one field, and the metadata consists of numerous distinct fields. These fields contain text which can be searched. Typical document fields include: body, document ID, author, recipient, filename, file-type, size, etc. For certain documents, certain fields may be empty. The naming and reliability of metadata fields will vary from project to project.
Note: In DSR, the metadata fields listed above the document body are normally an incomplete list of fields. To see all the metadata fields for a document, press the "Show Metadata" button at the top of the document.
The Default Search Stream
The default search stream is the set of document fields (body and metadata) that all search terms will run over by default, that is, all search terms that do not use a metadata operator.
The idea is that "by default" (when no metadata operator is used) we want most search terms to run over the document body, as well as some (but not all) metadata fields. For example, if you are searching for everything about "Project Troy", you probably care about documents where "Project Troy" is part of the filename, even if "Project Troy" isn't anywhere in the body of the document as well. On the other hand, there are many metadata fields we do not want most search terms to run over by default. For example, if you are searching for a key business deal with Microsoft, you do not want your search terms to return documents because they hit "Microsoft" in the file-type field.
The default search stream is configured at the beginning of an engagement, and it can differ from project to project. The typical default search stream will contain the following fields:
- Author/Sender
- Recipient/Copy/Blind Copy
- Filename/Title/Subject
- Attachment Name
- Document Body
The search term m/author: John Smith will only run over the author field. It will return only those documents that contain "John Smith" in the author field, regardless of whether "John Smith" appears in the body. However, the search term John Smith will run over the document body as well as fields such as author, recipient, copy, filename, etc., but it will not run over fields such as file-type, file-path, custodian, size, etc. It will return documents that contain "John Smith" in any of the fields in the default stream.
In DSR, to see which fields are in the default search stream for a given project, go to File in the DSR menu bar, choose Metadata Info, and sort the window that opens by the "Default Search Stream" column.
Case Sensitivity
By default, DART syntax is not case sensitive. For example, the following search terms do not contain case sensitive operators; therefore, they all match text without sensitivity to case, and they all return exactly the same results:
- BUDGET AND COST
- Budget AND Cost
- budget and cost
Similarly, DART search is not case sensitive. For example, the search term market* w/10 children will hit all of the following strings of text:
-
- Marketing regulations are needed to protect children.
- So many cHildrEn at the markeT today.
- MARKETSEGMENT CHILDREN
However, we encourage the use of some capitalization in search terms. This helps ensure your search terms are easy for others to quickly read and review. For example it can be helpful to capitalize Boolean Operators (they can easily blend in as a search word when not capitalized). It can also help to capitalize proper nouns and acronyms to better convey the intention of the search term to another reader:
- Tobacco AND [WHO, World Health Organization]
In situations where you need DART syntax to be case sensitive, the following operators allow you to run searches with sensitivity to case.
DART Syntax |
Description |
Example Search Terms |
allcaps( ) |
Matches when all letters in the word or phrase are upper case. |
allcaps(SAT) => Matches the text SAT. Does not match sat, Sat, SaT, etc. |
firstcap( ) |
Matches when the first letter in the word (or the first letter in each word of a phrase) is upper case and all other letters are lower case. This operator does not match words that contain only one letter, such as A or A30 (the allcaps operator must be used to capture A or A30). |
firstcap(Mark Jones) => Matches the text Mark Jones. Does not match mark jones, MARK JONES, maRk jOneS, etc. |
nocaps( ) |
Matches when all letters in the word or phrase are lower case. |
nocaps(abc100) => Matches the text abc100. Does not match Abc100, ABC100, aBc100, |
Punctuation and Special Characters
All non-alphanumeric characters in the text of a document are treated as a space during search. Therefore, they do not need to be accounted for in DART search terms. For example, the search term john smith will hit all of the following strings of text:
-
- john smith
- john_smith
- john.smith
- john/smith
- john@smith
- john(smith)
The use of empty curly brackets to represent an optional space or special character is a good way to target hyphenated words. For example the search term e{ }mail will hit all of the following strings of text:
-
- e mail
Be sure to include a space between the curly brackets in your search term.
Computationally Expensive Searches
A computationally "expensive" search is one that takes longer to run and requires more system resources (such as memory, database activity, heavy computation, etc.). It is important to try to avoid running expensive searches. Expensive searches will take a long time to run, and they can cause system slow-downs for you and for other DART users (and in extreme cases, they can crash the system).
Below are four attributes of an expensive search to watch out for:
1. Running searches over a large number of documents
As one would assume, if search terms are run over a large number of documents, they will take longer to run and will require more system resources. In general, a large number of documents for DART is over 500,000 documents. A small number is under 5,000.
However, there are many times where we are required to run our search terms over a very large number of documents. This is okay. The H5 system is designed to handle this, but you should expect searches to take longer to run. You also need to be more cautious and aware of the next three attributes (number of expansions, very common words, and wildcard operators) when running a search term over a large number of documents. It is okay to run non-expensive searches over a large number of documents. It is even okay to run expensive
searches over a small number of documents. The issue arises when expensive searches are run over a large number of documents.
2. Number of expansions
System resources can become drained when a single search term generates a very high number of expansions (see the Search Term Expansions section above for an explanation and examples of "expansions"). A search term that generates 10,000 or fewer expansions is not considered expensive (although there are other factors besides the expansion count that can make the search term expensive). A search term that generates 10,000 to 250,000 expansions is considered somewhat expensive. A search term that generates over 250,000 expansions is considered expensive.
Always be aware of the number of expansions in your search terms (use the Expand button under the DSR search bar to quickly and easily ascertain this number). As a best practice, if your search term contains over 250,000 expansions, break it up into multiple search terms. This will make your search terms run faster, and it helps avoid overtaxing the system.
3. Very common words
System resources can become drained when a search term generates a very high number of hits or potential matches during a run (hits across documents and multiple hits within the same document). Therefore, very common words tend to result in very expensive search terms. Search terms that simply contain very common words can be expensive as well, such as United States of America (the system looks up all instances of each word in order to calculate hits).
Avoid including common words in your search terms, unless absolutely necessary. Avoid commonly used English words such as articles (the, a, an, etc.), prepositions (to, for, of, etc.), pronouns/determiners (I, your, this, etc.), conjunctions (and, or, but, etc.), and common verbs (do, get, will, etc.). Be cautious when using contractions because the apostrophe is treated as a space during search, which can make the search term expensive. For example, searching for "Murphy's Law" will actually search for "Murphy s Law" which will search for the letter
"s" (and every instance of "apostrophe s" across the documents).
You can also use the Ignore Stop Words feature in DSR. This feature will allow you to leave very common words in your search term, but the system will replace these words with a uni-directional proximity operator during search. For example, the system will run United States of America as United States f/1 America and will run Murphy's Law as Murphy f/1 Law. To use this feature, check the Ignore Stop Words checkbox below the search bar, and then execute your search.
And finally, avoid words that may not be very common in general but that you know to be common in the document population. These instances will not be handled for you by the Ignore Stop Words feature. For example, if you were searching through Enron's documents, the search term Enron could be very expensive because it can hit the email address "@enron.com" which could occur hundreds of thousands (or even millions) of times across a large number of documents.
Note: If it is absolutely necessary to use the common words "and", "or", or "not" in your search term, remember that you must use quotes around these words so that they do not perform as operators during the search.
4. Wildcard operators
Because system resources are drained when a very high number of hits or potential matches are generated during a run, the wildcard operators can be expensive when not used carefully, especially the unlimited wildcard. Imagine running a* as a search term. This could hit millions of times before it was even halfway through the run, significantly draining vital system resources such as memory. A search term like this is unreasonable and can crash the system. As a best practice, do not use the unlimited wildcard at the end of a word with three or fewer letters. Also, do not use the unlimited wildcard at the beginning or in the middle of a word because this will almost always result in a very expensive search.
When you need to use a wildcard for words of three or fewer letters, at the beginning of a word, or in the middle of a word, consider using the single-character or digit wildcards instead. To make these wildcards more closely match the search functionality of the unlimited wildcard, consider enclosing them in curly brackets (which will make it optional). For example, consider using ban{?} instead of ban, or *{?}smith instead of *smith. You can also use two optional question mark wildcards in a row, for example, j{?}{?}smith instead of *j*smith{*}.
However, do not use more than three (optional or mandatory) question mark wildcards in a row because this will likely result in an expensive search.
Best Practices for Search Terms
Here is a review of the best practices to follow when running search terms in DART:
- Grouping: Use grouping every time your search term contains more than one of the following operators: OR, AND, NOT, w/<n>, f/<n>, m/<field>. This helps ensure you and other readers understand how the search term will perform.
- Number of expansions: If your search term contains over 250,000 expansions, break it up into multiple search terms. This will make your search terms run faster, and it helps avoid overtaxing the system.
- Very common words: Replace very common words in your search terms with a uni-directional proximity operator, or use the Ignore Stop Words feature in DSR. This will make your search terms run faster, and it helps avoid overtaxing the system.
- Wildcard characters: Do not use the unlimited wildcard at the end of a word with three or fewer letters. Do not use the unlimited wildcard at the beginning or in the middle of a word. Do not use more than three (optional or mandatory) single-character or digit wildcards in a row. All this will make your search terms run faster, and it helps avoid overtaxing the system.