Name Normalization and Language ID Search Syntax

IN THIS ARTICLE:

This article describes the DART Syntax for Name Normalization and Language ID searches.

Object Field Search

Objects are database structures that store information about particular entities or results. DART uses objects to store results for name normalization and language identification, and these results are visible and accessible in fields and sub-fields. Name Normalization results are automatically available in DART. However, Language Identification results are generated by Basis (third-party software) and are available by request. Objects require a special syntax that is similar to the syntax used elsewhere in DART but are unique in a few ways:

  • The syntax must specify the object field and the sub-field separated by the @ sign (e.g., m/normalized_from@fullname = John Harrison).
  • Object searches always require the use of the “equals” operator (=). The “contains” operator (:) cannot be used because the searches query the database directly.
  • Object searches cannot be run in Lace Jobs because they query the database directly and are not part of the search index.

Name Normalization

Name Normalization Fields:

  • normalized_from
  • normalized_to
  • normalized_cc
  • normalized_bcc
  • normalized_recipients
    • To + CC + BCC
  • normalized_email_participants_top
    • From + To + CC + BCC in the top-level header
  • normalized_email_participants_all
    • All people found in the From, To, CC, or BCC fields in any email header visible in the document.
  •  etx_normalized_thread_network
    • All people present anywhere in a thread group across all branches, even if those people are not visible on that particular document/branch.
    • Replace ‘x’ with the email threading set letter (eta, etb, etc.)

Name Normalization Sub-Fields:

  • @first
    • A person’s first name
  • @middle
    • A person’s middle name
  • @last
    • A person’s last name
  • @fullname
    • The person’s first, middle, and last name

Examples of name normalization searches:

  • Returns documents where “John” is the first name of the person in the Normalized_From field (e.g., John Harrison, John F. Brown, John James Roberts, etc.)
  • Returns documents where “Harrison” is the last name of a person in the Normalized_Recipients field (e.g., John Harrison, Nancy A. Harrison, T. J. Harrison, etc.)
  • Returns documents where “John Harrison” is the full name (first, middle, and last) of a person in the Normalized_Email_Participants_Top field.
  • Does not return documents with a middle name (e.g., John M. Harrison, John Mark Harrison, etc.)
  • Returns documents where “John Mark Harrison” is the full name (first, middle, and last) of a person in the Normalized_Email_Participants_Top field.
  • Does not return documents without a middle name or with a different middle name (e.g., John Harrison, John James Harrison, John M Harrison, etc.)
To specify a value for more than one sub-field for a single person, use a colon after the field name followed by @ signs and sub-fields in parentheses (the use of parentheses is optional):
  • Returns documents where “John” is the first name and “Harrison” is the last name of the same person in the Normalized_Recipients field (e.g., John Harrison, John M. Harrison, John Mark Harrison, etc.)
Compare this with:
  • This can return documents where “John Harrison” is in the Normalized_Recipients field, but it can also return documents where “John Davis” and “Mary Harrison” both appear in the Normalized_Recipients field but “John Harrison” does not.

Language Identification

Language Identification Fields:

  • ma_languages
    • All languages identified in the extracted text of a document.
  • ma_primary_language
    • The primary/most common language identified in the document (i.e., the language with the highest percentage).

Language Identification Sub-Fields:

  • language
    • The name of the language identified
  • percentage
    • The percentage of extracted text where this language is present

Examples of Language Identification searches:

  • Returns documents where German is identified as one of the languages present in the extracted text.
  • Returns document where Spanish is identified as the primary language in the extracted text.
To specify a percentage for a language, use a colon after the field name followed by @ signs and sub-fields in parentheses (the use of parentheses is optional):
  • Returns documents where German is identified as one of the languages present in the extracted text, and where German is present in less than 25% of the extracted text.
  • Returns documents where Spanish is identified as the primary language present in the extracted text, and where Spanish is present in over 80% of the extracted text.
To search for more than one language in a single search using Boolean OR, use a colon after the field name followed by @ signs and sub-fields in parentheses (the use of parentheses is optional):
m/ma_languages: (@language = english or @language = spanish)
  • Returns documents where English or Spanish is identified as one of the languages present in the extracted text.
  • Returns documents where Spanish, Italian, or Japanese is identified as the primary language present in the extracted text.
  • Returns documents where Spanish is present in over 60% of the extracted text or where Turkish is present is over 60% of the extracted text.

NOTE: When using Boolean OR, you can write the same terms above in a different way that will return the same results:

  • m/ma_languages@language = english OR m/ma_languages@language = spanish
  • m/ma_primary_language@language = spanish OR m/ma_primary_language@language = Italian OR m/ma_primary_language@language = japanese
  • m/ma_languages: (@language=spanish  and @percentage > 60) OR m/ma_languages: (@language=turkish and @percentage > 60)
To search for more than one language using Boolean AND, you must use multiple searches:
m/ma_languages@language = english AND m/ma_languages@language = spanish
  • Returns documents where both English and Spanish are identified as languages present in the extracted text.
Do not use Boolean AND in a single search. The following search will not return results because it is looking for a single language that is both English and Spanish:
  • m/ma_languages:(@language = english and @language = spanish)