Glossary

A

Acceptance testing or acceptance sampling: Procedure used to evaluate whether the coding on a set of documents is of acceptable quality, i.e. sufficiently accurate.

Active data: This term is often used to describe information currently displayed on a computer screen. The more technical usage refers to information stored on local storage media or a device visible to the operating system and/or application software with which it was created. Active data is accessible to users immediately and without modification or restoration.

Algorithm: A specific set of steps that, when accurately executed, leads to a specific, guaranteed outcome. Algorithms can be created for different processes, such as calculation, data processing, automated reasoning, or mathematical computations. Compare to "heuristic."

Analytics: The discovery, interpretation, and communication of meaningful patterns in data; and the process of applying those patterns toward effective decision-making.

Architecture: “Open architecture” refers to a system or network whose specifications are public. This includes officially approved standards as well as privately designed architectures whose specifications are made public by the designers. The opposite of open is closed or proprietary. The great advantage of open architecture is that anyone can design add-on products or plug-ins for it.

Archival data: Information that is maintained for long-term storage and record-keeping purposes but is not immediately accessible. Archived data can be stored in many ways. For example, it can be written onto removable media, like a DVD or backup tape, or maintained on a system hard drive.

Attachment: This term most commonly refers to an electronic file attached to an email message. More generally, it refers to a file or record attached or associated with another, often for retention, transfer, processing, review, production and/or routine records management. Multiple attachments can be associated with a single file or record (referred to as the “parent” or “master” record).

Author or originator: The person or office that created or issued an item.

B

Backup data: Active electronically stored information (ESI) copied onto a second medium (like a CD, DVD or backup tape) in its exact form, often intended as a source for recovery should the first medium fail. Usually, backup data is stored separately from active data, and differs from archival data (though may be a copy of archival data) in method and structure of its storage.

Batching: The process of gathering large amounts of electronically stored information together in batches. Typically this process is done so that documents can be allocated to reviewers for tagging.

Bates number: Sequential numbering used to track documents, images or production sets (as with productions made in native format), which often includes a suffix or prefix to help identify the producing party, case name or similar information.

Boolean search: A Boolean search is a type of search allowing users to combine keywords with operators such as AND, NOT, and OR to produce more relevant results than a single-word search. For example, a Boolean search could be “hotel” AND “New York.” This limits the search results to only those documents containing the two keywords.

Branches: Email conversation or thread that splits into different directions, generally where a message unit has multiple, different responses.

C

Cache: A high-speed storage mechanism utilized for frequently used data. Website contents, for example, often reside in cached storage locations on a hard drive.

Carving: The process of searching through the unused parts of a disk for files that haven’t been overwritten and recovering those files. Word to the wise: “deleted” does not mean gone--deleting a file usually only unlinks it from your computer’s file system. With the right software, deleted files can usually be recovered.

Chain of custody: Process of documenting and tracking possession, movement, handling and location of evidence. Chain of custody is tracked from the time evidence is obtained, until presentation in court or other submission. A clear chain of custody is important when issues of admissibility and authenticity arise, as it can establish that the evidence was not altered or tampered with in any way.

Checksum: A sequence of numbers and letters that is essentially unique for each and every file in the world. Comes in several different types, including MD5 and SHA1. Extremely useful for finding duplicates, or identifying minor changes to documents. Also called a hash.

Child document: Refers to a file that is attached to another document. An example would be an email attachment or a graph embedded in a word processing document. A child can be another email if it is included as an attached file rather than part of an email thread.

Clawback agreement: An agreement that protects against waiver of privilege and/or work production protections when inadvertent production occurs; requires automatic return or destruction of inadvertently disclosed records. See quick peek agreement.

Clawback: Document inadvertently produced that contains confidential or privileged information that must be returned to the producing party or destroyed.

Cloning: Cloning is a term generally used when referring to making a copy of the drive, as an example, to put into another machine without having to install everything from scratch. Another reason for cloning is mainly for backup purposes. Typically, cloning programs are not configured properly to get all areas of the drive. There is also a problem with later authentication, meaning there is no way to tell if anything was deleted or added to the clone after the day it was made. See mirroring.

Clustering: Organizing documents by similarity.

Coding: The method of entering information fields from a document and saving them in a format linked to that document within a database. There are different types of coding, objective and subjective. Anyone who can read the document, such as the date on a scan, can apply objective coding. Subjective coding requires knowledge of the underlying investigation. Also called tagging. Please take a look at the document review.

Collection: A group or set of documents.

Compression: The reduction in the size of data to save storage space and reduce the bandwidth necessary for access and transmission. “Lossless” compression preserves the integrity of the data (e.g., ZIP and RLE), while “lossy” compression does not (e.g., JPEG and MPEG).

Computer forensics: Computer investigation and analysis techniques to determine legal evidence. Applications include computer crime or misuse, theft of trade secrets, theft of or destruction of intellectual property, and fraud. Computer forensics specialists use many methods to capture computer system data, and recover deleted, encrypted, or damaged file information.

Concept search: A method of searching for files not based on keywords but on the document's subject matter, paragraph, or sentence. This is different to keyword searching, which requires an exact keyword hit.

Confidence interval: Expected range of results used as an estimate of the uncertainty of a sample.

Confidence level: How often a random sample will perform within a specific range (the confidence interval).

Container file: This single file contains multiple other files or documents. A common container file would be a zip file. Container files are typically used due to their considerably smaller file size. Extracted contents are usually anywhere from 50% to 250% larger than the original container file.

Corpus: A collection or body of documents.

Corrupted file: A corrupted file has been damaged and cannot be read by a computer in part or in whole. Common causes include viruses, hardware or software failures, and degradation due to the passage of time.

Cost shifting: When the responding party forces the requesting party to pay for the costs of responding to certain discoveries.

Culling: Reducing the size of the set of electronic documents using mutually defined criteria, e.g. dates, keywords, custodians, etc., to decrease the volume while increasing the relevancy of the information; processing a large set of data and removing the junk data so that it’s easier to search and less expensive to host or transfer. Culling techniques include de-duplication, near-de-duplication, email thread analysis, deNISTing, and filtering.

Custodian or data custodian: A person who has electronically stored information relevant to litigation collected for discovery. For example, the data custodian of an e-mail is the owner of the mailbox which contains the e-mail.

D

DART: The Data Analysis and Review Tool is the software platform used to carry our document review tasks.

Data extraction: Refers to the process of breaking down data from electronic documents to identify their metadata and body contents; parsing data from electronic documents into separate fields, such as “Date Created,” “Date Modified,” “Author,” etc. In a database, this allows for searches across data or by sorting respective fields.

Data filtering: Use of specified parameters to identify specific data.

Data mapping: Method used to capture information about how ESI is stored virtually and physically. A basic data map will include name and location information, while a more complex data map may include several, if not all, of the following: software and formatting information; description of backup procedures in place; interconnectivity and utilization of each type of ESI within the organization; accessibility, policies, and protocols for retention and management; and record custodian information.

Data mining: The process of extracting useful data from a volume of unstructured information; the process used to search for patterns and relationships in large data collections to extract other useful pieces of information.

Data processing: E-discovery workflow that formats collected ESI so that it can be culled and searched in a review tool. Processing can differ depending on the application being used. Typically processing includes the extraction of files from folders (.pst, and .zip formats), separation of attachments, conversion of files to formats the review tool can read and search, and extraction of text and metadata.

Data set: Named or defined collection of data.

Decompression: To expand or restore compressed data to its original size and format.

Deduplication or de-duping: A process that removes multiple copies of the identical file from a set of files in a document collection, leaving you with only one of the copies. Horizontal deduplication means removing all the duplicates across the board. Vertical deduplication means keeping a copy of a duplicate if it belongs to a different custodian. Deduplication can be by case, custodian, or production. Custodial deduplication removes all duplicate files within a single custodian’s collection, whereas global deduplication removes all duplicates across all custodians in a matter.

Defragment or defragging: Refers to the use of a computer utility to reorganize files in a contiguous manner. Fragmentation occurs naturally when a hard drive or other storage medium is frequently used and will result in storing a file in noncontiguous clusters. The more places that need to be searched, the slower the data will be accessed. Defragmentation may be set up to run automatically and with little or no oversight by users and will result in the overwriting of information residing in unallocated space. Data preservation may require suspension of defragmentation and imaging to preserve such information.

Deleted data: Refers to live data deleted by a computer system or user activity. “Soft deletion” refers to data marked for deletion that may no longer be accessible to the user (such as the emptying of one’s “recycle bin”), but has not yet been physically removed or overwritten. Soft deleted data may be recoverable. Further, deleted data in general, may remain on storage media, in whole or in part, until it is overwritten or “wiped,” and, even after being wiped, it may be possible to recover information relating to the deleted data.

DeNISTing: Removing the operating system files, program files, and other non-user-created data. The NIST (National Institute of Standards and Technology) list contains more than 40 million known files. Using this list to filter custodian hard drive files can be effective because these files are usually irrelevant to a case but often make up a sizable portion of a collected set of electronically stored information (ESI). DeNISTING is one way of culling data, by taking a list of checksums for known junk files and removing any matching files from the data set.

Directory: An organizational unit or container used to organize folders and files into a hierarchical or tree-like structure. Some user interfaces use the term “folder” instead.

Discovery: The process of identifying, acquiring, and reviewing information that is potentially significant to the matter and producing information that can be utilized as evidence in litigation.

Diskwipe: A utility used to overwrite or erase existing data.

Document family: A group of documents that is connected to each other for purposes of communication; pages or files produced either in hard copy or through a software application, which constitute a logical single communication of information, e.g. an email and attachments. For document review purposes, the email is referred to as the parent, and the attachments are children. May also include other groups of documents, such as, files found in the same .zip folder or a spreadsheet with embedded images.

Document review: Document review is a part of the discovery process in which each party to a case sorts through and analyzes the documents and data they possess (and later the documents and data supplied by their opponents through discovery) to determine which are sensitive or otherwise relevant to the case. Documents may be given coding or tags. Responsiveness coding is used to determine if a document is necessary to produce or not relevant to the case. Privilege coding is used where documents are expected to be withheld on the grounds of a legal privilege. Issue coding is used to identify documents relevant to specific topics.

Document type: A typical field used in coding, examples include “correspondence,” “memo,” “agreement,” etc.

Document: A specific file, such as an email or Word document. A document may be considered a parent if it has additional files (or children) embedded within it or attached to it. The entire group of documents is called a family.

Download: The process of moving data from another location to one’s own, typically over a network or the Internet.

Duty to preserve: Duty arising under state and federal law, upon reasonable anticipation of litigation, to preserve documents, electronic records and data, and any other evidence or information potentially relevant to a dispute. The duty also arises in the context of audits, government investigations and similar matters. The scope of the duty and what is required under a specific set of circumstances is determined by considerations of reasonableness and proportionality. Reasonable efforts to preserve include the suspension of routine deletion policies, issuance of adequate preservation instructions and oversight as appropriate.

E

Early Case Assessment (ECA): Described by a variety of tools or methods for investigating and quickly learning about document collection for the purposes of estimating the risks, costs, and time spent pursuing a particular legal course of action. As opposed to linear review, effective ECA utilizes search terms, filters and/or additional criteria to cull the data set and enable a more efficient and focused review.

E-discovery or electronic discovery: A process of finding, identifying, locating, reviewing, and producing relevant electronically stored information for litigation purposes; process where the parties to litigation exchange electronic evidence. Also known as digital discovery, electronic digital discovery, electronic document discovery, electronic evidence discovery.

EDRM or electronic discovery reference model: A map of the electronic discovery process from data preservation and collection to production.

Electronic document management (EDM): This refers to the process utilized in the management of documents, whether hard copy or electronic. In the case of hard copy documents, it includes those steps necessary to make them available electronically, such as images, archiving, etc.

Electronically stored information (ESI): Information that is stored electronically (regardless of the media or original formatting) as opposed to paper.

Email archiving: The process of preserving and storing email.

Email or electronic mail: Digital messages from an author to one or more recipients. Email operates across the Internet or computer networks.

Email threading: Process using analytics to organize emails into conversational groups by associating the members of all related branches of an email thread including the attachments. Email threading allows for ease of review by removing the need to review fully subsumed, duplicative content.

Embedded metadata: Metadata embedded with content.

Emo filter: Set of terms used to target sentiment or heightened emotional language, either positive or negative, in text. See sentiment analysis.

Encryption: A procedure that makes the contents of a file or message scrambled and unintelligible to anyone not authorized to read it.

Exact match: Search results that correspond precisely to the query as specified. Compare to fuzzy match or wild cards.

Expanded data: See decompression.

Expansion: The process of adding to the terms in a query to provide a wider range of terms and increase the range of document retrieved.

F

Fielded: A form of production (usually native, or nearly native) wherein the fields that hold discrete bits of information remain in place. For example, an email when converted to a PDF file is no longer fielded because the “to:” and “from:” fields of the email in a PDF document have the same status as any of the other text on the page. In contrast, when email is produced in a native or near-native format, the “to:” and “from:” fields retain their special status and are searchable.

File extension: A suffix to the name of a file, separated by a dot. Often an abbreviated version of the name of the program in which the file was created or saved, the suffix indicates the program that may be used to open the file.

File server: Refers to a computer attached to a network, of which the main purpose is to provide a centralized location for shared storage of computer files (such as documents and images) that can be accessed by the workstations attached to the same network. File servers are the heart of any server network. They can contain data for other programs or direct access to documents themselves.

File: A collection of data or information stored under a specific name, called a filename.

Filename: A unique identifier assigned to a specific file. Filenames can be descriptive (e.g., LtrToJohn) or cryptic (e.g., 5787720) and are followed by an extension.

Filtering: The process of using certain parameters to remove documents that do not fit within those parameters in order to reduce the volume of the data set.

Firewall: A system designed to prevent unauthorized access to a specific computer, server or private network.

Flash drive: A small, data storage device used to store files or transport them from one computer to another, also commonly referred to as a USB or thumb drive.

Forensics: A handling of electronically stored information in a way that confirms its authenticity, so that the information can be used as evidence in a court of law. Forensically sound collection refers to a manner of document collection that ensures the collected documents, including their metadata, are not altered in any way and the resulting collected documents are identical to the documents as they originally existed. One way to prove that the documents are unaltered is by assigning hash values to the collection of documents.

Form of production: The manner of producing documents and data, including file format (native format vs. TIFF or PDF) and method of production (electronic vs. paper).

Format: 1. (noun) The internal structure of a file that defines the way it is stored and the programs in which it can be used. 2. (verb) The act of preparing a storage medium ready for first use.

FRCP: The Federal Rules of Civil Procedure govern e-discovery and other elements of federal civil litigation. Most state courts will have their own rules and generally based upon elements of the FRCP.

FTP or file transfer protocol: The protocol for transferring files over a network or the Internet.

Full-text search: When a data file can be searched for specific words and/or numbers.

Fuzzy search or fuzzy match: Search results that correspond approximately to the query as specified. Fuzzy searches may include wild cards, spelling variations, stemming or other operations that result in approximate matches.

H

Hard drive: Self-contained storage device with a high capacity, a read-write mechanism, and one or more hard disks.

Harvesting: Collection of ESI; the method of gathering electronic data for future use in your investigation or lawsuit, preferable while maintaining file and system metadata.

Hash coding: Method of coding that provides quick access to data items capable of being distinguished through the use of a key term, like a person's name. Each data item to be stored is associated with the key term, the hash function is applied to that term, and the resulting hash value may then be used as an index that permits users to select one of several “hash buckets” in a hash table. The table contains pointers to the original item.

Hash: A hash value (or hash) is an alpha-numeric string generated by an algorithm and uniquely identifies original data. It is useful to authenticate data (such as a file) for evidence admissibility in court, determining duplicate documents, and identifying alterations to documents. Common hashes are MD5 or SHA. An example of a hash value: d41d8cd98f00b204e9800998ecf8427e.

Heuristic: A practical approach to problem-solving whose result is not guaranteed, such as a mental shortcut, rule of thumb, or general strategy.

Hidden data: Data not readily accessible or visible in a native. For example, certain rows of an Excel spreadsheet, tracking changes in a Word document, or comments in a PowerPoint presentation.

Hidden file: Files not readily accessible or visible. For example, in the case of many operating systems, critical files are “hidden” to prevent inexperienced users from accidentally deleting or altering them.

Hosting: Defines a service provided by a third-party litigation support firm that provides access to documents relating to a particular matter within a review software platform. The platform can be accessed via the internet by logging in with a username and password.

Hot documents: Documents most likely to be used during the course of a lawsuit. Often contain particularly damaging or exculpatory information. Also referred to as "smoking guns" or "front page of the NY Times" documents.

Hyperlink: An element in an electronic document (usually appearing as an underlined word or image) that links to another place in the same document or to an entirely different document when clicked.

I

Image processing: Capturing an image, usually from data in its native format, so it can be entered into another computer system for processing and, often, manipulation.

Image: Refers to an exact replica. Image may refer to a document type, such as a .tif or .jpeg. To image a hard drive means to make an identical copy. Forensic imaging is a bit-for-bit copy and can be made at a logical or physical level, meaning a user copies the C drive or the D drive, or the unallocated space (which is where deleted data resides). The main advantage to forensic imaging is the checksums and verification by digital fingerprint contained in the image format, which show that the image has not been altered since its inception. If the image has been altered, the CRC values (checksums) and the digital fingerprint (such as the MD5 hash) will change and not match, and the image will not verify.

Import: To bring information or data to one environment or application from another.

Inactive record: Records no longer routinely referenced must be retained, usually for audit or reporting purposes.

Inclusive: An email that contains unique content not included in any other email and thus must be reviewed. An email with no replies or forwards is, by definition, inclusive. The last email in a thread (also called the last in time) is inclusive by definition.

Index: In electronic discovery, index refers to database fields used to categorize, organize and identify each document or record.

Inline changes: A thread member that has changes made to an earlier message segment, e.g. when someone adds comments in another color to a previous email.

IRT or intelligent review technology: See predictive coding.

Issue coding or issue tagging: Identification during document review of specific subject areas or topics for which the document may prove useful during litigation.

J

Junk files: Documents with little to no evidentiary value, such as spam emails, and may also include unreadable file types or computer-generated temporary files.

K

Key document identification (KDI): Identifying the most important documents in a population. See hot documents and issue coding.

Keyword search: In e-discovery, a keyword search is a process of examining electronic documents in a collection or system by matching a keyword or keywords with instances in different documents. Keyword searches can only be done on electronic files in their native format, in searchable PDF, or in files associated with an OCR text file. Standard keyword searches will return a positive result only if the exact keyword or a close derivative is specified. Search derivatives returned by litigation support search engines commonly include stemming.

Keywords: Words designated as being important for search purposes.

L

Legacy data: Data whose format has become obsolete, making it difficult to access or process.

Legal hold: See litigation hold.

Linear review: Linear review is the process of having a human, usually a lawyer, manually review, i.e. set eyes on, each of the documents in the potentially responsive set (usually based on documents hit by an agreed-upon set of keywords) before any of them are produced to the other side. Linear review is very expensive and time-consuming and results in a fair amount of human error or miscoded documents. Alternatives include technology-assisted review and creative use of keyword searches, selective review, and clawback agreements.

Litigation hold: Communications issued upon notice of a duty to preserve and instruct individuals and entities in the efforts required to ensure adequate preservation of potentially relevant evidence. A litigation hold is the temporary interruption of a company’s document retention and destruction policies for data that might be relevant to a lawsuit or data that is reasonably anticipated to be significant. Also called legal hold, hold order, hold notice, preservation order, suspension order, freeze notice, and stop destruction notice. See duty to preserve and preservation.

Load file: A file that is used to import data into an electronic discovery platform after processing. The load file is provided with the data files and provides additional information about the files, such as the directories they came from, metadata not contained in the files themselves, Bates numbers corresponding to the files, and information about the requests to which the files are supposed to be responsive.

Machine learning: The next generation of TAR beyond predictive coding that uses continuous active learning. With this advanced TAR, there is no seed set. Rather, human reviewers simply begin coding documents while the computer observes in the background, learning from their entries. The computer analyzes those tags and feeds the review team what it believes are the most important documents. As the team codes those documents, the computer integrates that information, improving its understanding of the data set. Continuous active learning TAR is still dependent on the quality of the human coding, but it improves continuously as the process continues. When the review team reaches a point where few or none of the results are relevant, the process is complete.

MD5: See checksum and hash.

Media or medium: An object on which data is stored. Examples include disks, backup tapes, servers and hard drives.

Meet & confer: A meeting at the beginning of a case for lawyers to talk about discovery and try to reach agreement on preliminary matters like forms of production and dates for depositions. Ideally a good time for the parties to discuss e-discovery and the evidence that’s likely to be sought, and formulate a game plan.

Metadata: Data, typically stored electronically, describing the characteristics of specific ESI, such as how, when and by whom a particular set of data was created, edited, formatted, and processed. Examples of metadata include: access date, file path, filename, file size, and blind copy (bcc) recipients to an email. Such information is lost when an electronic document is converted to paper form for production. Metadata comes in two main categories, embedded metadata and system metadata. Embedded metadata travels with the file, while system metadata does not, so it is often provided as part of the load file. Examples of system metadata are: directory paths, last-modified dates, and created dates.

Mirroring: Duplicating data or a disk in a manner that results in an exact copy. This is often done for backup purposes.

Native format: Refers to an electronic document’s original form, as defined by the application that was used to create the document. Sometimes documents are converted from their native format to an imaged format, such as TIFF or PDF. Once converted, the original metadata cannot be viewed.

Natural language search: A manner of searching that does not require formulas or special connectors, e.g. Boolean operators, but can be performed by using plain statements or questions.

Near dupe or near-duplicates: Documents that contain a high percentage of the same content are referred to as near-duplicates. Examples include: different revisions of a memo where a few typos were fixed or a few sentences were added; an original email and a thread member forwarding that email with minimal additional language; a native Word document and a scanned and OCRed hard copy version with a few words not matching due to OCR errors. For e-discovery we want to group near-duplicates together, rather than discarding near-dupes as we might discard exact duplicates, since the small differences between near-dupes of responsive documents might reveal important information. Part of e-discovery includes the process of identifying and culling documents that are nearly duplicate. Deduplication software can group near duplicate documents by percentage of similarity, so a user can determine a cut-off point where near duplicates are sufficiently different to warrant review. See deduplication.

Near native: Functionally the same as native. Because some things can’t be produced in the application that created them, the next best thing is near-native. An example is an email generated in Gmail.

Network: Two or more computer systems that are linked together.

Non-inclusive: An email whose text and attachments are fully contained (or subsumed) in another, i.e. no unique content.

Non-reciprocal senders: Emails from addresses that do not receive messages, such as auto-notifications, newsletters, and spam senders. Documents identified as having non-reciprocal senders may be removed or de-prioritized for review because they are less likely to have responsive content.

Normalization: The process of reformatting data so that it can be stored in standardized format.

OCR or optical character recognition: The process of translating and converting printed images and copy into machine-encoded electronic text. A method of digitizing printed texts used so that documents can be converted from a non-searchable form, such as a scanned hard copy, into a form that can be electronically edited and searched.

Off-line storage: Storage of electronic records on a removable disc or other device for disaster-recovery purposes.

Operating system (OS): Provides the software platform that directs the overall activity of a computer, network or system. Common examples include UNIX, DOS, Windows, LINUX and Macintosh. The operating system is the foundation on which applications are built.

Overwrite: To copy or record new data over existing data, as with backup-tape recycling or when updating a file or directory. Overwritten data cannot be retrieved.

Parent document: A document, usually an email, to which other documents and files are attached to.

PDA or personal digital assistant: Handheld devices with computing, Internet, phone/fax and similar capabilities, such as Blackberries. Technology mostly replaced by smartphones.

Peripheral: Refers to a device that attaches to a computer, such as a printer, modem or disk drive.

Potentially privileged data set (PP): Documents set aside to be reviewed by attorneys in case they can be withheld on grounds of privilege. Potentially privileged documents are usually identified by legal keywords, attorney names and email addresses, and law firm names and domains.

Precision: In search results analysis, this is the measure of the level of relevance to the query in the results set documents, i.e. the proportion of retrieved documents that are responsive.

Predictive Coding: A method artificial intelligence used to predict documents that are more likely to contain responsive content that is used to cull relevant documents for production or review by reducing the number of non-responsive and irrelevant documents.. Predictive coding uses a combination of machine-learning and keyword search, filtering, and sampling to automate portions of document review. A predictive coding algorithms is used to determine the relevance of documents based on linguistic and other properties and characteristics. It relies on the initial coding from a human sampling of documents called a “seed set.” The computer analyzes the seed set and learns from it how to identify and evaluate the remaining documents. so the quality of the results depends heavily the quality of the original seed set. If that seed is sloppily coded or incomplete, the computer’s results are similarly flawed. Also referred to as IRT or TAR. See machine learning.

Preservation demand: A letter or email to the opposing party demanding that he or she keep evidence safe and prevent it from being destroyed.

Preservation: The process of managing, identifying, and retaining documents and other data for legal purposes; maintaining documents in a usable form and preventing the destruction of data..

Private network: A network that is connected to the Internet, but limits access to only those persons operating within the private network.

Privilege data set: Documents withheld from production despite being relevant and/or responsive on the grounds of legal privilege. Parties are generally required to produce a privilege log, identifying enough information about each document so that the opposing party can determine whether or not to challenge the withholding (e.g., senders and recipients, creation date, general description of subject matter and privileges asserted).

Privilege log: A document that describes documents or other items withheld from production in a civil lawsuit under a claim that the documents are "privileged" from disclosure due to the attorney–client privilege, work product doctrine, joint defense doctrine, or some other privilege.

Privilege: A special and exclusive legal advantage or right, for example, attorney work product and certain communications between an individual and his or her attorney, which are protected from disclosure.

Production: The process of producing or making available for another party’s review the documents and/or other ESI deemed responsive to one or more discovery requests. The delivery of documents and ESI to the opposing counsel or requesting party; typically involves producing the documents as hard copies, on CD/DVDs, or on hard drives to the other party(s).

Program: See application and software.

Protocol: A common format for transmitting data between two devices. TCO/IP is one of the most common protocols for networks.

Proximity: Nearness. Search proximity operators find documents that contain words that are near to each other.

PST: A personal storage table (.pst) is a file format used to store copies of messages, calendar events, and other items within Microsoft software such as Microsoft Exchange Client, Windows Messaging, and Microsoft Outlook; a file format for wrapping up large numbers of emails and attachments in a way that preserves their ability to be searched.

Quality control (QC): Efforts undertaken to ensure the quality of a product or task.

Query: A set of search terms and operators submitted to a search system that includes a description of the required information in terms allowed by the system. Also called a search string.

Quick peek agreement: An agreement allowing opposing parties to look at certain documents to facilitate an agreement on how to proceed.

Random sampling: Statistical process of choosing documents randomly, with each document having an equal chance of being selected.

Recall: In search results analysis, recall is the measure of the percent of total number of relevant documents in the quantity returned in the results set, i.e. the proportion of responsive documents in the entire collection that have been retrieved.

Recall-precision graph: Graph showing the trade-off between precision and recall. Typically the higher the recall, the lower the precision. In order to get more of the responsive documents, one may have to accept more non-responsive documents.

Record custodian: Individual responsible for the physical storage and protection of records. Custodian may also refer to various individuals with knowledge and/or possession of, or who created, sent, received and/or stored emails, documents and other data relevant to an ongoing or potential dispute.

Records manager: The person responsible for implementation of a records management program.

Records retention period: The length of time a given set or series of records must be maintained. The retention period is often expressed as a period of time (such as six years), an event or action (such as completion of an audit), or both (six years after completion of an audit).

Redaction: The intentional concealing of a portion of a document or image, done for the purpose of preventing its disclosure. To redact a document is to deliberately cover portions of the document that are considered privileged, proprietary, or confidential. This is usually done by “blacking-out” or “whiting-out” the copy that is to be concealed.

Relevance: Degree to which a document is related to a query.

Reliability: Statistical idea related to consistency and repeatability. The same measure taken in the same kind of situation should yield the same results in a reliable system.

Responsive review: Technology assisted review process to identify responsive documents for production.

Responsiveness: A determination of whether a document or data record is relevant to a specific discovery request or RFP.

Restore: Transferring data from a backup medium to an active system. Data is often restored for the purpose of recovering the data after a problem, failure or disaster, or where the data is relevant and has not been preserved or cannot be accessed elsewhere.

Retrieval: Identification of documents in a collection that are potentially responsive.

Review platform: Software for examining electronic evidence, e.g. Relativity or Concordance.

Review: Process used to read or otherwise analyze documents in order to determine content, relevance or applicability of some other objective or subjective standard. See document review.

Rewritable technology: Storage devices that permit data to be written more than once, such as hard drives and floppy disks.

Sampling: Process of choosing a subset of documents in a population that are intended to represent the whole population. Usually refers to any process of which a large collection of ESI or a database is tested to determine the existence and/or frequency of specific data or types of information.

Search results: The list of documents returned as the result of a query.

Searchable text: The content of a document is made searchable when the text of the document is indexed. Then a search is performed by referencing the index to find which documents contain that search term. For that index to be created, the text of the document must first be obtained and recorded by extracting it from the document. See index and tokenization.

Seed set: The initial set of data/documents used in predictive coding. The seed set is “trained” by learning algorithms to cull data down to a potentially relevant set for reviewers to analyze for production or privilege. See Predictive Coding.

Sentiment analysis: Process of identifying sentiment in text, such as positive or negative emotional content, i.e. heightened language. See emo filter.

Server: A computer or device on a network that manages network resources. There are several different types of servers. Typically, a server will be dedicated, which means that they perform no task other than their server tasks (i.e., a file server stores files, a print server manages one or more printers, etc.).

Smart card: A credit card size device that contains a battery, memory and microprocessor.

Social discovery: Discovery of electronically stored information on the various social media sites used today, including but not limited to: Facebook, Twitter, YouTube, LinkedIn, and Instagram.

Social media: Internet applications that permit users to publicly and interactively share and communicate information, whether of a general or personal nature.

Social network: A group of people who utilize social media, typically based on a specific theme or interest. Facebook is an example of a popular social network.

Software: Programs used to direct the operations of a computer, which includes operating systems and software applications. See application.

Source: The place where electronic evidence lives, e.g. computer disks, smart phones, thumb drives, Dropbox.

Spoliation: The alteration, deletion or partial destruction of records which may be relevant to ongoing or anticipated litigation, government investigation or audit. Failure to preserve information that may become evidence is also spoliation.

SQL or structured query language: A database computer language used to manage data; the international standard language for querying databases.

Stand-alone computer: A computer not connected to a network or other computers, except possibly through use of a modem.

Stand-alone document: A document, generally a non-email, that is not attached to an email family or other group of documents.

Stemming: Search function allowing returns of grammatical variations on a word. Stemming removes prefixes and suffixes from words before indexing them and using them for query processing. For example a search for "related" would also have results "relating", "relates", and "relate".

Stop words: Very common words that usually have very little value in a query, e.g. "the", "and", "I". Many systems ignore these words in a query or do not index them at all.

Storage device: This usually refers to mass storage devices, such as disks and tape drives, but can be used to describe any device capable of storing ESI.

Structured data: Data stored in a structured format such as a database.

System admin or system administrator: The person in charge of keeping a network working.

System files: An electronic file, created by the computer, not the user of the computer, that is part of the operating system or other control program.

Tagging: The process of assigning classifications, such as by relevance or privilege, to one or more documents. See coding.

TAR or technology-assisted review: An approach within the document review phase of e-discovery that leverages computer algorithms to identify and tag potentially responsive documents based on keywords and other metadata to expedite the document review process. TAR can, if carefully trained and applied, return more accurate and complete results both faster and cheaper than any human review team. See also machine learning and predictive coding.

Temporary or temp file: Files created by applications and stored temporarily on a computer. Temp files enable increased processor speeds. In the case of temporary Internet files, for example, a browser stores website data so that the next time the same website is accessed it can be loaded directly from the temporary Internet file. Stored data may also be viewed even in the absence of an Internet connection.

Thread: Also called a chain or string. An e-mail conversation which consists of multiple message units including the initiating e-mail and all e-mails related to it including the replies and forwards between senders and recipients in this e-mail chain. A thread may include multiple branches if the conversation veers off in multiple directions, e.g. if some message units have multiple, different replies.

TIFF: A tagged image file format (.tif) is a widely supported and utilized graphic file format used for storing bit map images and scanned hard copy documents. TIFFs are a static format, which means the data cannot be altered. Native documents might be converted to TIFF to prevent alteration and resulting in the loss of metadata but permitting commonly used functions such as Bates labeling and redaction. TIFF images are produced with a load file to make up for the fact that the TIFF conversion process strips out most metadata contained in the original file.

Tokenization: The process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. The token is a reference (i.e. identifier) that maps back to the sensitive data through a tokenization system.

Unallocated space: The area of computer media, such as a hard drive, that does not contain normally accessible data. Unallocated space frequently results from deletion, wherein data resides but is not generally accessible, until being overwritten, wiped or retrieved through utilization of forensic techniques.

Unicode: The code standard that prepares for uniform representation of character sets for all languages. It is also referred to as double-byte language.

Unique content: A thread member that contains additional information not fully subsumed in another thread member.

Unitization: The process of splitting image files received in multiple page formats down into individual documents.

Unstructured data: Data that is unstructured refers to information that does not exist in the usual row-column database. These text and multimedia data files, such as webpages, videos, audio files or videos, lack the ability to be organized effectively within a database.

USB or universal serial bus port: A socket on a computer or other device into which a USB cable or device can be inserted.

Vendor: An e-discovery company that provides data management, processing and/or hosting for e-discovery. May also host the review platform.

Virtual dupe: Documents that yield the same or similar hash values.

VPN or virtual private network: Secure networks that utilize mechanisms, such as encryption, to ensure access by authorized users only and prevent data interception.

W

Wild cards: Operators that allow query terms to be expanded, e.g. an asterisk used at the end of a word signaling that it can be completed in any way. See fuzzy match.

Z

Zip drive: A specific kind of removable disk storage device.

ZIP: A .zip is a common file format allowing fast and simple compressed storage for the purposes of archiving or transmitting large files.