Complete Data Solutions

Know Your Data

  • Home
  • About Us
    • Executive Bio
    • Our Team
  • Solutions
    • Data Management
    • Workflow Automation
    • Validation & Standardization
    • Data Profiling
  • Careers
    • Current Openings
    • Benefits
    • Why CDS
  • Contact Us

Validation & Standardization

Just as fixing a software bug gets exponentially more expensive to correct over time, bad data requires more expense to process and correct after ingestion. One of the best ways to improve accuracy on large data sets is to improve the data during ingestion.  When ingesting massive amounts of data; identification, validation, and standardization are often only minimally performed or skipped completely due to the resources needed to perform these steps when the incoming data format is not clearly defined or the fields are not strongly typed.

To validate data, it is essential to be able to recognize known prefixes, suffixes, or substrings in order to validate and standardize the input fields.  Typical ways of doing this involve trying subsets of the target string to see if it matches a known value using hash maps, tree-based maps, regular expressions, or remote databases. While functional, these all force users to trade off run time performance and initialization performance.

To improve efficiency, we have developed a program call Fast Data Recognizer (FDR).  FDR makes validating massive amounts of data possible with the performance required to scale to very large data sets. FDR combines the best properties of all of the above methods.  For a given set of known values and their corresponding metadata (validated/standardized format) , there is a one-time initialization where all of the byte sequences (of which there can be millions) and their corresponding metadata are ‘compiled’ into a memory-mappable state machine file.  This allows

  • Nearly zero-cost initialization, simply map the file to memory and it is ready to use, with blocks being paged into memory on-demand and all processes on the machine sharing the same memory pages.
  • Look up speed similar to a regular expression which also uses a state machine.
  • Simultaneously matching on known values of all lengths from a given starting point in the target string – no need for sequential hash or map lookups.

When a match is found, a callback tells the client code where and what was found and passes the metadata for the known sequence.

Solutions

  • Data Profiling
  • Validation & Standardization
  • Workflow Automation
  • Data Management

Contact Us

Corporate Headquarters
7115 Leesburg Pike, Suite 317
Falls Church, VA 22043
Telephone: 703-536-3282
Facsimile: 703-536-3283

Email: Info@cdsllc.com

GSA Schedule GS-35F-0572U

CyberChimps WordPress Themes

© Complete Data Solutions