Susan Smith has worked as an editor and writer in the technology industry for over 16 years. As an editor she has been responsible for the launch of a number of technology trade publications, both in print and online. Currently, Susan is the Editor of GISCafe and AECCafe, as well as those sites’ newsletters and blogs. She writes on a number of topics, including but not limited to geospatial, architecture, engineering and construction. As many technologies evolve and occasionally merge, Susan finds herself uniquely situated to be able to cover diverse topics with facility. « Less
Susan Smith has worked as an editor and writer in the technology industry for over 16 years. As an editor she has been responsible for the launch of a number of technology trade publications, both in print and online. Currently, Susan is the Editor of GISCafe and AECCafe, as well as those sites’ … More »
DataFission’s New DUSE Search Engine Lets You Search Any Unstructured Data
July 29th, 2016 by Susan Smith
Dr. Harold Trease, DataFission’s Chief Scientist spoke with GISCafe Voice about the new DataFission DataHunter, Digital Universe Search Engine – DUSE, a content-based, search engine for use in digesting and searching unstructured data.
The new search engine allows users to search anything digitally, from video, audio, network traffic, satellite images, radar data, malware, images as well as unstructured text, all in one search. Unlike traditional search engines such as Yahoo, Google and Microsoft that have been around for 15 years, searching with DUSE doesn’t require further work to search the medium the user requires. It contains both simple API’s and GUI’s that just provide rankings and answers to complex data queries to uniforms and analysts as well as tools for the advanced data scientist such as access to live data structures and tools at a very deep level.
GISCafe Voice: What types of input do you put into the search engine, as an example, to retrieve certain types of information?
DUSE — The DataFission Digital Universe Search Engine — is a content-based, search/recommendation engine used to ingest, digest, index, and search unstructured data. DUSE is data format agnostic and ingests all forms of digital data in the Digital Universe — streaming or archived. DUSE ingests any digital data, including: image (including WAMI and satellite imagery), video (aerial and terrestrial FMV), audio, cyber network traffic, IoT sensor data, and (unstructured/structured) text. DUSE was designed to exploit the inherent structure of the data in the Digital Universe by transforming the data into high-dimensional signatures and projecting these signatures into a search space. DUSE is the only search engine capable of truly “connecting the dots”, thus providing data fusion, by projecting all data into the same high-dimensional search space. This allows for connections that spans time, location, media type, and media mode.
GISCafe Voice: From what sources are you going to locate this information?
DUSE is designed to index and search the Digital Universe’s Internet of Everything including open and dark Web content, Internet network traffic, enterprise data, Internet of Things sensors, etc. In short, any source that is “crawlable” can be indexed and made searchable by DUSE.
Applications to date that we have addressed and demonstrated with proven results for DOD/IC/DHS in various forms are as listed:
GISCafe Voice: Are there forecasting tools built into your solution?
Yes. The DataFission DUSE Platform employs tightly-integrated driven Memory Networks to accelerate and expand DUSE’s search capabilities to enable machine learning and artificial intelligence algorithms that incorporate learning and prediction into the search process. The DUSE auto-generated index tables effectively quantifies and catalogues all of the information content into a form that can be queried with human-to-machine or machine-to-machine interfaces that search, compare, classify, link/associate data. The techniques utilized by DUSE’s unstructured data search engine and deep data analytics platform are currently in use for video/image/audio/text analysis but are generally applicable to any type of unstructured data stream or archive.
GISCafe Voice: Does foreign language play a large part in your deductions?
DUSE is agnostic to foreign language because at its core DUSE is a very general pattern matching engine. If data has patterns (in the high-dimensional search space), then DUSE is able to match patterns to produce similarity search results. Because DUSE is data agnostic any digitized spoken or written language is made searchable. Just as image and video are independent of foreign language because they are just visual patterns to humans, any form of digital data, to DUSE, is just a pattern that is searchable. DUSE is the ultimate data fusion engine able to make connections and associations between types of data.
At this time DUSE is not aware of the cultural, political, or religious nuances and inflections of the data, but it will grow more capable and aware through further development of the machine learning and AI methods.
GISCafe Voice: Does DataFission analyze and give recommendations, or does the customer make those determinations?
Yes. Both. DataFission has built a search engine, actually a “Similarity Engine” to be more precise, to not only search the data, but to mine the data for relationships such as patterns-of-life, activity-based-intelligence, and cyclostatic phenomena that occurs in any type of digital data. The DUSE search engine has two modes: one is its supervised mode, and the other is its unsupervised mode. With supervised search, a human, or maybe another machine, will give some example of what they are querying for, and use their human judgement to put the search results in context. With unsupervised search, DUSE decides what is important and interesting in the data (e.g., unique, anomalous, repetitive) and presents its summary of results. A user or another machine can then put these results into their context.
DUSE transforms and projects all forms of digital data into high-dimensional search space, then used to search this high-dimensional space for patterns of interest. These patterns may be people, places, and things (vehicles, backpacks, airplanes, airports). These patterns may also represent activities and events that make up complex patterns of life.
The DUSE Similarity Engine does make recommendations based on criteria and determinations selected by the user of specific applications relative to actions or objects of interest. The user selects what to search for, and search analytics and machine learning will return suggested ranked recommendations with an extremely high degree of sensitivity that learns user/application preferences and increases in accuracy over time. These results often come back with some very accurate and unexpected positive results, insights that humans may or would not have recognized.
DUSE is also extremely useful as a data discovery engine. A user can supply the data to DUSE, and DUSE returns what “it” thinks are the important patterns that may represent objects (people, places, and things) or events/activities. The user can put these discovered facts into their specific context to judge the significance. These discovered facts can also be used to find important connections to build multi-data social networks, not just of people-to-people, but of everything-to-everything. Rare and infrequent events, such as anomalies and outliers, can also be revealed through the immense data discovery capability of DUSE.
The DUSE platform excels in processing high data rate/multiple streaming video analytics:
DUSE’s analytic capabilities in Patterns-of-Life lead DUSE into predictive analytics and event-based forecasting.
GISCafe Voice: What is the search engine particularly sensitive to?
In a technical sense, because DUSE is a general pattern matching search engine, any corruption in the data (i.e., patterns) produces fuzzy results based on the magnitude of the data corruption. In the DUSE world, we call any corruption of the data, “noise”. There are many sources of noise in the real data (resolution, size changes, orientation changes, non-optimal sensor environments, instrument collection noise, weather conditions, etc.).
As a similarity-based search engine DUSE turns these sources of noise to its advantage. Any data collected in the wild is going to have noise associated with it. The DUSE algorithms are, by design, able to deal with it by returning results based on similarity matching. For example, similarity search allows the search engine to return results for a person’s face from different angles, under different lighting conditions, or when parts of the face are obscured or occluded. If a user is shopping for a pair of shoes and the snaps a cell phone picture of a pair of shoes they like, the search engine would return shoes similar to the example, which may include some that the user did not consider, but are similar. This allows for a “fuzzy” definition of what a match is and in the discovery mode this can be very valuable.
DUSE is also sensitive to the amount of data that it processes. For DUSE the more data it processes, the better. This is because more data means more patterns, which means its search space is better populated. Our goal at DataFission is to index and make searchable the entire Digital Universe.
Of course, DUSE is also sensitive to the amount and speed of computing hardware available. Similarity search of the Digital Universe doesn’t come cheap. IDC estimates the size of the Digital Universe currently at about 5 zettabytes, growing to greater than 45 zettabytes by 2020. Of the data in the Digital Universe, image and video represents about 88% of that data. Audio, communications, and network traffic comprise about 10%, and text, both unstructured and structured, represents the remaining 2%. Of this 2%, Google is estimated to have indexed about 3-5% of this 2%. This means that Google is only indexing and searching a very small fraction of the Digital Universe. Yet, Google’s infrastructure is composed of 100’s of thousands (probably millions) of servers. And this is just so that can index and search a fraction of the Digital Universe. DUSE is designed to search the whole Digital Universe, which of course requires (using current technology) a much larger computer infrastructure than Google currently has. This leads DataFission to explore game changing algorithm and hardware methods to optimize the unstructured search process. This includes hybrid CPU/accelerators (GPUs, FPGAs), as well as even more exotic hybrid architectures such as a CPU + D-Wave quantum computer.