(Combination of Speaker, Speech, and Face Recognition with a single interface)
Embedded AudioVisual Recognition
RecoMadeEasy® Embedded AudioVisual Recognition is an embedded natural language voice and video recognition engine that offers comprehensive conversational voice interaction, voice biometrics and facial recognition. The engine has a small memory footprint and is designed to run natively on devices that seek unconstrained natural language interfaces with high recognition accuracy in the presence of service interruption or when full, uninterrupted and secure access to a cloud server is not guaranteed.
The RecoMadeEasy® AudioVisual Recognition engine is comprised of three distinct technologies: Speaker, Speech, and Facial Recognition, which have been developed in our research labs in New York. When presented with an audio, video, or audio-video stream, the engine via the API returns the following in either XML or JSON:
- Speaker Segmentation of Incoming Audio, Video, or Both (including timestamps of the location where the speakers change and tagging of each audio, video, or combined segment with the ID of the person speaking in that segment)
- Standalone engine which may be used through a very simple
C++ SDK and API. This would be most useful for integrating
the engine into current products and IVR systems.
- Audio and/or Visual Identification of speaker(s)
- Audio and/or Visual Verification of speaker(s)
- Full Transcription of the audio stream
The engine is built to allow users to speak naturally and be
understood – even in a far-field, noisy
environment. RecoMadeEasy® is available as an SDK with an
included API that contains all necessary components for full
integration and enables engineers to get started easily and
without any work or costs for development.
The RecoMadeEasy® AudioVisual Reocgnition engine is also available as a server-side and a standalone product.
Language- and Text-Independence: The speaker recognition system is completely text- and language-independent. This means that a user may enroll her/his voice into the system in one language and be identified or verified in a completely different language. This allows the engine to be able to handle authentication and identification processes across any number of languages.
Large-Vocabulary Speech Recognition
The speech recognition side of the engine provides one of the
most accurate transcriptions for English, handling many
different dialects and accents in a single large-vocabulary
transcription engine, It is also capable of providing real-time
processing in a small memory footprint.
The speech recognition uses a streaming interface where the
recognizer, in the form of listeners and the client, both run on
the embedded device. Any light generic client capable of using a
websocket interface may stream audio/video to a listener and get
back real-time results of the transcript with optional
alternative results, including likelihood scores in any codec
that is supported by GStreamer-1.0, including MP3, Ogg Vorbis,
Free Lossless Audio Codec (FLAC), MP4, Pulse Code Modulation
(PCM), or other codecs such as those supported by a standard
Waveform Audio File Format (WAVE).
The facial recognition side of the engine provides face
detection, face identification (open-set and closed-set), and
facial verification from still images and video streams. It
supports all standard image and video formats such as png, jpeg,
gif, mp2, mp4, .mov, etc.
Supported Operating Systems
The RecoMadeEasy® Embedded AudioVisual
Recognigtion engine is available for the following operating
systems. The C++ SDK, command-line interface, and web
services may be used in any of the following systems:
Embedded Operating Systems:
- Android (aarch64 -- armv8a)(Latest)
- Ubuntu 18.04 Mate Linux (aarch64)(Latest)
(Language- and Text-Independent, aka: Speaker Biometrics, Voice Biometrics, or SIV)
Recipient: Frost & Sullivan Award 2011
Large-Vocabulary Speech Recognition
Available for English, Mandarin, Arabic, and German
Also Available in Bilingual Mandarin-English, Arabic-English, and German-English
(Customizable domain full transcription ~ 240,000+ word vocabulary)
(face detection and recognition)
Interactive Voice Response (IVR)
(Graph-based logic, easily configured)
Automatic Language Proficiency Rating (ALPR)
(Multi-lingual automated language proficiency rating)
Status: Advanced Development Stage
Status: Research Stage