close
close

Data that speaks volumes: AssemblyAI and AWS transform speech data into ML-driven insights

Data that speaks volumes: AssemblyAI and AWS transform speech data into ML-driven insights

by AWS editorial team | August 22, 2024 | Thought leadership

Phone calls, virtual meetings, podcasts, webinars, videos – audio is everywhere, and AI is enabling companies to gain insights from that data like never before. AssemblyAI develops machine learning (ML) models that provide accurate speech recognition for voice data, as well as speaker recognition, sentiment analysis, chapter detection, redaction of personally identifiable information (PII), and more. Senior software engineer Ben Gotthold explains, “We are entirely focused on developing ML models that understand human speech with superhuman capabilities. In short, we have a complete AI system that helps customers get the most out of their audio data.”

ML-powered audio innovation

AssemblyAI offers a variety of ML models that support a wide range of use cases. For example, a podcast or video platform can use speech recognition, speaker diarization, and summarization models to make their content more searchable. Content moderation and topic detection models can also be used to categorize and label sensitive or harmful content. PII redaction, keyword detection, sentiment analysis, and entity recognition models can be used for conversation intelligence solutions in contact centers or to analyze sales call data and help managers train new team members faster.

To operate effectively and provide excellent customer service, AssemblyAI needed an architecture that excelled in three key areas:

  1. Scalability: With millions of requests coming in every day, AssemblyAI needed to scale flexibly to meet demand, optimize resource usage, and control costs.
  2. Easy deployment and iteration: AssemblyAI needed an architecture that made deploying and continuously improving ML models as easy as possible.
  3. Security and Compliance: AssemblyAI wanted to build its architecture with services and technologies designed to protect data and meet the diverse compliance requirements of a global customer base.

AssemblyAI worked with Amazon Web Services to develop an architecture that was impressive on all levels.

Where audio becomes insights – Insights into AssemblyAI’s architecture

Customers first upload audio data to the AssemblyAI API or submit a reference to that data using cloud object storage services like Amazon Simple Storage Service (Amazon S3). Gotthold explains, “After a customer submits their data to our API, we download it, transcode it, and actually store it in Amazon S3. From there, we can send it to a variety of different models depending on the customer’s use case, including things like speaker labeling and sentiment analysis.”

Once a customer request is validated and the required feature types are recorded, the process is passed to AssemblyAI’s Orchestrator, which Gotthold calls the “brain of the operation.” The Orchestrator decides which models to invoke and in what order through an inference pipeline. This pipeline consists of several AWS services, including: Amazon Simple Queue Service (Amazon SQS), Amazon Elastic Container Service (Amazon ECS), and Amazon S3.

The orchestrator sends messages to Amazon SQS, a fully managed message queuing service for microservices, distributed systems, and serverless applications. This then drives the ML models running on Amazon ECS, a container orchestration that enables AssemblyAI to efficiently deploy, manage, and scale its models.

“We have dozens of models in use. We’re constantly iterating on them, deploying new versions and new models,” says Gotthold. Within Amazon ECS, AssemblyAI’s ML models are automatically scaled up and down based on customer demand.

Optimize resource utilization and control costs

AssemblyAI also uses Amazon CloudWatch to monitor and respond to performance changes and optimize resource usage. Gotthold explains, “Requests are constantly coming in, millions a day, and we record them in CloudWatch. Using the decision engine in our orchestrator, we know which models are needed and in what order they are called. So using signals like queue depth and other custom metrics, we can provision just the right number of model workers. Popular models scale up and down faster than less popular ones.”

“A good example of this would be a customer requesting speaker identification, i.e. who is speaking and when. We know that this happens after the audio is converted to text, so we can scale this service in advance so that the capacity is there exactly when we need it.” In addition to efficiency, optimizing resource usage also brings savings. “In general, it is quite expensive to run these models on GPUs, so we place a lot of emphasis on good scaling to keep costs under control,” says Gotthold.

After the request is completed for the customer, a notification is sent via Amazon Simple Notification Service (Amazon SNS) to AWS Lambda, a serverless event-driven compute service that notifies the customer when their transcription is ready.

Data security and responsible handling are a priority

AssemblyAI works with a global customer base and must adhere to strict compliance and data security standards. “There are a lot of non-functional requirements – compliance and things like that. We are SOC 2 Type 2 certified and place a lot of emphasis on following best practices for data storage,” says Gotthold.

AWS is designed to be the most secure global cloud infrastructure on which to build, migrate, and manage applications and workloads. AWS services like Amazon ECS and Amazon S3 enable users to securely manage data, detect potentially suspicious behavior, and mitigate risk. As Gotthold explains, “We have strict lifecycle policies in Amazon S3, so we only keep the data as long as it is useful to our orchestrator and ML pipeline.”

Enabling audio data innovation

AssemblyAI continues to innovate on behalf of its customers and support them with new ML models. The company released LeMUR, a framework for applying large language models (LLMs) to speech data, in 2023. With just a few lines of code, LeMUR enables customers to create custom summaries for multiple audio files at once, ask questions of their data with natural language prompts, summarize action points from meeting recordings, and more.

By building its architecture on AWS, AssemblyAI can continue to develop innovative solutions like LeMUR and discover new ways to turn audio data into insights, while knowing it has the scalability, ease of deployment, and security features to effectively manage demand and provide exceptional service to its customers.

Learn more about how AWS gives your software or technology company the freedom to migrate, innovate, and scale. Contact us now to get started.

Leave a Reply

Your email address will not be published. Required fields are marked *