next up previous
Next: INTRODUCTION

Department of Computer Science* Department of Linguistics\dag
University of Colorado at Boulder
{noah,jurafsky}@colorado.edu

TOWARDS BETTER INTEGRATION OF SEMANTIC PREDICTORS IN STATISTICAL LANGUAGE MODELING

Noah Coccaro* and Daniel Jurafsky\dag 
*

Abstract:

We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal test-set than a baseline N-gram model.



 

Noah Coccaro
9/15/1998