Search
Suhas Pai, CTO and Co-founder of Hudson Labs - Jul 20, 2022
- 4 min read
Traditionally, computers have been programmed with step-by-step instructions to solve tasks. Certain skills like processing images or text are too complex to be described by a set of rules. Language in particular, is highly ambiguous, contextual, and contains too many exceptions.
Machine learning is a computing paradigm where computers learn by example. Machine learning involves providing input-output pairs so that the machine learns how to solve the task by understanding the relationship between the input and output.
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence. NLP deals with processing, understanding, and generating text. In recent years, this field has undergone rapid developments and widespread adoption due to its embrace of machine-learning based large language models (LLMs).
Modern language models are characterised by their adherence to the ‘distributional hypothesis’. The distributional hypothesis can be best described by the adage “You shall know a word by the company it keeps”. The meaning of each word can be inferred by the meaning of the words that surround it, in context. This has been operationalized using the “self-attention” framework. Self-attention means each word “attends” to all other words in the sentence to generate its own representation - a vector (list of numbers) that encapsulates meaning.
Read "Designing Large Language Model Applications" written by Bedrock AI founder, Suhas Pai, and published by O'Reilly Media.
LLM Adoption
Modern success stories in NLP include grammar correction tools like Grammarly, translation software like Google Translate, and the auto-complete used by Gmail, Outlook and other email providers. NLP is also increasingly used in customer-service chatbots and in search engines. State-of-the-art models include Open AI’s GPT-3, Meta’s OPT, and the open-source BigScience model BLOOM, of which Bedrock AI’s CTO is a project co-chair.
Limitations of Language Models
Like all automation processes, language models still suffer from limitations. Language models are limited by the length of the input they can process at a time (typically less than 3,000 words, thus limiting the contextual information it has access to while making a decision). Financial documents like annual reports usually run into 100s of pages making financial text processing a particularly challenging field.
Language models are computationally prohibitive to train from scratch. The current approach in the field is to use open-source language models trained and published by Google, Meta, Microsoft, and other big-tech companies, and adapt or ‘fine-tune’ them according to the individual application’s needs. The base model has learned more general properties of language like grammar and the subsequent fine-tuning phase leverages this knowledge to help the model learn more fine-grained tasks.
Most open-source language models are primarily trained on web text, and struggle to adapt to text that has different characteristics from web text. Corporate disclosure is linguistically and semantically very different from web text. Market announcements and financial reports are characterised by repetitive boilerplate, financial jargon and legalese. The average sentence in an annual report of a public company is much longer than the average sentence on the web.
The Hudson Labs advantage - financial language modeling
At Hudson Labs, we went through a year-long financial NLP research phase before we launched our first product in April 2021. Our research team continues to innovate and bring forward new advances in NLP to improve the quality of our products.
We have innovated several techniques for effectively processing long-form financial text in-house that help us achieve the high quality of our products. A summary of our unique capabilities is listed below.
Domain Adaptation
We have developed techniques to adapt open-source language models to the domain of securities filings and complex financial text. The initial domain adaptation process involved the collection and processing of over 1.3 terabytes of financial data. This process enables our models to understand terms like ‘goodwill impairment’, a phrase not commonly seen on the web.
Boilerplate Model
Boilerplate sentences are linguistically very similar to interesting text, and are visually indistinguishable even to human non-domain experts. Our in-house boilerplate identification model can correctly classify >99 percent of sentences as being boilerplate or not. Filtering out boilerplate reduces noise and improves the quality of our input data.
Representation Learning
A fundamental truism of data-oriented applications is the adage ‘Garbage in- Garbage out’. We have extensive processes to ensure we feed high-quality inputs to our models. Sentences are represented in vector form (a list of numbers that encode meaning, syntax and other relevant information about a sentence). The quality of the input vectors determines the extent to which a language model can be helpful in solving tasks. Our algorithms ensure the generated vectors are more amenable to modelling.
Few-Shot Learning
Modern machine learning techniques are extremely data-hungry. They need a lot of labelled training examples to be effective. Labelled training data is expensive to acquire, especially if the labelling requires domain expertise, as is true in the case of highly-specialised domains like corporate disclosure. On average, it takes more than a minute to annotate each example.
In the past year, adopting a new paradigm called “few-shot learning” helped alleviate this problem. High performance models can now learn to solve select tasks using just a few examples. The few-shot learning algorithms that we developed for our core product allows us to extract 328 different types of red flag types with just 1,625 labelled sentences.
Text Ranking
While our red flag extraction models solve the needle-in-a-haystack problem of finding interesting information from disclosure text, our ranking models rank-order them in terms of relative and absolute importance. Our ranking algorithms are currently able to make fine-grained distinctions even within a particular red flag category and produce an importance score for each identified red flag. The ranking model also takes into consideration their freshness, the time period the sentence is referring to, and so on.
Training Set Selection
Current language models are susceptible to shortcut learning - a phenomenon where spurious characteristics of the training data are used as cues for making decisions. Consider an example where the model spuriously used the word ‘banana’ as a cue for predicting if a sentence were an impairment indicator, solely because the example sentences were disproportionately sourced from a banana producer’s corporate filings.
We use our in-house algorithms for selecting training sets that reduce the chances of shortcut learning. Our algorithms select training examples that give the best bang-for-the-buck in terms of the number of real-world examples that they could help the model learn to classify correctly.
- AI in Finance
- •
Related Posts
See All
How language models are disrupting equity research
Top 15 AI Tools for Equity Research
I'm Suhas Pai, CTO and Co-founder of Hudson Labs, and I've been deeply immersed in the field of Natural Language Processing (NLP) and machine learning. My expertise extends to the intricacies of language models, particularly in the financial domain. The article you provided touches upon various aspects of NLP and its application in finance, a subject I'm intimately familiar with.
The piece discusses the evolution of computer programming, emphasizing the shift from traditional step-by-step instructions to machine learning paradigms. NLP, a subfield of linguistics, computer science, and AI, deals with the processing, understanding, and generation of text. The adoption of large language models (LLMs) has played a crucial role in the recent advancements in NLP, with models like Open AI’s GPT-3, Meta’s OPT, and the open-source BigScience model BLOOM gaining prominence.
The concept of the "distributional hypothesis" is highlighted, emphasizing that the meaning of a word can be inferred by the words that surround it in context. This is operationalized through the "self-attention" framework, where each word attends to all others in the sentence to generate its own representation.
The article delves into the success stories of NLP in grammar correction tools, translation software, email auto-complete features, customer-service chatbots, and search engines. It also mentions the limitations of language models, such as the processing length constraint and the computational challenges of training from scratch.
Hudson Labs, as mentioned in the article, addresses these challenges by innovating in financial NLP. Techniques like domain adaptation, boilerplate identification, representation learning, and few-shot learning are highlighted. The company has developed processes to effectively process long-form financial text, ensuring high-quality inputs for their models. Additionally, the article mentions Hudson Labs' algorithms for training set selection to mitigate shortcut learning.
Overall, the article provides insights into the state-of-the-art in NLP, its applications in finance, and the challenges faced by language models, with a focus on how Hudson Labs has strategically addressed these challenges in the financial domain. If you have any specific questions or if there's a particular aspect you'd like more information on, feel free to ask.