Financial NLP and Large Language Models - The Hudson Labs Advantage (2024)

Suhas Pai, CTO and Co-founder of Hudson Labs
- Jul 20, 2022
- 4 min read

Traditionally, computers have been programmed with step-by-step instructions to solve tasks. Certain skills like processing images or text are too complex to be described by a set of rules. Language in particular, is highly ambiguous, contextual, and contains too many exceptions.

Machine learning is a computing paradigm where computers learn by example. Machine learning involves providing input-output pairs so that the machine learns how to solve the task by understanding the relationship between the input and output.

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence. NLP deals with processing, understanding, and generating text. In recent years, this field has undergone rapid developments and widespread adoption due to its embrace of machine-learning based large language models (LLMs).

Modern language models are characterised by their adherence to the ‘distributional hypothesis’. The distributional hypothesis can be best described by the adage “You shall know a word by the company it keeps”. The meaning of each word can be inferred by the meaning of the words that surround it, in context. This has been operationalized using the “self-attention” framework. Self-attention means each word “attends” to all other words in the sentence to generate its own representation - a vector (list of numbers) that encapsulates meaning.

Read "Designing Large Language Model Applications" written by Bedrock AI founder, Suhas Pai, and published by O'Reilly Media.

Financial NLP and Large Language Models - The Hudson Labs Advantage (2)

LLM Adoption

Modern success stories in NLP include grammar correction tools like Grammarly, translation software like Google Translate, and the auto-complete used by Gmail, Outlook and other email providers. NLP is also increasingly used in customer-service chatbots and in search engines. State-of-the-art models include Open AI’s GPT-3, Meta’s OPT, and the open-source BigScience model BLOOM, of which Bedrock AI’s CTO is a project co-chair.

Limitations of Language Models

Like all automation processes, language models still suffer from limitations. Language models are limited by the length of the input they can process at a time (typically less than 3,000 words, thus limiting the contextual information it has access to while making a decision). Financial documents like annual reports usually run into 100s of pages making financial text processing a particularly challenging field.

Language models are computationally prohibitive to train from scratch. The current approach in the field is to use open-source language models trained and published by Google, Meta, Microsoft, and other big-tech companies, and adapt or ‘fine-tune’ them according to the individual application’s needs. The base model has learned more general properties of language like grammar and the subsequent fine-tuning phase leverages this knowledge to help the model learn more fine-grained tasks.

Most open-source language models are primarily trained on web text, and struggle to adapt to text that has different characteristics from web text. Corporate disclosure is linguistically and semantically very different from web text. Market announcements and financial reports are characterised by repetitive boilerplate, financial jargon and legalese. The average sentence in an annual report of a public company is much longer than the average sentence on the web.

The Hudson Labs advantage - financial language modeling

At Hudson Labs, we went through a year-long financial NLP research phase before we launched our first product in April 2021. Our research team continues to innovate and bring forward new advances in NLP to improve the quality of our products.

We have innovated several techniques for effectively processing long-form financial text in-house that help us achieve the high quality of our products. A summary of our unique capabilities is listed below.

Domain Adaptation

We have developed techniques to adapt open-source language models to the domain of securities filings and complex financial text. The initial domain adaptation process involved the collection and processing of over 1.3 terabytes of financial data. This process enables our models to understand terms like ‘goodwill impairment’, a phrase not commonly seen on the web.

Boilerplate Model

Boilerplate sentences are linguistically very similar to interesting text, and are visually indistinguishable even to human non-domain experts. Our in-house boilerplate identification model can correctly classify >99 percent of sentences as being boilerplate or not. Filtering out boilerplate reduces noise and improves the quality of our input data.

Representation Learning

A fundamental truism of data-oriented applications is the adage ‘Garbage in- Garbage out’. We have extensive processes to ensure we feed high-quality inputs to our models. Sentences are represented in vector form (a list of numbers that encode meaning, syntax and other relevant information about a sentence). The quality of the input vectors determines the extent to which a language model can be helpful in solving tasks. Our algorithms ensure the generated vectors are more amenable to modelling.

Few-Shot Learning

Modern machine learning techniques are extremely data-hungry. They need a lot of labelled training examples to be effective. Labelled training data is expensive to acquire, especially if the labelling requires domain expertise, as is true in the case of highly-specialised domains like corporate disclosure. On average, it takes more than a minute to annotate each example.

In the past year, adopting a new paradigm called “few-shot learning” helped alleviate this problem. High performance models can now learn to solve select tasks using just a few examples. The few-shot learning algorithms that we developed for our core product allows us to extract 328 different types of red flag types with just 1,625 labelled sentences.

Text Ranking

While our red flag extraction models solve the needle-in-a-haystack problem of finding interesting information from disclosure text, our ranking models rank-order them in terms of relative and absolute importance. Our ranking algorithms are currently able to make fine-grained distinctions even within a particular red flag category and produce an importance score for each identified red flag. The ranking model also takes into consideration their freshness, the time period the sentence is referring to, and so on.

Training Set Selection

Current language models are susceptible to shortcut learning - a phenomenon where spurious characteristics of the training data are used as cues for making decisions. Consider an example where the model spuriously used the word ‘banana’ as a cue for predicting if a sentence were an impairment indicator, solely because the example sentences were disproportionately sourced from a banana producer’s corporate filings.

We use our in-house algorithms for selecting training sets that reduce the chances of shortcut learning. Our algorithms select training examples that give the best bang-for-the-buck in terms of the number of real-world examples that they could help the model learn to classify correctly.

AI in Finance
•

See All

How language models are disrupting equity research

Top 15 AI Tools for Equity Research

FAQs

What are the benefits of large language models? ›

They are extremely flexible because they can be trained to perform a variety of tasks, such as text generation, summarization, and translation. They are also scalable because they can be fine-tuned to specific tasks, which can improve their performance.

Read The Full Story ›

What is the purpose of a language model in NLP? ›

Introduction. A language model in NLP is a probabilistic statistical model that determines the probability of a given sequence of words occurring in a sentence based on the previous words. It helps to predict which word is more likely to appear next in the sentence.

Is NLP a large language model? ›

A large language model (LLM) is a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks. Large language models use transformer models and are trained using massive datasets — hence, large.

Get More Info Here ›

How accurate are large language models? ›

Research from MIT's CSAIL shows that getting a group of LLM systems to work together results in a more factually accurate answer.

Explore More ›

What are the advantages and disadvantages of large language models? ›

Research shows that LLMs get more capable with an increase in investment (Bowman, 2023). The advantage of LLMs is that they automate complex tasks and improve creativity. However, SLMs have lower operational and development costs, which makes them more accessible.

Tell Me More ›

What is the difference between NLP and large language models? ›

NLP encompasses a broad range of models and techniques for processing human language, while Large Language Models (LLMs) represent a specific type of model within this domain. However, in practical terms, LLMs exhibit a similar scope to traditional NLP technology in terms of task versatility.

Learn More Now ›

What is NLP and how does it work? ›

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI). It helps machines process and understand the human language so that they can automatically perform repetitive tasks. Examples include machine translation, summarization, ticket classification, and spell check.

What is the best language for NLP? ›

Python is undeniably the most popular programming language in the field of AI and NLP.

Tell Me More ›

Are large language models really AI? ›

A large language model (LLM) is a type of artificial intelligence (AI) program that can recognize and generate text, among other tasks. LLMs are trained on huge sets of data — hence the name "large." LLMs are built on machine learning: specifically, a type of neural network called a transformer model.

Explore More ›

Why do large language models make mistakes? ›

Lack of contextual awareness

Despite their eye-opening abilities, AI language models are beset by certain language-comprehension challenges. They aren't human, obviously, nor are they necessarily well trained, which means they can make some pretty unfortunate mistakes.

Explore More ›

Is ChatGPT a large language model? ›

The reason is that Large Language Models like ChatGPT are actually trained in phases. Phases of LLM training: (1) Pre-Training, (2) Instruction Fine-Tuning, (3) Reinforcement from Human Feedback (RLHF).

Get More Info ›

What are the benefits of LLM model? ›

What are the advantages of large language models?

Extensibility and adaptability. LLMs can serve as a foundation for customized use cases. ...
Flexibility. One LLM can be used for many different tasks and deployments across organizations, users and applications.
Performance. ...
Accuracy. ...
Ease of training. ...
Efficiency.

Find Out More ›

What is the impact of large language models? ›

One of the most significant impacts of LLMs is their ability to enhance productivity and efficiency across various professions. By automating routine tasks such as data analysis, report generation, and customer service inquiries, LLMs allow professionals to focus on more complex and creative aspects of their work.

What are the 4 benefits of learning a new language? ›

Learning a second language has numerous benefits, such as improving cognitive abilities, enhancing communication skills, broadening career opportunities, facilitating travel and cultural exchange, and even delaying the onset of age-related mental decline.

Explore More ›

Why are language models important? ›

Essentially, language modeling helps computers learn what to expect when receiving language input. This allows the artificial intelligence software to accurately string together spoken language through natural language understanding.

Show Me More ›

Financial NLP and Large Language Models - The Hudson Labs Advantage (2024)

Read "Designing Large Language Model Applications" written by Bedrock AI founder, Suhas Pai, and published by O'Reilly Media.

LLM Adoption

Limitations of Language Models

The Hudson Labs advantage - financial language modeling

Domain Adaptation

Boilerplate Model

Representation Learning

Few-Shot Learning

Text Ranking

Training Set Selection

Related Posts

FAQs

What are the benefits of large language models? ›

What is the best language for NLP? ›

References