LinguifAI: Train Custom Models on your Datasets

Published on August 25, 2024

Introduction

In this post, we’ll talk about a desktop application that allows users to upload their own datasets, train machine learning models, and make predictions. The application leverages a React frontend and a Flask backend, providing a seamless experience for non-experts to train and deploy models like Recurrent Neural Networks (RNNs) and Naive Bayes. Additionally, users can utilize pretrained classifiers for tasks such as sentiment analysis or news classification.

Application link: LinguifAI

Application Overview

The application is designed with a user-friendly interface that simplifies the process of model training and prediction. Users can upload their datasets, select the model they wish to train, and specify the labels and content columns. The application then handles the data preprocessing, model training, and evaluation, displaying a progress bar and loss graph to visualize training progress.

Users can also interact with a chatbot integrated into the application. The chatbot has access to a Retrieval-Augmented Generation (RAG) feature, allowing users to ask questions about their data and receive insightful responses.

Technical Details

Frontend: React

The frontend is built using React, a JavaScript library for building user interfaces. React’s component-based architecture enables the creation of reusable UI components, which makes the application both scalable and maintainable. Key features include:

A file upload component for dataset input.
Interactive forms to specify labels and content columns.
A progress bar component to visualize training progress.
A graphing component to display the loss curve after training.
A chatbot interface powered by the RAG feature.

Backend: Flask

The backend is implemented using Flask, a lightweight Python web framework. Flask serves as the API layer, handling requests from the React frontend and managing the machine learning workflows. Key responsibilities of the Flask backend include:

Handling dataset uploads and validating the input.
Data preprocessing, including text vectorization and train-test splitting.
Model training using Scikit-learn for Naive Bayes and TensorFlow for RNNs.
Saving and loading pretrained models for prediction tasks.
Serving predictions back to the frontend and allowing dataset downloads.
Managing the RAG-enabled chatbot interactions.

Modeling Algorithms

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for sequence data, making them suitable for tasks like text classification and time series forecasting. Unlike traditional neural networks, RNNs have connections that form directed cycles, allowing them to maintain a 'memory' of previous inputs in the sequence. This memory enables RNNs to capture temporal dependencies, which is crucial for understanding context in sequences.

The core component of an RNN is the recurrent layer, where the output from the previous time step is fed back into the network along with the current input. This creates a loop, where the hidden state of the network evolves over time. However, vanilla RNNs suffer from vanishing gradient problems, making it difficult to learn long-term dependencies. To address this, more advanced variants like Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs) are often used.

In our application, we use a basic LSTM model implemented with TensorFlow, which consists of:

An embedding layer to convert input text into dense vectors.
LSTM layers to process these vectors, capturing sequential dependencies.
A dense layer with a softmax activation function to output class probabilities.

Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes’ Theorem, which describes the probability of a class given a feature vector. The algorithm is termed "naive" because it assumes that the features are conditionally independent of each other given the class label. Despite this strong and often unrealistic assumption, Naive Bayes models are highly effective for text classification tasks, particularly when the input data is sparse.

The Naive Bayes classifier calculates the posterior probability of each class given the input features and assigns the label with the highest probability. In text classification, the features are typically word frequencies or TF-IDF values, and the model calculates the likelihood of a word appearing in a class.

Our application utilizes Scikit-learn's implementation of the Multinomial Naive Bayes classifier, which is particularly well-suited for discrete data like word counts. The key steps in training a Naive Bayes classifier include:

Vectorizing the text data using CountVectorizer or TfidfVectorizer.
Calculating the likelihood of each word in each class.
Applying Bayes' Theorem to predict the class of new data.

Using Pretrained Classifiers

In addition to custom model training, the application offers several pretrained classifiers that users can apply to their datasets. These models are trained on large, diverse datasets and include:

Sentiment Classifier: Determines the sentiment (positive, negative, neutral) of text data.
News Classifier: Categorizes news articles into topics such as politics, sports, or technology.
Hate Speech Classifier: Identifies whether a message contains hate speech or not.

Users can select one of these classifiers from the application’s interface and upload their dataset. The backend loads the corresponding model and applies it to the user's data, adding a new column with the predictions.

Application Workflow

1. Uploading Datasets

Users begin by uploading their dataset via the file upload component. The dataset should be in CSV format, with a clear distinction between label and content columns.

2. Selecting and Training Models

After the dataset is uploaded, users select whether they want to train an RNN or Naive Bayes model. They then specify the label column (the target variable) and the content column (the input features). Upon clicking the 'Train Model' button, the application starts the training process.

The progress bar component provides real-time feedback on the training process, showing the percentage completion. Once training is complete, a loss graph is displayed, visualizing the model's performance over time.

3. Making Predictions

The trained model can now be used to make predictions on new data. Users upload a new dataset, and the application adds a prediction column based on the trained model or a selected pretrained classifier. The resulting dataset is then available for download.

4. Chatbot Interaction

Users can interact with a chatbot integrated into the application to ask questions about their data or models. The chatbot uses a Retrieval-Augmented Generation (RAG) mechanism to provide accurate and context-aware responses.

Conclusion

This desktop application offers a powerful yet accessible way for users to engage with machine learning. By combining a React frontend with a Flask backend, it enables seamless dataset uploads, model training, and predictions. Whether users want to train custom models or use pretrained classifiers, the application simplifies the process and provides valuable insights through its interactive features.

As AI continues to evolve, tools like this one will become increasingly important in democratizing access to machine learning, allowing users from all backgrounds to harness the power of AI for their specific needs.

Back to Blog