text processing

Add Git project

GIT repositories, category: text processing

Compact Language Detector 2 (c++)

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. Legacy encodings must be converted to valid UTF-8 by the caller. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes

#text processing #c++

pretrained word embeddings (by Keras)

This script loads pre-trained word embeddings (GloVe embeddings) into a frozen Keras Embedding layer, and uses it to train a text classification model on the 20 Newsgroup dataset

#nlp #text processing #keras

CBoW.ipynb

CBoW with keras (example of implementation and using)

#text processing

Skip-gram-with-NS.ipynb

Skip-gram with negative sampling

#text processing #machine learning modelling

counsel-chat (counsel_chat.ipynb)

This repository holds the code for working with data from counselchat.com. The scarped data are from individiuals seeking assistance from licensed therapists and their associated responses.

#llm #nlp #text processing

fastText on IMDB-dataset (by Keras)

This example demonstrates the use of fasttext for text classification

#nlp #text processing #keras

LSTM stateful (by Keras)

Example demonstrate how to use a stateful LSTM model, stateful vs stateless LSTM performance comparison

#text processing #machine learning modelling #neural network

sequence-to-sequence translation (by Keras)

Implementation a basic character-level sequence-to-sequence model. Applied to translating short English sentences into short French sentences, character-by-character.

#text processing #keras

multi-label-text-classification

Holds code for collecting data from arXiv to build a multi-label text classification dataset and a simpler classifier on top of that.

#text processing #machine learning modelling

fastText

fastText is a library for efficient learning of word representations and sentence classification. 🖥️🤖📡💻🌐⌨️🖱️🌍⭐👩‍💻📱

#text processing #embeddings

mgpt

We introduce mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from linguistically diverse 25 language families using Wikipedia and C4 Corpus.

#llm #text processing #transformers

unstructured-text-modelling

Text Analytics (Unsupervised Clustering) and Neural Network Modelling

#text processing #machine learning modelling #neural network

LSTM text generation (by Keras)

Example script to generate text from Nietzsche's writings.

#nlp #text processing #machine learning modelling

openai-finetuning-example

This repository provides an example of fine-tuning OpenAI's GPT-4o-mini model for classifying customer service support tickets. Through fine-tuning, we are able to increase the classification accuracy from 69% to 94%.

#llm #nlp #text processing

miniGPT

Minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer), both training and inference

#llm #nlp #text processing

nanoGPT

Simplest, fastest repository for training/finetuning medium-sized GPTs.

#llm #nlp #ai