GIT repositories, category: text processing
Compact Language Detector 2 (c++)
CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. Legacy encodings must be converted to valid UTF-8 by the caller. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes
#text processing #c++pretrained word embeddings (by Keras)
This script loads pre-trained word embeddings (GloVe embeddings) into a frozen Keras Embedding layer, and uses it to train a text classification model on the 20 Newsgroup dataset
#nlp #text processing #kerasSkip-gram-with-NS.ipynb
Skip-gram with negative sampling
#text processing #machine learning modellingcounsel-chat (counsel_chat.ipynb)
This repository holds the code for working with data from counselchat.com. The scarped data are from individiuals seeking assistance from licensed therapists and their associated responses.
#llm #nlp #text processingfastText on IMDB-dataset (by Keras)
This example demonstrates the use of fasttext for text classification
#nlp #text processing #kerasLSTM stateful (by Keras)
Example demonstrate how to use a stateful LSTM model, stateful vs stateless LSTM performance comparison
#text processing #machine learning modelling #neural networksequence-to-sequence translation (by Keras)
Implementation a basic character-level sequence-to-sequence model. Applied to translating short English sentences into short French sentences, character-by-character.
#text processing #kerasmulti-label-text-classification
Holds code for collecting data from arXiv to build a multi-label text classification dataset and a simpler classifier on top of that.
#text processing #machine learning modellingfastText
fastText is a library for efficient learning of word representations and sentence classification. 🖥️🤖📡💻🌐⌨️🖱️🌍⭐👩💻📱
#text processing #embeddingsmgpt
We introduce mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from linguistically diverse 25 language families using Wikipedia and C4 Corpus.
#llm #text processing #transformersunstructured-text-modelling
Text Analytics (Unsupervised Clustering) and Neural Network Modelling
#text processing #machine learning modelling #neural networkLSTM text generation (by Keras)
Example script to generate text from Nietzsche's writings.
#nlp #text processing #machine learning modellingopenai-finetuning-example
This repository provides an example of fine-tuning OpenAI's GPT-4o-mini model for classifying customer service support tickets. Through fine-tuning, we are able to increase the classification accuracy from 69% to 94%.
#llm #nlp #text processingminiGPT
Minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer), both training and inference
#llm #nlp #text processing