The work is completed in a Google Colab Professional with a V100 GPU and Excessive RAM setting for the steps involving LLM. The pocket book is split into self-contained sections, most of which will be executed independently, minimizing dependency on earlier steps. Information is saved after every part, permitting continuation in a brand new session if wanted. Moreover, the parsed dataset and the Python modules, are available on this Github repository.
I exploit a subset of the arXiv Dataset that’s overtly obtainable on the Kaggle platform and primarly maintained by Cornell College. In a machine readable format, it accommodates a repository of 1.7 million scholarly papers throughout STEM, with related options comparable to article titles, authors, classes, abstracts, full textual content PDFs, and extra. It’s up to date often.
The dataset is clear and in a straightforward to make use of format, so we will give attention to our job, with out spending an excessive amount of time on knowledge preprocessing. To additional simplify the info preparation course of, I constructed a Python module that performs the related steps. It may be discovered at utils/arxiv_parser.py if you wish to take a peek on the code, in any other case observe alongside the Google Colab:
obtain the zipped arXiv file (1.2 GB) within the listing of your selection which is labelled data_path,obtain the arxiv_parser.py within the listing utils,import and initialize the module in your Google Colab pocket book,unzip the file, this may extract a 3.7 GB file: archive-metadata-oai-snapshot.json,specify a basic matter (I work with cs which stands for pc science), so that you’ll have a extra maneagable measurement knowledge,select the options to maintain (there are 14 options within the downloaded dataset),the abstracts can differ in size fairly a bit, so I added the choice of choosing entries for which the variety of tokens within the summary is in a given interval and used this characteristic to downsize the dataset,though I select to work with the title characteristic, there’s an choice to take the extra frequent strategy of concatenating the title and the abstact in a single characteristic denoted corpus .# Import the info parser modulefrom utils.arxiv_parser import *
# Initialize the info parserparser = ArXivDataProcessor(data_path)
# Unzip the downloaded file to extract a json file in data_pathparser.unzip_file()
# Choose a subject and extract the articles on that topictopic=’cs’entries = parser.select_topic(‘cs’)
# Construct a pandas dataframe with specified selectionsdf = parser.select_articles(entries, # extracted articlescols=[‘id’, ‘title’, ‘abstract’], # options to keepmin_length = 100, # min tokens an summary ought to havemax_length = 120, # max tokens an summary ought to havekeep_abs_length = False, # don’t preserve the abs_length columnbuild_corpus=False) # don’t construct a corpus column
# Save the chosen knowledge to a csv file ‘selected_{matter}.csv’, makes use of data_pathparser.save_selected_data(df,matter)
With the choices above I extract a dataset of 983 pc science articles. We’re prepared to maneuver to the following step.
If you wish to skip the info processing steps, you might use the cs dataset, obtainable within the Github repository.
The Technique
KeyBERT is a technique that extracts key phrases or keyphrases from textual content. It makes use of doc and phrase embeddings to search out the sub-phrases which might be most just like the doc, through cosine similarity. KeyLLM is one other minimal technique for key phrase extraction however it’s based mostly on LLMs. Each strategies are developed and maintained by Maarten Grootendorst.
The 2 strategies will be mixed for enhanced outcomes. Key phrases extracted with KeyBERT are fine-tuned by means of KeyLLM. Conversely, candidate key phrases recognized by means of conventional NLP strategies assist grounding the LLM, minimizing the technology of undesired outputs.
For particulars on alternative ways of utilizing KeyLLM see Maarten Grootendorst, Introducing KeyLLM — Key phrase Extraction with LLMs.
Use KeyBERT [source] to extract key phrases from every doc — these are the candidate key phrases offered to LLM to fine-tune:
paperwork are embedded utilizing Sentence Transformers to construct a doc degree illustration,phrase embeddings are extracted for N-grams phrases/phrases,cosine similarity is used to search out the phrases or phrases which might be most just like every doc.
Use KeyLLM [source] to finetune the kewords extracted by KeyBERT through textual content technology with transformers [source]:
the group detection technique in Sentence Transformers [source] teams the same paperwork, so we’ll extract key phrases solely from one doc in every group,the candidate key phrases are offered the LLM which fine-tunes the key phrases for every cluster.
Moreover Sentence Transformers, KeyBERT helps different embedding fashions, see [here].
Sentence Transformers facilitate group detection by utilizing a specified threshold. When paperwork lack inherent clusters, clear groupings might not emerge. In my case, out of 983 titles, roughly 800 distinct communities have been recognized. Extra naturally clustered knowledge tends to yield better-defined communities.
The Giant Language Mannequin
After experimting with numerous smaller LLMs, I select Zephyr-7B-Beta for this mission. This mannequin is predicated on Mistral-7B, and it is among the first fashions fine-tuned with Direct Desire Optimization (DPO). It not solely outperforms different fashions in its class but in addition surpasses Llama2–70B on some benchmarks. For extra insights on this LLM check out Benjamin Marie, Zephyr 7B Beta: A Good Trainer is All You Want. Though it’s possible to make use of the mannequin instantly on a Google Colab Professional, I opted to work with a GPTQ quantized model ready by TheBloke.
Begin by downloading the mannequin and its tokenizer following the directions offered within the mannequin card:
# Required installs!pip set up transformers optimum speed up!pip set up auto-gptq –extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
# Required importsfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Load the mannequin and the tokenizermodel_name_or_path = “TheBloke/zephyr-7B-beta-GPTQ”
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,device_map=”auto”,trust_remote_code=False,revision=”essential”) # change revision for a distinct branchtokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
Moreover, construct the textual content technology pipeline:
generator = pipeline(mannequin=llm,tokenizer=tokenizer,job=’text-generation’,max_new_tokens=50,repetition_penalty=1.1,)
The Key phrase Extraction Immediate
Experimentation is vital on this step. Discovering the optimum immediate requires some trial and error, and the efficiency relies on the chosen mannequin. Let’s not neglect that LLMs are probabilistic, so it’s not assured that they are going to return the identical output each time. To develop the immediate under, I relied on each experimentation and the next issues:
immediate = “Inform me about AI”prompt_template=f”'<|system|></s><|person|>{immediate}</s><|assistant|>”’
And right here is the immediate I exploit to fine-tune the key phrases extracted with KeyBERT:
prompt_keywords= “””<|system|>I’ve the next doc:Semantics and Termination of Merely-Moded Logic Packages with Dynamic Schedulingand 5 candidate key phrases:scheduling, logic, semantics, termination, moded
Primarily based on the knowledge above, extract the key phrases or the keyphrases that finest describe the subject of the textual content.Comply with the necessities under:1. Ensure to extract solely the key phrases or keyphrases that seem within the textual content.2. Present 5 key phrases or keyphrases! Don’t quantity or label the key phrases or the keyphrases!3. Don’t embrace anything apart from the key phrases or the keyphrases! I repeat don’t embrace any feedback!
semantics, termination, simply-moded, logic applications, dynamic scheduling</s>
<|person|>I’ve the next doc:[DOCUMENT]and 5 candidate key phrases:[CANDIDATES]
Primarily based on the knowledge above, extract the key phrases or the keyphrases that finest describe the subject of the textual content.Comply with the necessities under:1. Ensure to extract solely the key phrases or keyphrases that seem within the textual content.2. Present 5 key phrases or keyphrases! Don’t quantity or label the key phrases or the keyphrases!3. Don’t embrace anything apart from the key phrases or the keyphrases! I repeat don’t embrace any feedback!</s>
<|assistant|>”””
Key phrase Extraction and Parsing
We now have every thing wanted to proceed with the key phrase extraction. Let me remind you, that I work with the titles, so the enter paperwork are quick, staying nicely throughout the token limits for the BERT embeddings.
Begin with making a TextGeneration pipeline wrapper for the LLM and instantiate KeyBERT. Select the embedding mannequin. If no embedding mannequin is specified, the default mannequin is all-MiniLM-L6-v2. On this case, I choose the highest-performant pretrained mannequin for sentence embeddings, see right here for a whole listing.
# Set up the required packages!pip set up keybert!pip set up sentence-transformers
# The required importsfrom keybert.llm import TextGenerationfrom keybert import KeyLLM, KeyBERTfrom sentence_transformers import SentenceTransformer
# KeyBert TextGeneration pipeline wrapperllm_tg = TextGeneration(generator, immediate=prompt_keywords)
# Instantiate KeyBERT and specify an embedding modelkw_model= KeyBERT(llm=llm_tg, mannequin = “all-mpnet-base-v2”)
Recall that the dataset was ready and saved as a pandas dataframe df. To course of the titles, simply name the extract_keywords technique:
# Retain the articles titles just for analysistitles_list = df.title.tolist()
# Course of the paperwork and acquire the resultstitles_keys = kw_model.extract_keywords(titles_list, thresold=0.5)
# Add the outcomes to dfdf[“titles_keys”] = titles_keys
The brink parameter determines the minimal similarity required for paperwork to be grouped into the identical group. The next worth will group practically similar paperwork, whereas a decrease worth will cluster paperwork masking comparable matters.
The selection of embeddings considerably influences the suitable threshold, so it’s advisable to seek the advice of the mannequin card for steerage. I’m grateful to Maarten Grootendorst for highlighting this side, as will be seen right here.
It’s essential to notice that my observations apply solely to condemn transformers, as I haven’t experimented with different kinds of embeddings.
Let’s check out some outputs:
Feedback:
Within the second instance offered right here, we observe key phrases or keyphrases not current within the authentic textual content. If this poses an issue in your case, contemplate enabling check_vocab=True as carried out [here]. Nonetheless, it is essential to do not forget that these outcomes are extremely influenced by the LLM selection, with quantization having a minor impact, in addition to the development of the immediate.With longer enter paperwork, I seen extra deviations from the required output.One constant commentary is that the variety of key phrases extracted typically deviates from 5. It’s frequent to come across titles with fewer extracted key phrases, particularly when the enter is transient. Conversely, some titles yield as many as 10 extracted key phrases. Let’s look at the distribution of key phrase counts for this run:
These variations complicate the following parsing steps. There are a couple of choices for addressing this: we may examine these circumstances intimately, request the mannequin to revise and both trim or reiterate the key phrases, or just overlook these situations and focus solely on titles with precisely 5 key phrases, as I’ve determined to do for this mission.
The next step is to cluster the key phrases and keyphrases to disclose frequent matters throughout articles. To perform this I exploit two algorithms: UMAP for dimensionality discount and HDBSCAN for clustering.
The Algorithms: HDBSCAN and UMAP
Hierarchical Density-Primarily based Spatial Clustering of Purposes with Noise or HDBSCAN, is a extremely performant unsupervised algorithm designed to search out patterns within the knowledge. It finds the optimum clusters based mostly on their density and proximity. That is particularly helpful in circumstances the place the quantity and form of the clusters could also be unknown or troublesome to find out.
The outcomes of HDBSCAN clustering algorithm can differ in the event you run the algorithm a number of occasions with the identical hyperparameters. It is because HDBSCAN is a stochastic algorithm, which signifies that it includes a point of randomness within the clustering course of. Particularly, HDBSCAN makes use of a random initialization of the cluster hierarchy, which may end up in totally different cluster assignments every time the algorithm is run.
Nonetheless, the diploma of variation between totally different runs of the algorithm can rely upon a number of elements, such because the dataset, the hyperparameters, and the seed worth used for the random quantity generator. In some circumstances, the variation could also be minimal, whereas in different circumstances it may be important.
There are two clustering choices with HDBSCAN.
The first clustering algorithm, denoted hard_clustering assigns every knowledge level to a cluster or labels it as noise. This can be a exhausting task; there are not any blended memberships. This strategy would possibly lead to one giant cluster categorized as noise (cluster labelled -1) and quite a few smaller clusters. Superb-tuning the hyperparameters is essential [see here], as it’s choosing an embedding mannequin particularly tailor-made for the area. Check out the related Google Colab for the outcomes of exhausting clustering on the mission’s dataset.Comfortable clustering on the opposite facet is a more moderen characteristic of the HDBSCAN library. On this strategy factors usually are not assigned cluster labels, however as a substitute they’re assigned a vector of possibilities. The size of the vector is the same as the variety of clusters discovered. The chance worth on the entry of the vector is the chance the purpose is a member of the the cluster. This enables factors to probably be a mixture of clusters. If you wish to higher perceive how smooth clustering works please consult with How Comfortable Clustering for HDBSCAN Works. This strategy is healthier fitted to the current mission, because it generates a bigger set of relatively comparable sizes clusters.
Whereas HDBSCAN can carry out nicely on low to medium dimensional knowledge, the efficiency tends to lower considerably as dimension will increase. Usually HDBSCAN performs finest on as much as round 50 dimensional knowledge, [see here].
Paperwork for clustering are usually embedded utilizing an environment friendly transformer from the BERT household, leading to a a number of hundred dimensions knowledge set.
To scale back the dimension of the embeddings vectors we use UMAP (Uniform Manifold Approximation and Projection), a non-linear dimension discount algorithm and one of the best performing in its class. It seeks to be taught the manifold construction of the info and to discover a low dimensional embedding that preserves the important topological construction of that manifold.
UMAP has been proven to be extremely efficient at preserving the general construction of high-dimensional knowledge in decrease dimensions, whereas additionally offering superior efficiency to different in style algorithms like t-SNE and PCA.
Key phrase Clustering
Set up and import the required packages and libraries.# Required installs!pip set up umap-learn!pip set up hdbscan!pip set up -U sentence-transformers
# Basic importsimport pandas as pdimport numpy as npimport reimport pickle
# Imports wanted to generate the BERT embeddingsfrom sentence_transformers import SentenceTransformer
# Libraries for dimensionality reductionimport umap.umap_ as umap
# Import the clustering algorithmimport hdbscan
Put together the dataset by aggregating all key phrases and keyphrases from every title’s particular person quintet right into a single listing of distinctive key phrases and put it aside as a pandas dataframe.# Load the info if wanted – titles with 5 extracted keywordsdf5 = pd.read_csv(data_path+parsed_keys_file)
# Create an inventory of all sublists of key phrases and keyphrasesdf5_keys = df5.titles_keys.tolist()
# Flatten the listing of sublistsflat_keys = [item for sublist in df5_keys for item in sublist]
# Create an inventory of distinctive keywordsflat_keys = listing(set(flat_keys))
# Create a dataframe with the distinct keywordskeys_df = pd.DataFrame(flat_keys, columns = [‘key’])
I receive nearly 3000 distinctive key phrases and keyphrases from the 884 processed titles. Here’s a pattern: n-colorable graphs, experiments, constraints, tree construction, complexity, and so on.
Generate 768-dimensional embeddings with Sentence Transformers.# Instantiate the embedding modelmodel = SentenceTransformer(‘all-mpnet-base-v2’)
# Embed the key phrases and keyphrases into 768-dim actual vector spacekeys_df[‘key_bert’] = keys_df[‘key’].apply(lambda x: mannequin.encode(x))
Carry out dimensionality discount with UMAP.# Scale back to 10-dimensional vectors and preserve the native neighborhood at 15embeddings = umap.UMAP(n_neighbors=15, # Balances native vs. world construction.n_components=10, # Dimension of lowered vectorsmetric=’cosine’).fit_transform(listing(keys_df.key_bert))
# Add the lowered embedding vectors to the dataframekeys_df[‘key_umap’] = embeddings.tolist()
Cluster the 10-dimensional vectors with HDBSCAN. To maintain this weblog succinct, I’ll omit descriptions of the parameters that pertain extra to exhausting clustering. For detailed data on every parameter, please consult with [Parameter Selection for HDBSCAN*].# Initialize the clustering modelclusterer = hdbscan.HDBSCAN(algorithm=’finest’,prediction_data=True,approx_min_span_tree=True,gen_min_span_tree=True,min_cluster_size=20,cluster_selection_epsilon = .1,min_samples=1,p=None,metric=’euclidean’,cluster_selection_method=’leaf’)# Match the dataclusterer.match(embeddings)
# Create smooth clusterssoft_clusters = hdbscan.all_points_membership_vectors(clusterer)
# Add the smooth cluster data to the dataclosest_clusters = [np.argmax(x) for x in soft_clusters]keys_df[‘cluster’] = closest_clusters
Under is the distribution of key phrases throughout clusters. Examination of the unfold of key phrases and keyphrases into smooth clusters reveals a complete of 60 clusters, with a reasonably even distribution of parts per cluster, various from about 20 to almost 100.
Having clustered the key phrases, we are actually able to make use of GenAI as soon as extra to boost and refine our findings. At this step, we’ll use a LLM to research every cluster, summarize the key phrases and keyphrases whereas assigning a quick label to the cluster.
Whereas it’s not needed, I select to proceed with the identical LLM, Zephyr-7B-Beta. Must you require downloading the mannequin, please seek the advice of the related part. Notably, I’ll modify the immediate to swimsuit the distinct nature of this job.
The next perform is designed to extract a label and an outline for a cluster, parse the output and combine it right into a pandas dataframe.
def extract_description(df: pd.DataFrame,n: int )-> pd.DataFrame:”””Use a customized immediate to ship to a LLMto extract labels and descriptions for an inventory of key phrases.”””
one_cluster = df[df[‘cluster’]==n]one_cluster_copy = one_cluster.copy()pattern = one_cluster_copy.key.tolist()
prompt_clusters= f”””<|system|>I’ve the next listing of key phrases and keyphrases:[‘encryption’,’attribute’,’firewall’,’security properties’,’network security’,’reliability’,’surveillance’,’distributed risk factors’,’still vulnerable’,’cryptographic’,’protocol’,’signaling’,’safe’,’adversary’,’message passing’,’input-determined guards’,’secure communication’,’vulnerabilities’,’value-at-risk’,’anti-spam’,’intellectual property rights’,’countermeasures’,’security implications’,’privacy’,’protection’,’mitigation strategies’,’vulnerability’,’secure networks’,’guards’]
Primarily based on the knowledge above, first identify the area these key phrases or keyphrases belong to, secondly give a quick description of the area.Don’t use greater than 30 phrases for the outline!Don’t present particulars!Don’t give examples of the contexts, don’t say ‘comparable to’ and don’t listing the key phrases or the keyphrases!Don’t begin with an announcement of the shape ‘These key phrases belong to the area of’ or with ‘The area’.
Cybersecurity: Cybersecurity, emphasizing strategies and methods for safeguarding digital informationand networks in opposition to unauthorized entry and threats.</s>
<|person|>I’ve the next listing of key phrases and keyphrases:{pattern}Primarily based on the knowledge above, first identify the area these key phrases or keyphrases belong to, secondlygive a quick description of the area.Don’t use greater than 30 phrases for the outline!Don’t present particulars!Don’t give examples of the contexts, don’t say ‘comparable to’ and don’t listing the key phrases or the keyphrases!Don’t begin with an announcement of the shape ‘These key phrases belong to the area of’ or with ‘The area’.<|assistant|>”””
# Generate the outputsoutputs = generator(prompt_clusters,max_new_tokens=120,do_sample=True,temperature=0.1,top_k=10,top_p=0.95)
textual content = outputs[0][“generated_text”]
# Instance stringpattern = “<|assistant|>n”
# Extract the outputresponse = textual content.cut up(sample, 1)[1].strip(” “)# Verify if the output has the specified formatif len(response.cut up(“:”, 1)) == 2:label = response.cut up(“:”, 1)[0].strip(” “)description = response.cut up(“:”, 1)[1].strip(” “)else:label = description = response
# Add the outline and the labels to the dataframeone_cluster_copy.loc[:, ‘description’] = descriptionone_cluster_copy.loc[:, ‘label’] = label
return one_cluster_copy
Now we will apply the above perform to every cluster and acquire the outcomes:
import reimport pandas as pd
# Initialize an empty listing to retailer the cluster dataframesdataframes = []clusters = len(set(keys_df.cluster))
# Iterate over the vary of n valuesfor n in vary(clusters-1):df_result = extract_description(keys_df,n)dataframes.append(df_result)
# Concatenate the person dataframesfinal_df = pd.concat(dataframes, ignore_index=True)
Let’s check out a pattern of outputs. For full listing of outputs please consult with the Google Colab.
We should do not forget that LLMs, with their inherent probabilistic nature, will be unpredictable. Whereas they typically adhere to directions, their compliance shouldn’t be absolute. Even slight alterations within the immediate or the enter textual content can result in substantial variations within the output. Within the extract_description() perform, I’ve included a characteristic to log the response in each label and outline columns in these circumstances the place the Label: Description format shouldn’t be adopted, as illustrated by the irregular output for cluster 7 above. The outputs for your complete set of 60 clusters can be found within the accompanying Google Colab pocket book.
A second commentary, is that every cluster is parsed independently by the LLM and it’s doable to get repeated labels. Moreover, there could also be situations of recurring key phrases extracted from the enter listing.
The effectiveness of the method is very reliant on the selection of the LLM and points are minimal with a extremely performant LLM. The output additionally relies on the standard of the key phrase clustering and the presence of an inherent matter throughout the cluster.
Methods to mitigate these challenges rely upon the cluster depend, dataset traits and the required accuracy for the mission. Listed here are two choices:
Manually rectify every situation, as I did on this mission. With solely 60 clusters and merely three faulty outputs, guide changes have been made to appropriate the defective outputs and to make sure distinctive labels for every cluster.Make use of an LLM to make the corrections, though this technique doesn’t assure flawless outcomes.
Information to Add into the Graph
There are two csv recordsdata (or pandas dataframes if working in a single session) to extract the info from.
articles – it accommodates distinctive id for every article, title , summary and titles_keys which is the listing of 5 extracted key phrases or keyphrases;key phrases – with columns key , cluster , description and label , the place key accommodates a whole listing of distinctive key phrases or keyphrases, and the remaining options describe the cluster the key phrase belongs to.
Neo4j Connection
To construct a data graph, we begin with establishing a Neo4j occasion, selecting from choices like Sandbox, AuraDB, or Neo4j Desktop. For this mission, I’m utilizing AuraDB’s free model. It’s simple to launch a clean occasion and obtain its credentials.
Subsequent, set up a connection to Neo4j. For comfort, I exploit a customized Python module, which will be discovered at [utils/neo4j_conn.py](<https://github.com/SolanaO/Blogs_Content/blob/grasp/keyllm_neo4j/utils/neo4j_conn.py>) . This module accommodates strategies for connecting and interacting with the graph database.
# Set up neo4j!pip set up neo4j
# Import the connectorfrom utils.neo4j_conn import *
# Graph DB occasion credentialsURI = ‘neo4j+ssc://xxxxxx.databases.neo4j.io’USER = ‘neo4j’PWD = ‘your_password_here’
# Set up the connection to the Neo4j instancegraph = Neo4jGraph(url=URI, username=USER, password=PWD)
The graph we’re about to construct has a easy schema consisting of three nodes and two relationships:
Constructing the graph now could be simple with simply two Cypher queries:
# Load Key phrase and Matter nodes, and the relationships HAS_TOPICquery_keywords_topics = “””UNWIND $rows AS rowMERGE (ok:Key phrase {identify: row.key})MERGE (t:Matter {cluster: row.cluster, description: row.description, label: row.label})MERGE (ok)-[:HAS_TOPIC]->(t)”””graph.load_data(query_keywords_topics, key phrases)
# Load Article nodes and the relationships HAS_KEYquery_articles = “””UNWIND $rows as rowMERGE (a:Article {id: row.id, title: row.title, summary: row.summary})WITH a, rowUNWIND row.titles_keys as keyMATCH (ok:Key phrase {identify: key})MERGE (a)-[:HAS_KEY]->(ok)”””graph.load_data(query_articles, articles)
Question the Graph
Let’s examine the distribution of the nodes and relationships on sorts:
We will discover what particular person matters (or clusters) are the preferred amongst our assortment of articles, by counting the cumulative variety of articles related to the key phrases they’re linked to:
Here’s a snapshot of the node Semantics that corresponds to cluster 58 and its linked key phrases:
We will additionally determine generally occurring works in titles, utilizing the question under:
We noticed how we will construction and enrich a set of semingly unrelated quick textual content entries. Utilizing conventional NLP and machine studying, we first extract key phrases after which we cluster them. These outcomes information and floor the refinement course of carried out by Zephyr-7B-Beta. Whereas some oversight of the LLM continues to be neccessary, the preliminary output is considerably enriched. A data graph is used to disclose the newly found connections within the corpus.
Our key takeaway is that no single technique is ideal. Nonetheless, by strategically combining totally different strategies, acknowledging their strenghts and weaknesses, we will obtain superior outcomes.
Google Colab Pocket book and Code
Information
Technical Documentation
Blogs and Articles
Maarten Grootendorst, Introducing KeyLLM — Key phrase Extraction with LLMs, In direction of Information Science, Oct 5, 2023.Benjamin Marie, Zephyr 7B Beta: A Good Trainer Is All You Want, In direction of Information Science, Nov 10, 2023.The H4 Crew, Zephyr: Direct Distillation of LM Alignment, Technical Report, arXiv: 2310.16944, Oct 25, 2023.