π‘Training & Testing
Train & Test validations validate the training and testing data and give information about what could be improved in data to help in model training.
YAML Specifications
Currently there are two sections in train-test suits specifications through YAML those are :
detect_duplicates
Its better if your test data and training data have a little bit overlap. That helps for better and accurate model building and also testing. But too much overlapping is not recommended. This feature in train-test suit helps to detect the overlapping accurately and raise error if there is too much overlap.
To use detect_duplicates we need to define the section "detect_duplicates".
Next thing we need to pass is the parameters "params".
In params we have two attributes one is preprocess_args and threshold.
preprocess_args - contains various attributes like:-
ignore_case - if while comparing we want to ignore casing of sentences.
remove_punctuation - Its better to ignore punctuations as these are irrelevant for comparison.
normalize_unicode - data can contain some unicodes better to normalize them.
remove_stopwords - removing stop-words results accurate calculations.
ignore_whitespace - Ignoring white_space also helps in improving accuracy.
{ignore_case: False, remove_punctuation: True, normalize_unicode: True, remove_stopwords: True}
threshold - attribute where we define range i.e, how much percentage it should range for overlapping. Its same as range-validate in validation suit. It has four attributes - βgtβ (greater than), βltβ (less than), βgteβ (greater than equal to), and βlteβ (less than equal to).
detect_unknown_tokens
Unknown tokens in data can degrade accuracy of model so its important to know how much percentage are unknown tokens in our data. Basically this feature will count out Out of Vocabulary tokens from the test dataset.
The training dataset is also taken into consideration for better accuracy and creating a better domain perspective vocabulary.
define the detect_unknown_tokens section in yaml file.
now add params with preprocess_args and threshold values.
preprocess_args
remove_punctuation - puntuations are not important for detection of unknown tokens.
remove_stopwords - Stop-words can create problems in detection of unknown tokens.
do_lemmatization - using WordNetLemmatizer on sentences to increase accuracy of detecting unknown tokens.
threshold - same as defined in detect_duplicates.
text_complexity_distribution
This feature can identify different classes in which sentences falls.
It very important to have a proper distribution of LLM data in both training and test data.
Text Complexity can currently identify sentences in below classes:
Very Easy
Easy
Fairly Easy
Standard
Fairly Difficult
Difficult
Very Confusing
The classes are on the basis of Flesch Reading Ease Score.
for use define the text_complexity_distribution section in yaml file.
now add params with preprocess_args and threshold values.
preprocess_args - currently no arguments are supported so pass blank dictionary.
threshold - same as defined in detect_duplicates. will compare the complexity classes found in both the data training and test and threshold will be applied on the same classes if the difference is violating the threshold then will be informed to user.
Example sentences
Very Easy (90-100): "The sun rises in the east."
Easy (80-89): "Cats are known for their independent nature, often wandering through the neighborhood."
Fairly Easy (70-79): "The process of photosynthesis is how plants convert sunlight into energy for their growth."
Standard (60-69): "Global warming is a topic of concern as it has led to significant changes in our planet's climate patterns."
Fairly Difficult (50-59): "The intricate relationship between economic policies and geopolitical stability becomes evident during times of trade negotiations."
Difficult (30-49): "Quantum physics, with its non-intuitive principles and probabilistic nature, challenges even the brightest minds in the scientific community."
Very Confusing (0-29): "The convoluted interplay of esoteric hermeneutics juxtaposed with ontological paradigms presents an epistemological enigma."
Sample Yaml File
How to use
Simply by passing training and test sentences with your rules file or even it can be invoked without that lets see below.
Last updated
Was this helpful?