💡Training & Testing

Train & Test validations validate the training and testing data and give information about what could be improved in data to help in model training.

YAML Specifications

Currently there are two sections in train-test suits specifications through YAML those are :

detect_duplicates

Its better if your test data and training data have a little bit overlap. That helps for better and accurate model building and also testing. But too much overlapping is not recommended. This feature in train-test suit helps to detect the overlapping accurately and raise error if there is too much overlap.

To use detect_duplicates we need to define the section "detect_duplicates".
Next thing we need to pass is the parameters "params".
In params we have two attributes one is preprocess_args and threshold.
preprocess_args - contains various attributes like:-
- ignore_case - if while comparing we want to ignore casing of sentences.
- remove_punctuation - Its better to ignore punctuations as these are irrelevant for comparison.
- normalize_unicode - data can contain some unicodes better to normalize them.
- remove_stopwords - removing stop-words results accurate calculations.
- ignore_whitespace - Ignoring white_space also helps in improving accuracy.
  {ignore_case: False, remove_punctuation: True, normalize_unicode: True, remove_stopwords: True}
threshold - attribute where we define range i.e, how much percentage it should range for overlapping. Its same as range-validate in validation suit. It has four attributes - ‘gt’ (greater than), ‘lt’ (less than), ‘gte’ (greater than equal to), and ‘lte’ (less than equal to).

detect_duplicates:
  - params: {preprocess_args: {ignore_case: False, remove_punctuation: True, normalize_unicode: True, remove_stopwords: True},threshold: {lte: 20}}

detect_unknown_tokens

Unknown tokens in data can degrade accuracy of model so its important to know how much percentage are unknown tokens in our data. Basically this feature will count out Out of Vocabulary tokens from the test dataset.

The training dataset is also taken into consideration for better accuracy and creating a better domain perspective vocabulary.
define the detect_unknown_tokens section in yaml file.
now add params with preprocess_args and threshold values.
preprocess_args
- remove_punctuation - puntuations are not important for detection of unknown tokens.
- remove_stopwords - Stop-words can create problems in detection of unknown tokens.
- do_lemmatization - using WordNetLemmatizer on sentences to increase accuracy of detecting unknown tokens.
threshold - same as defined in detect_duplicates.

detect_unknown_tokens:
  - params: {preprocess_args : {remove_punctuation: True, remove_stopwords: True, do_lemmatization: False}, threshold: {lte: 20}}

text_complexity_distribution

This feature can identify different classes in which sentences falls.
It very important to have a proper distribution of LLM data in both training and test data.
Text Complexity can currently identify sentences in below classes:
- Very Easy
- Easy
- Fairly Easy
- Standard
- Fairly Difficult
- Difficult
- Very Confusing
The classes are on the basis of Flesch Reading Ease Score.
for use define the text_complexity_distribution section in yaml file.
now add params with preprocess_args and threshold values.
preprocess_args - currently no arguments are supported so pass blank dictionary.
threshold - same as defined in detect_duplicates. will compare the complexity classes found in both the data training and test and threshold will be applied on the same classes if the difference is violating the threshold then will be informed to user.

text_complexity_distribution:
  - params: {preprocess_args : {}, threshold: {lte: 1}}

Example sentences

Very Easy (90-100): "The sun rises in the east."

Easy (80-89): "Cats are known for their independent nature, often wandering through the neighborhood."

Fairly Easy (70-79): "The process of photosynthesis is how plants convert sunlight into energy for their growth."

Standard (60-69): "Global warming is a topic of concern as it has led to significant changes in our planet's climate patterns."

Fairly Difficult (50-59): "The intricate relationship between economic policies and geopolitical stability becomes evident during times of trade negotiations."

Difficult (30-49): "Quantum physics, with its non-intuitive principles and probabilistic nature, challenges even the brightest minds in the scientific community."

Very Confusing (0-29): "The convoluted interplay of esoteric hermeneutics juxtaposed with ontological paradigms presents an epistemological enigma."

Sample Yaml File

rules.yaml

detect_duplicates:
  - params: {preprocess_args: {ignore_case: False, remove_punctuation: True, normalize_unicode: True, remove_stopwords: True},threshold: {lte: 20}}
detect_unknown_tokens:
  - params: {preprocess_args : {remove_punctuation: True, remove_stopwords: True, do_lemmatization: False}, threshold: {lte: 5}}
text_complexity_distribution:
  - params: {preprocess_args : {}, threshold: {gte: 8}}

How to use

Simply by passing training and test sentences with your rules file or even it can be invoked without that lets see below.

from censius.validation.training import TrainTestValidator
validator = TrainTestValidator("yaml_file_path") #if not passed then it will take a default yaml file
training_sentences = [
    "It was a great day",
    "Playing games has always been thought to be important",
    "of adults has never been researched that deeply. I believe ",
    "that playing games is every bit as important for adults ",
    "as for children. Not only is taking time out to play games ",
]

test_sentences = [
    "Deep learning is a subfield of machine learning.",
    "NLP is a part of AI.",
    "Natural language processing is a subfield of AI.",
    "The weather is awesome",
    "Space X is performing good",
    "Playing games has always been thought to be important to ",
    "the development of well-balanced and creative children; ",
    "however, what part, if any, they should play in the lives ",
    "of adults has never been researched that deeply. I believe ",
    "that playing games is every bit as important for adults ",
    "as for children. Not only is taking time out to play games ",
    "with our children and other adults valuable to building ",
    "interpersonal relationships but is also a wonderful way ",
    "to release built up tension.",
    "please come here ASAP",
    "This generator is not working",
    "Enjoy the warm hot water bath",
]
validator.validate(training_sentences, test_sentences)

Output

Text Complexity differences greater than {'gte': 8} %:
class          Training %     Test %    Diff %    
-------------------------------------------------------
Very Easy           40.00%        29.41%        10.59%
Fairly Easy         40.00%        29.41%        10.59%
*******************************************************
Text Complexity other classes
class          Training %     Test %       Diff %       
-------------------------------------------------------
Easy           20.00     %        17.65%         2.35%

Information on Violations
{
    "total_unknown_tokens": [
        [
            "processing",
            "ASAP",
            "relationships",
            "wellbalanced",
            "NLP",
            "lives",
            "performing",
            "subfield"
        ]
    ]
}

Total Violations found
{
    "detect_duplicates": [
        "lte : 20, detect_duplicates rule violated got: 23.53"
    ],
    "detect_unknown_tokens": [
        "lte : 5, detect_unknown_tokens rule violated got: 13.79"
    ]
}

PreviousPrompt

Last updated 2 years ago

hashtagYAML Specifications

hashtagdetect_duplicates

hashtagdetect_unknown_tokens

hashtagtext_complexity_distribution

hashtagExample sentences

hashtagSample Yaml File

hashtagHow to use