πŸ’‘Training & Testing

Train & Test validations validate the training and testing data and give information about what could be improved in data to help in model training.

YAML Specifications

Currently there are two sections in train-test suits specifications through YAML those are :

detect_duplicates

Its better if your test data and training data have a little bit overlap. That helps for better and accurate model building and also testing. But too much overlapping is not recommended. This feature in train-test suit helps to detect the overlapping accurately and raise error if there is too much overlap.

  • To use detect_duplicates we need to define the section "detect_duplicates".

  • Next thing we need to pass is the parameters "params".

  • In params we have two attributes one is preprocess_args and threshold.

  • preprocess_args - contains various attributes like:-

    • ignore_case - if while comparing we want to ignore casing of sentences.

    • remove_punctuation - Its better to ignore punctuations as these are irrelevant for comparison.

    • normalize_unicode - data can contain some unicodes better to normalize them.

    • remove_stopwords - removing stop-words results accurate calculations.

    • ignore_whitespace - Ignoring white_space also helps in improving accuracy.

      {ignore_case: False,
       remove_punctuation: True, 
       normalize_unicode: True, 
       remove_stopwords: True}
  • threshold - attribute where we define range i.e, how much percentage it should range for overlapping. Its same as range-validate in validation suit. It has four attributes - β€˜gt’ (greater than), β€˜lt’ (less than), β€˜gte’ (greater than equal to), and β€˜lte’ (less than equal to).

detect_unknown_tokens

Unknown tokens in data can degrade accuracy of model so its important to know how much percentage are unknown tokens in our data. Basically this feature will count out Out of Vocabulary tokens from the test dataset.

  • The training dataset is also taken into consideration for better accuracy and creating a better domain perspective vocabulary.

  • define the detect_unknown_tokens section in yaml file.

  • now add params with preprocess_args and threshold values.

  • preprocess_args

    • remove_punctuation - puntuations are not important for detection of unknown tokens.

    • remove_stopwords - Stop-words can create problems in detection of unknown tokens.

    • do_lemmatization - using WordNetLemmatizer on sentences to increase accuracy of detecting unknown tokens.

  • threshold - same as defined in detect_duplicates.

text_complexity_distribution

  • This feature can identify different classes in which sentences falls.

  • It very important to have a proper distribution of LLM data in both training and test data.

  • Text Complexity can currently identify sentences in below classes:

    • Very Easy

    • Easy

    • Fairly Easy

    • Standard

    • Fairly Difficult

    • Difficult

    • Very Confusing

  • The classes are on the basis of Flesch Reading Ease Score.

  • for use define the text_complexity_distribution section in yaml file.

  • now add params with preprocess_args and threshold values.

  • preprocess_args - currently no arguments are supported so pass blank dictionary.

  • threshold - same as defined in detect_duplicates. will compare the complexity classes found in both the data training and test and threshold will be applied on the same classes if the difference is violating the threshold then will be informed to user.

Example sentences

Very Easy (90-100): "The sun rises in the east."

Easy (80-89): "Cats are known for their independent nature, often wandering through the neighborhood."

Fairly Easy (70-79): "The process of photosynthesis is how plants convert sunlight into energy for their growth."

Standard (60-69): "Global warming is a topic of concern as it has led to significant changes in our planet's climate patterns."

Fairly Difficult (50-59): "The intricate relationship between economic policies and geopolitical stability becomes evident during times of trade negotiations."

Difficult (30-49): "Quantum physics, with its non-intuitive principles and probabilistic nature, challenges even the brightest minds in the scientific community."

Very Confusing (0-29): "The convoluted interplay of esoteric hermeneutics juxtaposed with ontological paradigms presents an epistemological enigma."

Sample Yaml File

How to use

  • Simply by passing training and test sentences with your rules file or even it can be invoked without that lets see below.

Last updated

Was this helpful?