👁️‍🗨️Validation Suite

Validation Suite contains two sections one is prompt and training & testing validations.

Pre-requisite

Basic understanding of YAML required for writing rules.
Basic understanding of functions and json.
Basic understanding of Python

Prompt Validations

Prompt Validations are a good way to validate your I/O data. Lets understand with example. For a use-case where you want to put some sort of rules over the data you get or you pass to your model for prediction. The rules could be very simple like validating the length, range, or choice of data or it could be a little complex like text complexity or check for profanity in data.

Prompt Validations would allow you to do so by passing a rules.yaml file.
Rules file would contain all the rules you want to put over the data.

rules.yaml

format-rules:
  - key-validate: [username, password, disease, quote]
validation-rules:
  - username:
    - datatype: string
    - len-validate: {gt: 0, lt: 20 }
  - password:
    - datatype: string
    - len-validate: {gt: 0 ,lt: 30}
  - disease:
    - datatype: string
    - choice-validate: [headache, vomiting]
  - quote:
    - datatype: string
    - text-complexity-validate: {lt: 0.5}
    - profanity-validate: {lt: 20}
    - sentiment-polarity-validate: {lt: 0.3}

Lets see in Action

// Install the library
>>> pip install censius

from censius.validation.prompt import InputValidator, OutputValidator
ip = InputValidator('rules.yaml')
// Data loaded from 'rules.yaml' successfully.
input_data = {'username':'testuser',
              'password':'axiwerlw@adsl',
              'disease':'cold',
              'quote':'not feeling fucking good' 
              }
ip.validate(input_data)

Output

Total Violations found
{
    "key-validate": [],
    "username": [],
    "password": [],
    "disease": [
        "input value not in expected choices rule violated"
    ],
    "quote": [
        "input text complexity lt:0  length rule violated:75.88",
        "input text profanity lt:20 rule violated : 25.0"
    ]
}

validations failed for disease as we passed and unexpected value and for quote as the text complexity was higher then expected also the profanity was 25 but rule was for 20 %.

Lets Deep Dive into Prompt Validations

⚠️Prompt

Training & Testing Validations

Train and Test validations help you to analyse your data and to make it better. It contains 3 sections currently i.e,

detect_duplicates - which detects how much your data is overlapping.
detect_unknown_tokens - which helps in knowing how much percentage of data is out of vocabulary. The vocabulary is made out of corpus available worldwide and combining the training data for a better vocabulary.
text_complexity_distribution - finding what are the available classes in the data and how the distribution is among the test and train data. Its very important to have a proper distribution among classes to have a better training.
Not going into further details lets see how these things works.

from censius.validation.training import TrainTestValidator
validator = TrainTestValidator() #this will load a default rules yaml file
validator.quick_run_dataset() #running a simple function to directly get data
# and execute funcion

Text Complexity differences greater than {'gte': 25} %:
class          Training %     Test %    Diff %    
-------------------------------------------------------
*******************************************************
Text Complexity other classes
class          Training %     Test %       Diff %       
-------------------------------------------------------
Standard       35.44     %        36.09%         0.64%
Fairly Difficult20.78     %        20.77%         0.01%
Difficult      12.56     %        11.98%         0.57%
Very Easy      1.11      %         0.67%         0.45%
Fairly Easy    19.78     %        22.10%         2.33%
Easy           9.00      %         6.79%         2.21%
Very Confusing 1.33      %         1.60%         0.26%

Information on Violations
{
    "total_unknown_tokens": [
        "info : unknown tokens written to tokens_1693402704511003000.txt"
    ]
}

Total Violations found
{
    "detect_duplicates": [
        "lte : 15, detect_duplicates rule violated got: 21.17"
    ],
    "detect_unknown_tokens": [
        "lte : 15, detect_unknown_tokens rule violated got: 18.13"
    ]
}

The above run shows how much overlapping found in data and how many unknown tokens are detected. Also the distribution of text complexity in test and train classes.

rules.yaml

detect_duplicates:
  - params: {preprocess_args: {ignore_case: False, remove_punctuation: True, normalize_unicode: True, remove_stopwords: True},threshold: {lte: 15}}
detect_unknown_tokens:
  - params: {preprocess_args : {remove_punctuation: True, remove_stopwords: True, do_lemmatization: True}, threshold: {lte: 15}}
text_complexity_distribution:
  - params: {preprocess_args : {}, threshold: {gte: 25}}

Update Yaml

if you want to pass your own yaml file then also its possible with below code

from censius.validation.training import TrainTestValidator
validator = TrainTestValidator() #this will load a default rules yaml file
validator.quick_run_dataset("yourrules.yaml") #running a simple function to directly get data
# and execute funcion

Manipulate dataset

We can update the dataset also which is loaded and run accordingly

from censius.validation.training import TrainTestValidator
validator = TrainTestValidator() #this will load a default rules yaml file
validator.quick_run_dataset() #running a simple function to directly get data
# and execute funcion
dataset = validator.training_sentences + validator.test_sentences #get whole dataset
traning_sentences = dataset[:500] #slice the training_sentences
test_sentences = dataset[500:] #slice test_sentences
validator.validate(training_sentences, test_sentences)

Lets deep dive in training and testing

💡Training & Testing

PreviousNLP API's NextPrompt

Last updated 2 years ago

hashtagPre-requisite

hashtagPrompt Validations

hashtagLets see in Action

hashtagLets Deep Dive into Prompt Validations

hashtagTraining & Testing Validations

hashtagUpdate Yaml

hashtagManipulate dataset

hashtagLets deep dive in training and testing