ποΈβπ¨οΈValidation Suite
Validation Suite contains two sections one is prompt and training & testing validations.
Pre-requisite
Basic understanding of YAML required for writing rules.
Basic understanding of functions and json.
Basic understanding of Python
Prompt Validations
Prompt Validations are a good way to validate your I/O data. Lets understand with example. For a use-case where you want to put some sort of rules over the data you get or you pass to your model for prediction. The rules could be very simple like validating the length, range, or choice of data or it could be a little complex like text complexity or check for profanity in data.
Prompt Validations would allow you to do so by passing a rules.yaml file.
Rules file would contain all the rules you want to put over the data.
format-rules:
- key-validate: [username, password, disease, quote]
validation-rules:
- username:
- datatype: string
- len-validate: {gt: 0, lt: 20 }
- password:
- datatype: string
- len-validate: {gt: 0 ,lt: 30}
- disease:
- datatype: string
- choice-validate: [headache, vomiting]
- quote:
- datatype: string
- text-complexity-validate: {lt: 0.5}
- profanity-validate: {lt: 20}
- sentiment-polarity-validate: {lt: 0.3}Lets see in Action
validations failed for disease as we passed and unexpected value and for quote as the text complexity was higher then expected also the profanity was 25 but rule was for 20 %.
Lets Deep Dive into Prompt Validations
β οΈPromptTraining & Testing Validations
Train and Test validations help you to analyse your data and to make it better. It contains 3 sections currently i.e,
detect_duplicates - which detects how much your data is overlapping.
detect_unknown_tokens - which helps in knowing how much percentage of data is out of vocabulary. The vocabulary is made out of corpus available worldwide and combining the training data for a better vocabulary.
text_complexity_distribution - finding what are the available classes in the data and how the distribution is among the test and train data. Its very important to have a proper distribution among classes to have a better training.
Not going into further details lets see how these things works.
The above run shows how much overlapping found in data and how many unknown tokens are detected. Also the distribution of text complexity in test and train classes.
Update Yaml
if you want to pass your own yaml file then also its possible with below code
Manipulate dataset
We can update the dataset also which is loaded and run accordingly
Lets deep dive in training and testing
π‘Training & TestingLast updated
Was this helpful?