Page cover

πŸ‘οΈβ€πŸ—¨οΈValidation Suite

Validation Suite contains two sections one is prompt and training & testing validations.

Pre-requisite

  • Basic understanding of YAML required for writing rules.

  • Basic understanding of functions and json.

  • Basic understanding of Python

Prompt Validations

Prompt Validations are a good way to validate your I/O data. Lets understand with example. For a use-case where you want to put some sort of rules over the data you get or you pass to your model for prediction. The rules could be very simple like validating the length, range, or choice of data or it could be a little complex like text complexity or check for profanity in data.

  • Prompt Validations would allow you to do so by passing a rules.yaml file.

  • Rules file would contain all the rules you want to put over the data.

rules.yaml
format-rules:
  - key-validate: [username, password, disease, quote]
validation-rules:
  - username:
    - datatype: string
    - len-validate: {gt: 0, lt: 20 }
  - password:
    - datatype: string
    - len-validate: {gt: 0 ,lt: 30}
  - disease:
    - datatype: string
    - choice-validate: [headache, vomiting]
  - quote:
    - datatype: string
    - text-complexity-validate: {lt: 0.5}
    - profanity-validate: {lt: 20}
    - sentiment-polarity-validate: {lt: 0.3}

Lets see in Action

validations failed for disease as we passed and unexpected value and for quote as the text complexity was higher then expected also the profanity was 25 but rule was for 20 %.

Lets Deep Dive into Prompt Validations

⚠️Prompt

Training & Testing Validations

Train and Test validations help you to analyse your data and to make it better. It contains 3 sections currently i.e,

  • detect_duplicates - which detects how much your data is overlapping.

  • detect_unknown_tokens - which helps in knowing how much percentage of data is out of vocabulary. The vocabulary is made out of corpus available worldwide and combining the training data for a better vocabulary.

  • text_complexity_distribution - finding what are the available classes in the data and how the distribution is among the test and train data. Its very important to have a proper distribution among classes to have a better training.

  • Not going into further details lets see how these things works.

The above run shows how much overlapping found in data and how many unknown tokens are detected. Also the distribution of text complexity in test and train classes.

Update Yaml

  • if you want to pass your own yaml file then also its possible with below code

Manipulate dataset

We can update the dataset also which is loaded and run accordingly

Lets deep dive in training and testing

πŸ’‘Training & Testing

Last updated

Was this helpful?