Data Collection
  • 30 Jul 2024
  • 1 Minute to read
  • Contributors
  • Dark
    Light
  • PDF

Data Collection

  • Dark
    Light
  • PDF

Article summary

Overview

Ushur Language Intelligence (LI) necessitates data collection for the effective training of its AI and ML models, particularly for the Ushur SmartMail solution. The process involves preparing email classification data from enterprise and production emails. This collaborative effort between your team and the Ushur team should commence at the project's inception.

Steps Involved

  1. Data Collation and Anonymization

  2. Data Preprocessing

  3. Data Analysis

  4. Data Labeling


Data Collation and Anonymization

How It Works

- Data Collation: Collect emails with specified topics and indexed categories, ensuring data across various dates and times to handle variance.
- Anonymization: Mask or remove Personally Identifiable Information (PII) to maintain data privacy. Use the Ushur Anonymizer tool to remove all PII before sending the data to Ushur.

Importance

Ensures that Ushur Language Intelligence can learn from the data without exposing sensitive information.

Tools and Techniques

Ushur Anonymizer (a Python-based tool).

Responsibilities

  • Your Role: Download, anonymize, and send the data.

  • Ushur Team: Guide on data hygiene and provide best practices for achieving business accuracy goals.


Data Preprocessing

How It Works

The Ushur team analyzes the data quality and performs preprocessing steps to remove extraneous noise (HTML tags, symbols, unwanted data). They identify gaps in the email content and remove duplicates.

Importance

Reduces noise, enhancing the accuracy of email classification.

Responsibilities

  • Your Role: Send the data to Ushur via SFTP or another secure method.

  • Ushur Team: Perform data preprocessing.


Data Analysis

How It Works

The Ushur team analyzes the data file to ensure it is correctly formatted and contains the necessary information for training. The data file should be in .csv format with two columns: topic and phrase.

Importance

Proper formatting is essential for effective training of the Ushur AI.

Responsibilities

  • Your Role: Ensure the .csv file is correctly formatted.

  • Ushur Team: Analyze the data file.


Data Labeling

How It Works

Label data with the appropriate topic (work type/classification topic).

Importance

Accurate labeling improves the precision of Ushur's AI in classifying topics.

Responsibilities

Your Role and Ushur Team: Ensure data categories are well-separated and avoid overlaps to improve AI training accuracy.

Note

Ushur engineers will provide recommendations to improve data quality and ensure business accuracy goals are met.
The collaboration between your team and Ushur is crucial for the successful implementation of the Ushur SmartMail solution.


Was this article helpful?