Information security is very important for any organization. Lost money is a minor problem, the serious one is that the enterprise system. However, fraud email and phishing email occupy a small set of data when comparing to normal email. Augmenting fraud and phishing email is a way to tackle this problem.

Image for post

Example of CEO fraud email (Regina et al., 2020)

Therefore, Regina et al. proposed three different approaches to generate synthetic data for model training. As synthetic data is a kind of “fake” data, some low-quality data may hurt model performance. Validations are needed to keep a high-quality synthetic data. Also, there are some assumptions which are:

  • Synthetic data should share the same label as the original text. For example, synthetic data should be change label from positive to negative (for binary classifier).
  • Synthetic data should not be redundant. In other words, the augmented text should not be almost identical to the original text.

Image for post

Above: Original Text. Below: Augmented Text (Regina et al., 2020)

Image for post

#data-augmentation #nlp #data-science #artificial-intelligence #machine-learning

Text Augmentation for detecting spear-phishing emails
1.35 GEEK