Products

SIGN UPLOG IN

Models / Text / Language Detection

Automated Language detection

Table of contents

Overview

As a general rule, it is recommended that you specify the language of each text item you submit to the API. This helps ensure optimal accuracy and processing speed.

That said, there are situations where it is not possible to know what language a given text item was written in. In such cases, you can let the API automatically detect the language and perform profanity detection in the relevant language.

To do so, you just need to provide a list of target languages to the API. We recommend that you provide a list that is as short as possible. Including too many languages can decrease accuracy and speed.

Failure modes

If a string contains only non-script tokens, such as whitespace, punctuation or emojis, we will default to english.

If the actual language is very different from the languages in your list (for instance a chinese sentence but you only specified english or french), the API will fallback to your default language.

Use Language detection for International Profanity Detection

Let's assume you want to moderate the following message, but don't know what language the user chose:

va te faire foutre

In such a case, you can set lang to be the comma-separated list of target languages, for instance en,fr,it,sv. Here is a code example:


curl -X POST 'https://api.sightengine.com/1.0/text/check.json' \
  -F 'text=va te faire foutre' \
  -F 'lang=en,fr,it,sv' \
  -F 'mode=rules' \
  -F 'api_user={api_user}' \
  -F 'api_secret={api_secret}'


# this example uses requests
import requests
import json

data = {
  'text': 'va te faire foutre',
  'mode': 'rules',
  'lang': 'en,fr,it,sv',
  'api_user': '{api_user}',
  'api_secret': '{api_secret}'
}
r = requests.post('https://api.sightengine.com/1.0/text/check.json', data=data)

output = json.loads(r.text)


$params = array(
  'text' => 'va te faire foutre',
  'lang' => 'en,fr,it,sv',
  'mode' => 'rules',
  'api_user' => '{api_user}',
  'api_secret' => '{api_secret}',
);

// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/text/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);

$output = json_decode($response, true);


// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');

data = new FormData();
data.append('text', 'va te faire foutre');
data.append('lang', 'en,fr,it,sv');
data.append('mode', 'rules');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');

axios({
  url: 'https://api.sightengine.com/1.0/text/check.json',
  method:'post',
  data: data,
  headers: data.getHeaders()
})
.then(function (response) {
  // on success: handle response
  console.log(response.data);
})
.catch(function (error) {
  // handle error
  if (error.response) console.log(error.response.data);
  else console.log(error.message);
});

See request parameter description

ParameterTypeDescription
textstringUTF-8 encoded text to moderate
modestringcomma-separated list of modes. Modes are rules for the rule-based model or ml for ML models
categoriesstringcomma-separated list of categories to check. Possible values: profanity, personal, link, drug, weapon, violence, self-harm, medical, extremism, spam, content-trade, money-transaction (optional)
langstringcomma-separated list of target languages
opt_countriesstringcomma-separated list of target countries for phone number detection (optional)
liststringid of a custom list to be used for rule-based moderation (optional)
api_userstringyour API user id
api_secretstringyour API secret

The API correctly detects the language and applies the corresponding profanity detection model. The JSON response contains a description of detected profanity:


{
  "status": "success",
  "request": {
    "id": "req_6cujQglQPgGApjI5odv0P",
    "timestamp": 1471947033.92,
    "operations": 1
  },
  "profanity": {
    "matches": [
      "type": "inappropriate",
      "intensity": "high",
      "match": "tefairefoutre",
      "start": 3,
      "end": 17
    ]
  },
  "personal": {
    "matches": []
  },
  "link": {
    "matches": []
  },
}

Any other needs?

See our full list of Text models for details on other filters and checks you can run on your text content. You might also want to check our Image & Video models to moderate images and videos. This includes moderation of text in images/videos.

Was this page helpful?