Content Moderation Glossary

113 terms every trust & safety team should know

Clear definitions for the vocabulary of content moderation — from abuse detection to webhooks.

3 Strikes Policy
A 3 strikes policy is a moderation rule that escalates consequences for repeated violations:…
Abuse Detection
Abuse detection is the process of identifying harmful or abusive user-generated content — such…
Active Learning
Active learning is a training strategy where the model itself selects the most informative…
Advance Fee Scam (419 Scam)
An advance fee scam tricks the victim into paying an upfront fee in exchange…
Age Verification
Age verification is the process of confirming a user's age before granting access to…
AI Generated Content (AIGC)
AI-generated content (AIGC) is text, images, audio, or video produced by generative AI models…
AI Guardrails
AI guardrails are the rules, filters, and policies built around an AI system to…
AI Voice Cloning Scam
An AI voice cloning scam uses a few seconds of recorded speech — pulled…
AI Watermarking
AI watermarking is the practice of embedding imperceptible signals into AI-generated text, images, audio,…
Algorithmic Moderation
Algorithmic moderation is the use of rule-based or pattern-matching algorithms to automatically detect and…
Allowlist & Blocklist
An allowlist is a curated list of words, phrases, users, or domains that are…
Appeal Process
An appeal process is a mechanism that lets a user contest a moderation decision…
Artificial Intelligence (AI) Moderation
AI moderation is the use of machine learning, natural language processing, and computer vision…
Astroturfing
Astroturfing is a coordinated campaign disguised as a spontaneous grassroots movement, where paid or…
Automated Moderation
Automated moderation is the use of software tools — including rules engines, AI classifiers,…
Banning
Banning is the act of permanently revoking a user's access to a platform or…
Bot Detection
Bot detection is the identification of automated accounts and scripted traffic through a combination…
Brand Safety
Brand safety is the set of measures advertisers and platforms use to prevent a…
Business Email Compromise (BEC)
Business email compromise is a targeted fraud in which attackers impersonate an executive, employee,…
C2PA
The Coalition for Content Provenance and Authenticity is an open technical standard for attaching…
Catfishing
Catfishing is the practice of creating a fake online persona to deceive another person…
Chat Moderation
Chat moderation is the practice of monitoring and managing real-time conversations — in messaging…
Community Guidelines
Community guidelines are a set of rules and standards published by a platform that…
Confusion Matrix
A confusion matrix is a 2x2 table that breaks model predictions into true positives,…
Content Filtering
Content filtering is the process of screening incoming user-generated content against a set of…
Content Flagging
Content flagging is a feature that lets users report posts, comments, or media they…
Content Moderation
Content moderation is the practice of monitoring and managing user-generated content on a platform…
Content Review
Content review is the process of examining flagged or reported content to determine whether…
Contextual Analysis
Contextual analysis is the examination of a piece of content within its surrounding context…
Coordinated Inauthentic Behavior (CIB)
Coordinated inauthentic behavior is a term for networks of fake or compromised accounts that…
COPPA
The Children's Online Privacy Protection Act is a US federal law that restricts how…
Crypto Scam
A crypto scam is any fraud that exploits cryptocurrency rails to steal funds, including…
CSAM (Child Sexual Abuse Material)
CSAM stands for Child Sexual Abuse Material — any visual depiction of sexually explicit…
Cyberbullying
Cyberbullying is the use of digital communication tools — social media, messaging apps, comments,…
Dark Web
The dark web is a portion of the internet that is not indexed by…
Data Labeling
Data labeling is the process of annotating raw content with the correct categories so…
Deepfake
A deepfake is a piece of synthetic media — typically video, audio, or image…
Deepfake Scam
A deepfake scam uses AI-generated synthetic video or audio to impersonate a real person…
Digital Services Act (DSA)
The Digital Services Act (DSA) is a European Union regulation that sets binding rules…
Disinformation
Disinformation is false information that is deliberately created and spread to deceive, manipulate, or…
Doxxing
Doxxing is the act of publicly sharing someone's private personal information — such as…
Employment Scam
An employment scam is a fraud that uses fake job listings, recruiter outreach, or…
F1 score
The F1 score is the harmonic mean of precision and recall, used to evaluate…
False Negative
A false negative in content moderation is an instance where harmful or policy-violating content…
False Positive
A false positive in content moderation is an instance where benign content is incorrectly…
Flagging
Flagging is the act of marking a piece of content for moderator review, typically…
Fraud Detection
Fraud detection is the process of identifying and preventing deceptive activity on a platform…
Government Impersonation Scam
A government impersonation scam is a fraud in which criminals pose as officials from…
Grooming Detection
Grooming detection is the conversation-level identification of patterns used by adults to build trust…
Ground Truth
Ground truth is the human-labeled reference set that a classifier is trained and evaluated…
Hash Matching
Hash matching is a detection technique that compares the cryptographic or perceptual fingerprint of…
Hate Speech
Hate speech is content that promotes violence, discrimination, or hostility toward individuals or groups…
Human in the Loop
Human in the loop is a moderation approach where AI handles bulk decisions but…
Human Moderation
Human moderation is the practice of having trained people — rather than algorithms —…
Imposter Scam
An imposter scam is a fraud in which the attacker pretends to be someone…
Investment Scam
An investment scam is a fraud that lures victims into fake trading platforms, nonexistent…
KOSA (Kids Online Safety Act)
KOSA is proposed US federal legislation that would impose a duty of care on…
LLM
An LLM (Large Language Model) is a neural network trained on huge volumes of…
LLM Hallucination
An LLM hallucination is a confident but factually incorrect or fabricated output produced by…
LLM Jailbreak
An LLM jailbreak is a prompt or sequence of prompts crafted to bypass a…
Machine Learning Moderation
Machine learning moderation is the use of supervised models trained on labeled examples of…
Manual Review
Manual review is the process of a human moderator examining a piece of content…
Misinformation
Misinformation is false or misleading information that is shared without the intent to deceive…
MLCommons safety categories
The MLCommons safety categories are a standardized taxonomy of 13 harm types — created…
Model Drift
Model drift is the gradual decay in a classifier's accuracy as the language, topics,…
Moderation Queue
A moderation queue is the prioritized list of flagged or reported content waiting for…
NCII (Non-Consensual Intimate Imagery)
Non-consensual intimate imagery refers to sexually explicit photos or videos shared without the subject's…
NLP
NLP (Natural Language Processing) is the branch of artificial intelligence that gives machines the…
NSFA (Not Safe for Ads)
NSFA stands for "Not Safe for Ads" and labels content that is unsafe to…
NSFW
NSFW stands for "Not Safe For Work" and is used to label content —…
Nudity Detection
Nudity detection is the use of computer vision models to identify images or video…
OCR (Optical Character Recognition)
Optical character recognition is the extraction of machine-readable text from images, scanned documents, and…
Offensive Content
Offensive content is user-generated material likely to upset or alienate readers — including hate…
Online Safety Act (UK)
The Online Safety Act is a UK law that imposes a legal "duty of…
Perspective API
Perspective API is a public toxicity classification service from Google Jigsaw that scores text…
Phishing
Phishing is a social engineering attack in which the attacker impersonates a trusted entity…
PhotoDNA
PhotoDNA is a hash-matching technology developed by Microsoft that creates a robust digital signature…
Pig Butchering Scam
A pig butchering scam is a long-con fraud in which a scammer builds a…
PII Detection
PII detection is the automated identification of personally identifiable information — such as names,…
Post Moderation
Post moderation is the practice of letting user-generated content go live immediately and reviewing…
Pre Moderation
Pre moderation is the practice of reviewing user-generated content before it goes live, blocking…
Precision
Precision is a moderation metric that measures what fraction of the items flagged by…
Proactive Moderation
Proactive moderation is the practice of detecting and acting on policy violations before users…
Profanity Filter
A profanity filter is a tool that scans user-generated text against a list of…
Prompt Injection
Prompt injection is an attack against an LLM-powered application in which adversarial instructions —…
Reactive Moderation
Reactive moderation is the practice of waiting for users to report violations and only…
Recall
Recall is a moderation metric that measures what fraction of all the actually harmful…
Red Teaming (AI)
AI red teaming is the practice of adversarially probing a machine learning system —…
Romance Scam
A romance scam is a fraud in which the attacker feigns a romantic relationship…
Rug Pull
A rug pull is a cryptocurrency exit scam in which the developers of a…
Section 230
Section 230 is the provision of the 1996 Communications Decency Act that shields US…
Self-Harm Detection
Self-harm detection is the identification of user-generated content that expresses suicidal ideation, self-injury, or…
Sentiment Analysis
Sentiment analysis is a natural language processing technique that classifies text according to its…
Sextortion
Sextortion is a form of online blackmail in which an attacker threatens to share…
Shadow Banning
Shadow banning is the practice of silently reducing the visibility of a user's posts…
SHAFT
SHAFT is a content moderation and advertising compliance acronym for Sex, Hate, Alcohol, Firearms,…
SIM Swap
A SIM swap is an attack in which a fraudster social-engineers a mobile carrier…
Smishing
Smishing is phishing delivered over SMS, where attackers send text messages impersonating a bank,…
Sock Puppet Account
A sock puppet is a fake online identity created to deceive other users, usually…
Spam
Spam is unsolicited, irrelevant, or repetitive content posted at scale, typically for advertising, link…
Takedown
A takedown is the removal of a piece of content from a platform after…
Tech Support Scam
A tech support scam is a fraud in which criminals impersonate well-known software or…
Terms of Service (ToS)
Terms of Service is the legal agreement between a platform and its users that…
Toxicity
Toxicity in content moderation describes language that is harmful, abusive, or disruptive to a…
Transparency Report
A transparency report is a regular public disclosure in which a platform reports on…
True Negative
A true negative in content moderation is an instance where benign content is correctly…
True Positive
A true positive in content moderation is an instance where harmful or policy-violating content…
Trust & Safety
Trust & Safety is the discipline within an online platform responsible for protecting users…
User-Generated Content (UGC)
User-generated content (UGC) is any text, image, video, audio, or comment created and published…
Vishing
Vishing is voice phishing, where an attacker calls the victim and impersonates a bank,…
Vision-Language Model (VLM)
A vision-language model is a multimodal model that understands images and text together, letting…
Zero Tolerance Policy
A zero tolerance policy is a moderation rule that triggers an immediate and severe…
Zero-Shot Classification
Zero-shot classification is the ability of a large language model to assign labels it…

Last updated

Find out what we'd flag on your platform