Content Moderation Glossary
113 terms every trust & safety team should know
Clear definitions for the vocabulary of content moderation — from abuse detection to webhooks.
- 3 Strikes Policy
- A 3 strikes policy is a moderation rule that escalates consequences for repeated violations:…
- Abuse Detection
- Abuse detection is the process of identifying harmful or abusive user-generated content — such…
- Active Learning
- Active learning is a training strategy where the model itself selects the most informative…
- Advance Fee Scam (419 Scam)
- An advance fee scam tricks the victim into paying an upfront fee in exchange…
- Age Verification
- Age verification is the process of confirming a user's age before granting access to…
- AI Generated Content (AIGC)
- AI-generated content (AIGC) is text, images, audio, or video produced by generative AI models…
- AI Guardrails
- AI guardrails are the rules, filters, and policies built around an AI system to…
- AI Voice Cloning Scam
- An AI voice cloning scam uses a few seconds of recorded speech — pulled…
- AI Watermarking
- AI watermarking is the practice of embedding imperceptible signals into AI-generated text, images, audio,…
- Algorithmic Moderation
- Algorithmic moderation is the use of rule-based or pattern-matching algorithms to automatically detect and…
- Allowlist & Blocklist
- An allowlist is a curated list of words, phrases, users, or domains that are…
- Appeal Process
- An appeal process is a mechanism that lets a user contest a moderation decision…
- Artificial Intelligence (AI) Moderation
- AI moderation is the use of machine learning, natural language processing, and computer vision…
- Astroturfing
- Astroturfing is a coordinated campaign disguised as a spontaneous grassroots movement, where paid or…
- Automated Moderation
- Automated moderation is the use of software tools — including rules engines, AI classifiers,…
- Banning
- Banning is the act of permanently revoking a user's access to a platform or…
- Bot Detection
- Bot detection is the identification of automated accounts and scripted traffic through a combination…
- Brand Safety
- Brand safety is the set of measures advertisers and platforms use to prevent a…
- Business Email Compromise (BEC)
- Business email compromise is a targeted fraud in which attackers impersonate an executive, employee,…
- C2PA
- The Coalition for Content Provenance and Authenticity is an open technical standard for attaching…
- Catfishing
- Catfishing is the practice of creating a fake online persona to deceive another person…
- Chat Moderation
- Chat moderation is the practice of monitoring and managing real-time conversations — in messaging…
- Community Guidelines
- Community guidelines are a set of rules and standards published by a platform that…
- Confusion Matrix
- A confusion matrix is a 2x2 table that breaks model predictions into true positives,…
- Content Filtering
- Content filtering is the process of screening incoming user-generated content against a set of…
- Content Flagging
- Content flagging is a feature that lets users report posts, comments, or media they…
- Content Moderation
- Content moderation is the practice of monitoring and managing user-generated content on a platform…
- Content Review
- Content review is the process of examining flagged or reported content to determine whether…
- Contextual Analysis
- Contextual analysis is the examination of a piece of content within its surrounding context…
- Coordinated Inauthentic Behavior (CIB)
- Coordinated inauthentic behavior is a term for networks of fake or compromised accounts that…
- COPPA
- The Children's Online Privacy Protection Act is a US federal law that restricts how…
- Crypto Scam
- A crypto scam is any fraud that exploits cryptocurrency rails to steal funds, including…
- CSAM (Child Sexual Abuse Material)
- CSAM stands for Child Sexual Abuse Material — any visual depiction of sexually explicit…
- Cyberbullying
- Cyberbullying is the use of digital communication tools — social media, messaging apps, comments,…
- Dark Web
- The dark web is a portion of the internet that is not indexed by…
- Data Labeling
- Data labeling is the process of annotating raw content with the correct categories so…
- Deepfake
- A deepfake is a piece of synthetic media — typically video, audio, or image…
- Deepfake Scam
- A deepfake scam uses AI-generated synthetic video or audio to impersonate a real person…
- Digital Services Act (DSA)
- The Digital Services Act (DSA) is a European Union regulation that sets binding rules…
- Disinformation
- Disinformation is false information that is deliberately created and spread to deceive, manipulate, or…
- Doxxing
- Doxxing is the act of publicly sharing someone's private personal information — such as…
- Employment Scam
- An employment scam is a fraud that uses fake job listings, recruiter outreach, or…
- F1 score
- The F1 score is the harmonic mean of precision and recall, used to evaluate…
- False Negative
- A false negative in content moderation is an instance where harmful or policy-violating content…
- False Positive
- A false positive in content moderation is an instance where benign content is incorrectly…
- Flagging
- Flagging is the act of marking a piece of content for moderator review, typically…
- Fraud Detection
- Fraud detection is the process of identifying and preventing deceptive activity on a platform…
- Government Impersonation Scam
- A government impersonation scam is a fraud in which criminals pose as officials from…
- Grooming Detection
- Grooming detection is the conversation-level identification of patterns used by adults to build trust…
- Ground Truth
- Ground truth is the human-labeled reference set that a classifier is trained and evaluated…
- Hash Matching
- Hash matching is a detection technique that compares the cryptographic or perceptual fingerprint of…
- Hate Speech
- Hate speech is content that promotes violence, discrimination, or hostility toward individuals or groups…
- Human in the Loop
- Human in the loop is a moderation approach where AI handles bulk decisions but…
- Human Moderation
- Human moderation is the practice of having trained people — rather than algorithms —…
- Imposter Scam
- An imposter scam is a fraud in which the attacker pretends to be someone…
- Investment Scam
- An investment scam is a fraud that lures victims into fake trading platforms, nonexistent…
- KOSA (Kids Online Safety Act)
- KOSA is proposed US federal legislation that would impose a duty of care on…
- LLM
- An LLM (Large Language Model) is a neural network trained on huge volumes of…
- LLM Hallucination
- An LLM hallucination is a confident but factually incorrect or fabricated output produced by…
- LLM Jailbreak
- An LLM jailbreak is a prompt or sequence of prompts crafted to bypass a…
- Machine Learning Moderation
- Machine learning moderation is the use of supervised models trained on labeled examples of…
- Manual Review
- Manual review is the process of a human moderator examining a piece of content…
- Misinformation
- Misinformation is false or misleading information that is shared without the intent to deceive…
- MLCommons safety categories
- The MLCommons safety categories are a standardized taxonomy of 13 harm types — created…
- Model Drift
- Model drift is the gradual decay in a classifier's accuracy as the language, topics,…
- Moderation Queue
- A moderation queue is the prioritized list of flagged or reported content waiting for…
- NCII (Non-Consensual Intimate Imagery)
- Non-consensual intimate imagery refers to sexually explicit photos or videos shared without the subject's…
- NLP
- NLP (Natural Language Processing) is the branch of artificial intelligence that gives machines the…
- NSFA (Not Safe for Ads)
- NSFA stands for "Not Safe for Ads" and labels content that is unsafe to…
- NSFW
- NSFW stands for "Not Safe For Work" and is used to label content —…
- Nudity Detection
- Nudity detection is the use of computer vision models to identify images or video…
- OCR (Optical Character Recognition)
- Optical character recognition is the extraction of machine-readable text from images, scanned documents, and…
- Offensive Content
- Offensive content is user-generated material likely to upset or alienate readers — including hate…
- Online Safety Act (UK)
- The Online Safety Act is a UK law that imposes a legal "duty of…
- Perspective API
- Perspective API is a public toxicity classification service from Google Jigsaw that scores text…
- Phishing
- Phishing is a social engineering attack in which the attacker impersonates a trusted entity…
- PhotoDNA
- PhotoDNA is a hash-matching technology developed by Microsoft that creates a robust digital signature…
- Pig Butchering Scam
- A pig butchering scam is a long-con fraud in which a scammer builds a…
- PII Detection
- PII detection is the automated identification of personally identifiable information — such as names,…
- Post Moderation
- Post moderation is the practice of letting user-generated content go live immediately and reviewing…
- Pre Moderation
- Pre moderation is the practice of reviewing user-generated content before it goes live, blocking…
- Precision
- Precision is a moderation metric that measures what fraction of the items flagged by…
- Proactive Moderation
- Proactive moderation is the practice of detecting and acting on policy violations before users…
- Profanity Filter
- A profanity filter is a tool that scans user-generated text against a list of…
- Prompt Injection
- Prompt injection is an attack against an LLM-powered application in which adversarial instructions —…
- Reactive Moderation
- Reactive moderation is the practice of waiting for users to report violations and only…
- Recall
- Recall is a moderation metric that measures what fraction of all the actually harmful…
- Red Teaming (AI)
- AI red teaming is the practice of adversarially probing a machine learning system —…
- Romance Scam
- A romance scam is a fraud in which the attacker feigns a romantic relationship…
- Rug Pull
- A rug pull is a cryptocurrency exit scam in which the developers of a…
- Section 230
- Section 230 is the provision of the 1996 Communications Decency Act that shields US…
- Self-Harm Detection
- Self-harm detection is the identification of user-generated content that expresses suicidal ideation, self-injury, or…
- Sentiment Analysis
- Sentiment analysis is a natural language processing technique that classifies text according to its…
- Sextortion
- Sextortion is a form of online blackmail in which an attacker threatens to share…
- Shadow Banning
- Shadow banning is the practice of silently reducing the visibility of a user's posts…
- SHAFT
- SHAFT is a content moderation and advertising compliance acronym for Sex, Hate, Alcohol, Firearms,…
- SIM Swap
- A SIM swap is an attack in which a fraudster social-engineers a mobile carrier…
- Smishing
- Smishing is phishing delivered over SMS, where attackers send text messages impersonating a bank,…
- Sock Puppet Account
- A sock puppet is a fake online identity created to deceive other users, usually…
- Spam
- Spam is unsolicited, irrelevant, or repetitive content posted at scale, typically for advertising, link…
- Takedown
- A takedown is the removal of a piece of content from a platform after…
- Tech Support Scam
- A tech support scam is a fraud in which criminals impersonate well-known software or…
- Terms of Service (ToS)
- Terms of Service is the legal agreement between a platform and its users that…
- Toxicity
- Toxicity in content moderation describes language that is harmful, abusive, or disruptive to a…
- Transparency Report
- A transparency report is a regular public disclosure in which a platform reports on…
- True Negative
- A true negative in content moderation is an instance where benign content is correctly…
- True Positive
- A true positive in content moderation is an instance where harmful or policy-violating content…
- Trust & Safety
- Trust & Safety is the discipline within an online platform responsible for protecting users…
- User-Generated Content (UGC)
- User-generated content (UGC) is any text, image, video, audio, or comment created and published…
- Vishing
- Vishing is voice phishing, where an attacker calls the victim and impersonates a bank,…
- Vision-Language Model (VLM)
- A vision-language model is a multimodal model that understands images and text together, letting…
- Zero Tolerance Policy
- A zero tolerance policy is a moderation rule that triggers an immediate and severe…
- Zero-Shot Classification
- Zero-shot classification is the ability of a large language model to assign labels it…
Last updated
