Pawb.Social Feedback @pawb.social Crashdoom (he/him) @pawb.social 1y ago

[RFC] Use of Automated Moderation Tools

Due to the recent spam waves affecting the Fediverse, we'd like to open requests for comment on the use of automated moderation tools across Pawb.Social services.

We have a few ideas on what we'd like to do, but want to make sure users would feel comfortable with this before we go ahead with anything.

For each of these, please let us know if you believe each use-case is acceptable or not acceptable in your opinion, and if you feel like sharing additional info, we'd appreciate it.

1. Monitoring of Public Streaming Feed

We would like to set up a bot that monitors the public feed (all posts with Public visibility that appears in the Federated timeline) to flag any posts that meet our internally defined heuristic rules.

Flagged posts would be reported per normal from a special system-user account, but reports would not be forwarded to remote instances to avoid false-positives.

These rules would be fixed based on metadata from the posts (account indicators, mentions, links, etc.), but not per-se the content of the posts themselves.

2. Building of a local AI spam-detection model

Taking this a step further, we would like to experiment with using TensorFlow Lite and Google Coral Edge TPUs to make a fully local model, trained on the existing decisions made by our moderation team. To stress, the model would be local only and would not share data with any third party, or service.

This model would analyze the contents of the post for known spam-style content and identifiers, and raise a report to the moderation team where it exceeds a given threshold.

However, we do recognize that this would result in us processing posts from remote instances and users, so we would commit to not using any remote posts for training unless they are identified as spam by our moderators.

3. Use of local posts for non-spam training

If we see support with #2, we'd also like to request permission from users on a voluntary basis to provide as "ham" (or non-spam / known good posts) to the spam-detection model.

While new posts would be run through the model, they would not be used for training unless you give us explicit permission to use them in that manner.

I'm hoping this method will allow users who feel comfortable with this to assist in development of the model, while not compelling anyone to provide permission where they dislike or are uncomfortable with the use of their data for AI training.

4. Temporarily limiting suspected spam accounts

If our heuristics and / or AI detection identify a significant risk or pattern of spammy behavior, we would like to be able to temporarily hide / suppress content from the offending account until a moderator is able to review it. We've also suggested an alternative idea to Glitch-SOC, the fork we run for furry.engineer and pawb.fun, to allow hiding a post until it can be reviewed.

Limiting the account would prevent anyone not following them from seeing posts or mentions by them, until their account restriction is lifted by a moderator.

In a false-positive scenario, an innocent user may not have their posts or replies seen by a user on furry.engineer / pawb.fun until their account restriction is lifted which may break existing conversations or prevent new ones.

We'll be leaving this Request for Comment open-ended to allow for evolving opinions over time, but are looking for initial feedback within the next few days for Idea #1, and before the end of the week for ideas #2 through #4.

24 comments

@crashdoom I'm generally against automated moderation having been shadowbanned on other platforms for no reason I can identify. These scripts are never infallable nomatter how well intentioned. A computer can be trained to recognise keywords but it can never understand context.

Having said that, I do appreciate the urgency to do something. If you do go ahead with it, I would ask the following:

- Make sure the user is informed of any action, never use shadowbans.

- Make sure there is easy access to human review in the event mistakes do occur.
- @RavenLuni @crashdoom Yeah I agree. Automated moderation systems can cause a lot of problems when they ban or limit without human interaction.
  
  If they do though, they need to inform the user of the actions performed, and there needs to be an easy way to appeal them, so they aren't just baseless automated bans like on every mainstream service.
- I've upvoted this but I'd just like to chuck in that I think Raven makes a lot of sense here. I've had posts deleted or hidden by automod bots on other sites and even when they're restored they don't get as much traction as the posts which were left alone. So there's an effect even if the action can be "reversed" - and I say that in quotes because it's not like you can turn the clock back.
  
  Hard agree on the no use of shadowbans and keeping users informed, and the easy escalation to a human.
  
  My ideal would be some kind of system which looks at the public feed for keywords and raises anything of concern to an admin, and maybe the admin's response goes back in as 'training'. Something more like SpamAssassin's Bayesian ham/spam classifier perhaps.
  
  I don't think automated actions without a human in the loop is the right way to go - and I have grave concerns about biases creeping into the model over time. The poster child for this is pretty much Amazon's HR resume' review system ended up with racist biases. There's been a lot of good progress improving PoC/BIPOC/BAME/non-white acceptance and it'd be a shame if something like this accidentally ended up scarring or undoing some of that.
It all feels generally ok by me, but I have some thoughts.

With #4, I am worried about timing. Imagine someone makes some art, and puts it in the main furry hashtag. But they trigger your bot because they link to their personal website or Patreon or something. It could take some number of hours, depending on whose awake or not, for it to be "cleared". By that time, if it's inserted into the feed based on its date, nobody would see it. If you are going to release transparency statistics (which IMO is important), I'd like to see "time until released" be a metric.

If this system goes into place, will the limit on mastodon.social etc. be lifted if you can screen them for spam?

On a personal note, I'm a bit weary of moderation by AI because of false positives. I'm neurodivergent, so I speak and communicate in a "different" way to neurotypical people. So I'm worried that an AI will pick up on that and block me unfairly because of the lack of training data. But that shouldn't be a problem here if you guys are committed to keeping a person in the loop (unlike most other places with AI moderation...). Although, apparently over half the furry fediverse is neurodivergent, so it's probably the neurotypicals that'll need to be worried about that.

Lastly, and this may be a hot take so I'm curious about other pawb people's thoughts on this, but does it need to be local only? This may be a crazy idea, but since it's opt in anyway, why not throw up all the training data on github or somewhere? Maybe even work with similarly minded instance admins and create one giant set of data. It'd be an ethically sourced dataset for moderation that can also be audited by anyone (to make sure you don't have specific political parties "accidentally" marked as spam). Most of my (and I presume many other people's) hangups with AI moderation is how closed and secretive it is. Would be nice to break that trend with something open.

Might also increase uptake as well. I bet a lot of privacy focused Linux users still will give Valve their system information because they see where it's going, and the value of doing so (i.e. to beat those filthy Mac users).
I'm for automated flagging to help you with moderation but, especially with AI, you should review every action your flagger takes. There might be a big false positive percentage. And there should be a way for muted people to talk to an admin to get it resolved (maybe put a little link on the sidebar for that). Freezing posts so that no one can see them or comment on them seems fine even if it takes a day or so to resolve.
When someone gets muted they should get a clear message that informs them about this too. Like a whisper message maybe.
I like the idea, if I understood it correctly, that users can help train the AI by throwing typical spam at it. Maybe there could be a whole community just for that.
Other than that I'm fine with all points and I'm glad if you can future proof this server for protection against massive bot raids.
Thanks for your service :3
I think 1-3 are fine (since nothing really happens without a human involved), but 4 should come in after several months of testing the model to make sure its false positive rate is as close to 0 as possible.

I in general think that LLM/"AI" stuff is massively overblown when used for creating content, but when analyzing stuff, it's much more reasonable to employ as a referral to humans to make the final decision.

I guess I've just been lucky in that I've not gotten any spam yet on masto...
Appreciate the feedback so far, let me try to see if I can answer most / many of the questions:

What are the risks of #4?

Many users are worried about the risk of automated actions going wrong and not knowing what we mean with "pattern of spammy behavior."

For how we would identify the pattern of behavior that would allow for automated actions, we would review any major spam wave, such as the one we've been experiencing over the past few days:

We would then identify any indicators we could use that are indicative of the known spam, and create a heuristic ruleset that would limit or suspend those accounts while targeting only those accounts actively engaging in the spam, not just referring to it. There are additional safeguards we can add, such as preventing rules being applied to users where the user is followed by someone on our instances.

For the risk of automated actions going wrong, if we were using a limit (not a suspend) then the account would be hidden from public view but could still be viewed if specifically searched by name, it would also suppress all notifications from that user unless they are followed by you. (e.g. if they messaged you out of the blue, you wouldn't see it if you weren't following them.)

If a suspend was used, the account would be marked for deletion from our instances but all follower relationships would immediately break (e.g. if you were following them, the system would automatically unfollow when they are suspended). Typically, we can restore data within 30 days, but follower relationships are typically unrecoverable. So long as rules are appropriately limited in scope to only target those with a lot of spam indicators, no false-positives should occur.

What about appeals?

For local users (anyone registered on furry.engineer and pawb.fun), all actions against your account (except reports) can be appealed. If you have a post removed or are suspended, all actions can be appealed directly to the admin team.

For remote users, we can remove restrictions on remote accounts if we receive an appeal from any of our users, or by the affected account directly. This can be done via email, or just through a DM to one of the admins who can pass it to the team.

Would the AI model have oversight?

Yes. Where the team believe the filter has flagged sufficient content appropriately and maintains no false-positives, we may promote a model or ruleset to allowing automated actions (limit / suspend).

We'll keep an eye on the actions of each ruleset by reviewing the daily / weekly actions taken to ensure they meet the criteria and have not misidentified any users or content, and we'll also start publicly tracking the statistics of the models / rulesets we create and use, including a count of false-positives or reversed decisions.

Will you notify users?

Due to limitations in Mastodon, we can only notify local users (users on furry.engineer or pawb.fun) when actions are taken against their account; This process happens automatically when your post is removed, or your account is warned, limited, or suspended.

There's no easy way to notify remote users other than sending them a DM, but doing so could be seen as spammy or lead to inciting further abusive behavior by informing them of our activity. While we can have transparency with our users due to having an invite-only platform, other instances are frequently open-registration which can allow the abusive user to re-create an account to continue to harass our users. BUT, I'm open to suggestions on this.
- hmm, on the last point: If it's just a single user harassing then it shouldn't be too much trouble if they re-create an account. The anti-spam system should flag them again if they keep harassing. If it's a lot of bots then I would assume they already have methods to determine whether an account is suspended (like DM-ing each other maybe). Hence there wouldn't be an advantage of not informing them of being suspended.
  I might be completely wrong here and missing a key point as I don't really know anything about Mastodon or spam prevention really but it just feels wrong to censor someone without them knowing.
  If time is crucial you could inform people an hour/a day/etc. after their suspension.
  
  So, the issue lays in that there's no technical way to notify the remote user (someone not on furry.engineer or pawb.fun) that they've been suspended on our end, without sending a message to them directly. If we suspend them on our end, that doesn't per se suspend them on their end and they wouldn't know that their messages were no longer reaching our users; They would still be able to message other users on their instance, and users on other instances, but not to our users.
  
  We're apprehensive about notifying remote accounts specifically because we don't often know the moderation practices of the remote instance (to know if they'll deal with it, or if they have open-registration allowing anyone to join without approval) and it may encourage further abusive behavior through ban evasions (creating new accounts on that instance or elsewhere to continue messaging) from the user being made aware that we're no longer receiving their messages.
I'm opposed to #4 on principle. ANY action taken against an account should ALWAYS be done by a person after direct review. It doesn't matter if it can be fixed afterwards or not, you're still potentially subjecting people to unfair treatment and profiling. You can have it notify moderators but the moderators should be the ones actually making the decision whether to limit an account for further investigation, not the auto-mod bot.

If you implement #4 as-is, I'm just flat-out not going to stick around.

EDIT: Also, I ran into an infinite loading bug when submitting this post.
@crashdoom

#1 makes sense; sure.

#2 seems fine to try but I am a little skeptical about the chances of success without domain knowledge. A Coral Edge TPU in particular feels quite unnecessary — most spam models are totally fine running on CPU. I am also a little surprised to see the first impression is to build rather than looking for existing local solutions.

#3 Sure, if it’s user by user opt in, that could be fine. I’d also ask — would false positives (flagged in an automated manner, reviewed by a human and found to be not spam) be entered as well to be trained on, or no?

#4 Seems reasonable, though I would hope that their posts would still be visible when directly viewing their profile page. I would also hope there is some mechanism in place such that automated techniques routinely misidentify a user, that they be exempted from this after ~2 times. I would also be curious to see some stats on this in transparency reports.
- #2: I've had some light experience before specifically with TensorFlow Lite models during my degree program. For the Coral Edge TPU, we wanted to off-load the processing to try to get the speed as near to zero latency as possible, though admittedly, it would potentially be superfluous. I'm also looking into some existing models I could potentially use but hadn't found any that particularly stood out, but if anyone has any recommendations I'd love to check them out!
  
  #3: Good question; If the system flags a post automatically as potentially spam, and the team determine it's not spam, I would probably like to be able to train on that message as "ham" / not-spam to avoid future false positives. But, that would be an extension of the scope of what we'd train on, so I'd very much like feedback on that too.
  
  #4: Yes, when a user is limited the profile will show a content warning before the contents of the profile. I believe the prompt is something like "This user has been hidden by the moderators of [instance name]". For repeated mis-identifications, yes, edge-cases like this we could approve the user and exempt them from future automated reports.
@crashdoom #1 and #4 i am entirely for. regarding ai, i think that using a fully local model is a great idea, but as @jackemled points out, bias is something that ai models tend to pick up very easily from training data, so that's something to be wary about. hopefully using only training data from this network's spam reports and voluntary non-spam data should produce accurate results, but i'd still be especially cautious. i have some faith that it could work, though, and would be happy to be part of the non-spam training data if you decide to go that route.
@crashdoom I think automated moderation tools are potentially problematic unless they can be made to take into account the cultural norms of the person speaking. For example, there's a traditional British food the name of which is a homophobic slur in American English. Another British slang term for a cigarette also falls foul of this. People who like this food have been auto-banned from other platforms for posting about it with no homophobic intent. I just don't think intelligent mod tools are sufficiently capable to pick up from the speaker's other posts or profile that what they said isn't prejudiced because of who said it or that someone using the n-word is black and therefore the standards for whether the speech should trigger discipline could be radically different than if a white person said it.

If auto moderation is introduced, I think it's important that the bot should message anyone it targets, tell them what its grounds were and how to appeal to a human if they think it was wrong
First three are fine with me without changes, but I need more details on the specific implementation of #4, and especially regarding false positives, before I can comment on that one.

Maybe we could have a trial period of #4 that merely displays a banner next to flagged posts, notifying folks that the account is suspected of spam, until it has been sufficiently vetted to reliably prevent false positives. The banner could also ask folks to confirm whether or not it really is spam.
#2 seems to require #3 by definition -- the model can't know what spam is without knowing what ham is as well. In general a DSpam model would seem to be the right one -- all posts used to train ham, individual posts marked as spam are removed from the ham set and added to the spam set, and then a separate spam feed that could be monitored for false positives.

In general all of these approaches sound fine to me -- I hope that mastodon can develop a built-in spam suppression system but for now we have to rely on these bespoke approaches.
@crashdoom I don't like the idea of using any "ai" or neural network model for this. I have NEVER had a good experience with automated moderation that uses them, they're always either completely nonfunctional or biased against or for a demographic. If that is what you choose to do, I think your idea of making it completely local is very good.

I think 1 & 4 are great, & that 2 & 3 could be unreliable & miss everything or create too many false positives.
Number 1 seems like a good enough idea, number 4 I'm not sure I can really comment on because I only use the lemmy instance for this group of instances, my masto is on an unrelated one, so I don't feel I have much of stake in something that doesn't look like it wouldn't impact the lemmy one much.

The AI business has me a bit concerned, just because I've heard AI to be inherently prone to bias even when one tries to avoid it, but if the AI is only used to flag things for moderator review rather than taking user affecting action itself, I don't see the harm, since it's just another user making reports in that case.
AFAIK, Pleroma and software forked from it (namely Akkoma and Soapbox) uses what it calls Message Rewrite Facility which is basically doing what #1 does anyway as a method of fine-tuned automatic moderation. It'd be nice for Mastodon (and Misskey) to use MRFs as the core of its moderation tooling. So I have no problems with #1. I'll also accept #4 as long as there's a method to report accounts as "not spam". I've had an issue with not getting notifications from a friend of mine because furry.engineer was limiting mastodon.social until recently I believe.

I don't mind #2 and #3 either, tbh, thinking about it more. AI is quite the buzzword nowadays but there have been cases when AI-assisted moderation has been helpful. I'd limit it to public posts only though, as that data would already be out in the open anyway given how the fediverse works, though given the drama over Bridgy Fed from people who don't realise what public really means this might be a bit of an uphill battle.
Any system where the most severe outcome is "A moderator will look at it" is an easy sell for me, so I wouldn't have any problem with 1 or 2. And an opt-in system of nearly any kind is going to be okay by me so long as it doesn't stand to harm anyone who hasn't given informed consent, so 3 also sounds fine.

With 4, I'd definitely want more details on what is considered "a significant risk or pattern of spammy behavior" and on why the temporary suppression "may break existing conversations or prevent new ones" before being comfortable with such a system.
@crashdoom I'm generally very wary about any sort of automated system that can ban or limit accounts without human input. Perhaps an alternative system to give moderators time to respond would be something that limits accounts that are reported by multiple local users in a short time period? That does have the potential for abuse as well and I think we should carefully consider the avenues for it, but at our community's scale it seems feasible to me.
IMO this all seems basically fine, as long as no action is taken without human hands touching it. Flagging and the like are fine, I just don't want to see conversations get broken since that's literally the whole point of existing on fedi for me. I totally get trying to limit spam though, so it's balancing it that's important. I appreciate you actually caring about what people have to say.
@crashdoom I've had thoughts about this for a while.

Building a first-warning system (using e.g. TensorFlow) would be a good way of at least seeing possible issues ahead.

I'm curious if, functionally, considering anything that ends up flagged as spam be marked as the equivalent of "followers only" for some amount of time until a human has had a chance to clear it, would help, as I would expect it helps with the shadowban issue.

New accounts, especially those from a server that nobody follows from, I think are the biggest one to look out for.

Honestly, I also take a fairly straightforward opinion on posts from other places: if it's public to the world, it's fair game, especially for classifier data. Generative models, no, but pure classifiers? Go ham. They put it on the Public Internet.
I support 1, 2, 3, 4.