I just developed and published a script to clear your pict-rs object storage from potential CSAM.

db0 · edit-2 3 years ago

I just developed and published a script to clear your pict-rs object storage from potential CSAM.

@FriendlyBeagleDog@lemmy.blahaj.zone · 3 years ago

Not well versed in the field, but understand that large tech companies which host user-generated content match the hashes of uploaded content against a list of known bad hashes as part of their strategy to detect and tackle such content.

Could it be possible to adopt a strategy like that as a first-pass to improve detection, and reduce the compute load associated with running every file through an AI model?

@dan@upvote.au · edit-2 3 years ago

match the hashes

It’s more than just basic hash matching because it has to catch content even if it’s been resized, cropped, reduced in quality (lower JPEG quality with more artifacts), colour balance change, etc.

@crunchpaste@lemmy.dbzer0.com · 3 years ago

Well, we have hashing algorithms that do exactly that, like phash for example.

@dan@upvote.au · 3 years ago

Definitely. A lot of the good algorithms used by big services are proprietary though, unfortunately.

@crunchpaste@lemmy.dbzer0.com · 3 years ago

Can you point me to some of them? I’m quite interested in visual hashing.

@dan@upvote.au · edit-2 3 years ago

Microsoft’s PhotoDNA is probably the most well-known. Every major service that has user-generated content uses it. Last I checked, it wasn’t open-source. It was built for detecting CSAM, but it’s really just a general-purpose similarity hashing algorithm.

Meta has some algorithms that are open-source: https://about.fb.com/news/2019/08/open-source-photo-video-matching/

Google has CSAI Match for hash-matching of videos and Google Content Safety API for classification of new content, but both are proprietary.

db0 · edit-2 3 years ago

There’s better approaches than hashing. For comparing images I am calculating “distance” in tensors between them. This can match even when compression artifacts are involved or the images are slightly altered.

@FriendlyBeagleDog@lemmy.blahaj.zone · 3 years ago

Ah, of course - that’s unfortunate, but thanks for the pointer.

I just developed and published a script to clear your pict-rs object storage from potential CSAM.

I just developed and published a script to clear your pict-rs object storage from potential CSAM.

GitHub - Haidra-Org/lemmy-safety: A script that goes through a lemmy pict-rs object storage and tries to prevent illegal or unethical content