cross-posted from: https://lemmy.world/post/28546756

So I’ve completed the cosine similarity function, which means the script is now recommending videos in a raw way. Below is just a ranking of videos that match my watch history (all three are most likely videos I’ve already watched):

2: {shortUUID: “saKY2TWfwNYgPUQFkE4xsi”, similarity: 0.4955} 3: {shortUUID: “kk7x8GAs7gNvkzaPs6EPiU”, similarity: 0.4099} 4: {shortUUID: “uXeAyVfX1WEzqSPsDxtH3p”, similarity: 0.2829}

Getting to this point made me realize: there’s no such thing as a simple algorithm—just simple ways to collect data. The code currently has issues with collecting data properly, so that’s something that needs fixing. Hopefully, once the data collection in this script is improved, it can be reused for future Fediverse algorithms.

There are countless ways to process the data. Cosine similarity is a simple concept and easy to implement in code, but it has a flaw: content you’ve already watched tends to rank higher than anything new. So a basic “pick the highest cosine similarity” approach probably isn’t ideal. It either needs logic to remove already-watched videos, or to bias toward videos lower down in the ranking. But filtering out watched videos isn’t perfect either—people do like to rewatch things.

The algorithm currently just looks at how much time you spent watching unique segments of a video, then assigns a value in seconds to all the words in the title, description, and tags, and sums that over all videos.

The algorithm is actually okay—subjectively, it’s better than just sorting by date. I picked a few videos at random from the top 300 ranked by cosine similarity , and there was content interesting enough to watch for more than 30 seconds, and some that was just too weird for me. Here are a few examples:

Some of these links are across different instances because no single PeerTube instance has all the videos. I loaded metadata for over 6,000 videos across five instances during testing.

The question is: should the algorithm be scoped to a single instance (only looking at content on the user’s home instance), or should it recommend from any instance and take you there?

funny thing to note is that there might be a linux pipeline in this algo

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    6 hours ago

    I think it needs to work across instances, since we’re concerned wit the Fediverse and federation is one of the defining mechanics. Also when I have a look at my subscriptions, they come from a variety of instances. So I don’t think a single instance feature would be of any use for me.

    Sure. And with the cosine similarity, you’d obviously need to suppress already watched videos. Obviously I watched them and the algorithm knows, but I’d like it to recommend new videos to me.