Products in multiple categories. Find similar products.

This is a draft

– product model

– product can be in  more than one categories, on multiple levels of the tree (i can attach the product anywhere i want); no restriction on this what so ever.

– categories are organized as a nested set (logical organized as a tree)

Problem: need to find similar products based on this taxonomy; sorting from ‘most similar’ to ‘least similar’

Important concepts

Nearest ancestor (of two categories) –
Level of the nearest ancestor
Set with categories and categories’ ancestors



Between 2 product categories (of 2 different products):

– nearest ancestor level: higher is better

– nearest ancestor level: having 1 level-2 common ancestor is better than having 2 level-1 common ancestors

– levels scoring grows with an alpha factor > 2 (mine is 2.5)

In conclusion: level 0 nearest ancestor: 0p1:10p2:25p3:62p 4:155p etc

Between 2 products:

– Score: best scoring nearest common ancestor + 20% of other nearest common ancestors for any of the products’ categories


– for each product category of a specific product, take the category and all of its ancestors

– make a set S, for every product, with all categories and all of categories ancestors (this set can be cached)

– for each 2 products: compute  intersection: I12 = S1 AND S2

– for that intersection (which contains common ancestors, meaning common parent categories) – compute the score, as shown before


– this ideea works for a reasonable number of categories and products; don’t code it for amazon 😛 ; this implementation can of course be better optimised

– this alghorithm gives a better score  products that are both in a very specific category (EG: Peripherals -> keyboards -> multimedia –> wireless )

– the similarities must be recalculated (for ALL products) on changing a product’s categorization ; however, the categories and the categories’ ancestors can be cached

– optimisation ideea: enhance the score if products have similar words (also exclude stop words)


One Response

  1. Hi. ‘Stumbled upon’ your post through WordPress’ tag surfer and thought this was an interesting problem.

    Have you considered building an ‘index’ (of sorts) containing the distance (physical on the tree or logical based on external rules) between categories (nodes) on the tree?

    The base data could simply be the distance between parent-child, and from this distances between any other two nodes in the category tree can be derived.

    Then, to calculate similarity between two products simply sum the distances between all categories that product belongs to and lowest distance => most similar. The downside to a simple sum is products with more categories become more ‘distant’, so maybe average (sum of all distances between all categories divided by # of categories) or a more heuristic aggregate function.

    Anyway, this is just a mind fart late at night after too much wine so ignore me if I’m not making sense. 🙂

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: