'Normalization of a multi-dimensional space, what algorithm is this?
I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Solution 1:[1]
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Solution 2:[2]
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Solution 3:[3]
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alexander Kuznetsov |
| Solution 2 | Ivaylo Strandjev |
| Solution 3 | Glorfindel |
