Have you ever wondered how to calculate the similarity between two sets of observations from different random variables? For instance, let’s say you’re comparing the prices of a car part from different retailers. You have ‘n’ observations of prices from one retailer, and ‘m’ observations from another. Both sets of prices follow a log-normal distribution. But how do you define a distance that shows how close these two distributions are?
One approach could be to compute the estimators of the central values (mean, geometric mean) of each distribution based on the observations and calculate the distance between these two estimators. But is this distance good enough? Does it capture the uncertainty that comes with having a low number of observations?
In this case, the objective is to estimate the similarity between two car models by comparing the distributions of their prices part by part. This raises interesting questions about how to quantify the similarity between two distributions.
One possible solution is to use a distance metric that takes into account the uncertainty in the observations. For example, you could use a Bayesian approach to model the distributions and calculate the posterior distribution of the distance between the two distributions. This would give you a more nuanced understanding of the similarity between the two distributions.
Another approach could be to use a non-parametric test to compare the two distributions. This would allow you to test whether the two distributions are similar without making any assumptions about the underlying distribution.
Ultimately, the choice of distance metric depends on the specific problem you’re trying to solve and the characteristics of the data. But by carefully considering the options, you can develop a more robust and meaningful measure of similarity between two distributions.