pearson.dist with flat spectra #344

cbeleites · 2021-07-14T11:59:48Z

When calculating pearson.dist() (which is basically a scaled correlation between rows/spectra) with a perfectly flat spectrum, the result is NaN.

This is caused by the standardization of the data matrix: the variance within the flat spectrum is 0, so a division by 0 occurs.

The behaviour is consistent with cor(x, y) which returns NA in this case.
OTOH, we may say that since the covariance with a flat spectrum is always 0 also the correlation should be 0 (and Pearson distance 0.5).
Besides allowing smoothly to work with flat spectra and pearson.dist(), this would allow users to distinguish Pearson distance to a flat spectrum from situations where e.g. NAs in the spectra cause the distance to be NA.

Opinions?

library(hyperSpec)

x <- flu - flu [3] + 200
plot(x)


pearson.dist (x)
#>              1            2            3            4            5
#> 2 0.0008858704                                                    
#> 3          NaN          NaN                                       
#> 4 0.9967988590 0.9950559547          NaN                          
#> 5 0.9984690049 0.9968275493          NaN 0.0014374616             
#> 6 0.9990662021 0.9977563018          NaN 0.0016757621 0.0006331176

cor (t(x[[1]]), t(x[[3]]))
#> Warning in cor(t(x[[1]]), t(x[[3]])): the standard deviation is zero
#>      [,1]
#> [1,]   NA
cov (t(x[[1]]), t(x[[3]]))
#>      [,1]
#> [1,]    0

The text was updated successfully, but these errors were encountered:

GegznaV · 2021-07-14T12:54:39Z

By saying "flat spectrum", do you mean a spectrum in which all intensities are the same, i.e., constant?

cbeleites · 2021-07-14T13:01:27Z

yes

GegznaV · 2021-07-14T13:08:53Z

I see, that in documentation the distance is defined as D^2 = (1 - COR (x')) / 2. What does x' mean?)
In the abstract of https://arxiv.org/abs/1908.06029, a bit different formula of Pearson distance is given.
How should we interpret the Pearson distance? What are the min and max boundaries and what are the "special points" (that show, e.g., min and max distance)?

GegznaV · 2021-07-14T13:11:45Z

Here (https://rdrr.io/cran/rdist/man/rdist.html) also a bit different formula:

GegznaV · 2021-07-14T13:15:57Z

I feel that I do not have enough competence on this distance yet.
@claudia, could you suggest some good freely available but reliable sources on this type of distance? (I'm not sure if those I referenced are reliable).

bryanhanson · 2021-07-14T13:34:36Z

@GegznaV I sent you a document on Slack that I have found very helpful.

GegznaV · 2021-07-16T11:26:55Z

Some thoughts:

If we think that a value should be returned instead of NaN for "flat" spectrum, a warning must be issued in this case. I think, "flat" spectrum (saturated in the whole wavelength range or with no signal) should be a tail-tail sign that something is wrong and I user knows about the presence of spectra like this one.

We could have an argument for options:

for a regular algorithm that returns NaN;
for an algorithm that returns a number but with a warning;
for an algorithm that returns a number but without a warning (must not be the default);
...

In the case of the NaN result, I'm not convinced that it is a "bad" idea presented by @sangttruong, that an extremely small number could be added to a spectrum (e.g., to a single wavelength) that practically does not change the spectrum but helps to avoid mathematical constraints and get a reasonable result?
I think we should consider this option as there is a practice to add small numbers to matrices when it is impossible to find e.g., in another way (we did that during our math lectures).

And what value should be issued for two flat spectra? Is 0.5 reasonable as well in this scenario? Or should it be 1 as the shape of two spectra is identical and this kind of measure measures similarity of shape?

bryanhanson · 2021-07-16T14:27:15Z

I think it is unwise to do anything other than what the functions naturally return (NaN). Users need to know there is an issue with their data. If the user wants to handle such data in a different way, it is not that hard to so (but probably depends on their final use of the information, which we don't know and shouldn't try to guess).

GegznaV · 2021-07-16T15:08:24Z

I agree with the idea that the original algorithm shouldn't be touched if it is widely accepted to use the algorithm in that form and that is the user's responsibility to fix his/her data as we cannot think of all boundary conditions and in some situations, our "shortcut solution" can be even more unexpected.

Yet, we can create a section in the documentation on how to deal with situations like these and illustrate the situation with an example.

GegznaV · 2021-07-28T00:12:19Z

@cbeleites, What should we do with this issue? Is it OK to leave NA, when the algorithm issues NA?

cbeleites · 2021-07-30T20:58:34Z

Yes, but I've been looking into the paper you linked and realized that "Pearson distance" is quite ambiguous.
Squared vs. root and so on.

I'd like to sort this out for the release. I can open a new branch for this, though.
So for the moment, please leave this issue open.

cbeleites added the Type: question ❔ Questions for all to consider. label Jul 14, 2021

GegznaV mentioned this issue Jul 28, 2021

Feature/19 rename to dist_pearson()/22 update visual unit tests r-hyperspec/hyperSpec#20

Merged

cbeleites self-assigned this Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pearson.dist with flat spectra #344

pearson.dist with flat spectra #344

cbeleites commented Jul 14, 2021

GegznaV commented Jul 14, 2021

cbeleites commented Jul 14, 2021

GegznaV commented Jul 14, 2021

GegznaV commented Jul 14, 2021

GegznaV commented Jul 14, 2021

bryanhanson commented Jul 14, 2021

GegznaV commented Jul 16, 2021

bryanhanson commented Jul 16, 2021

GegznaV commented Jul 16, 2021

GegznaV commented Jul 28, 2021

cbeleites commented Jul 30, 2021

pearson.dist with flat spectra #344

pearson.dist with flat spectra #344

Comments

cbeleites commented Jul 14, 2021

GegznaV commented Jul 14, 2021

cbeleites commented Jul 14, 2021

GegznaV commented Jul 14, 2021

GegznaV commented Jul 14, 2021

GegznaV commented Jul 14, 2021

bryanhanson commented Jul 14, 2021

GegznaV commented Jul 16, 2021

bryanhanson commented Jul 16, 2021

GegznaV commented Jul 16, 2021

GegznaV commented Jul 28, 2021

cbeleites commented Jul 30, 2021