Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pearson.dist with flat spectra #344

Open
cbeleites opened this issue Jul 14, 2021 · 11 comments
Open

pearson.dist with flat spectra #344

cbeleites opened this issue Jul 14, 2021 · 11 comments
Assignees
Labels
Type: question ❔ Questions for all to consider.

Comments

@cbeleites
Copy link
Owner

When calculating pearson.dist() (which is basically a scaled correlation between rows/spectra) with a perfectly flat spectrum, the result is NaN.

This is caused by the standardization of the data matrix: the variance within the flat spectrum is 0, so a division by 0 occurs.

  • The behaviour is consistent with cor(x, y) which returns NA in this case.
  • OTOH, we may say that since the covariance with a flat spectrum is always 0 also the correlation should be 0 (and Pearson distance 0.5).
    Besides allowing smoothly to work with flat spectra and pearson.dist(), this would allow users to distinguish Pearson distance to a flat spectrum from situations where e.g. NAs in the spectra cause the distance to be NA.

Opinions?

library(hyperSpec)

x <- flu - flu [3] + 200
plot(x)


pearson.dist (x)
#>              1            2            3            4            5
#> 2 0.0008858704                                                    
#> 3          NaN          NaN                                       
#> 4 0.9967988590 0.9950559547          NaN                          
#> 5 0.9984690049 0.9968275493          NaN 0.0014374616             
#> 6 0.9990662021 0.9977563018          NaN 0.0016757621 0.0006331176

cor (t(x[[1]]), t(x[[3]]))
#> Warning in cor(t(x[[1]]), t(x[[3]])): the standard deviation is zero
#>      [,1]
#> [1,]   NA
cov (t(x[[1]]), t(x[[3]]))
#>      [,1]
#> [1,]    0
@cbeleites cbeleites added the Type: question ❔ Questions for all to consider. label Jul 14, 2021
@GegznaV
Copy link
Collaborator

GegznaV commented Jul 14, 2021

By saying "flat spectrum", do you mean a spectrum in which all intensities are the same, i.e., constant?

@cbeleites
Copy link
Owner Author

yes

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 14, 2021

  1. I see, that in documentation the distance is defined as D^2 = (1 - COR (x')) / 2. What does x' mean?)

  2. In the abstract of https://arxiv.org/abs/1908.06029, a bit different formula of Pearson distance is given.

  3. How should we interpret the Pearson distance? What are the min and max boundaries and what are the "special points" (that show, e.g., min and max distance)?

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 14, 2021

Here (https://rdrr.io/cran/rdist/man/rdist.html) also a bit different formula:
image

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 14, 2021

I feel that I do not have enough competence on this distance yet.
@claudia, could you suggest some good freely available but reliable sources on this type of distance? (I'm not sure if those I referenced are reliable).

@bryanhanson
Copy link
Collaborator

@GegznaV I sent you a document on Slack that I have found very helpful.

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 16, 2021

Some thoughts:

If we think that a value should be returned instead of NaN for "flat" spectrum, a warning must be issued in this case. I think, "flat" spectrum (saturated in the whole wavelength range or with no signal) should be a tail-tail sign that something is wrong and I user knows about the presence of spectra like this one.

We could have an argument for options:

  • for a regular algorithm that returns NaN;
  • for an algorithm that returns a number but with a warning;
  • for an algorithm that returns a number but without a warning (must not be the default);
  • ...

In the case of the NaN result, I'm not convinced that it is a "bad" idea presented by @sangttruong, that an extremely small number could be added to a spectrum (e.g., to a single wavelength) that practically does not change the spectrum but helps to avoid mathematical constraints and get a reasonable result?
I think we should consider this option as there is a practice to add small numbers to matrices when it is impossible to find e.g., in another way (we did that during our math lectures).


And what value should be issued for two flat spectra? Is 0.5 reasonable as well in this scenario? Or should it be 1 as the shape of two spectra is identical and this kind of measure measures similarity of shape?

@bryanhanson
Copy link
Collaborator

I think it is unwise to do anything other than what the functions naturally return (NaN). Users need to know there is an issue with their data. If the user wants to handle such data in a different way, it is not that hard to so (but probably depends on their final use of the information, which we don't know and shouldn't try to guess).

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 16, 2021

I agree with the idea that the original algorithm shouldn't be touched if it is widely accepted to use the algorithm in that form and that is the user's responsibility to fix his/her data as we cannot think of all boundary conditions and in some situations, our "shortcut solution" can be even more unexpected.

Yet, we can create a section in the documentation on how to deal with situations like these and illustrate the situation with an example.

@GegznaV
Copy link
Collaborator

GegznaV commented Jul 28, 2021

@cbeleites, What should we do with this issue? Is it OK to leave NA, when the algorithm issues NA?

@cbeleites
Copy link
Owner Author

Yes, but I've been looking into the paper you linked and realized that "Pearson distance" is quite ambiguous.
Squared vs. root and so on.

I'd like to sort this out for the release. I can open a new branch for this, though.
So for the moment, please leave this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: question ❔ Questions for all to consider.
Projects
None yet
Development

No branches or pull requests

3 participants