-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for mixed effects. #34
base: main
Are you sure you want to change the base?
Conversation
Hi @matthewwardrop, it's great news you're adding this to your Formulaic package! It's been an excellent source of inspiration for my formulae library. I'm happy to help you with this new feature. I'm not familiar with the codebase in Formulaic, but I'll try to give you some pointers that were useful to me when implementing this feature.
Zi = linalg.khatri_rao(Ji.T, Xi.T).T Suppose you have a term
This is much better explained in section 2.3 of Fitting Linear Mixed Effects Models using lme4.
library(lme4)
library(plot.matrix)
# Random effects matrices using `||`
f1 = ~ Days + (Days || Subject)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = sleepstudy))
dim(t(terms$Ztlist$`1 | Subject`)) # 180 18
dim(t(terms$Ztlist$`0 + Days | Subject`)) # 180 18
# Random effects matrices using `|`
f2 = ~ Days + (Days | Subject)
terms2 = mkReTrms(findbars(f2), model.frame(subbars(f2), data = sleepstudy))
dim(t(terms2$Ztlist$`Days | Subject`)) # 180 36
interleave_matrices = function(m1, m2) {
# m1 and m2 have the same number of columns
m = matrix(nrow = nrow(m1), ncol = 2 * ncol(m1))
for (j in seq(ncol(m1))) {
m[, j * 2 - 1] = m1[, j]
m[, j * 2] = m2[, j]
}
return(m)
}
m1 = interleave_matrices(
as.matrix(t(terms$Ztlist$`1 | Subject`)),
as.matrix(t(terms$Ztlist$`0 + Days | Subject`))
)
m2 = as.matrix(t(terms2$Ztlist$`Days | Subject`))
all(m1 == m2)
# TRUE
plot(as.matrix(m1), breaks = 10, col = viridis::viridis, main = "Double pipe ||")
plot(as.matrix(m2), breaks = 10, col = viridis::viridis, main = "Single pipe |")
# Conclusion:
# Using `||` gives separated matrices for the intercept and the slope. This means
# that b_0 ~ N and b_1 ~ N (the effects for the intercept and the slope are
# modeled as separated, uncorrelated, multivariate normal distributions).
# On the other hand, using `|` results in one matrix for both the intercept and
# the slope. Thus, (b_0, b_1) ~ N and the effects for the intercept and the slope
# have a joint multivariate normal distribution.
# In summary, using `||` does not change the columns of the design matrix, but
# how they are considered in terms of their correlation when fitting the model. Feel free to reach out to discuss further this new feature. It's great you're adding it to Formulaic! :D |
Thanks @tomicapretto for your prompt reply and discussion in this space! The Khatri Rao product is equivalent (unless I'm greatly mistaken) to the standard model matrix generation process (Cartesian product of colums generated by each factor vs. column-wise Kronecker product of the columns of the transposed matrices), which is nice because you don't need to do anything special. And yes... the Thanks again! And I'll ping you if/when I get something more solid! |
As far as I know, it's not necessary to enforce linear independence for the columns of the entire matrix Z, neither to enforce these to be independent of X. In fact, both lme4 and formulae can result in Z matrices where rank(Z) < ncol(Z). See the following example where there are two random effects for intercepts. model = lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)
mm = as.matrix(t(model@pp$`.->Zt`))
ncol(mm) # 30
qr(mm)$rank # 29 Next, we can see that the columns of Z for (1|plate) and the columns of Z for (1|sample) add up to a vector of ones, which is not a surprise since they are both matrices for random intercepts. f1 = ~ 1 + (1|plate) + (1|sample)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = Penicillin))
Z1 = as.matrix(t(terms$Ztlist$`1 | plate`)) # 144 24
Z2 = as.matrix(t(terms$Ztlist$`1 | sample`)) # 144 6
f1 = ~ 1 + (1|plate) + (1|sample)
terms = mkReTrms(findbars(f1), model.frame(subbars(f1), data = Penicillin))
Z1 = as.matrix(t(terms$Ztlist$`1 | plate`)) # 144 24
Z2 = as.matrix(t(terms$Ztlist$`1 | sample`)) # 144 6
all(apply(Z1, 1, sum) == 1) # TRUE
all(apply(Z1, 1, sum) == apply(Z1, 1, sum)) # TRUE
Also this can be useful. Note, however, the situation described in the post is not the same as the one described in my examle. In the example, we have two partitions of Z that are independent within (their columns are linearly independent) but dependent between (the sum of the columns of both blocks are equal). I'm sorry if this is not the right technical language, I hope I can transmit the idea. On the other hand, the example in MixedModels.jl docs uses a categorical variable that with full encoding (i.e. one column per level of the variable) without checking if that introduces linear dependencies. This last situation can't happen in formulae because it is prevented the same way as in the matrix of the fixed effects. Well, I hope this helps with your work. If I'm not very clear, please don't hesitate to ask ping me again 😄 |
03c05aa
to
e14be6c
Compare
Hi @tomicapretto , Finally circling back around to this. Formulaic has come a long way since this original draft PR, and now supports arbitrarily structured formulae. There are perhaps a few gaps, but this now works like: >>> from formulaic import Formula
>>> Formula("(a|g) + (b*c|g*h) + (d||i)")
root:
1
.group:
a|g + b|g + b|h + c|g + c|h + b:c|g + b:c|h + b|g:h + c|g:h + b:c|g:h
.group_independent:
d||i
>>> Formula("y ~ x + (a+b|g)")
.lhs:
y
.rhs:
root:
1 + x
.group:
a|g + b|g Compared to the prior draft, each of the group terms also preserve their rank under the The only thing left to do, which is perhaps optional given that the information about grouping is now in the model spec, is to cause the model matrices to preserve the '|' in the column names: >>> df = pandas.DataFrame({"a": [1,2,3,4], "g": list('wxyz')})
>>> mm = Formula("(a|g)").get_model_matrix(df)
>>> mm
root:
Intercept
0 1.0
1 1.0
2 1.0
3 1.0
.group:
g[T.w]:a g[T.x]:a g[T.y]:a g[T.z]:a
0 1 0 0 0
1 0 2 0 0
2 0 0 3 0
3 0 0 0 4
>>> mm.model_spec.group.terms
[a|g]
>>> mm.model_spec.group.term_indices
OrderedDict([(a|g, (0, 1, 2, 3))])
... Fixing this would be a little annoying and special-cased, but definitely doable. Thoughts? |
@matthewwardrop thanks for tagging me here! I'll provide feedback within this week! I'm excited about this change 😄 |
@matthewwardrop first of all, congrats because of all the progress with formulaic, it's definitely a fantastic library and I think I have never seen something even closer to this in the entire Python ecosystem. This s a wonderful solution. About the group-specific terms, the import pandas as pd
from formulaic import Formula
df = pd.DataFrame({"a": [1, 2, 3, 4], "g": list('wxyz')})
Formula("(a | g)").get_model_matrix(df)
Formula("(1 + a | g)").get_model_matrix(df)
On top of that, there are a couple of things about names that could be changed
I don't think these points would be critical for a developer trying to use formulaic since these are objects within the Another issue. Since the LHS of See the following examples import numpy as np
import pandas as pd
from formulaic import Formula
from formulae import design_matrices
df = pd.DataFrame({"g": np.repeat(list("ABCD"), 3), "x1": np.tile(list("xyz"), 4)}) Using formulae dm = design_matrices("(x1 | g)", df)
dm.group
dm.group.terms["x1|g"].labels
If we drop the intercept from the LHS of the group-specific term, then we have all the levels of dm = design_matrices("(0 + x1 | g)", df)
print(dm.group)
print(dm.group.terms["x1|g"].labels)
And if we use formulaic, this is still not happening Formula("(0 + x1 | g)").get_model_matrix(df).group.columns
Formula("(1 + x1 | g)").get_model_matrix(df).group.columns Index(['g[T.A]', 'g[T.B]', 'g[T.C]', 'g[T.D]', 'g[T.A]:x1[T.x]',
'g[T.A]:x1[T.y]', 'g[T.A]:x1[T.z]', 'g[T.B]:x1[T.x]', 'g[T.B]:x1[T.y]',
'g[T.B]:x1[T.z]', 'g[T.C]:x1[T.x]', 'g[T.C]:x1[T.y]', 'g[T.C]:x1[T.z]',
'g[T.D]:x1[T.x]', 'g[T.D]:x1[T.y]', 'g[T.D]:x1[T.z]'],
dtype='object') |
This would be the last example using > library(lme4)
>
> df <- data.frame(
+ g = rep(LETTERS[1:4], each = 3),
+ x1 = rep(letters[24:26], times = 4)
+ )
>
> print(df)
g x1
1 A x
2 A y
3 A z
4 B x
5 B y
6 B z
7 C x
8 C y
9 C z
10 D x
11 D y
12 D z
>
> f1 <- ~ (x1 | g)
> f2 <- ~ (0 + x1 | g)
>
> lme4_terms_f1 <- mkReTrms(findbars(f1), model.frame(subbars(f1), data = df))
> lme4_terms_f2 <- mkReTrms(findbars(f2), model.frame(subbars(f2), data = df))
>
> names(lme4_terms_f1$Ztlist)
[1] "x1 | g"
> names(lme4_terms_f2$Ztlist)
[1] "0 + x1 | g"
>
> x1_g_f1 <- as.matrix(t(lme4_terms_f1$Ztlist$`x1 | g`))
> x1_g_f2 <- as.matrix(t(lme4_terms_f2$Ztlist$`0 + x1 | g`))
>
> print(x1_g_f1)
A A A B B B C C C D D D
1 1 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0
3 1 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0
5 0 0 0 1 1 0 0 0 0 0 0 0
6 0 0 0 1 0 1 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0
8 0 0 0 0 0 0 1 1 0 0 0 0
9 0 0 0 0 0 0 1 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 1 1 0
12 0 0 0 0 0 0 0 0 0 1 0 1
> print(x1_g_f2)
A A A B B B C C C D D D
1 1 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0
8 0 0 0 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 0 0 0 0 0 1
>
> print(lme4_terms_f1$cnms)
$g
[1] "(Intercept)" "x1y" "x1z"
> print(lme4_terms_f2$cnms)
$g
[1] "x1x" "x1y" "x1z" Edit I'm more than happy to help with whatever I can help :D |
Thanks for your detailed comments above @tomicapretto , and apologies that it has taken me years (?!) to get around to processing it. Life has been pretty hectic! I'm going to take another stab at getting this merged in for version 1.1 that should be released later this month. I'll test against the above datasets, and formulae. |
Amazing! Ping me if you want me to test anything :) |
dd375ec
to
e7250bc
Compare
Hi @tomicapretto,
I stumbled across your cool formulae project recently, and it piqued my interest. One of the primary goals of
formulaic
was to be easily extensible to support new use-cases, much like that offormulae
. As such, I wanted to try my hand at a quick extension to Formulaic that would support your use-case, while keeping it general so that others can plug into it also. I don't necessarily want you to throw away your excellent work on Formulae, nor is this PR immediately ready for merging... but I'd love to get your take on this. Perhaps there is opportunity for synergy.With this PR, you can do:
For something like this to be tenable for your use-cases, what further additions would be desirable? One thing I think could definitely be added pretty easily is aliasing of the terms to match the "|" notation.