I would like to fit model at factor level and use those fitted model name on fly for predicting new data at such matching factor level. I am failing in prediction in this logic, can someone guide on this considering below case?
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit<-sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Bb<-Aa[-(which(names(Aa) %in% "delay"))]
sapply(unique(Aa[["card"]]), function(x,y) predict(seq_along(x=ModelFit), newdata=DataOPEN[DataOPEN$SubsidiaryName],type="response"))
I have made this into a loop for simplicity. The prediction throws a warning, but seems to work. Your DataOPEN dataset was not provided, so I just calculated the prediction using the original Aa (new column pred). A final rounded version of the prediction is shown in column pred.round.
Aa <- data.frame(amount=c(1,2,1,2,1,1,2,2,1,1,1,2,2,2,1), cat1=sample(letters[21:24], 15,rep=TRUE),cat2=sample(letters[11:18], 5,rep=TRUE),
card=c("a","b","c","a","c","b","a","c","b","a","b","c","a","c","a"), delay=sample(c(1,1,0,0,0),5,rep=TRUE))
ModelFit <- sapply(as.character(unique(Aa[["card"]])), function(x)glm(delay~amount+cat1+cat2, family = "binomial", data = subset(Aa, card==x)), simplify = FALSE, USE.NAMES = TRUE)
Aa$pred <- NaN # create a new variable for prediction
for(i in levels(Aa$card)){
newdat <- subset(Aa, subset=card==i)
newdat$pred <- predict(ModelFit[[i]], newdata=newdat,type="response")
Aa$pred[match(rownames(newdat), rownames(Aa))] <- newdat$pred
}
Aa$pred.round <- round(Aa$pred) # a rounded prediction
Aa
The output:
> Aa
amount cat1 cat2 card delay pred pred.round
1 1 u p a 0 1.170226e-09 0
2 2 x o b 1 1.000000e+00 1
3 1 x o c 0 2.143345e-11 0
4 2 w m a 0 1.170226e-09 0
5 1 v n c 0 2.143345e-11 0
6 1 x p b 0 5.826215e-11 0
7 2 u o a 1 5.000000e-01 0
8 2 x o c 0 2.143345e-11 0
9 1 w m b 0 5.826215e-11 0
10 1 w n a 0 1.170226e-09 0
11 1 w p b 0 5.826215e-11 0
12 2 w o c 1 1.000000e+00 1
13 2 u o a 0 5.000000e-01 0
14 2 u m c 0 2.143345e-11 0
15 1 w n a 0 1.170226e-09 0
Related
Lets, say I have data:
df <- data.frame (RR_Code = c( "848140", "848180", "848190", "848310", "848360", "848410", "848490", "850131", "850132", "850133"),
Model = c("X1", "FG", "FD", "XR", "RT", "FG", "CV", "GH", "ER", "RF"))
RR_Code Model
1 848140 X1
2 848180 FG
3 848190 FD
4 848310 XR
5 848360 RT
6 848410 FG
7 848490 CV
8 850131 GH
9 850132 ER
10 850133 RF
Now I want to create a new column based on filer.df and if RR_Code is included in filter.df, its 1, otherwise 0.
filter.df <- c("848410", "848490", "850131", "850132")
Expected outcome:
RR_Code Model filter
1 848140 X1 0
2 848180 FG 0
3 848190 FD 0
4 848310 XR 0
5 848360 RT 0
6 848410 FG 1
7 848490 CV 1
8 850131 GH 1
9 850132 ER 1
10 850133 RF 0
We can do
df$filter <- +(df$RR_Code %in% filter.df)
df
# RR_Code Model filter
#1 848140 X1 0
#2 848180 FG 0
#3 848190 FD 0
#4 848310 XR 0
#5 848360 RT 0
#6 848410 FG 1
#7 848490 CV 1
#8 850131 GH 1
#9 850132 ER 1
#10 850133 RF 0
There are several ways of doing it, my first intuition would be using case_when, from dplyr.
df$new_column <- case_when(df$RR_Code %in% filter.df ~ 1,
TRUE ~ 0)
Which would return:
RR_Code Model new_column
1 848140 X1 0
2 848180 FG 0
3 848190 FD 0
4 848310 XR 0
5 848360 RT 0
6 848410 FG 1
7 848490 CV 1
8 850131 GH 1
9 850132 ER 1
10 850133 RF 0
There is something I do not understand in model.matrix. When I enter a single binary variable without an intercept it returns two levels.
> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x, data = temp.data)
> head(temp.data.table)
xA xB
1 1 0
2 0 1
3 0 1
4 0 1
5 1 0
6 0 1
However, when I enter another binary level, it creates only 3 columns. Why is that? What makes the behavior of the function suddenly different? and how can I avoid it?
> temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+ 'y' = sample(c('J', 'D'), 1000, replace = TRUE))
> temp.data.table <- model.matrix( ~ 0 + x + y, data = temp.data)
> head(temp.data.table)
xA xB yJ
1 0 1 0
2 0 1 1
3 0 1 1
4 0 1 0
5 1 0 1
6 0 1 0
You need to work with factors and set the contrasts to FALSE. Try this:
n <- 10
temp.data <- data.frame('x'=sample(c('A', 'B'), n, replace=TRUE),
'y'=factor(sample(c('J', 'D'), n, replace=TRUE)))
model.matrix( ~ 0 + x + y, data=temp.data,
contrasts=list(y=contrasts(temp.data$y, contrasts=FALSE)))
# xA xB yD yJ
# 1 0 1 1 0
# 2 1 0 0 1
# 3 0 1 1 0
# 4 1 0 0 1
# 5 0 1 0 1
# 6 1 0 1 0
# 7 1 0 1 0
# 8 0 1 1 0
# 9 0 1 0 1
# 10 0 1 1 0
# attr(,"assign")
# [1] 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x
# [1] "contr.treatment"
#
# attr(,"contrasts")$y
# D J
# D 1 0
# J 0 1
To understand why this happens, try:
contrasts(temp.data$y)
# J
# D 0
# J 1
contrasts(temp.data$y, contrasts=F)
# D J
# D 1 0
# J 0 1
With your x variable this happens automatically by setting 0 + to remove the intercept. (Actually x also should be coded as factor).
The reason is, that in linear regression the levels of factor variables are usually compared to a reference level (which you could change using relevel). In your model matrix, with 0 + you remove the intercept for your first variable but not to the following (try model.matrix( ~ 0 + y + x, data=temp.data) where you get only one x but to y). This is determined in the standard contrasts setting using treatment contrasts by default.
You may want to read a relevant post of Rose Maier (2015) explaining this in great detail:
Contrasts in R
You need to reset the contrasts of the factor variables. See this post.
temp.data <- data.frame('x' = sample(c('A', 'B'), 1000, replace = TRUE),
+ 'y' = sample(c('J', 'D'), 1000, replace = TRUE))
dat = model.matrix(~ -1 +., data=temp.data, contrasts.arg = lapply(temp.data[,1:2], contrasts, contrasts=FALSE))
head(dat)
xA xB yD yJ
1 0 1 0 1
2 1 0 0 1
3 1 0 0 1
4 1 0 0 1
5 0 1 1 0
6 0 1 0 1
I was wondering if there was a generic way to sort a symmetrical matrix in R, whilst preserving the diagonal values.
For example, if I have a matrix like this:
# Create Matrix -----------------------------------------------------------
offdiag <- c(rep(1:6))
m <- matrix(NA, ncol = 4, nrow = 4,dimnames = list(c("A","B","C","D"), c("A","B","C","D")))
m[lower.tri(m)] <- offdiag
m[upper.tri(m)] <- t(m)[upper.tri(t(m))]
diag(m) <- 0
m
which produces this:
A B C D
A 0 1 2 3
B 1 0 4 5
C 2 4 0 6
D 3 5 6 0
In the above example the values C and D share the largest value. So what I am trying to achieve is to reorder the matrix so the largest value is in the top left of the upper triangle (whilst not altering the diagonal 0's).
So if I were to explicitly rearrange the matrix, by hand, the final result would be:
# Create sorted matrix by hand --------------------------------------------
A <- c(2,3,0,1)
B <- c(4,5,1,0)
C <- c(0,6,2,4)
D <- c(6,0,3,5)
matr <- cbind(C,D,A,B)
rownames(matr) <- c("C","D","A","B")
matr
which would produce:
C D A B
C 0 6 2 4
D 6 0 3 5
A 2 3 0 1
B 4 5 1 0
What I'm wondering is, is there a way to generically sort the matrix like in my example for a (n X n) matrix?
Maybe you can try the code below
q <- which(colSums(m == max(m))>0,arr.ind = T)
o <- c(q, seq(ncol(m))[-q])
mout <- m[o,o]
such that
> mout
C D A B
C 0 6 2 4
D 6 0 3 5
A 2 3 0 1
B 4 5 1 0
I'm trying to generate a model.matrix that puts dummy variables for a categorical variable if it exists in either of a pair of factors. Here is an example:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
This results in 10 rows for the (5*(5-1)/2) pairs of (A,B,C,D,E):
df
group1 group2 val control1
1 A B 0.9788535 1.620103
2 A C 1.2101000 2.146025
3 A D 0.8841293 2.210699
4 A E 0.8622912 1.352755
5 B C 1.1840101 2.034643
6 B D 0.9730296 1.593481
7 B E 0.9574277 2.755427
8 C D 0.7279171 1.864196
9 C E 0.2472371 2.779127
10 D E 0.8517064 1.881325
I want to control for a fixed effect in a linear model when a particular level is in either group1 or group2. I can construct a model matrix for this:
tmp1 <- model.matrix(~ 0+group1,df)
tmp2 <- model.matrix(~ 0+group2,df)
tmp3 <- (tmp1|tmp2)*1
tmp3
group1A group1B group1C group1D group1E
1 1 1 0 0 0
2 1 0 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
5 0 1 1 0 0
6 0 1 0 1 0
7 0 1 0 0 1
8 0 0 1 1 0
9 0 0 1 0 1
10 0 0 0 1 1
A few questions:
Doing it this way does not leave me a lot of options in terms of other covariates. How can I construct such a dummy variable as represented by the model matrix tmp3 and then use it in a call to lm with other covariates such as control1?
The idea is that there is a fixed effect on whether an individual (A,B,C,D,E) is in either group1 or group2. This seems like a reasonable assumption, but I haven't found any references. Am I missing something obvious or does this have a common name in statistics?
Thanks for any help.
I am not sure if model.matrix does provide any options from this, but at least in your example you can reconstruct the matrix you are after without too much effort.
model_mat <- data.frame(tmp3[,-1], val = df$val, control1 = df$control1)
lm(val ~ ., data = model_mat)
You need to remove one of the dummies, I have removed A but you can of course pick any of the others as reference category.
Here's a solution using akrun's idea:
group1 <- factor(c("A","A","A","A","B",
"B","B","C","C","D"),
levels=c("A","B","C","D","E"))
group2 <- factor(c("B","C","D","E","C",
"D","E","D","E","E"),
levels=levels(group1))
set.seed(8)
val <- rnorm(10,1,.25)
control1 <- rnorm(10,2,.5)
df <- data.frame(group1,
group2,
val,
control1)
tmpval <- as.data.frame(Reduce('|',lapply(df[1:2], function(group) model.matrix(~0+group)))*1)
indf <- cbind(df,tmpval)
mod1 <- lm(val ~ 0+groupA+groupB+groupC+groupD+groupE,
indf)
summary(mod1)
I am working with multiple binary vectors e.g., A,B,C,D,E,F,G,H.
I want to find the classification between them. I have tried the following:
log_data<-read.csv(choose.files(), as.is = T, header = T, blank.lines.skip = TRUE)
data<-log_data[2:ncol(log_data)]
data
TIME A B C D E F G
1 1 1 1 0 1 0 1 1
2 0 0 1 1 1 1 0 1
3 1 1 1 1 1 0 1 1
4 1 0 1 1 1 1 0 1
.....................
fit <- network(data)
fit.prior <- jointprior(fit)
fit <- getnetwork(learn(fit,rats,fit.prior))
**Error in postc0c(node$condposterior[[1]]$mu, node$condposterior[[1]]$tau, :
NA/NaN/Inf in foreign function call (arg 1)**
Getting this error just because all are continuous variable and NULL at mu.
How should I proceed in order to classify after creating a network?