Have a 1000*16 matrix from a simulation with team names as characters. I want to count number of occurrences per team in all 16 columns.
I know I could do apply(test, 2, table) but that makes the data hard to work with afterward since all teams is not included in every column.
If you have a vector that is all the unique team names you could do something like this. I'm counting occurrences here via column to ensure that not every team (in this case letter) is not included.
set.seed(15)
letter_mat <- matrix(
sample(
LETTERS,
size = 1000*16,
replace = TRUE
),
ncol = 16,
nrow = 1000
)
output <- t(
apply(
letter_mat,
1,
function(x) table(factor(x, levels = LETTERS))
)
)
head(output)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
[1,] 1 2 0 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0 1
[2,] 0 1 0 2 2 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 2 2 1
[3,] 1 1 0 0 1 0 1 2 1 0 0 0 0 0 1 0 1 0 1 1 0 0 3 0 1 1
[4,] 0 1 0 0 0 1 0 0 0 2 0 1 0 0 1 1 1 1 2 0 2 3 0 0 0 0
[5,] 2 1 0 0 0 0 0 2 0 2 1 1 1 0 0 2 0 2 1 0 0 1 0 0 0 0
[6,] 0 0 0 0 0 1 3 1 0 0 0 0 1 1 3 0 1 0 0 1 0 0 0 1 0 3
Here is Matlab code to form the matrix of logical values of '0' and '1'
A=[1 2 3 4 5 6 7 8 9 10 ];
N = numel(A);
step = 2; % Set this to however many zeros you want to add each column
index = N:-step:1;
val = (1:N+step).' <= index;
Which result in
val=
1 1 1 1 1
1 1 1 1 1
1 1 1 1 0
1 1 1 1 0
1 1 1 0 0
1 1 1 0 0
1 1 0 0 0
1 1 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
How to do same task in r ,particularly val = (1:N+step).' <= indexthis step?
One option is
i <- seq_len(ncol(m1))
sapply(rev(i), function(.i) {
m1[,.i][sequence(.i *2)] <- 1
m1[,.i]
})
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 1 1
# [2,] 1 1 1 1 1
# [3,] 1 1 1 1 0
# [4,] 1 1 1 1 0
# [5,] 1 1 1 0 0
# [6,] 1 1 1 0 0
# [7,] 1 1 0 0 0
# [8,] 1 1 0 0 0
# [9,] 1 0 0 0 0
#[10,] 1 0 0 0 0
#[11,] 0 0 0 0 0
#[12,] 0 0 0 0 0
Or vectorize it
i1 <- rep(i, rev(2*i))
m1[cbind(ave(i1, i1, FUN = seq_along), i1)] <- 1
m1
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 1 1
# [2,] 1 1 1 1 1
# [3,] 1 1 1 1 0
# [4,] 1 1 1 1 0
# [5,] 1 1 1 0 0
# [6,] 1 1 1 0 0
# [7,] 1 1 0 0 0
# [8,] 1 1 0 0 0
# [9,] 1 0 0 0 0
#[10,] 1 0 0 0 0
#[11,] 0 0 0 0 0
#[12,] 0 0 0 0 0
Or another option without creating a matrix beforehand
n <- 5
i1 <- seq(10, 2, by = -2)
r1 <- c(rbind(i1, rev(i1)))
matrix(rep(rep(c(1, 0), n), r1), ncol = n)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 1 1 1 1
# [2,] 1 1 1 1 1
# [3,] 1 1 1 1 0
# [4,] 1 1 1 1 0
# [5,] 1 1 1 0 0
# [6,] 1 1 1 0 0
# [7,] 1 1 0 0 0
# [8,] 1 1 0 0 0
# [9,] 1 0 0 0 0
#[10,] 1 0 0 0 0
#[11,] 0 0 0 0 0
#[12,] 0 0 0 0 0
data
m1 <- matrix(0, 12, 5)
I'm new to R and have and stuck with an assignment. Would be really grateful if someone could help!
This is the task:
"Use a linear regression model to calculate the MLE(ß^) for the three ß parameters when using the linear regression mode to relate each of the L genotype markers with the phenotype (i.e. your R code must include the formula for the MLE).
Plot histograms for each of parameter estimates ß calculated for the L genotypes (i.e. three histograms!)"
I have 200 individuals and my Xd vector and Xa vector are:
> Xd
[1] 0 0 0 0 1 0 0 NA 0 0 0 0 0 0 0 NA 0 NA 0 1 0 0 0 1 0
[26] 0 0 0 0 0 0 0 0 -1 0 0 1 -1 0 NA -1 0 0 0 0 0 0 NA 0 1
[51] 0 0 0 -1 0 0 0 NA 0 0 -1 -1 0 0 0 0 -1 0 0 -1 0 0 0 0 0
[76] 0 1 0 0 0 0 1 0 0 0 NA NA 0 0 0 NA 0 0 NA 0 -1 0 0 0 -1
[101] 0 0 0 NA NA 0 NA 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0
[126] 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 NA 1 0 0
[151] -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 1 1 0 -1 0 -1
[176] 0 0 0 0 NA 0 0 0 1 0 0 0 0 0 0 0 -1 0 0 0 0 NA 0 0
> Xa
[1] 1 1 1 1 -1 1 1 NA 1 1 1 1 1 1 1 NA 1 NA 1 -1 1 1 1 -1 1
[26] 1 1 1 1 1 1 1 1 -1 1 1 -1 -1 1 NA -1 1 1 1 1 1 1 NA 1 -1
[51] 1 1 1 -1 1 1 1 NA 1 1 -1 -1 1 1 1 1 -1 1 1 -1 1 1 1 1 1
[76] 1 -1 1 1 1 1 -1 1 1 1 NA NA 1 1 1 NA 1 1 NA 1 -1 1 1 1 -1
[101] 1 1 1 NA NA 1 NA 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1
[126] 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 NA -1 1 1
[151] -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NA 1 -1 -1 1 -1 1 -1
[176] 1 1 1 1 NA 1 1 1 -1 1 1 1 1 1 1 1 -1 1 1 1 1 NA 1 1
>
What i did was:
> Xmu=rep(1,200)
> X=cbind(Xmu, Xa, Xd)
Then I get the following error
In cbind(Xmu, Xa, Xd) :
number of rows of result is not a multiple of vector length (arg 2)
What does that mean?? How do I calculate my MLE for my ß parameters? I would have proceeded like this:
Y <- 1 + Xa*1 + Xd*0 + rnorm(200,0,sqrt(1))
betas <- solve(t(X)%*% X) %*% t(X)%*% Y
beta_mu <- betas[1]
beta_a <- betas[2]
beta_d <- betas[3]
Also the "your code must include the MLE formula"-part confuses me!? Thanks
I have two factors. factor A have 2 level, factor B have 3 level.
How to create the following design matrix?
factorA1 factorA2 factorB1 factorB2 factorB3
[1,] 1 0 1 0 0
[2,] 1 0 0 1 0
[3,] 1 0 0 0 1
[4,] 0 1 1 0 0
[5,] 0 1 0 1 0
[6,] 0 1 0 0 1
You have a couple of options:
Use base and piece it together yourself:
(iris.dummy<-with(iris,model.matrix(~Species-1)))
(IRIS<-data.frame(iris,iris.dummy))
Or use the ade4 package as follows:
dummy <- function(df) {
require(ade4)
ISFACT <- sapply(df, is.factor)
FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE])
NONFACTS <- df[, !ISFACT,drop = FALSE]
data.frame(NONFACTS, FACTS)
}
dat <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(dat)
## x eggs.bar eggs.foo ham.blue ham.green ham.red
## 1 0.3365302 0 1 0 0 1
## 2 1.1341354 0 1 1 0 0
## 3 2.0489741 1 0 0 1 0
## 4 1.1019108 1 0 0 0 1
Assuming your data in in a data.frame called dat, let's say the two factors are given as in this example:
> dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
> dat
f1 f2 id
1 C D 1
2 B E 2
3 B E 3
4 A D 4
5 C E 5
6 C E 6
7 C D 7
8 B E 8
9 C D 9
10 A D 10
11 B E 11
12 C E 12
13 B D 13
14 B E 14
15 A D 15
16 C E 16
17 C D 17
18 C D 18
19 B D 19
20 C D 20
> dat$f1
[1] C B B A C C C B C A B C B B A C C C B C
Levels: A B C
> dat$f2
[1] D E E D E E D E D D E E D E D E D D D D
Levels: D E
You can use outer to get a matrix as you showed, for each factor:
> F1 <- with(dat, outer(f1, levels(f1), `==`)*1)
> colnames(F1) <- paste("f1",sep="=",levels(dat$f1))
> F1
f1=A f1=B f1=C
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 1 0
[4,] 1 0 0
[5,] 0 0 1
[6,] 0 0 1
[7,] 0 0 1
[8,] 0 1 0
[9,] 0 0 1
[10,] 1 0 0
[11,] 0 1 0
[12,] 0 0 1
[13,] 0 1 0
[14,] 0 1 0
[15,] 1 0 0
[16,] 0 0 1
[17,] 0 0 1
[18,] 0 0 1
[19,] 0 1 0
[20,] 0 0 1
Now do the same for the second factor:
> F2 <- with(dat, outer(f2, levels(f2), `==`)*1)
> colnames(F2) <- paste("f2",sep="=",levels(dat$f2))
And cbind them to get the final result:
> cbind(F1,F2)
model.matrix is the process that lm and others use in the background to convert for you.
dat <- data.frame(f1=sample(LETTERS[1:3],20,T),f2=sample(LETTERS[4:5],20,T),id=1:20)
dat
model.matrix(~dat$f1 + dat$f2)
It creates the INTERCEPT variable as a column of 1's, but you can easily remove that if you need.
model.matrix(~dat$f1 + dat$f2)[,-1]
Edit: Now i see that this is essentially the same as one of the other comments, but more concise.
Expanding and generalizing #Ferdinand.kraft's answer:
dat <- data.frame(
f1 = sample(LETTERS[1:3], 20, TRUE),
f2 = sample(LETTERS[4:5], 20, TRUE),
row.names = paste0("id_", 1:20))
covariates <- c("f1", "f2") # in case you have other columns that you don't want to include in the design matrix
design <- do.call(cbind, lapply(covariates, function(covariate){
apply(outer(dat[[covariate]], unique(dat[[covariate]]), FUN = "=="), 2, as.integer)
}))
rownames(design) <- rownames(dat)
colnames(design) <- unlist(sapply(covariates, function(covariate) unique(dat[[covariate]])))
design <- design[, !duplicated(colnames(design))] # duplicated colnames happen sometimes
design
# C A B D E
# id_1 1 0 0 1 0
# id_2 0 1 0 1 0
# id_3 0 0 1 1 0
# id_4 1 0 0 1 0
# id_5 0 1 0 1 0
# id_6 0 1 0 0 1
# id_7 0 0 1 0 1
Model matrix only allows what it calls "dummy" coding for the first factor in a formula.
If the intercept is present, it plays that role. To get the desired effect of a redundant index matrix (where you have a 1 in every column for the corresponding factor level and 0 elsewhere), you can lie to model.matrix() and pretend there's an extra level. Then trim off the intercept column.
> a=rep(1:2,3)
> b=rep(1:3,2)
> df=data.frame(A=a,B=b)
> # Lie and pretend there's a level 0 in each factor.
> df$A=factor(a,as.character(0:2))
> df$B=factor(b,as.character(0:3))
> mm=model.matrix (~A+B,df)
> mm
(Intercept) A1 A2 B1 B2 B3
1 1 1 0 1 0 0
2 1 0 1 0 1 0
3 1 1 0 0 0 1
4 1 0 1 1 0 0
5 1 1 0 0 1 0
6 1 0 1 0 0 1
attr(,"assign")
[1] 0 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.treatment"
attr(,"contrasts")$B
[1] "contr.treatment"
> # mm has an intercept column not requested, so kill it
> dm=as.matrix(mm[,-1])
> dm
A1 A2 B1 B2 B3
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 0 0 1
4 0 1 1 0 0
5 1 0 0 1 0
6 0 1 0 0 1
> # You can also add interactions
> mm2=model.matrix (~A*B,df)
> dm2=as.matrix(mm2[,-1])
> dm2
A1 A2 B1 B2 B3 A1:B1 A2:B1 A1:B2 A2:B2 A1:B3 A2:B3
1 1 0 1 0 0 1 0 0 0 0 0
2 0 1 0 1 0 0 0 0 1 0 0
3 1 0 0 0 1 0 0 0 0 1 0
4 0 1 1 0 0 0 1 0 0 0 0
5 1 0 0 1 0 0 0 1 0 0 0
6 0 1 0 0 1 0 0 0 0 0 1
Things get complicated with model.matrix() again if we add a covariate x and interactions of x with factors.
a=rep(1:2,3)
b=rep(1:3,2)
x=1:6
df=data.frame(A=a,B=b,x=x)
# Lie and pretend there's a level 0 in each factor.
df$A=factor(a,as.character(0:2))
df$B=factor(b,as.character(0:3))
mm=model.matrix (~A + B + A:x + B:x,df)
print(mm)
(Intercept) A1 A2 B1 B2 B3 A0:x A1:x A2:x B1:x B2:x B3:x
1 1 1 0 1 0 0 0 1 0 1 0 0
2 1 0 1 0 1 0 0 0 2 0 2 0
3 1 1 0 0 0 1 0 3 0 0 0 3
4 1 0 1 1 0 0 0 0 4 4 0 0
5 1 1 0 0 1 0 0 5 0 0 5 0
6 1 0 1 0 0 1 0 0 6 0 0 6
So mm has an intercept, but now A:x interaction terms have an unwanted level A0:x
If we reintroduce x as as a separate term, we will cancel that unwanted level
mm2=model.matrix (~ x + A + B + A:x + B:x, df)
print(mm2)
(Intercept) x A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 1 1 0 1 0 0 1 0 1 0 0
2 1 2 0 1 0 1 0 0 2 0 2 0
3 1 3 1 0 0 0 1 3 0 0 0 3
4 1 4 0 1 1 0 0 0 4 4 0 0
5 1 5 1 0 0 1 0 5 0 0 5 0
6 1 6 0 1 0 0 1 0 6 0 0 6
We can get rid of the unwanted intercept and the unwanted bare x term
dm2=as.matrix(mm2[,c(-1,-2)])
print(dm2)
A1 A2 B1 B2 B3 x:A1 x:A2 x:B1 x:B2 x:B3
1 1 0 1 0 0 1 0 1 0 0
2 0 1 0 1 0 0 2 0 2 0
3 1 0 0 0 1 3 0 0 0 3
4 0 1 1 0 0 0 4 4 0 0
5 1 0 0 1 0 5 0 0 5 0
6 0 1 0 0 1 0 6 0 0 6
Is there a native R function that will take an input vector and return the corresponding binary matrix where the matrix has the same number of columns as unique values in the input vector?
For example, given x <- 1:3, I want to return the following matrix:
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
The functions contrasts gets close, but I can't seem to get around the n-1 columns returned:
> contrasts(as.factor(x))
2 3
1 0 0
2 1 0
3 0 1
Actually, contrasts is what you want.
contrasts(as.factor(1:3), contrasts=FALSE)
1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
model.matrix() might help here, but you need to suppress the intercept:
> model.matrix(~ factor(1:3) - 1)
factor(1:3)1 factor(1:3)2 factor(1:3)3
1 1 0 0
2 0 1 0
3 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(1:3)`
[1] "contr.treatment"
Something slightly more complex:
> set.seed(1)
> fac <- factor(sample(1:3, 10, replace = TRUE))
> model.matrix(~ fac - 1)
fac1 fac2 fac3
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 1 0 0
6 0 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 1 0 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$fac
[1] "contr.treatment"