own design matrix in r - r

Here is my data:
sub <- paste ("s", 1:6, sep = "")
mark1a <- c("A", "A", "B", "d1", "A", 2)
mark1b <- c("A", "B", "d1", 2, "d1", "A")
myd <- data.frame (sub, mark1a, mark1b)
myd
sub mark1a mark1b
1 s1 A A
2 s2 A B
3 s3 B d1
4 s4 d1 2
5 s5 A d1
6 s6 2 A
I want create a design matrix of the pair of variables (columns) - mark1a and mark1b. A design matrix will consists of length (unique (c(mark1a, mark1b))) for each unique (c(mark1a, mark1b). then 1 or 2 based on if the particular number is present once or twice in the columns and else 0. The following is expected output (not a figure):
I could understand how this can be done:

You could try something like this:
cbind(myd, t(apply(myd, 1, function(x) sapply(unique(unlist(myd[, 2:3])), function(y) sum(x==y)))))
1 s1 A A 2 0 0 0
2 s2 A B 1 1 0 0
3 s3 B d1 0 1 1 0
4 s4 d1 2 0 0 1 1
5 s5 A d1 1 0 1 0
6 s6 2 A 1 0 0 1

First, make sure that the mark1a and mark1b columns share the same levels:
all.levels <- levels(myd["mark1a", "mark1b"])
levels(myd$mark1a) <- all.levels
levels(myd$mark1b) <- all.levels
Then you can compute the sum of two frequency tables and bind it to myd:
library(plyr)
cbind(myd, ddply(myd, "sub", function(x)table(x$mark1a) + table(x$mark1b))[,-1])
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0

I would say the solution from #jmsigner is the way go to for a one-liner, but I usually get confused by those nested apply (and its relatives) solutions.
Here's a similar solution:
# Identify all the levels in `mark1a` and `mark1b`
mydLevels = unique(c(levels(myd$mark1a), levels(myd$mark1b)))
# Use these levels and an anonymous function with `lapply`
temp = data.frame(lapply(mydLevels,
function(x) rowSums(myd[-1] == x)+0))
colnames(temp) = mydLevels
# This gives you the correct output, but not in the order
# that you have in your original question.
cbind(myd, temp)
# sub mark1a mark1b 2 A B d1
# 1 s1 A A 0 2 0 0
# 2 s2 A B 0 1 1 0
# 3 s3 B d1 0 0 1 1
# 4 s4 d1 2 1 0 0 1
# 5 s5 A d1 0 1 0 1
# 6 s6 2 A 1 1 0 0

Related

How to count values in columns in a data.frame?

I have a data.frame with vegetation in a presence-abscence matrix and ELLENBERG-values about moisture (values 1-9 and indicator plants (! and =)). Now I want to count the plants in every column (observation point) and for each ELLENBERG-value.
T1 -T4 are my observation points and when the plant is present, the value is 1, if absent 0. In F_nr are my ELLENBERG Values from 1 to 9. In F_sym the indicators with ! and =. In my output I count the values, i. e. in T1 I have one plants with 4, two with 7, one with ! and one with =.
Here some small example data:
set.seed(1)
df <- df2 <- data.frame(name=c("Acer campestre", "Acer negundo", "Achillea millefolium agg.", "Agrostis stolonifera", "Alnus glutinosa", "Alnus incana"),
T1=rbinom(6, 1, .5), T2=rbinom(6, 1, .5), T3=rbinom(6, 1, .5), T4=rbinom(6, 1, .5),
F_Nr=c(5,6,4,7,9,7), F_sym=c(NA, NA, NA, "!","=", "="))
I excpect a matrix like this, to create plots about the distribution of the values.
df_count <- data.frame(F_sum=c(1,2,3,4,5,6,7,8,9,"=", "!"),
T1=c(0,0,0,1,0,0,2,0,0,1,0),
T2=c(0,0,0,1,1,1,0,0,0,0,0),
T3=c(0,0,0,1,1,0,1,0,1,1,1),
T4=c(0,0,0,1,0,1,0,0,1,1,0))
Thanks for your help
We can use a combination of aggregate() and merge().
df2 <- read.table(text="
name T1 T2 T3 T4 F_Nr F_sym
'Acer campestre' 0 1 1 0 5 <NA>
'Acer negundo' 0 1 0 1 6 <NA>
'Achillea millefolium agg.' 1 1 1 1 4 <NA>
'Agrostis stolonifera' 1 0 0 0 7 !
'Alnus glutinosa' 0 0 1 1 9 =
'Alnus incana' 1 0 1 0 7 =",
header=TRUE, stringsAsFactors=FALSE)
fnr <- aggregate(df2[,2:5], list(df2$F_Nr), sum)
fsm <- aggregate(df2[,2:5], list(df2$F_sym), sum)
counts0 <- rbind(fnr, fsm)
dtf <- data.frame(F_sum=c(1:9, "=", "!"), stringsAsFactors=FALSE)
counts <- merge(dtf, counts0, by.x="F_sum", by.y="Group.1", all.x=TRUE)
counts[is.na(counts)] <- 0
counts[match(dtf$F_sum, counts$F_sum), ]
# F_sum T1 T2 T3 T4
# 3 1 0 0 0 0
# 4 2 0 0 0 0
# 5 3 0 0 0 0
# 6 4 1 1 1 1
# 7 5 0 1 1 0
# 8 6 0 1 0 1
# 9 7 2 0 1 0
# 10 8 0 0 0 0
# 11 9 0 0 1 1
# 2 = 1 0 2 1
# 1 ! 1 0 0 0

Generate pairwise movement data from sequence

I have a sequence which looks like this
SEQENCE
1 A
2 B
3 B
4 C
5 A
Now from this sequence, I want to get the matrix like this where i the row and jth column element denotes how many times movement occurred from ith row node to jth column node
A B C
A 0 1 0
B 0 1 1
C 1 0 0
How Can I get this in R
1) Use table like this:
s <- DF[, 1]
table(tail(s, -1), head(s, -1))
giving:
A B C
A 0 0 1
B 1 1 0
C 0 1 0
2) or like this. Since embed does not work with factors we convert the factor to character,
s <- as.character(DF[, 1])
do.call(table, data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
3) xtabs also works:
s <- as.character(DF[, 1])
xtabs(data = data.frame(embed(s, 2)))
giving:
X2
X1 A B C
A 0 0 1
B 1 1 0
C 0 1 0
Note: The input DF in reproducible form is:
Lines <- " SEQENCE
1 A
2 B
3 B
4 C
5 A"
DF <- read.table(text = Lines, header = TRUE)

How to find columns that fit an specific range (per individual) and add 1, else 0, using R

I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
library(data.table)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
Notes:
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
Data:
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
-3L))

Creating subgroups from categorical data by using lapply in R

I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.
How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))
model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0

How can I reshape my dataframe using reshape package?

I have a dataframe that looks like this:
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
etc.
Because I need to correlate variabeles a-g (their scores are in score1) with score2 in only step 5 I think i need to schange my dataset into this first:
a b c d e f g score2_step5
0 1 1 0 0 0
1 1 1 0 1 1
1 0
etc.
I am pretty sure that the Reshape package should be able to help me to do the job, but I haven't been able to make it work yet.
Can anyone help me? Many thanks in advance!
Here's another version. In case there is no step = 5, the value for score2_step = 0. Assuming your data.frame is df:
require(reshape2)
out <- do.call(rbind, lapply(seq(1, nrow(df), by=5), function(ix) {
iy <- min(ix+4, nrow(df))
df.b <- df[ix:iy, ]
tt <- dcast(df.b, 1 ~ var1, fill = 0, value.var = "score1", drop=F)
tt$score2_step5 <- 0
if (any(df.b$step == 5)) {
tt$score2_step5 <- df.b$score2[df.b$step == 5]
}
tt[,-1]
}))
> out
a b d e f g score2_step5
2 0 1 1 0 0 0 0
21 1 1 1 0 1 0 1
22 0 0 1 0 0 0 0
It looks like you want 7 correlations between the variables a-g and score2_step5--is that correct? First, you're going to need another variable. I'm assuming that step repeats continuously from 1 to 5; if not, this is going to be more complicated. I'm assuming your data is called df. I also prefer the newer reshape2 package, so I'm using that.
df$block <- rep(1:(nrow(df)/5),each=5)
df.molten <- melt(df,id.vars=c("var1", "step", "block"),measure.vars=c("score1"))
df2 <- dcast(df.molten, block ~ var1)
score2_step5 <- df$score2[df$step==5]
and then finally
cor(df2, score2_step5, use='pairwise')
There's an extra column (block) in df2 that you can get rid of or just ignore.
I added another row to your example data because my code doesn't work unless there is a step-5 observation in every block.
dat <- read.table(textConnection("
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
5 a 1 0"),header=TRUE)
Like #JonathanChristensen, I made another variable (I called it rep instead of block), and I made var1 into a factor (since there are no c values in the example data set given and I wanted a placeholder).
dat <- transform(dat,var1=factor(var1,levels=letters[1:7]),
rep=cumsum(step==1))
tapply makes the table of score1 values:
tab <- with(dat,tapply(score1,list(rep,var1),identity))
add the score2, step-5 values:
data.frame(tab,subset(dat,step==5,select=score2))

Resources