Consider the following data set (named data).
library(DescTools)
v1 v2 v3 w1 w2 w3
1 0 0 0 1 0
0 1 0 0 0 1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 0 1
0 0 1 1 0 0
My objective is to compute contingency coefficient for all combination of (v1,v2,v3) and (w1,w2,w3). To make it clear, v1 & w1,v1 & w2, v1 & w3, etc using for loop. For example, the loop will do the following at the first iteration.
tab1 <- table(data$v1,data$w1)
c1 < ContCoef(tab1)
Any help is highly appreciated!
results = list()
for (v_col in c("v1", "v2", "v3")) {
for(w_col in c("w1", "w2", "w3")) {
tab = table(data[[v_col]], data[[w_col]])
results[[paste(v_col, w_col)]] = ContCoef(tab)
}
}
# view individual results
results[["v1_w2"]]
results[["v3_w1"]]
Related
I have a huge dataset and I want to compute the correlation of each item with the total score of the scale, but without containing the item. Now I could do it separately for each item, but I am trying to do a loop, so that it is a bit easier.
Example dataset:
dat <- read.table(header=TRUE, text="
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5 ItemX6 ItemY1 ItemY2 ItemY3 ItemY4 ItemY5 ItemY6
1 1 0 1 0 1 1 1 0 1 0 1
0 1 0 0 0 1 0 1 0 0 0 1
1 1 0 1 0 1 1 1 0 1 0 0
1 0 1 0 0 1 1 0 1 0 0 1
1 1 0 1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1 1 0 0
")
xscore <- rowSums(select(dat, starts_with("ItemX")))
Now I could do it like the following, but as I have 107 Items it is a bit much.
cor(dat$ItemX1,rowSums(select(dat, starts_with("ItemX") & -"ItemX1")),use="pairwise.complete.obs")
cor(dat$ItemX2,rowSums(select(dat, starts_with("ItemX") & -"ItemX2")),use="pairwise.complete.obs")
cor(dat$ItemX3,rowSums(select(dat, starts_with("ItemX") & -"ItemX3")),use="pairwise.complete.obs")
cor(dat$ItemX4,rowSums(select(dat, starts_with("ItemX") & -"ItemX4")),use="pairwise.complete.obs")
cor(dat$ItemX5,rowSums(select(dat, starts_with("ItemX") & -"ItemX5")),use="pairwise.complete.obs")
cor(dat$ItemX6,rowSums(select(dat, starts_with("ItemX") & -"ItemX6")),use="pairwise.complete.obs")
That's why I'm trying out the following loop, but now I don't know how to specify that the rowSums is calculated without the item which is in use for the correlation.
variables <- names(dat)
names.item <- c(grep("ItemX", variables, value = TRUE))
item.diff.p <- data.frame(matrix(NA, ncol=2, nrow=(length(names.item)-1)))
names(item.diff.p) <- c("Item", "cor")
length(names.item)
for(i in 1:(length(names.item))-1){
item <- names.item[i]
par <- cor(dat[,names(dat)[grepl("ItemX",names(dat))]],
rowSums(select(dat, starts_with("ItemX"))),use="pairwise.complete.obs")
item.diff.p[i, c("cor")]
}
par
Thank you all!
You can iterate through the columns of a subsetted dataframe, and calculate:
X_dat = dat[,grep("^ItemX",colnames(dat))]
res = sapply(1:ncol(X_dat),function(i){
cor(X_dat[,i],rowSums(X_dat[,-i]),use="p")
})
names(res) = colnames(X_dat)
res
ItemX1 ItemX2 ItemX3 ItemX4 ItemX5 ItemX6
0.6324555 0.1250000 -0.7500000 0.1250000 0.4152274 0.2335497
So I have a list that contains certain characters as shown below
list <- c("MY","GM+" ,"TY","RS","LG")
And I have a variable named "CODE" in the data frame as follows
code <- c("MY GM+","","LGTY", "RS","TY")
df <- data.frame(1:5,code)
df
code
1 MY GM+
2
3 LGTY
4 RS
5 TY
Now I want to create 5 new variables named "MY","GM+","TY","RS","LG"
Which takes binary value, 1 if there's a match case in the CODE variable
df
code MY GM+ TY RS LG
1 MY GM+ 1 1 0 0 0
2 0 0 0 0 0
3 LGTY 0 0 1 0 1
4 RS 0 0 0 1 0
5 TY 0 0 1 0 0
Really appreciate your help. Thank you.
Since you know how many values will be returned (5), and what you want their types to be (integer), you could use vapply() with grepl(). We can turn the resulting logical matrix into integer values by using integer() in vapply()'s FUN.VALUE argument.
cbind(df, vapply(List, grepl, integer(nrow(df)), df$code, fixed = TRUE))
# code MY GM+ TY RS LG
# 1 MY GM+ 1 1 0 0 0
# 2 0 0 0 0 0
# 3 LGTY 0 0 1 0 1
# 4 RS 0 0 0 1 0
# 5 TY 0 0 1 0 0
I think your original data has a couple of typos, so here's what I used:
List <- c("MY", "GM+" , "TY", "RS", "LG")
df <- data.frame(code = c("MY GM+", "", "LGTY", "RS", "TY"))
I have a dataframe (.txt) which looks like this [where "dayX" = the day of death in a survival assay in fruitflies, the numbers beneath are the number of flies to die in that treatment combination on that day, X or A are treaments, m & f are also treatments, the first number is the line, the second number is the block]
line day1 day2 day3 day4 day5
1 Xm1.1 0 0 0 2 0
2 Xm1.2 0 0 1 0 0
3 Xm2.1 1 1 0 0 0
4 Xm2.2 0 0 0 3 1
5 Xf1.1 0 3 0 0 1
6 Xf1.2 0 0 1 0 0
7 Xf2.1 2 0 2 0 0
8 Xf2.2 1 0 1 0 0
9 Am1.1 0 0 0 0 2
10 Am1.2 0 0 1 0 0
11 Am2.1 0 2 0 0 1
12 Am2.2 0 2 0 0 0
13 Af1.1 3 0 0 1 0
14 Af1.2 0 1 3 0 0
15 Af1.1 0 0 0 1 0
16 Af2.2 1 0 0 0 0
and want it to become this using R->
XA mf line block individual age
1 X m 1 1 1 4
2 X m 1 1 2 4
3 X m 1 2 1 3
and so on...
the resulting dataframe collects the "age" value from the day the individual died, as scored in the upper dataframe, for example there were two flies that died on the 4th day (day4) in treatment Xm1.1 therefore R creates two rows, one containing information extracted regarding the first individual and thus being labelled as individual "1", then another row with the same information except labelled as individual "2".. if a 3rd individual died in the same treatment on day 5, there would be a third row which is the same as the above two rows except the "age" would be "5" and individual would be "3". When it moves on to the next treatment row, in this case Xm1.2, the first individual to die within that treatment set would be labelled as individual "1" (which in this case dies on day 3). In my example there is a total of 38 deaths, therefore I am trying to get R to build a df which is 38*6 (excl. headers).
is there a way to take my dataframe [the real version is approx 50*640 with approx 50 individuals per unique combination of X/A, m/f, line (1:40), block (1-4) so ~32000 individual deaths] to an end dataframe of 6*~32000 in an automated way?
both of these example dataframes can be built using this code if it helps you to try out solutions:
test<-data.frame(1:16);colnames(test)=("line")
test$line=c("Xm1.1","Xm1.2","Xm2.1","Xm2.2","Xf1.1","Xf1.2","Xf2.1","Xf2.2","Am1.1","Am1.2","Am2.1","Am2.2","Af1.1","Af1.2","Af2.1","Af2.2")
test$day1=rep(0,16);test$day2=rep(0,16);test$day3=rep(0,16);test$day4=rep(0,16);test$day5=rep(0,16)
test$day4[1]=2;test$day3[2]=1;test$day2[3]=1;test$day4[4]=3;test$day5[5]=1;
test$day3[6]=1;test$day1[7]=2;test$day1[8]=1;test$day5[9]=3;test$day3[10]=1;
test$day2[11]=2;test$day2[12]=2;test$day4[13]=1;test$day3[14]=3;test$day4[15]=1;
test$day1[16]=1;test$day3[7]=2;test$day3[8]=1;test$day2[5]=3;test$day1[3]=1;
test$day5[11]=1;test$day5[9]=2;test$day5[4]=1;test$day1[13]=3;test$day2[14]=1;
test2=data.frame(rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3),rep(1:3))
colnames(test2)=c("XA","mf","line","block","individual","age")
test2$XA[1]="X";test2$mf[1]="m";test2$line[1]=1;test2$block[1]=1;test2$individual[1]=1;test2$age[1]=4;
test2$XA[2]="X";test2$mf[2]="m";test2$line[2]=1;test2$block[2]=1;test2$individual[2]=2;test2$age[2]=4;
test2$XA[3]="X";test2$mf[3]="m";test2$line[3]=1;test2$block[3]=2;test2$individual[3]=1;test2$age[3]=3;
apologies for the awfully long way of making this dummy dataset, suffering from sleep deprivation and jetlag and haven't used R for months, if you run the code in R you will hopefully see better what I aim to do
-------------------------------------------------------------------------------------
By Rg255:
Currently stuck at this derived from #Arun's answer (I have added the strsplit (as.character(dt$line) , "" )) section to get around one error)
df=read.table("C:\\Users\\...\\data.txt",header=T)
require(data.table)
head(df[1:20])
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
Produces the following output:
> df=read.table("C:\\Users\\..\\data.txt",header=T)
> require(data.table)
> head(df[1:20])
line Day4 Day6 Day8 Day10 Day12 Day14 Day16 Day18 Day20 Day22 Day24 Day26 Day28 Day30 Day32 Day34 Day36 Day38 Day40
1 Xm1.1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 4 2
2 Xm2.1 0 0 0 0 0 0 0 0 0 2 0 0 0 1 2 1 0 2 0
3 Xm3.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1
4 Xm4.1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 3 8
5 Xm5.1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 3 3 6
6 Xm6.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
> dt <- as.data.table(df)
> dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
+ list(individual = sequence(dd[dd>0]),
+ age = rep(which(dd>0), dd[dd>0])
+ )}, by=line]
> out <- as.data.table(data.frame(do.call(rbind, strsplit(as.character(dt$line), ""))[, c(1:3,5)], stringsAsFactors=FALSE))
Warning message:
In function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)
> setnames(out, c("XA", "mf", "line", "block"))
> out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
Error in `[.data.table`(out, , `:=`(line = as.numeric(line), block = as.numeric(block))) :
LHS of := must be a single column name, when with=TRUE. When with=FALSE the LHS may be a vector of column names or positions.
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
> out <- cbind(out, dt[, list(individual, age)])
>
Here goes a data.table solution. The line column must have unique values.
require(data.table)
df <- read.table("data.txt", header=TRUE, stringsAsFactors=FALSE)
dt <- as.data.table(df)
dt <- dt[, {dd <- unlist(.SD, use.names = FALSE);
list(individual = sequence(dd[dd>0]),
age = rep(which(dd>0), dd[dd>0])
)}, by=line]
out <- as.data.table(data.frame(do.call(rbind,
strsplit(gsub("([[:alpha:]])([[:alpha:]])([0-9]+)\\.([0-9]+)$",
"\\1 \\2 \\3 \\4", dt$line), " ")), stringsAsFactors=FALSE))
setnames(out, c("XA", "mf", "line", "block"))
out[, `:=`(line = as.numeric(line), block = as.numeric(block))]
out <- cbind(out, dt[, list(individual, age)])
This works on your data.txt file.
Given data that looks like:
library(data.table)
DT <- data.table(x=rep(1:5, 2))
I would like to split this data into 5 boolean columns that indicate the presence of each number.
I can do this like this:
new.names <- sort(unique(DT$x))
DT[, paste0('col', new.names) := lapply(new.names, function(i) DT$x==i), with=FALSE]
But this uses a pesky lapply which is probably slower than the data.table alternative and this solutions strikes me as not very "data.table-ish".
Is there a better and/or faster way to create these new columns?
How about model.matrix?
model.matrix(~factor(x)-1,data=DT)
factor(x)1 factor(x)2 factor(x)3 factor(x)4 factor(x)5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
6 1 0 0 0 0
7 0 1 0 0 0
8 0 0 1 0 0
9 0 0 0 1 0
10 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(x)`
[1] "contr.treatment"
Apparently, you can put model.matrix into [.data.table to give the same results. Not sure if it would be faster:
DT[,model.matrix(~factor(x)-1)]
There is also nnet::class.ind
library(nnet)
cbind(DT, setnames(as.data.table(DT[, class.ind(x)]),paste0('col', unique(DT$x))))
library(data.table)
DT <- data.table(x=rep(1:5, 2))
# add column with id
DT[, id := seq.int(nrow(DT))]
# cast long table into wide
DT.wide <- dcast(DT, id ~ x, value.var = "x", fill = 0, fun = function(x) 1)
I have a dataframe with some boolean values (1/0) as follows (sorry I couldn't work out how to make this into a smart table)
Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0
I have 64 samples (Sam/Ted....etc) which are in a list called files i.e;
files <- c("Sam", "Ted", "Ann", ....)
And I would like to create a a column summing the flag values for each sample to create the following:
Sam Ted
probe1.flagsum 1 1
probe2.flagsum 0 0
probe3.flagsum 1 0
probe4.flagsum 0 0
probe5.flagsum 2 1
I am fairly new to R, trying to learn on a need to know basis but I have tried the following:
for(i in files) {
FLAGS$i <- cbind(sapply(i, function(y) {
#greping columns to filter for one sample
filter1 <- grep(names(filters), pattern=y)
#print out the summed values for those columns
FLAGS$y <-rowSums(filters[,(filter1)])
}
}
The above code does not work and I am bit lost as how to move forward.
Can anyone help me untangle this problem or point me in the right direction of the commands/tools to use.
Thank you.
This is easily doable in base R reshape, though using the reshape or reshape2 packages might be more intuitive.
Here's a solution in base R:
# Here's your data in its current form
dat = read.table(header=TRUE, text="Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0")
# Generate an ID row
dat$id = row.names(dat)
# Reshape wide to long
r.dat = reshape(dat, direction="long",
timevar="probe",
varying=1:6, sep=".")
# Calculate row sums
r.dat$sum = rowSums(r.dat[3:5])
# Reshape back to wide format, dropping what you're not interested in
reshape(r.dat, direction="wide",
idvar="id", timevar="probe",
drop=3:5)
## id sum.Sam sum.Ted
## probe1.Sam probe1 1 1
## probe2.Sam probe2 0 0
## probe3.Sam probe3 1 0
## probe4.Sam probe4 0 0
## probe5.Sam probe5 2 1
More than one way to skin a cat
You can also whip up a function like this one:
myFun = function(data, varnames) {
temp = vector("list", length(varnames))
for (i in 1:length(varnames)) {
temp[[i]] = colSums(t(dat[grep(varnames[i], names(data))]))
names(temp)[[i]] = varnames[i]
}
data.frame(temp)
}
Then, making use of the vector that you have of names:
files = c("Sam", "Ted")
myFun(dat, files)
## Sam Ted
## probe1 1 1
## probe2 0 0
## probe3 1 0
## probe4 0 0
## probe5 2 1
Enjoy!
If filters is your input matrix and FLAGS your desired output matrix then I would (naïvely) do something like this:
FLAGS <- matrix(0,nrow=nrow(filters),ncol=length(files))
for(i in 1:length(files)){
grep(files[i],colnames(filters)) -> index
FLAGS[,i] <- rowSums(filters[,index])
}
colnames(FLAGS) <- files
assuming your matrix is called input
input <- matrix(rbinom(30, 1, 0.5), ncol = 6)
colnames(input) <- c("F1.S", "F2.S", "F3.S", "F1.T", "F2.T", "F3.T")
rownames(input) <- paste("probe", 1:5, sep = "")
input <- as.data.frame(input)
library(reshape)
input$probe <- rownames(input)
Molten <- melt(input, id.vars = "probe")
Molten$ID <- gsub("^.*\\.", "", levels(Molten$variable))[Molten$variable]
cast(probe ~ ID, data = Molten, fun = "sum")
update with the dat frame from mrdwab
dat = read.table(header=TRUE, text="Flag1.Sam Flag2.Sam Flag3.Sam Flag1.Ted Flag2.Ted Flag3.Ted
probe1 0 1 0 1 0 0
probe2 0 0 0 0 0 0
probe3 1 0 0 0 0 0
probe4 0 0 0 0 0 0
probe5 1 1 0 1 0 0")
library(reshape)
dat$probe <- rownames(dat)
Molten <- melt(dat, id.vars = "probe")
Molten$ID <- gsub("^.*\\.", "", levels(Molten$variable))[Molten$variable]
cast(probe ~ ID, data = Molten, fun = "sum")