Create list of tables by loop in R - r

I am struggling to create list of tables (object = table, not data.frame) by loop in R.
Structure of my data is also a little bit complicated - sometimes the table function do not give 2x2 table - how to fill tables with incomplete dimensions to 2x2 automatically?
Sample data (in real dataset much much larger...)
my.data <- data.frame(y.var = c(0,1,0,1,1,1,0,1,1,0),
sex = rep(c("male","female"), times = 5),
apple = c(0,1,1,0,0,0,1,0,0,0),
orange = c(1,0,1,1,0,1,1,1,0,0),
ananas = c(0,0,0,0,0,0,0,0,0,0))
# y.var sex apple orange ananas
# 1 0 male 0 1 0
# 2 1 female 1 0 0
# 3 0 male 1 1 1
Look into creating tables - for apple I have nice 2x2 tables
table(my.data$y.var, my.data$apple)
# 0 1
# 0 2 2
# 1 5 1 .... Ok, nice 2x2 table.
table(my.data$y.var, my.data$apple, my.data$sex)
# , , = female
# 0 1
# 0 1 0
# 1 3 1
# , , = male
# 0 1
# 0 1 2
# 1 2 0 .... Ok, nice 2x2 table.
However for ananas I have only 2x1 tables
table(my.data$y.var, my.data$ananas)
# 0 # 0 1
# 0 4 # 0 4 0
# 1 6 .... NOT Ok! I need 2x2 table like this: # 1 6 0
table(my.data$y.var, my.data$ananas, my.data$sex)
# , , = female
# 0 # 0 1
# 0 1 # 0 1 0
# 1 4 .... NOT Ok! I need 2x2 table like this: # 1 4 0
# , , = male
# 0 # 0 1
# 0 3 # 0 3 0
# 1 2 .... NOT Ok! I need 2x2 table like this: # 1 2 0
I can do list manually like this, however this is not very practical.
my.list <- list(table(my.data$y.var, my.data$apple),
table(my.data$y.var, my.data$apple, my.data$sex),
table(my.data$y.var, my.data$orange),
table(my.data$y.var, my.data$orange, my.data$sex),
table(my.data$y.var, my.data$ananas),
table(my.data$y.var, my.data$ananas, my.data$sex))
How to do self-correcting-table-dimensions-loop? Necessary for following analyses...

We can use lapply to loop over the list of columns after converting the columns of interest to have the same levels with factor and then do a table and keep the output in the list
my.data[-2] <- lapply(my.data[-2], factor, levels = 0:1)
lst1 <- lapply(my.data[3:5], function(x) table(my.data$y.var, x))

Related

weird R issue with table()

I have a list of 503 obs x 637 vars I'm trying to coerce into a frequency table. My list is of the form shown below, where blanks are NAs, but is spread out over very many more columns and rows
1
2
3
4
5
string
sting_2
string_2
string_3
When I run table(list), though, I get a super weird output and can only really screenshot it to show
This doesnt happen on the dataframe I use to generate this list, and I can't seem to figure out or search a way around it. Additionally, when I try to use table() on just one column in my list, as in table(list[2]),I get NULL even though I'd expect string 1 string_2 1. Very unsure what's happening, and swear I had it working before in both cases. Please help!!!!
Using table on a list (as you described in the comments of your question) produces one dimension for every vector/element within the list. For example,
table(list(c(1,1,2)))
# 1 2
# 2 1
x <- list(c(1,1,2),c(1,1,2),c(1,1,2))
table(x)
# , , x.3 = 1
# x.2
# x.1 1 2
# 1 2 0
# 2 0 0
# , , x.3 = 2
# x.2
# x.1 1 2
# 1 0 0
# 2 0 1
The first (1d) list produces a 1d table; the second (3d) list produces a 3d table array. So your data 637 vars (I'm inferring each element within the list is a var), you will get a 637-dim return.
If you want a table of the contents, then you need to unlist it.
table(unlist(x))
# 1 2
# 6 3
As to why you're getting NULL, it seems likely that you have all NA in that element. For example,
table(c(x, list(c(NA,NA,NA))))
# < table of extent 2 x 2 x 2 x 0 >
(where we previously saw a 3-d table here). You can try to work around this with useNA="always" or useNA="ifany".
table(c(x, list(c(NA,NA,NA))), useNA="always")
# , , .3 = 1, .4 = NA
# .2
# .1 1 2 <NA>
# 1 2 0 0
# 2 0 0 0
# <NA> 0 0 0
# , , .3 = 2, .4 = NA
# .2
# .1 1 2 <NA>
# 1 0 0 0
# 2 0 1 0
# <NA> 0 0 0
# , , .3 = NA, .4 = NA
# .2
# .1 1 2 <NA>
# 1 0 0 0
# 2 0 0 0
# <NA> 0 0 0
which returns us closer to the "normal" output of your initial table(x), or
table(unlist(c(x, list(c(NA,NA,NA)))), useNA="always")
# 1 2 <NA>
# 6 3 3
in a simplified view.

making 1000 contingency tables in R

I have a vector called "combined" with 1's and 0's
combined
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I sampled twice from this vector, each with a sample size of 3 and put it into a contingency table of counts as follows.
2 1
1 2
I want to reiterate this sampling 1000 times such that I end with 1000 contingency tables each with counts of 1s and 0s from the sampling.
This is what I tried:
sample1 = as.vector(replicate(10000, sample(combined, 3)))
sample2 = as.vector(replicate(10000, sample(combined, 3)))
con_table = table(sample1,sample2)
but I ended up only getting 1 table instead of 10000. Hoping to get some help.
8109 7573
7306 7012
You need to wrap the entire expression, sample and table inside replicate. Add a conversion to a factor to ensure you always get a 2x2 table. E.g. a simple version with 2 replications:
combined <- rep(0:1,each=10)
combined <- as.factor(combined)
replicate(2, table(sample(combined,3), sample(combined,3)), simplify=FALSE)
#[[1]]
#
# 0 1
# 0 0 1
# 1 1 1
#
#[[2]]
#
# 0 1
# 0 1 1
# 1 0 1

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Creating subgroups from categorical data by using lapply in R

I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.
How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))
model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0

Resources