Include zero frequencies in 2-way frequency/contingency table - r

I am trying to make a contingency (frequency) table using table() in R for two integer variables, but the default option in table does not include all the values in the range for each. For example:
a=c(1,2,3,5)
b=c(1,1,2,3)
table(a,b)
returns:
1 2 3
1 1 0 0
2 1 0 0
3 0 1 0
5 0 0 1
I would like it to give:
1 2 3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 0 0
5 0 0 1
This is a simple example where the value '4' isn't in one of the vectors. I know I could manipulate it into an array and add in a row of zero's, but I'm wondering if there's a simpler way to automatically do this for when the variables might span hundreds of (sparse) integer values.

A way to get this is
a=factor(c(1,2,3,5), levels=1:5)
b=factor(c(1,1,2,3), levels=1:5)
table(a,b)

Related

Find rows of matrix that contain the value '1' no more than 2 times in R?

I have successfully imported my csv file into R. It is a 6 by 6 matrix.
0 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 1 0 0 0 1
0 1 0 1 0 0
1 1 1 1 1 1
I am looking for a function that will allow me to calculate which rows have the value '1' exactly twice.
I know 3 of the rows contain '1' so I would like to print '3'.
Is there any function that will allow me to achieve this?
We can use rowSums to get the sum of each row, convert it to logical with comparison operator and get the position by wrapping with which
which(rowSums(m1) == 2)
If it is the count, use sum
sum(rowSums(m1) == 2)

Exporting a list

I'm trying to export a list of 0's and 1's in R using the following code:
write(export, file="export.txt", ncol=1)
However, in the file "export.txt," there are 1's and 2's instead of 0's and 1's. How do I get the exported file to have 0's and 1's?
R List: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 1
This is what shows up in the file: 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 2 2
I suspect that export is a factor variable. write is a wrapper for cat and cat doesn't seem to gracefully handle factors:
x <- factor(0:1)
cat(x)
## 1 2
You can coerce to character to get the proper output:
cat(as.character(x), file="export.txt")
## 0 1

Random binary distribuition

I would like to generate a random binary combination (row order) in my dataframe df:
bin
2
2
2
2
3
2
3
2
In this example I intend to generate 6 times 0 (the same number of 2) and two times 1 (the same number of 3). I expect something like that:
bin
0
0
1
0
0
1
0
0
Any ideas? Thank you
So given a vector bin
bin<-c(2,2,2,2,3,2,3,2)
You would like to create a new vector that contains the same number of 0's as the number of 2's in bin, and the same number of 1's as the number of 3's in bin. Assuming that's correct, then
sample(rep(0:1, table(bin)))
Should do the trick. Here are the results of running that command several times:
# 0 0 0 0 1 1 0 0
# 0 0 0 1 0 0 1 0
# 0 0 0 1 0 0 1 0
# 0 0 1 0 1 0 0 0

How to shuffle a dataframe column wise, but independent of rows?

In this answer: https://stackoverflow.com/a/11503439/651779 it is shown how to shuffle a dataframe row- and column wise. I am interested in shuffeling column wise. From the original dataframe
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
shuffling column wise
> df3=df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
However, I would like to shuffle each row on its columns independent of the other rows, instead of shuffling the complete columns, so that you could get something like
>df4
c a b
1 0 1 1
2 1 0 0
3 1 0 0
4 0 0 0
Now I do that by looping over each row, shuffling the row, and putting it in a dataframe. Is there an easy way to do this?
something like:
t(apply(df1, 1, function(x) { sample(x, length(x)) } ))
This will give you the result in matrix form. If you have factors, a mix of numeric and characters etc, be aware that this will coerce everything to character.

r sum several colmns by another column

I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance
Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code

Resources