I would like to generate a random binary combination (row order) in my dataframe df:
bin
2
2
2
2
3
2
3
2
In this example I intend to generate 6 times 0 (the same number of 2) and two times 1 (the same number of 3). I expect something like that:
bin
0
0
1
0
0
1
0
0
Any ideas? Thank you
So given a vector bin
bin<-c(2,2,2,2,3,2,3,2)
You would like to create a new vector that contains the same number of 0's as the number of 2's in bin, and the same number of 1's as the number of 3's in bin. Assuming that's correct, then
sample(rep(0:1, table(bin)))
Should do the trick. Here are the results of running that command several times:
# 0 0 0 0 1 1 0 0
# 0 0 0 1 0 0 1 0
# 0 0 0 1 0 0 1 0
# 0 0 1 0 1 0 0 0
Related
I have successfully imported my csv file into R. It is a 6 by 6 matrix.
0 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 1 0 0 0 1
0 1 0 1 0 0
1 1 1 1 1 1
I am looking for a function that will allow me to calculate which rows have the value '1' exactly twice.
I know 3 of the rows contain '1' so I would like to print '3'.
Is there any function that will allow me to achieve this?
We can use rowSums to get the sum of each row, convert it to logical with comparison operator and get the position by wrapping with which
which(rowSums(m1) == 2)
If it is the count, use sum
sum(rowSums(m1) == 2)
I have a dataset (df) which looks like:
df
X24_TT X35_FTT X55_FTT X70_FTT X85_TT X86_FTT X90_FTT X96_FTT
0 1 0 0 0 1 0 1
1 0 0 1 1 1 0 1
1 0 0 0 0 1 0 1
0 1 0 0 0 0 1 1
0 0 1 1 0 1 0 0
1 0 0 0 0 0 0 0
I want to keep only those columns whose column sums are greater than 20th percentile i.e.
a) Calculate colsum for each column in "df" i.e an array of colsums
b) Find 20th percentile value of that array of colsums
c) Keep columns whose colsum > 20th percentile value
If we calculate the 20th percentile of colsum of "df" it comes to be 1.
Therefore in a new dataset(nm), only those columns appears whose column sum are greater than 1 i.e.
nm
X24_TT X35_FTT X70_FTT X86_FTT X96_FTT
0 1 0 1 1
1 0 1 1 1
1 0 0 1 1
0 1 0 0 1
0 0 1 1 0
1 0 0 0 0
Please note that number of columns are huge therefore provide dynamic solutions
I'm looking for a function in R which can do the permutation. For example, I have a vector with five 1 and ten 0 like this:
> status=c(rep(1,5),rep(0,10))
> status
[1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Now I'd like to randomly permute the position of these numbers but keep the same number of 0 and 1 in vector and to get new series of number, for example to get something like this:
1 1 0 1 0 1 0 0 0 0 0 1 0 0 0
or
1 0 0 0 0 0 0 1 1 0 0 1 0 1 0
I found the function sample() can help us to sample, but the number of 1 and 0 is not the same each time. Do you know how can I do this with R? Thanks in advance.
We can use sample
sample(status)
#[1] 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0
sample(status)
#[1] 0 0 0 0 1 1 0 0 1 1 0 0 0 1 0
If we use sample to return the entire vector, it will do the permutation and give the frequency count same for each of the unique elements
colSums(replicate(5, sample(status)))
#[1] 5 5 5 5 5
i.e. we get 5 one's in each of the sampling. So, the remaining 0's would be 10.
I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)
I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance
Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code