r sum several colmns by another column - r

I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance

Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code

Related

Find rows of matrix that contain the value '1' no more than 2 times in R?

I have successfully imported my csv file into R. It is a 6 by 6 matrix.
0 0 0 0 0 0
0 1 0 0 0 0
0 1 1 0 0 0
0 1 0 0 0 1
0 1 0 1 0 0
1 1 1 1 1 1
I am looking for a function that will allow me to calculate which rows have the value '1' exactly twice.
I know 3 of the rows contain '1' so I would like to print '3'.
Is there any function that will allow me to achieve this?
We can use rowSums to get the sum of each row, convert it to logical with comparison operator and get the position by wrapping with which
which(rowSums(m1) == 2)
If it is the count, use sum
sum(rowSums(m1) == 2)

Compute percentage weights on rows when one column is not numeric

I have this data called out:
Dates Consumer Staples Energy Financials Health Care
1 12/31/99 0 0 0 0 0
2 03/31/00 0 0 0 0 0
3 06/30/00 0 0 0 0 0
4 09/30/00 0 0 0 0 0
5 12/31/00 0 0 0 0 0
6 03/31/01 1000 0 0 50 0
7 06/30/01 0 0 0 0 0
I would like to compute the weights for each category on each row
but need to avoid summing the first column which is a date
Weights <- round(out[2:6]/rowSums(out[2:6])*100, 2)
1/ Is there a way to keep the dates in the first column, and compute
the weights of the next 5 columns in the same data set
2/ When a date has only 0 data, how to avoid the NAs?
Thank you for you help
outN <- out[,-1]
rownames(outN) <- out[,1]
Cap_Weights <- round(outN/rowSums(outN)*100, 2)
Cap_Weights[is.na(Cap_Weights)] <- 0

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

Random binary distribuition

I would like to generate a random binary combination (row order) in my dataframe df:
bin
2
2
2
2
3
2
3
2
In this example I intend to generate 6 times 0 (the same number of 2) and two times 1 (the same number of 3). I expect something like that:
bin
0
0
1
0
0
1
0
0
Any ideas? Thank you
So given a vector bin
bin<-c(2,2,2,2,3,2,3,2)
You would like to create a new vector that contains the same number of 0's as the number of 2's in bin, and the same number of 1's as the number of 3's in bin. Assuming that's correct, then
sample(rep(0:1, table(bin)))
Should do the trick. Here are the results of running that command several times:
# 0 0 0 0 1 1 0 0
# 0 0 0 1 0 0 1 0
# 0 0 0 1 0 0 1 0
# 0 0 1 0 1 0 0 0

Apply a function to elements of matrix on condition

I'm looking to change the value of a certain entry in a matrix based on the value of another entry. Its easiest to explain with an example:
Matrix
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 0
QRS-NOP 0 0 0 0
As you can see, each of the rows in the matrix above has a counterpart (e.g. ABC-DEF's counterpart is DEF-ABC).
Is there some way in which I can look to see which rows have a one in the first column and then place a 2 in the fourth column of its counterpart? In the above example then:
ABC-DEF 1 0 0 0
HIJ-KLM 0 0 0 0
NOP-QRS 1 0 0 0
KLM-HIJ 0 0 0 0
DEF-ABC 0 0 0 2
QRS-NOP 0 0 0 2
I'm quite stuck and would really appreciate any help!
Thanks!
Assuming your column names are V1,...,V5, you can do something like this :
values <- d$V1[d$V2==1]
d$V5[d$V1 %in% gsub("(...)-(...)","\\2-\\1", values)] <- 2
Which will give :
V1 V2 V3 V4 V5
1 ABC-DEF 1 0 0 0
2 HIJ-KLM 0 0 0 0
3 NOP-QRS 1 0 0 0
4 KLM-HIJ 0 0 0 0
5 DEF-ABC 0 0 0 2
6 QRS-NOP 0 0 0 2
If, instead of a data frame, your data is a numeric matrix m with row names, you can do :
values <- rownames(m)[m[,1]==1]
m[rownames(m) %in% gsub("(...)-(...)","\\2-\\1", values),4] <- 2
EDIT : To understand what the code is doing, you must see that :
gsub("(...)-(...)","\\2-\\1", values)
will replace any character string in the values vector of the form XXX-YYY by YYY-XXX via regexp matching. The result is a character vector of the "counterparts" of values. Then we use %in% to select every rows whose rownames appear in these counterpart values, and assign 2 in the fourth column for these rows.

Resources