For a sample dataframe:
df1 <- structure(list(i.d = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), group = c(1L,
1L, 2L, 1L, 3L, 3L, 2L, 2L, 1L), cat = c(0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, NA)), .Names = c("i.d", "group", "cat"), class = "data.frame", row.names = c(NA,
-9L))
I wish to add an additional column to my dataframe ("pc.cat") which records the percentage '1s' in column cat BY the group ID variable.
For example, there are four values in group 1 (i.d's a, b, d and i). Value 'i' is NA so this can be ignored for now. Only one of the three values left is one, so the percentage would read 33.33 (to 2 dp). This value will be populated into column 'pc.cat' next to all the rows with '1' in the group (even the NA columns). The process would then be repeated for the other groups (2 and 3).
If anyone could help me with the code for this I would greatly appreciate it.
This can be accomplished with the ave function:
df1$pc.cat <- ave(df1$cat, df1$group, FUN=function(x) 100*mean(na.omit(x)))
df1
# i.d group cat pc.cat
# 1 a 1 0 33.33333
# 2 b 1 0 33.33333
# 3 c 2 1 66.66667
# 4 d 1 1 33.33333
# 5 e 3 0 0.00000
# 6 f 3 0 0.00000
# 7 g 2 1 66.66667
# 8 h 2 0 66.66667
# 9 i 1 NA 33.33333
library(data.table)
setDT(df1)
df1[!is.na(cat), mean(cat), by=group]
With data.table:
library(data.table)
DT <- data.table(df1)
DT[, list(sum(na.omit(cat))/length(cat)), by = "group"]
Related
DF
ID B C D
1 A 1 1 3
2 B 2 3 1
3 C 1 1 1
4 D 3 1 1
5 E 1 0 0
Given a dataframe such the one mentioned above, how can I quickly calculate the means for each row in one column and store them in another column of the dataframe? For example the average of column B would be: 0.5, 1, 0.5, 1,5, 0.5.
And is it possible to have a function that does it automatically for several columns at once?
Option is to get the matching row element from 'ID' to divide the column with the value
f1 <- function(dat, colNm) transform(dat,
newCol = dat[[colNm]]/dat[match(colNm, ID), colNm])
f1(DF, 'B')
# ID B C D newCol
#1 A 1 1 3 0.5
#2 B 2 3 1 1.0
#3 C 1 1 1 0.5
#4 D 3 1 1 1.5
#5 E 1 0 0 0.5
If it is to divide by a constant value, then just do
DF[-1] <- DF[-1]/2
data
DF <- structure(list(ID = c("A", "B", "C", "D", "E"), B = c(1L, 2L,
1L, 3L, 1L), C = c(1L, 3L, 1L, 1L, 0L), D = c(3L, 1L, 1L, 1L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5"))
Hello I have the data frame and I need to remove all the rows with max values from each columns.
Example
A B C
1 2 3 5
2 4 1 1
3 1 4 3
4 2 1 1
So the output is:
A B C
4 2 1 1
Is there any quick way to do this?
We can do this with %in%
df1[!seq_len(nrow(df1)) %in% sapply(df1, which.max),]
# A B C
#4 2 1 1
If there are ties for maximum values in each row, then do
df1[!Reduce(`|`, lapply(df1, function(x) x== max(x))),]
df[-sapply(df, which.max),]
# A B C
#4 2 1 1
DATA
df = structure(list(A = c(2L, 4L, 1L, 2L), B = c(3L, 1L, 4L, 1L),
C = c(5L, 1L, 3L, 1L)), .Names = c("A", "B", "C"),
class = "data.frame", row.names = c(NA,-4L))
Starting from a table of 372 columns and 12,000 rows in R, I need to create a new table with columns that contain rows with the sum of same row from columns 1:4, then 5:8, then 9:12, and so on up to column 372 of the original table. Here a short example:
Input:
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L, 7L,
+ 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"), c("a", "b",
+"c", "d", "e", "f", "g", "h")))
Which looks like this:
a b c d e f g h
r1 3 2 3 1 1 2 3 1
r2 1 6 1 8 5 1 7 1
Expected output:
A B
r1 9 7
r2 16 14
So, A = a+b+c+d, and B=e+f+g+h. Easy to do with a small table in Excel. Columns a-d correspond to a group, e-f to another, if that helps.
The question is currently underspecified, but supposing you have a matrix...
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L,
7L, 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"),
c("a", "b", "c", "d", "e", "f", "g", "h")))
Make your column mapping:
map = data.frame(old = colnames(m), new = rep(LETTERS, each=4, length.out=ncol(m)))
old new
1 a A
2 b A
3 c A
4 d A
5 e B
6 f B
7 g B
8 h B
And then rowsum by it:
res = rowsum(t(m), map$new)
r1 r2
A 9 16
B 7 14
We have to transpose the data with t here because R has rowsum but no colsum. You can transpose it back afterwards, like t(res).
A base R solution, suppose df is your data frame:
cols = 8
do.call(cbind, lapply(seq(1, ncols, 4), function(i) rowSums(df[i:(i+3)])))
# [,1] [,2]
# r1 9 7
# r2 16 14
Another way:
df <- data.frame(t(matrix(colSums(matrix(t(df), nrow=4)),nrow=nrow(df))))
## X1 X2
##1 9 7
##2 16 14
First transpose the data to a 4 x (ncol(df)/4 * now(df)) matrix where now each column is a group of four columns for each row in the original data frame.
Sum each column using colSums
Transpose the data back to a data frame with the original number of rows
You can do this in a vectorised way if you transform your original data to a matrix with 4 columns, then use rowSums on that, and then transform it back to match the rows of the original data frame. Here it is in one long command
df <- read.table(header = TRUE, text = "a b c d e f g h
3 2 3 1 1 2 3 1
1 6 1 8 5 1 7 1")
matrix(rowSums(matrix(as.vector(t(as.matrix(df))),
ncol = 4, byrow = TRUE)), ncol = ncol(df) / 4, byrow = TRUE)
# [,1] [,2]
#[1,] 9 7
#[2,] 16 14
Edit: To preserve the row names, if e.g. rownames(df) <- c("r1", "r2"), just apply them to the resulting matrix (the row order is preserved), ie run rownames(result) <- rownames(df).
I have a matrix(similar to a wig file) like this:
Position reference A C G T N sum(total read counts)
68773265 A 1 0 0 0 0 1
68773266 C 0 1 0 1 0 2
68773267 C 0 1 1 2 0 4
To achieve variant(non-reference) allele ratio,
I want to create this: (sum-reference sequence's count)/sum * 100 per position
Position reference frequency(%) sum(total read counts)
68773265 A 0 1
68773266 C 50 2
68773267 C 75 4
Please give me some advice on this problem. Thanks in advance!!
Using the subset of column names "nm1", match the "reference" column with the "nm1" to get the column index, cbind with 1:nrow(df1) for creating row/column index. Get the rowSums of "nm1" columns ("Sum1"), use this to create "frequencyPercent" based on the formula in the post.
nm1 <- c('A', 'C', 'G', 'T') # this could include `N` also
indx <- cbind(1:nrow(df1), match(df1$reference, nm1))
Sum1 <- rowSums(df1[nm1])
data.frame(df1[1:2], frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
SumTotalCounts=df1[,ncol(df1)])
Or use transform on the original dataset
transform(df1, frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
check.names=FALSE)[c(1:2,8:9)]
# Position reference sum(total read counts) frequencyPercent
#1 68773265 A 1 0
#2 68773266 C 2 50
#3 68773267 C 4 75
data
df1 <- structure(list(Position = 68773265:68773267, reference = c("A",
"C", "C"), A = c(1L, 0L, 0L), C = c(0L, 1L, 1L), G = c(0L, 0L,
1L), T = 0:2, N = c(0L, 0L, 0L), `sum(total read counts)` = c(1L,
2L, 4L)), .Names = c("Position", "reference", "A", "C", "G",
"T", "N", "sum(total read counts)"), class = "data.frame",
row.names = c(NA, -3L))
I have a data frame df with rows that are duplicates for the names column but not for the values column:
name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y
I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:
name value etc1 etc2
A 10 1 X
B 2 1 Y
C 45 1 Y
I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?
Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)
library(data.table)
dat <- read.table(text='name value etc1 etc2
A 9 1 X
A 10 1 X
A 11 1 X
B 2 1 Y
C 40 1 Y
C 50 1 Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
name etc1 etc2 mm
1: A 1 X 10
2: B 1 Y 2
3: C 1 Y 45
EDIT extend to more than one value variable
In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this
name value etc1 etc2 value1
1 A 9 1 X 2.1763485
2 A 10 1 X -0.7954326
3 A 11 1 X -0.5839844
4 B 2 1 Y -0.5188709
5 C 40 1 Y -0.8300233
6 C 50 1 Y -0.7787496
The above solution can be extended like this :
X[,lapply(.SD,mean),keys]
name etc1 etc2 value value1
1: A 1 X 10 0.2656438
2: B 1 Y 2 -0.5188709
3: C 1 Y 45 -0.8043865
This will compute the mean for all variables that don't exist in keys list.
You can use aggregate() function like below:
aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:
sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))
sample.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
Assuming your dataframe is df.
install.packages("plyr")
library(plyr)
df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L,
50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L,
1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name",
"value", "etc1", "etc2"), class = "data.frame", row.names = c(NA,
-6L))
df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))
df.m
name value etc1 etc2
1 A 10 1 X
2 B 2 1 Y
3 C 45 1 Y
This simple one worked for me:
avg_data <- aggregate( . ~ name, df, mean)
Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.