Average across some rows in R - r

I have not found a way to take an average across SOME columns in R when working with a data frame table. Basically, I want to take the average of the 3 controls (CTR_R1+CTR_R2+CTR_R3) and insert that value as another column right after CTR_R3 (see below). The same for the TRT.
Is there away to take the average and insert it in a specific location?
GeneID|CTR_R1|CTR_R2|CTR_R3|CTR_AVG|TRT_R1| TRT_R2| TRT_R3|TRT_AVG|pValue

How about
df$CTR_AVG <- rowMeans(df[,2:4])
df$TRT_AVG <- rowMeans(df[,6:8])

This code should work for you, if your data.frame is named df:
df$CTR_AVG <- ( df$CTR_R1 + df$CTR_R2 + df$CTR_R3 ) / 3
That is assuming that the CTR_AVG column already exists as you shown in your question. If it does not the code will put the column at the end of the data.frame. To move it to the right spot, you will need to select the columns in the correct order, like so:
df[ , c( 'GeneID', 'CTR_R1', 'CTR_R2', 'CTR_R3', 'CTR_AVG', 'TRT_R1', 'TRT_R2', 'TRT_R3','TRT_AVG','pValue' ]

The below code should work even if there are many CTR or TRT columns (i.e. 100s). But, I am guessing #beginneR's solution to be faster.
indx <- grep("^CTR", colnames(df1), value=TRUE)
indxT <- grep("^TRT", colnames(df1), value=TRUE)
df1[,c('CTR_Avg', 'TRT_Avg')] <- lapply(list(indx, indxT),
function(x) Reduce(`+`, df1[,x])/length(x))
or you can use rowMeans in the above step.
df2 <- df1[,c('GeneID', indx, 'CTR_Avg', indxT, 'TRT_Avg', 'pValue')]
head(df2,2)
# GeneID CTR_R1 CTR_R2 CTR_R3 CTR_Avg TRT_R1 TRT_R2 TRT_R3 TRT_Avg pValue
#1 1 6 2 10 6.000000 10 11 15 12 0.091
#2 2 5 12 8 8.333333 5 3 13 7 0.051
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:20,20*6, replace=TRUE), ncol=6))
colnames(df1) <- c("CTR_R1", "CTR_R2", "CTR_R3", "TRT_R1", "TRT_R2", "TRT_R3")
df1 <- cbind(GeneID=1:20, df1,
pValue=sample(seq(0.001, 0.10, by=0.01), 20, replace=TRUE))

make some dummy data
df=data.frame(CTR_R1=1:10,CTR_R2=1:10,CTR_R3=1:10,somethingelse=1:10)
get a new column
df$CTR_AVG=apply(df[c("CTR_R1","CTR_R2","CTR_R3")],1,mean)

Thanks so much for your replies. I am sorry I did not phrase my original question better. I meant to ask how to write one script to take the average and place that value in the right place. I do not have in my table the column that says "CTR_AVG", nor the column "TRT_AVG".
I was wondering if i could do it more 'elegantly' than doing what i did below (which works too).
Many thanks.
#
names (edgeR_table)
"GeneID" "CTR_R1" "CTR_R2" "CTR_R3" "TRT_R1" "TRT_R2" "TRT_R3" "logFC" "logCPM" "LR" "PValue" "FDR"
#
edgeR_table$CTR_AVG <- rowMeans(edgeR_table[,2:4])
edgeR_table$TRT_AVG <- rowMeans(edgeR_table[,5:7])
edgeR_table <- edgeR_table[, c(1,2,3,4,13,5,6,7,14,8,9,10,11,12)]

Related

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Last Observation Carried Forward In a data frame? [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
I wish to implement a "Last Observation Carried Forward" for a data set I am working on which has missing values at the end of it.
Here is a simple code to do it (question after it):
LOCF <- function(x)
{
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
# example:
LOCF(c(1,2,3,4,NA,NA))
LOCF(c(1,NA,3,4,NA,NA))
Now this works great for simple vectors. But if I where to try and use it on a data frame:
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
a
t(apply(a, 1, LOCF)) # will make a mess
It will turn my data frame into a character matrix.
Can you think of a way to do LOCF on a data.frame, without turning it into a matrix? (I could use loops and such to correct the mess, but would love for a more elegant solution)
This already exists:
library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))
If you do not want to load a big package like zoo just for the na.locf function, here is a short solution which also works if there are some leading NAs in the input vector.
na.locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v)+1]
}
Adding the new tidyr::fill() function for carrying forward the last observation in a column to fill in NAs:
a <- data.frame(col1 = rep("a",4), col2 = 1:4,
col3 = 1:4, col4 = c(1,NA,NA,NA))
a
# col1 col2 col3 col4
# 1 a 1 1 1
# 2 a 2 2 NA
# 3 a 3 3 NA
# 4 a 4 4 NA
a %>% tidyr::fill(col4)
# col1 col2 col3 col4
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 3 3 1
# 4 a 4 4 1
There are a bunch of packages implementing exactly this functionality.
(with same basic functionality, but some differences in additional options)
spacetime::na.locf
imputeTS::na_locf
zoo::na.locf
xts::na.locf
tidyr::fill
Added a benchmark of these methods for #Alex:
I used the microbenchmark package and the tsNH4 time series, which has 4552 observations.
These are the results:
So for this case na_locf from imputeTS was the fastest - closely followed by na.locf0 from zoo. The other methods were significantly slower. But be careful it is only a benchmark made with one specific time series. (added the code that you can test for your specific use case)
Results as a plot:
Here is the code, if you want to recreate the benchmark with a self selected time series:
library(microbenchmark)
library(imputeTS)
library(zoo)
library(xts)
library(spacetime)
library(tidyr)
# Create a data.frame from tsNH series
df <- as.data.frame(tsNH4)
res <- microbenchmark(imputeTS::na_locf(tsNH4),
zoo::na.locf0(tsNH4),
zoo::na.locf(tsNH4),
tidyr::fill(df, everything()),
spacetime::na.locf(tsNH4),
times = 100)
ggplot2::autoplot(res)
plot(res)
# code just to show each methods produces correct output
spacetime::na.locf(tsNH4)
imputeTS::na_locf(tsNH4)
zoo::na.locf(tsNH4)
zoo::na.locf0(tsNH4)
tidyr::fill(df, everything())
This question is old but for posterity... the best solution is to use data.table package with the roll=T.
I ended up solving this using a loop:
fillInTheBlanks <- function(S) {
L <- !is.na(S)
c(S[L][1], S[L])[cumsum(L)+1]
}
LOCF.DF <- function(xx)
{
# won't work well if the first observation is NA
orig.class <- lapply(xx, class)
new.xx <- data.frame(t( apply(xx,1, fillInTheBlanks) ))
for(i in seq_along(orig.class))
{
if(orig.class[[i]] == "factor") new.xx[,i] <- as.factor(new.xx[,i])
if(orig.class[[i]] == "numeric") new.xx[,i] <- as.numeric(new.xx[,i])
if(orig.class[[i]] == "integer") new.xx[,i] <- as.integer(new.xx[,i])
}
#t(na.locf(t(a)))
return(new.xx)
}
a <- data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA))
LOCF.DF(a)
Instead of apply() you can use lapply() and then transform the resulting list to data.frame.
LOCF <- function(x) {
# Last Observation Carried Forward (for a left to right series)
LOCF <- max(which(!is.na(x))) # the location of the Last Observation to Carry Forward
x[LOCF:length(x)] <- x[LOCF]
return(x)
}
a <- data.frame(rep("a",4), 1:4, 1:4, c(1, NA, NA, NA))
a
data.frame(lapply(a, LOCF))

Resources