Using R to do aggregation like tapply in matrice rowwisely - r

I have a problem in doing matrix computation, could you please shed some light upon it.
Thank you very much in advance!
I have a data frame genderLocation and a matrix test, they correspond to each other with the index
genderLocation[,1:6]
scanner_gender cmall_gender wechat_gender scanner_location cmall_location wechat_location
156043 3 2 2 Guangzhou Shenzhen Shenzhen
156044 2 NA NA Shenzhen <NA>
156045 2 NA 2 Shenzhen <NA> Hongkong
156046 2 NA 2 Shenzhen <NA> Shenzhen
test
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.8 0.7 0.6 0.6 0.7 0.7
[2,] 0.8 1.0 1.0 0.6 0.7 0.7
[3,] 0.8 1.0 0.6 0.6 0.7 0.7
[4,] 0.8 1.0 0.6 0.6 0.7 0.7
Now I wanna aggregate genderLocation, compute the averages of their corresponding digits in matrix test.
Take 156043 row for example, the results should be
2 3 Guangzhou Shenzhen
0.65 0.80 0.60 0.70
I dont know how to do it using the apply family(as it is not suggested to using for-loops in R).
This seems to be
> apply(test,1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation)
but I cannot understand the results, if limiting to the first 2 rows, it seems understandable.
> apply(test[1:2,],1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation[1:2,])
c("2", NA) c("3", "2") c("广州", "深圳") c("深圳", "") c("深圳", NA)
0.65 0.80 0.60 0.70 0.70
c("2", NA) c("3", "2") c("广州", "深圳") c("深圳", "") c("深圳", NA)
1.0 0.8 0.6 0.7 0.7
[,1] [,2]
c("2", NA) 0.65 1.0
c("3", "2") 0.80 0.8
c("广州", "深圳") 0.60 0.6
c("深圳", "") 0.70 0.7
c("深圳", NA) 0.70 0.7
##### FYI
test=matrix(c(0.8,0.8,0.8,0.8, 0.7,1,1,1, 0.6,1,0.6,0.6, 0.6,0.6,0.6,0.6, 0.7,0.7,0.7,0.7, 0.7,0.7,0.7,0.7),nrow=4,ncol=6,byrow=F)
genderLocation<- data.frame(scanner_gender=c(3,2,2,2),cmall_gender=c(2,NA,NA,NA),wechat_gender=c(2,NA,2,2),
scanner_location=c("Guangzhou","Shenzhen","Shenzhen","Shenzhen"),
cmall_location=c("Shenzhen",NA,NA,NA),
wechat_location=c("Shenzhen","","Hongkong","Shenzhen"))
genderLocation1<-cbind(genderLocation,test) # binded for some apply functions only accepting one input.

The following works for your example data but I don't know how stable it is with all of your data. An issue may occur if some of your rows in df do not share a common value with other rows. However, if you want to keep your output as a list, this should work with no problems (that is, skip Reduce...). Keeping that in mind...
--Your data--
test <- matrix(c(0.8,0.8,0.8,0.8,0.7,1,1,1,0.6,1,0.6,0.6,0.6,0.6,0.6,0.6,rep(0.7,8)), nrow=4)
df <- data.frame(scanner_gender=c(3,2,2,2),
cmall_gender=c(2,NA,NA,NA),
wechat_location=c(2,NA,2,2),
scanner_location=c("Guanzhou","Shenzhen","Shenzhen","Shenzhen"),
cmall_location=c("Shenzhen",NA,NA,NA),
wechat_location=c("Shenzhen",NA,"Hongkong","Shenzhen"),
stringsAsFactors=F)
rownames(df) <- c(156043,156044,156045,156046)
--Operation--
I combine map from purrr with other tidyverse verbs to 1) create a 2-column data frame with df row-entry in first column and test row-entry in second column, 2) then filter out where is.na(A)==T, 3) then summarise the mean by group, 4) then spread into rowwise data frame using A (keys) as columns
L <- map(1:nrow(df),~data.frame(A=unlist(df[.x,]),B=unlist(test[.x,])) %>%
filter(!is.na(A)) %>%
group_by(A) %>%
summarise(B=mean(B)) %>%
spread(A,B) )
I then reduce this list to a data frame using Reduce and full_join
newdf <- Reduce("full_join", L)
--Output--
`2` `3` Guanzhou Shenzhen Hongkong
1 0.65 0.8 0.6 0.70 NA
2 0.80 NA NA 0.60 NA
3 0.70 NA NA 0.60 0.7
4 0.70 NA NA 0.65 NA

Related

Remove % of Items in Columns

I'm trying to drop columns that have more than 90% of NA values present, I've followed the following but I only get a values in return, not sure what I can be doing wrong. I would be expecting an actual data frame, I tried putting as.data.frame in front but this is also erroneous.
Linked Post: Delete columns/rows with more than x% missing
Example DF
gene cell1 cell2 cell3
A 0.4 0.1 NA
B NA NA 0.1
C 0.4 NA 0.5
D NA NA 0.5
E 0.5 NA 0.6
F 0.6 NA NA
Desired DF
gene cell1 cell3
A 0.4 NA
B NA 0.1
C 0.4 0.5
D NA 0.5
E 0.5 0.6
F 0.6 NA
Code
#Select Genes that have NA values for 90% of a given cell line
df_col <- df[,2:ncol(df)]
df_col <-df_col[, which(colMeans(!is.na(df_col)) > 0.9)]
df <- cbind(df[,1], df_col)
I would use dplyr here.
If you want to use select() with logical conditions, you are probably looking for the where() selection helper in dplyr.
It can be used like this: select(where(condition))
I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well
library(dplyr)
df %>% select(where(~mean(is.na(.))<0.8))
It can also be done with base R and colMeans:
df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]
or with purrr:
library(purrr)
df %>% keep(~mean(is.na(.))<0.8)
Output:
gene cell1 cell3
1 a 0.4 NA
2 b NA 0.1
3 c 0.4 0.5
4 d NA 0.5
5 e 0.5 0.6
6 f 0.6 NA
Data
df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))
Well, cell3 has 83% NA values (5/6) but anyway you can do -
ignore <- 1
perc <- 0.8 #80 %
df <- cbind(df[ignore], df[-ignore][colMeans(is.na(df[-ignore])) < perc])
df
# gene cell1 cell3
#1 A 0.4 NA
#2 B NA 0.1
#3 C 0.4 0.5
#4 D NA 0.5
#5 E 0.5 0.6
#6 F 0.6 NA

Transform NA values based on first registration and nearest values

I already made a similar question but now I want just to restrict the new values of NA.
I have some data like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A NA 0.1 0.2 NA 0.3 0.2
B 0.1 NA NA 0.3 0.2 0.1
C NA NA NA NA 0.3 NA
D 0.1 0.2 0.3 NA 0.1 NA
E NA NA 0.1 0.2 0.1 0.3
I would like to change the NA values of my data based on the first date a value is registered. So for example for A, the first registration is Date 2. Then I want that before that registration the values of NA in A are 0, and after the first registration the values of NA become the mean of the nearest values (mean of date 3 and 5).
In case the last value is an NA, transform it into the last registered value (as in C and D). In the case of E all NA values will become 0.
Get something like this:
Date 1 Date 2 Date 3 Date 4 Date 5 Date 6
A 0 0.1 0.2 0.25 0.3 0.2
B 0.1 0.2 0.2 0.3 0.2 0.1
C 0 0 0 0 0.3 0.3
D 0.1 0.2 0.3 0.2 0.1 0.1
E 0 0 0.1 0.2 0.1 0.3
Can you help me? I'm not sure how to do it in R.
Here is a way using na.approx from the zoo package and apply with MARGIN = 1 (so this is probably not very efficient but get's the job done).
library(zoo)
df1 <- as.data.frame(t(apply(dat, 1, na.approx, method = "constant", f = .5, na.rm = FALSE)))
This results in
df1
# V1 V2 V3 V4 V5
#A NA 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C NA NA NA NA 0.3
#E NA NA 0.1 0.20 0.1
Replace NAs and rename columns.
df1[is.na(df1)] <- 0
names(df1) <- names(dat)
df1
# Date_1 Date_2 Date_3 Date_4 Date_5
#A 0.0 0.1 0.2 0.25 0.3
#B 0.1 0.2 0.2 0.30 0.2
#C 0.0 0.0 0.0 0.00 0.3
#E 0.0 0.0 0.1 0.20 0.1
explanation
Given a vector
x <- c(0.1, NA, NA, 0.3, 0.2)
na.approx(x)
returns x with linear interpolated values
#[1] 0.1000000 0.1666667 0.2333333 0.3000000 0.2000000
But OP asked for constant values so we need the argument method = "constant" from the approx function.
na.approx(x, method = "constant")
# [1] 0.1 0.1 0.1 0.3 0.2
But this is still not what OP asked for because it carries the last observation forward while you want the mean for the closest non-NA values. Therefore we need the argument f (also from approx)
na.approx(x, method = "constant", f = .5)
# [1] 0.1 0.2 0.2 0.3 0.2 # looks good
From ?approx
f : for method = "constant" a number between 0 and 1 inclusive, indicating a compromise between left- and right-continuous step functions. If y0 and y1 are the values to the left and right of the point then the value is y0 if f == 0, y1 if f == 1, and y0*(1-f)+y1*f for intermediate values. In this way the result is right-continuous for f == 0 and left-continuous for f == 1, even for non-finite y values.
Lastly, if we don't want to replace the NAs at the beginning and end of each row we need na.rm = FALSE.
From ?na.approx
na.rm : logical. If the result of the (spline) interpolation still results in NAs, should these be removed?
data
dat <- structure(list(Date_1 = c(NA, 0.1, NA, NA), Date_2 = c(0.1, NA,
NA, NA), Date_3 = c(0.2, NA, NA, 0.1), Date_4 = c(NA, 0.3, NA,
0.2), Date_5 = c(0.3, 0.2, 0.3, 0.1)), .Names = c("Date_1", "Date_2",
"Date_3", "Date_4", "Date_5"), class = "data.frame", row.names = c("A",
"B", "C", "E"))
EDIT
If there are NAs in the last column we can replace these with the last non-NAs before we apply na.approx as shown above.
dat$Date_6[is.na(dat$Date_6)] <- dat[cbind(1:nrow(dat),
max.col(!is.na(dat), ties.method = "last"))][is.na(dat$Date_6)]
This is another possible answer, using na.locf from the zoo package.
Edit: apply is actually not required; This solution fills in the last observed value if this value is missing.
# create the dataframe
Date1 <- c(NA,.1,NA,NA)
Date2 <- c(.1, NA,NA,NA)
Date3 <- c(.2,NA,NA,.1)
Date4 <- c(NA,.3,NA,.2)
Date5 <- c(.3,.2,.3,.1)
Date6 <- c(.1,NA,NA,NA)
df <- as.data.frame(cbind(Date1,Date2,Date3,Date4,Date5,Date6))
rownames(df) <- c('A','B','C','D')
> df
Date1 Date2 Date3 Date4 Date5 Date6
A NA 0.1 0.2 NA 0.3 0.1
B 0.1 NA NA 0.3 0.2 NA
C NA NA NA NA 0.3 NA
D NA NA 0.1 0.2 0.1 NA
# Load library
library(zoo)
df2 <- t(na.locf(t(df),na.rm = F)) # fill last observation carried forward
df3 <- t(na.locf(t(df),na.rm = F, fromLast = T)) # last obs carried backward
df4 <- (df2 + df3)/2 # mean of both dataframes
df4 <- t(na.locf(t(df4),na.rm = F)) # fill last observation carried forward
df4[is.na(df4)] <- 0 # NA values are 0
Date1 Date2 Date3 Date4 Date5 Date6
A 0.0 0.1 0.2 0.25 0.3 0.1
B 0.1 0.2 0.2 0.30 0.2 0.2
C 0.0 0.0 0.0 0.00 0.3 0.3
D 0.0 0.0 0.1 0.20 0.1 0.1
Here's another option with base R + rollmean from zoo (clearly easy to rewrite in base R for this case with window size k = 2).
t(apply(df, 1, function(x) {
means <- c(0, rollmean(na.omit(x), 2), tail(na.omit(x), 1))
replace(x, is.na(x), means[1 + cumsum(!is.na(x))[is.na(x)]])
}))
# Date1 Date2 Date3 Date4 Date5 Date6
# A 0.0 0.1 0.2 0.25 0.3 0.2
# B 0.1 0.2 0.2 0.30 0.2 0.1
# C 0.0 0.0 0.0 0.00 0.3 0.3
# D 0.1 0.2 0.3 0.20 0.1 0.1
# E 0.0 0.0 0.1 0.20 0.1 0.3
Explanation. Suppose that x is the first row of df:
# Date1 Date2 Date3 Date4 Date5 Date6
# A NA 0.1 0.2 NA 0.3 0.2
Then
means
# [1] 0.00 0.15 0.25 0.25 0.20
is a vector of 0, rolling means of two the following non-NA elements, and the last non-NA element. Then all we need to do is to replace those elements of x that are is.na(x). We will replace them by the elements of means at indices 1 + cumsum(!is.na(x))[is.na(x)]. That's the trickier part. Here
cumsum(!is.na(x))
# [1] 0 1 2 2 3 4
Meaning that the first element of x has seen 0 non-NA elements, while, say, the last one has seen 4 non-NA elements so far. Then
cumsum(!is.na(x))[is.na(x)]
# [1] 0 2
is about those NA elements in x that we want to replace. Notice that then
1 + cumsum(!is.na(x))[is.na(x)]
# [1] 1 3
corresponds to the elements of means that we want to use for replacement.
I am finding the function below too complicated but it works, so here it goes.
fun <- function(x){
if(anyNA(x)){
inx <- which(!is.na(x))
if(inx[1] > 1) x[seq_len(inx[1] - 1)] <- 0
prev <- inx[1]
for(i in inx[-1]){
if(i - prev > 1){
m <- mean(c(x[i], x[prev]))
while(prev < i){
x[prev] <- m
prev <- prev + 1
}
}
prev <- i
}
}
x
}
res <- t(apply(df1, 1, fun))
res <- as.data.frame(res)
res
# Date.1 Date.2 Date.3 Date.4 Date.5
#A 0.0 0.1 0.25 0.25 0.3
#B 0.2 0.2 0.20 0.30 0.2
#C 0.0 0.0 0.00 0.00 0.3
#E 0.0 0.0 0.10 0.20 0.1
Data.
df1 <- read.table(text = "
Date.1 Date.2 Date.3 Date.4 Date.5
A NA 0.1 0.2 NA 0.3
B 0.1 NA NA 0.3 0.2
C NA NA NA NA 0.3
E NA NA 0.1 0.2 0.1
", header = TRUE)

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.
You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

R: Filling Missing Values (NA) by Multiplying Two Separate Vectors

I'm having a brain-freeze.
This is what I have:
C <- c(C1, C2, C3) # A constant for every row in the data frame
r <- c(r1, r2, r3, r4) # A ratio for every column in the data frame
My data frame looks like this:
1 2 3 4
a 0.7 0.4 NA NA
b NA NA 0.3 NA
c NA 0.6 NA 0.4
I need to fill in the NA's with a multiplication of C and r so that it looks like this:
1 2 3 4
a 0.7 0.4 C1*r3 C1*r4
b C2*r1 C2*r2 0.3 C2*r4
c C3*r1 0.6 C3*r3 0.4
Notice that the multiplication is only done for the NA's and not for numbers that already exist. I know is.na is used to pick out the NA's, and it's probably just linear algebra, but my brain has quit for the day. Any help would be great.
Thanks.
If mm is your matrix , you can fill missing values like this:
mm[is.na(mm)] <- outer(C,r)[is.na(mm)]
example with data :
mm <- read.table(text=' 1 2 3 4
a 0.7 0.4 NA NA
b NA NA 0.3 NA
c NA 0.6 NA 0.4')
C <- c(1, 1, 1) # A constant for every row in the data frame
r <- c(2, 2, 2, 2)
mm[is.na(mm)] <- outer(C,r)[is.na(mm)]
# X1 X2 X3 X4
# a 0.7 0.4 2.0 2.0
# b 2.0 2.0 0.3 2.0
# c 2.0 0.6 2.0 0.4

Resources