replacing values in dataframe with another dataframe r - r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.

You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Related

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

Defining a function that includes for loops

I have 2 data frames.
One (df1) has columns for slopes and intercepts, and the other (df2) has an index column (i.e., row numbers).
I wish to apply a function based on parameters from df1 to the entire index column in df2. I don't want the function to mix and match slopes and intercepts (i.e., I want to make sure that the function always uses slopes and intercepts from the same columns in df1).
I tried to do this
my_function <- function(x) {for (i in df1$slope) for (j in df1$intercept) {((i*x)+j)}}
df3 <- for (k in df2$Index) {my_function(k)}
df3
but it didn't work.
Here are sample data:
> df1
thermocouple slope intercept
1 1 0.01 0.5
2 2 -0.01 0.4
3 3 0.03 0.2
> df2
index t_1 t_2 t_3
1 1 0.3 0.2 0.2
2 2 0.5 0.2 0.3
3 3 0.3 0.9 0.1
4 4 1.2 1.8 0.4
5 5 2.3 3.1 1.2
Here would be the output I need:
index baseline_t_1 baseline_t_2 baseline_t_3
1 0.51 0.39 0.23
2 0.52 0.38 0.26
3 0.53 0.37 0.29
4 0.54 0.36 0.32
5 0.55 0.35 0.35
What am I doing wrong?
Thanks!
Try this:
By passing three arguments with values at the same time to a anonymous function defined in Map.
Map( function(index, slope, intercept) (index * slope ) + intercept,
index = df2$Index, slope = df1$slope, intercept = df1$intercept)
May be this: I am not sure which one you prefer given there is no data and expected output in the question.
lapply( df2$index, function(index){
unlist( Map( function(slope, intercept) (index * slope ) + intercept,
slope = df1$slope, intercept = df1$intercept) )
})

Using R to do aggregation like tapply in matrice rowwisely

I have a problem in doing matrix computation, could you please shed some light upon it.
Thank you very much in advance!
I have a data frame genderLocation and a matrix test, they correspond to each other with the index
genderLocation[,1:6]
scanner_gender cmall_gender wechat_gender scanner_location cmall_location wechat_location
156043 3 2 2 Guangzhou Shenzhen Shenzhen
156044 2 NA NA Shenzhen <NA>
156045 2 NA 2 Shenzhen <NA> Hongkong
156046 2 NA 2 Shenzhen <NA> Shenzhen
test
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.8 0.7 0.6 0.6 0.7 0.7
[2,] 0.8 1.0 1.0 0.6 0.7 0.7
[3,] 0.8 1.0 0.6 0.6 0.7 0.7
[4,] 0.8 1.0 0.6 0.6 0.7 0.7
Now I wanna aggregate genderLocation, compute the averages of their corresponding digits in matrix test.
Take 156043 row for example, the results should be
2 3 Guangzhou Shenzhen
0.65 0.80 0.60 0.70
I dont know how to do it using the apply family(as it is not suggested to using for-loops in R).
This seems to be
> apply(test,1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation)
but I cannot understand the results, if limiting to the first 2 rows, it seems understandable.
> apply(test[1:2,],1,function(tst,genderLoc) print(tapply(tst,as.character(genderLoc),mean)),genderLocation[1:2,])
c("2", NA) c("3", "2") c("广州", "深圳") c("深圳", "") c("深圳", NA)
0.65 0.80 0.60 0.70 0.70
c("2", NA) c("3", "2") c("广州", "深圳") c("深圳", "") c("深圳", NA)
1.0 0.8 0.6 0.7 0.7
[,1] [,2]
c("2", NA) 0.65 1.0
c("3", "2") 0.80 0.8
c("广州", "深圳") 0.60 0.6
c("深圳", "") 0.70 0.7
c("深圳", NA) 0.70 0.7
##### FYI
test=matrix(c(0.8,0.8,0.8,0.8, 0.7,1,1,1, 0.6,1,0.6,0.6, 0.6,0.6,0.6,0.6, 0.7,0.7,0.7,0.7, 0.7,0.7,0.7,0.7),nrow=4,ncol=6,byrow=F)
genderLocation<- data.frame(scanner_gender=c(3,2,2,2),cmall_gender=c(2,NA,NA,NA),wechat_gender=c(2,NA,2,2),
scanner_location=c("Guangzhou","Shenzhen","Shenzhen","Shenzhen"),
cmall_location=c("Shenzhen",NA,NA,NA),
wechat_location=c("Shenzhen","","Hongkong","Shenzhen"))
genderLocation1<-cbind(genderLocation,test) # binded for some apply functions only accepting one input.
The following works for your example data but I don't know how stable it is with all of your data. An issue may occur if some of your rows in df do not share a common value with other rows. However, if you want to keep your output as a list, this should work with no problems (that is, skip Reduce...). Keeping that in mind...
--Your data--
test <- matrix(c(0.8,0.8,0.8,0.8,0.7,1,1,1,0.6,1,0.6,0.6,0.6,0.6,0.6,0.6,rep(0.7,8)), nrow=4)
df <- data.frame(scanner_gender=c(3,2,2,2),
cmall_gender=c(2,NA,NA,NA),
wechat_location=c(2,NA,2,2),
scanner_location=c("Guanzhou","Shenzhen","Shenzhen","Shenzhen"),
cmall_location=c("Shenzhen",NA,NA,NA),
wechat_location=c("Shenzhen",NA,"Hongkong","Shenzhen"),
stringsAsFactors=F)
rownames(df) <- c(156043,156044,156045,156046)
--Operation--
I combine map from purrr with other tidyverse verbs to 1) create a 2-column data frame with df row-entry in first column and test row-entry in second column, 2) then filter out where is.na(A)==T, 3) then summarise the mean by group, 4) then spread into rowwise data frame using A (keys) as columns
L <- map(1:nrow(df),~data.frame(A=unlist(df[.x,]),B=unlist(test[.x,])) %>%
filter(!is.na(A)) %>%
group_by(A) %>%
summarise(B=mean(B)) %>%
spread(A,B) )
I then reduce this list to a data frame using Reduce and full_join
newdf <- Reduce("full_join", L)
--Output--
`2` `3` Guanzhou Shenzhen Hongkong
1 0.65 0.8 0.6 0.70 NA
2 0.80 NA NA 0.60 NA
3 0.70 NA NA 0.60 0.7
4 0.70 NA NA 0.65 NA

Replicate rows by different N

I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4
Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00
An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.

partially match a data.frame and subset all the data.frame

I have some data that looks like this:
List_name Condition1 Condition2 Situation1 Situation2
List1 0.01 0.12 66 123
List2 0.23 0.22 45 -34
List3 0.32 0.23 13 -12
List4 0.03 0.56 -3 45
List5 0.56 0.05 12 100
List6 0.90 0.09 22 32
I would like to filter each column "Condition" of the data.frame according to a cut off 0.5.
After the filter, the subset will occur and will carry the corresponding value of columns "Situation". Filter and subset will work pairwise: "Condition1" with "Situation1", "Condition2" with "Situation2" and so on.
Just the desired output:
List_name Condition1 Situation1 List_name Condition2 Situation2
List1 0.01 66 List1 0.12 123
List2 0.23 45 List2 0.22 -34
List3 0.32 13 List3 0.23 -12
List4 0.03 -3 List5 0.05 100
List6 0.09 32
I'm pretty sure that there's probably another similar situation posted before but I searched and I didn't find it.
Similar to excellent #Arun solution, but based on columns names and without any assumption.
cols.conds <- colnames(dat)[gregexpr(pattern='Condition[0-9]+',colnames(dat)) > 0]
lapply(cols.conds, function(x){
col.list <- colnames(dat)[1]
col.situ <- gsub('Condition','Situation',x)
dat[which(dat[[x]] < 0.5), c(col.list,x,col.situ)]}
)
I assume dat is :
dat <- read.table(text =' List_name Condition1 Condition2 Situation1 Situation2
List1 0.01 0.12 66 123
List2 0.23 0.22 45 -34
List3 0.32 0.23 13 -12
List4 0.03 0.56 -3 45
List5 0.56 0.05 12 100
List6 0.90 0.02 22 32',head=T)
You can use the notion that boolean checks are vectorized:
x <- c(0.1, 0.3, 0.5, 0.2)
x < 0.5
# [1] TRUE TRUE FALSE TRUE
And some grep results:
grep('Condition', names(DF1))
To do this subsetting you can use apply to generate your boolean vector:
keepers <- apply(DF1[, grep('Condition', names(DF1))], 1, function(x) any(x < 0.5))
And subset:
DF1[keepers,]
Notice that this doesn't necessarily return the data structure you showed in your question. But you can alter the anonymous function accordingly using all or a different threshold value.
In lieu of the edits, I would approach this differently. I would use melt from the reshape2 package:
library(reshape2)
dat.c <- melt(DF1,
id.var='List_name',
measure.var=grep('Condition', names(DF1), value=TRUE),
variable.name='condition',
value.name='cond.val')
dat.c$idx <- gsub('Condition', '', dat.c$condition)
dat.s <- melt(DF1,
id.var='List_name',
measure.var=grep('Situation', names(DF1), value=TRUE),
variable.name='situation',
value.name='situ.val')
dat.s$idx <- gsub('Situation', '', dat.s$situation)
dat <- merge(dat.c, dat.s)
out <- dat[dat$cond.val < 0.5,]
List_name idx condition cond.val situation situ.val
1 List1 1 Condition1 0.01 Situation1 66
2 List1 2 Condition2 0.12 Situation2 123
3 List2 1 Condition1 0.23 Situation1 45
4 List2 2 Condition2 0.22 Situation2 -34
5 List3 1 Condition1 0.32 Situation1 13
6 List3 2 Condition2 0.23 Situation2 -12
7 List4 1 Condition1 0.03 Situation1 -3
10 List5 2 Condition2 0.05 Situation2 100
12 List6 2 Condition2 0.09 Situation2 32
You can then use dcast to put the data back in the initial format if you want, but I find data in this "long" form much easier to work with. This form is also pleasant since it avoids the need for NA values where you have rows where one condition is met and others are not.
out.c <- dcast(out, List_name ~ condition, value.var='cond.val')
out.s <- dcast(out, List_name ~ situation, value.var='situ.val')
merge(out.c, out.s)
List_name Condition1 Condition2 Situation1 Situation2
1 List1 0.01 0.12 66 123
2 List2 0.23 0.22 45 -34
3 List3 0.32 0.23 13 -12
4 List4 0.03 NA -3 NA
5 List5 NA 0.05 NA 100
6 List6 NA 0.09 NA 32
I think what you're asking for is attainable, but it can't be bind(bound) in the way you've shown as they have unequal elements. So, you'll get a list.
Here, I assume that your data.frame always is of the form List_name, followed by a list of Condition1, ... ,ConditionN and then Situation1, ..., SituationN.
Then, this can be obtained by getting the ids first and then filtering using lapply
ids <- grep("Condition", names(df))
lapply(ids, function(x) df[which(df[[x]] < 0.5), c(1,x,x+length(ids))])
# [[1]]
# List_name Condition1 Situation1
# 1 List1 0.01 66
# 2 List2 0.23 45
# 3 List3 0.32 13
# 4 List4 0.03 -3
#
# [[2]]
# List_name Condition2 Situation2
# 1 List1 0.12 123
# 2 List2 0.22 -34
# 3 List3 0.23 -12
# 5 List5 0.05 100
# 6 List6 0.09 32

Resources