Subsetting a dataframe based on another dataframe in R - r

df:
y x
F T
F F
T T
T F
df1:
y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100
df and df1 are the two data frames. And for each row of df, I want to select the matching rows in df1, add the values in column “new”, and store output in a new data frame like this.
df_res:
y x new
F T .104
F F .312
T T .4
T F .2
Kindly help me out! I have been toiling over this for a long time now. The table headers will change according to the variables, so please do do not hard code the table headers.
Thanks.

I don't know how long is your data but this can be one approach.
df<- read.table(text="y x
F T
F F
T T
T F",header=T,sep="")
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
df$yx <- paste0(df$y,df$x)
df1$yx <- paste0(df1$y, df1$x)
# Update automatically using the for loop
for (i in 1:4){
new[i] <- sum(df1[which(df1[,7]==df[i,3]),6])
}
df$new <- new
df
y x yx new
1 FALSE TRUE FALSETRUE 0.104
2 FALSE FALSE FALSEFALSE 0.312
3 TRUE TRUE TRUETRUE 0.400
4 TRUE FALSE TRUEFALSE 0.200
Using sapply
new <- sapply(1:4, function(x) sum(df1[which(df1[,7]==df[x,3]),6]))

it seems like if all you want is F,T combination. this works. otherwise you have to write more clearly.
text=" y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100"
df<-read.table(text=text, header=T)
df_res<-aggregate(data=df, new~interaction(y,x),sum)
interaction(y, x) new
1 FALSE.FALSE 0.312
2 TRUE.FALSE 0.200
3 FALSE.TRUE 0.104
4 TRUE.TRUE 0.400

Here's an answer using merge and plyr.
Read in your example data.frame:
df1 <- read.table(text="y z probs.x x probs.y new
F F 0.08 T 0.4 0.032
F F 0.24 F 0.4 0.096
F T 0.12 T 0.6 0.072
F T 0.36 F 0.6 0.216
T F 0.40 T 0.5 0.200
T F 0.20 F 0.5 0.100
T T 0.40 T 0.5 0.200
T T 0.20 F 0.5 0.100", header=T, sep="")
If I understand, there are 2 steps to what your asking. First is to select rows in df1 that match patterns in df. That can be done with merge. The df you gave has all combinations of True and False for x and y. Let's leave one out so we can see the effect:
df <- read.table(text="y x
F T
T T
T F",header=T,sep="")
df_merged <- merge(df, df1, all.y=F)
The results are a new data.frame the omits the rows where both x and y are F. This is equivalent to a left join in a SQL database.
y x z probs.x probs.y new
1 FALSE TRUE FALSE 0.08 0.4 0.032
2 FALSE TRUE TRUE 0.12 0.6 0.072
3 TRUE FALSE FALSE 0.20 0.5 0.100
4 TRUE FALSE TRUE 0.20 0.5 0.100
5 TRUE TRUE FALSE 0.40 0.5 0.200
6 TRUE TRUE TRUE 0.40 0.5 0.200
The second part of the question is to group the data and apply a sum to the groups. Plyr is a great tool for this kind of data manipulation:
library(plyr)
ddply(df_merged, .(y,x), function(df) c(new=sum(df$new)))
The dd means we are giving a data.frame and want a data.frame as a result. The next argument .(y,x) is a quoted expression and names the variables we're grouping by. The result is this:
y x new
1 FALSE TRUE 0.104
2 TRUE FALSE 0.200
3 TRUE TRUE 0.400

Related

Automated fill in columns in r

I have a dataframe (shown below) where there are some asterisks in the "sig" column.
I want to fill in asterisks in the empty cells in the sig column everywhere above the furthest down row where there is an asterisk, which in this case would be everywhere from row "H" up to get something like this:
I'm thinking some sort of a for loop where it identifies the furthest down row where there is an asterisk and then fills in asterisks in empty cells above might be the way to go, but I'm not sure how to code this.
For debugging purposes, I make the data frame in R with
df<- data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Any help would be greatly appreciated - thanks!
Another way:
df[1:max(which(df$sig == "*")), "sig"] = "*"
Gives:
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02
We could use replace based on finding the index of the last element having *
library(dplyr)
df <- df %>%
mutate(sig = replace(sig, seq(tail(which(sig == "*"), 1)), "*"))
-output
df
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04
10 j 0.10
11 k 0.02
Another solution would be using fill, but you need to change "" to NA
Libraries
library(tidyverse)
Data
df <-
data.frame("variable"= c("a","b","c","d","e","f","g","h","i","j","k"),
"value" = c(0.04,0.03,0.04,0.02,0.03,0.02,0.02,0.01,0.04,0.1,0.02),
"sig" = c("*","*","*","","*","*","","*","","",""))
Code
df %>%
mutate(sig = if_else(sig == "",NA_character_,sig)) %>%
fill(sig,.direction = "up")
Output
variable value sig
1 a 0.04 *
2 b 0.03 *
3 c 0.04 *
4 d 0.02 *
5 e 0.03 *
6 f 0.02 *
7 g 0.02 *
8 h 0.01 *
9 i 0.04 <NA>
10 j 0.10 <NA>
11 k 0.02 <NA>

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

Rearranging each row from largest value to smallest value in R

My data frame is set as follows:
Black White Red Blue
0.8 0.1 0.07 0.03
0.3 0.6 0 0.1
0.1 0.6 0.25 0.05
I wanted my data frame to look like this:
Black White Red Blue Color1 Color2 Color3 Color4
0.8 0.1 0.07 0.03 0.8 0.1 0.07 0.03
0.3 0.6 0 0.1 0.6 0.3 0.1 0
0.1 0.6 0.25 0.05 0.6 0.25 0.1 0.05
In which Color1 represents the largest value for each row, Color2 represents the second largest value, Color3 represents the third largest, and Color4 represents the smallest value for each row.
So far, I've used this function to obtain what I wanted, which is the result above:
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
df$Color1 <- apply(df, 1, max)
df$Color2 <- apply(df, 1, function(x)x[maxn(3)(x)])
df$Color3 <- apply(df, 1, function(x)x[maxn(4)(x)])
df$Color4 <- apply(df, 1, function(x)x[maxn(5)(x)])
Is there a more concise way for me to arrange my dataset?
Additionally, a bit off-topic: I'm not sure if it's because this is a CSV file that I'm working with that whenever I use the function
df$Color2 <- apply(df, 1, function(x)x[maxn(2)(x)])
It will return the same result as the function
apply(df, 1, max)
AND
apply(df, 1, function(x)x[maxn(1)(x)])
One option is to use sort with apply, transpose and then cbind with data frame as:
cbind(df, t(apply(df, 1, sort, decreasing = TRUE)))
# Black White Red Blue 1 2 3 4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
Updated: Based on suggestion from #dww column names can be assigned as:
df[paste0('color',1:4)] = t(apply(df, 1, sort, decreasing = TRUE))
# Black White Red Blue color1 color2 color3 color4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
It's quite a bit more complex but a speedier solution if you're dealing with a large number of rows is to only do the sorting/ordering once and re-insert it into a matrix shape:
matrix(x[order(-row(x), x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
Some timings:
x <- matrix(rnorm(300000*5), nrow=300000, ncol=5)
system.time(t(apply(x, 1, sort, decreasing=TRUE)))
# user system elapsed
# 14.13 0.00 14.13
system.time(
matrix(x[order(-row(x),x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
)
# user system elapsed
# 0.10 0.00 0.09

Rename dataframe columns by spliting its names and melting dataframe afterwards

Example dataframe:
col_1 col_2 col_3 col_4
f1 0.1 0.2 0.3 0.4
f2 0.01 0.02 0.03 0.04
f3 0.001 0.002 0.003 0.004
I want to rename columns by splitting its names with sep="_" to get this:
1 2 3 4
f1 0.1 0.2 0.3 0.4
f2 0.01 0.02 0.03 0.04
f3 0.001 0.002 0.003 0.004
Then I'd like to plot density for each column (f.name vs f.value) on the same plot (for instance: http://ggplot2.tidyverse.org/reference/geom_freqpoly-11.png) so I guess I need to melt it into something like this:
col f.name f.value
1 f1 0.1
2 f1 0.2
3 f1 0.3
4 f1 0.4
1 f2 0.01
2 f2 0.02
3 f2 0.03
4 f2 0.04
1 f3 0.001
2 f3 0.002
3 f3 0.003
4 f3 0.004
Any suggestions how to do that?
Without testing the code, use packages 'dplyr' and 'tidyr'. Where df is your input data frame, the following should work:
df %>% gather(col, val, starts_with('col')) %>%
separate(col, into=c('nah','col'), sep='_') %>%
ggplot(aes(x=val, colour=col)) + geom_freqpoly()

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.
You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Resources