R - compare columns with rows of two different data frames - r

I have two data frames. I would like to take a subset of the first data frame considering only the columns for which the first values is equal to the first value of the rows of the second data frame.
Example
Data Frame 1:
columns_df1 : a b c d e
Data Frame 2:
rows_df2 : a c e
Subset I would like to obtain:
final_columns_df1 = a c e
I am stuck on how to compare columns with rows belonging to two different data frames.
Thanks for your help!

Ok. It's a little unclear what you want from your question as you don't provide a full reproducible answer. But I think this is what you're looking for.
df1 <- data.frame(a = c(1, 2),
b = c(3, 4),
c = c(5, 6),
d = c(7, 8),
e = c(9, 10))
df2 <- data.frame(f = c("a", "b"),
g = c("c", "d"),
h = c("e", "f"))
final_columns_df1 <- df1[ , names(df1) %in% df2[1, ]]
final_columns_df1
a c e
1 1 5 9
2 2 6 10

Related

How to sum the data frames in a list that have the same name?

I have two lists containing multiple data frames:
list_1 <- list(a = tibble(c(1,2),c(3,4)),
b = tibble(c(3,4),c(2,5)),
c = tibble(c(5,62),c(1,6)))
list_2 <- list(a = tibble(c(1,2),c(3,4)),
b = tibble(c(3,4),c(2,5)),
d = tibble(c(5,62),c(1,6)))
Now, I would like to sum up all data frames that have the same name. Thus, the desired output should look like this:
list_1 <- list(a = tibble(c(2,4),c(6,8)),
b = tibble(c(6,8),c(4,10)))
Does any body have an idea how to solve this problem?
Thanks in advance.
You can get the common names in both the list using intersect and add the two list for only the common names with Map.
common_names <- intersect(names(list_1), names(list_2))
Map(`+`, list_1[common_names], list_2[common_names])
#$a
# c(1, 2) c(3, 4)
#1 2 6
#2 4 8
#$b
# c(3, 4) c(2, 5)
#1 6 4
#2 8 10
Same with purrr's map2 :
purrr::map2(list_1[common_names], list_2[common_names], `+`)

How to merge two data frames

So let's say that I have two data frames. So for example:
a <- c(10,20,30,40)
b <- c('b', 'p', 't', 'x')
c <- c(TRUE,FALSE,TRUE,FALSE)
d <- c(2.5, 8, 10, 7)
df1 <- data.frame(a,b,c,d)
e<-c(2.5,2.5,8,8,8,10,10,10)
f<-c(T, T, F, F, F, T, F, T)
df2<- data.frame(e,f)
I know that all the values of column e in dataframe 2 are contained in column d of dataframe 1.
I want to be able to place column b into dataframe 2 so that it would look like this:
e<-c(2.5,2.5,8,8,8,10,10,10)
f<-c(T, T, F, F, F, T, F, T)
b<-b("b", "b", "p", "p", "p", "t", "t", "t")
df2<- data.frame(e,f,c)
That is, where a value in column e in dataframe 2 is equal to a value in column d of dataframe 1, I want to place the value of column C corresponding to that value in column D into a new column in Dataframe 2.
In reality, I am using much larger datasets than this, so I am hoping for something that does this in a timely manner(i.e preferably not nested for loops). Any help would be greatly appreciated!
We can do a merge in base R
merge(df2, df1[c('b', 'd')], by.x = 'e', by.y = 'd')
Another base R solution using match
df2$b <- df1$b[match(df2$e,df1$d)]
which gives
> df2
e f b
1 2.5 TRUE b
2 2.5 TRUE b
3 8.0 FALSE p
4 8.0 FALSE p
5 8.0 FALSE p
6 10.0 TRUE t
7 10.0 FALSE t
8 10.0 TRUE t

Combining values Boolean columns to one with Priority in R

Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.
tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present

How to deal with join keys of a different length and variable ordering?

Consider two data tables where the number of key columns differ:
library(data.table)
tmp_dt <- data.table(group1 = letters[1:5], group2 = c(1, 1, 2, 2, 2), a = rnorm(5), key = c("group1", "group2"))
tmp_dt2 <- data.table(group2 = c(1, 2, 3), color = c("r", "g", "b"), key = "group2")
I want to join tmp_dt to tmp_dt2 by group2, however the following fails:
tmp_dt[tmp_dt2]
> tmp_dt[tmp_dt2]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
x.'group1' is a character column being joined to i.'group2' which is type 'double'. Character columns must join to factor or character columns.
This makes sense since it tries to join the data tables on the first key variable. How do I fix it so that the behaviour is the same as dplyr::inner_join, without incurring overheads in resetting the key on tmp_dt twice?
> inner_join(tmp_dt, tmp_dt2, by = "group2")
group1 group2 a color
1 a 1 0.2501413 r
2 b 1 0.6182433 r
3 c 2 -0.1726235 g
4 d 2 -2.2239003 g
5 e 2 -1.2636144 g
Using lapply
tmp_dt[,color:=unlist(lapply(.BY, function(x) tmp_dt2[group2==x, color])), by=group2]
As pointed out by Frank in the comments, using on
tmp_dt[tmp_dt2, on="group2"]
tmp_dt2[tmp_dt, on="group2"]
Using on is roughly twice as fast as lapply using .BY. Although the first example returns a sixth row of NA 3 NA b
You should use this code
tmp_dt2[tmp_dt, on = 'group2']

split dataframe in R by row

I have a long dataframe like this:
Row Conc group
1 2.5 A
2 3.0 A
3 4.6 B
4 5.0 B
5 3.2 C
6 4.2 C
7 5.3 D
8 3.4 D
...
The actual data have hundreds of row. I would like to split A to C, and D. I looked up the web and found several solutions but not applicable to my case.
How to split a data frame?
For example:
Case 1:
x = data.frame(num = 1:26, let = letters, LET = LETTERS)
set.seed(10)
split(x, sample(rep(1:2, 13)))
I don't want to split by arbitrary number
Case 2: Split by level/factor
data2 <- data[data$sum_points == 2500, ]
I don't want to split by a single factor either. Sometimes I want to combine many levels together.
Case 3: select by row number
newdf <- mydf[1:3,]
The actual data have hundreds of rows. I don't know the row number. I just know the level I would like to split at.
It sounds like you want two data frames, where one has (A,B,C) in it and one has just D. In that case you could do
Data1 <- subset(Data, group %in% c("A","B","C"))
Data2 <- subset(Data, group=="D")
Correct me if you were asking something different
For those who end up here through internet search engines time after time, the answer to the question in the title is:
x <- data.frame(num = 1:26, let = letters, LET = LETTERS)
split(x, sort(as.numeric(rownames(x))))
Assuming that your data frame has numerically ordered row names. Also split(x, rownames(x)) works, but the result is rearranged.
You may consider using the recode() function from the "car" package.
# Load the library and make up some sample data
library(car)
set.seed(1)
dat <- data.frame(Row = 1:100,
Conc = runif(100, 0, 10),
group = sample(LETTERS[1:10], 100, replace = TRUE))
Currently, dat$group contains the upper case letters A to J. Imagine we wanted the following four groups:
"one" = A, B, C
"two" = D, E, J
"three" = F, I
"four" = G, H
Now, use recode() (note the semicolon and the nested quotes).
recodes <- recode(dat$group,
'c("A", "B", "C") = "one";
c("D", "E", "J") = "two";
c("F", "I") = "three";
c("G", "H") = "four"')
split(dat, recodes)
With base R, we can input the factor that we want to split on.
split(df, df$group == "D")
Output
$`FALSE`
Row Conc group
1 1 2.5 A
2 2 3.0 A
3 3 4.6 B
4 4 5.0 B
5 5 3.2 C
6 6 4.2 C
$`TRUE`
Row Conc group
7 7 5.3 D
8 8 3.4 D
If you wanted to split on multiple letters, then we could:
split(df, df$group %in% c("A", "D"))
Another option is to use group_split from dplyr, but will need to make a grouping variable first for the split.
library(dplyr)
df %>%
mutate(spl = ifelse(group == "D", 1, 0)) %>%
group_split(spl, .keep = FALSE)

Resources