Match select the max value from another data frame - r

I have two data frames v1 and V2. I need to add column y from v2 to data frame v1 but want the matched value to be max. for example
v1 <- data.frame(x = c("a1","b2"))
v2 <- data.frame(x = c("a1","a1","b2","b2"), y= c(1,3,4,6))
I am using below line to populate y column in v1.
v1$y <-v2$y[match(v1$x,v2$x)]
which outputs below.
> v1
x y
1 a1 1
2 b2 4
match is taking y based on the first occurrence but I need it based on max. something like below
> v1
x y
1 a1 3
2 b2 6

As match returns first match, you can order the data such that the first match is the max match
v2 <- v2[order(v2$x, -v2$y), ]
v1$y <- v2$y[match(v1$x, v2$x)]
v1
# x y
#1 a1 3
#2 b2 6

You can first aggregate to find the max and then match it to v1.
tt <- aggregate(y ~ x, data=v2, FUN=max)
v1$y <-tt$y[match(v1$x,tt$x)]
v1
# x y
#1 a1 3
#2 b2 6

Try to aggregate first and then join (or match),
merge(v1, aggregate(y~x, v2, max), by = 'x')
or
max_v2 <- aggregate(y~x, v2, max)
max_v2$y[match(v1$x, max_v2$x)]

A possible base solution:
new_df<-merge(v1,v2, by="x")
aggregate(.~x, new_df,max)
Or with dplyr:
v1 %>%
left_join(v2, "x") %>%
group_by(x) %>%
summarise(y=max(y))
# A tibble: 2 x 2
x y
<fct> <dbl>
1 a1 3
2 b2 6
Or another base option:
aggregate(.~x,v2[v1$x %in% v2$x,],max)
x y
1 a1 3
2 b2 6

First filter v2 for for max values and then match
library(dplyr)
v1 <- data.frame(x = c("a1","b2"))
v2 <- data.frame(x = c("a1","a1","b2","b2"), y= c(1,3,4,6))
v2.sub <- v2 %>%
group_by(x) %>%
filter(y==max(y))
v1$y <-v2.sub$y[match(v1$x,v2.sub$x)]

Here is a solution with data.table
library("data.table")
v1 <- data.table(x = c("a1","b2"))
v2 <- data.table(x = c("a1","a1","b2","b2"), y= c(1,3,4,6))
v2[, .(y=max(y)), x][v1, on="x"]
# > v2[, .(y=max(y)), x][v1, on="x"]
# x y
# 1: a1 3
# 2: b2 6

Related

Split data frame column by number of characters specified in another column

Sorry if this sounds trivial, but I have been stuck for a while with this.
I want to split a column of strings into two, splitting at the number of the character specified in another column:
dat <- tibble(x=c("ABCDEFG", "QRSTUVWXYZ", "FGYHGBJIOW"), y=c(4,3,8))
dat
A tibble: 3 x 2
x y
<chr> <dbl>
1 ABCDEFG 4
2 QRSTUVWXYZ 3
3 FGYHGBJIOW 8
Desired outcome:
x1 x2 y
-------------------------
ABCD EFG 4
QRS TUVWXYZ 3
FGYHGBJI OW 8
I have tried using tidy::separate, where it can take the number of characters in the sep =, but it won't take the number from another column. I have tried writing a function in the hope that it would do that (https://dplyr.tidyverse.org/articles/programming.html), but it seems it doesn't let the sep= part to take column name as arguments (https://tidyr.tidyverse.org/reference/separate.html).
Any help would be appreciated!
A simple solution would be:
dat <- dat %>% mutate(x1 = substring(x, 1, y),
x2 = substring(x, y + 1, nchar(x)))
Similar to #PinotTiger's solution using within.
dat <- within(dat, {
x2 <- substring(x, y + 1, nchar(x))
x1 <- substring(x, 1, y)
rm(x)
})[c(2, 3, 1)]
dat
# x1 x2 y
# 1 ABCD EFG 4
# 2 QRS TUVWXYZ 3
# 3 FGYHGBJI OW 8
You can use str_extractfrom the library stringrand force the calculation of the number of characters to be extracted each time into the pattern to be matched:
dat$x1 <- str_extract(dat$x, paste0("\\w{",dat$y,"}"))
dat$x2 <- str_extract(dat$x, paste0("\\w{",nchar(dat$x) - dat$y,"}$"))
dat
# A tibble: 3 x 4
x y x1 x2
<chr> <dbl> <chr> <chr>
1 ABCDEFG 4 ABCD EFG
2 QRSTUVWXYZ 3 QRS TUVWXYZ
3 FGYHGBJIOW 8 FGYHGBJI OW
An option with separate after creating delimiter at the position specified by 'y' with str_replace
library(dplyr)
library(tidyr)
library(stringr)
dat %>%
mutate(x = str_replace(x, sprintf("(.{%d})", y), "\\1,")) %>%
separate(x, into = c('x1', 'x2'))

How to rearrange a tibble by rownames

Traditional dataframes support rearrangement of rows by rownames:
> df <- data.frame(c1 = letters[1:3], c2 = 1:3, row.names = paste0("x", 1:3))
> df
c1 c2
x1 a 1
x2 b 2
x3 c 3
#' If we want, say, row "x3" and "x1":
> df[c("x3", "x1"), ]
c1 c2
x3 c 3
x1 a 1
When it comes to tibble, since it drops the concept of rownames, I wonder what the standard way is to achieve similar goal.
> tb <- as_tibble(rownames_to_column(df))
> tb
# A tibble: 3 x 3
rowname c1 c2
<chr> <fct> <int>
1 x1 a 1
2 x2 b 2
3 x3 c 3
> ?
Thanks.
Update
I can come up with the following solution:
> tb[match(c("x3", "x1"), tb[["rowname"]]), ]
# A tibble: 2 x 3
rowname c1 c2
<chr> <fct> <int>
1 x3 c 3
2 x1 a 1
But it seems clumsy. Does anyone have better idea?
Update 2
In a more generalized sense, my question can be rephrased as: by the syntax of tidyverse, what is the most neat and quick equivalent to
df[c("x3", "x1"), ]
that is, subsetting and rearranging rows of a dataframe.
As joran described, you can use filter to select rows of interest and then to arrange a tibble in a specific order, manually defined, you can use arrange with factor:
tibble(rowname = paste0("x", 1:3), c1 = letters[1:3], c2 = 1:3) %>%
filter(rowname %in% c("x3", "x1")) %>%
arrange(factor(rowname, levels = c("x3", "x1")))

Update existing data.frame with values from another one if missing

I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b

Concatenate rows and columns

I have a data set like this
x y z
a 5 4
b 1 2
And i want concat columns and rows :
ay 5
az 4
by 1
bz 2
Thanks
You can use melt, and paste but you will need to make your rownames a variable, i..e
df$new <- rownames(df)
m_df <- reshape2::melt(df)
rownames(m_df) <- paste0(m_df$new, m_df$variable)
m_df <- m_df[-c(1:2)]
m_df
# value
#ax 5
#bx 1
#ay 4
#by 2
#az 3
#bz 1
After your edit, you don't need to convert rownames to a variable so just,
m1_df <- reshape2::melt(df)
m1_df$new <- paste0(m1_df$x, m1_df$variable)
m1_df
# x variable value new
#1 a y 5 ay
#2 b y 1 by
#3 a z 4 az
#4 b z 2 bz
You can then tidy your data frame to required output
with dplyr-tidyr
library(dplyr)
library(tidyr)
df %>%
gather(var, val, -x) %>%
mutate(var=paste0(x, var)) %>%
select(var, val)%>%
arrange(var)
# var val
#1 ay 5
#2 az 4
#3 by 1
#4 bz 2
library(reshape2)
library(dplyr)
library(tibble)
library(stringr)
# Create dataframe
x <- data.frame(x = c(5, 1),
y = c(4, 2),
z = c(3, 1),
row.names = c('a', 'b'))
# Convert rowname to column and melt
x <- tibble::rownames_to_column(x, "rownames") %>%
melt('rownames')
# assign concat columns as rownames
row.names(x) <- str_c(x$rownames, x$variable)
# Select relevant columns only
x <- select(x, value)
# Remove names from dataframe
names(x) <- NULL
> x
ax 5
bx 1
ay 4
by 2
az 3
bz 1
Here is another option in base R
stack(setNames(as.list(unlist(df1[-1])), outer(df1$x, names(df1)[-1], paste0)))[2:1]

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))
Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.
Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

Resources