generic data frame update function with tidyverse dplyr like sql-update - r

I search for a generic data frame update function like the sql-update that updates values in the first data frame in case the keys match with the keys in the second data frame. Is there a more generic way as in my example, maybe also by considering the value names? Something like a generic dplyr::update(df1, df2, by = "key") function?
library(tidyverse)
# example data frame
df1 <- as_data_frame(list(key = c(1,2,3,4,5,6,7,8,9),
v1 = c(11,12,13,14,15,16,17,18,19),
v2 = c(21,22,23,24,25,26,27,28,29),
v3 = c(31,32,33,34,35,36,37,38,39),
v4 = c(41,42,43,44,45,46,47,48,49)))
df2 <- as_data_frame(list(key = c(3,5,9),
v2 = c(231,252,293),
v4 = c(424,455,496)))
# update df1 with values from df2 where key match
org_names <- df1 %>% names()
df1 <- df1 %>%
left_join(df2, by = "key") %>%
mutate(v2 = ifelse(is.na(v2.y), v2.x, v2.y),
v4 = ifelse(is.na(v4.y), v4.x, v4.y)) %>%
select(org_names)
> df1
# A tibble: 9 x 5
key v1 v2 v3 v4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41
2 2 12 22 32 42
3 3 13 231 33 424
4 4 14 24 34 44
5 5 15 252 35 455
6 6 16 26 36 46
7 7 17 27 37 47
8 8 18 28 38 48
9 9 19 293 39 496
>

1) %<>% Magrittr has the compound assignment pipe:
library(magrittr)
df1 %>%
{ keys <- intersect(.$key, df2$key)
.[match(keys, .$key), names(df2)] %<>% { df2[match(keys, df2$key), ] }
.
}
which, for the problem under consideration, simplifies to this because all keys in df2 are in df1:
df1 %>% { .[match(df2$key, .$key), names(df2)] %<>% { df2 }; . }
2) <- The basic R assignment operator could also be used in much the same way and, in fact, the code is shorter than (1):
df1 %>%
{ keys <- intersect(.$key, df2$key)
.[match(keys, .$key), names(df2)] <- df2[match(keys, df2$key), ]
.
}
however, for the problem under consideration all keys in df2 are in df1 so it simplifies to:
df1 %>% { .[match(df2$key, .$key), names(df2)] <- df2; . }
3) mutate_cond Using mutate_cond defined in this SO post we can write the following.
df1 %>% mutate_cond(.$key %in% df2$key, v2 = df2$v2, v4 = df2$v4)
Note: The first two approaches work if the keys in df1 and df2 are each unique. The third additionally requires the keys be in the same order and every key in df2 be in df1. The problem in the question satisfies these.
Update: Have somewhat generalized the code in (1) and (2).

Related

Using Filter inside a Function - Optimizing the function code

I often load data from which I need to filter a select number of variables, which are always the same. The column containing them, however, isn't always identical. Here's a sample that reflects my data:
df <- data.frame(id = c(1:50),
variable = sample(c("v1", "v2", "v3", "v4","v5","v6","v7"), 50, replace = T),
value = sample(c(80:150), 50, replace = T))
> df %>%
+ head()
id variable value
1 1 v3 138
2 2 v5 89
3 3 v2 98
4 4 v7 84
5 5 v1 144
6 6 v2 105
Using help from posts on stackoverflow, I've come up with the following function, and above it I've writing the piece of code I'd usually write:
df %>%
filter(variable %in% c('v1','v2','v3','v4'))
my_filter <- function(df,x){
df %>%
filter(x %in% c('v1','v2','v3','v4'))
}
Now if I used it:
df %>%
my_filter(df$variable)
> head(df %>%
+ my_filter(df$variable))
id variable value
1 3 v2 98
2 5 v1 144
3 6 v2 105
4 8 v2 128
5 10 v1 122
6 11 v1 131
This works fine, and will save me a small amount of time everytime I repeat this task. I was wondering though, if there was a way to avoid having to type in df$variable in the second argument of the function, and obtain a solution using chaining only. Something that I could then use like this:
df %>%
my_filter(variable)

How to match values of several variables to a variable in a look up table?

I have two datasets:
loc <- c("a","b","c","d","e")
id1 <- c(NA,9,3,4,5)
id2 <- c(2,3,7,5,6)
id3 <- c(2,NA,5,NA,7)
cost1 <- c(10,20,30,40,50)
cost2 <- c(50,20,30,30,50)
cost3 <- c(40,20,30,10,20)
dt <- data.frame(loc,id1,id2,id3,cost1,cost2,cost3)
id <- c(1,2,3,4,5,6,7)
rate <- c(0.9,0.8,0.7,0.6,0.5,0.4,0.3)
lookupd_tb <- data.frame(id,rate)
what I want to do, is to match the values in dt with lookup_tb for id1,id2 and id3 and if there is a match, multiply rate for that id to its related cost.
This is my approach:
dt <- dt %>%
left_join(lookupd_tb , by=c("id1"="id")) %>%
dplyr :: mutate(cost1 = ifelse(!is.na(rate), cost1*rate, cost1)) %>%
dplyr :: select (-rate)
what I am doing now, works fine but I have to repeat it 3 times for each variable and I was wondering if there is a more efficient way to do this(probably using apply family?)
I tried to join all three variables with id in my look up table but when rate is joined with my dt, all the costs (cost1, cost2 and cost3) will be multiply by the same rate which I don't want.
I appreciate your help!
A base R approach would be to loop through the columns of 'id' using sapply/lapply, get the matching index from the 'id' column of 'lookupd_tb', based on the index, get the corresponding 'rate', replace the NA elements with 1, multiply with 'cost' columns and update the 'cost' columns
nmid <- grep("id", names(dt))
nmcost <- grep("cost", names(dt))
dt[nmcost] <- dt[nmcost]*sapply(dt[nmid], function(x) {
x1 <- lookupd_tb$rate[match(x, lookupd_tb$id)]
replace(x1, is.na(x1), 1) })
Or using tidyverse, we can loop through the sets of columns i.e. 'id' and 'cost' with purrr::map2, then do the same approach as above. The only diference is that here we created new columns instead of updating the 'cost' columns
library(tidyverse)
dt %>%
select(nmid) %>%
map2_df(., dt %>%
select(nmcost), ~
.x %>%
match(., lookupd_tb$id) %>%
lookupd_tb$rate[.] %>%
replace(., is.na(.),1) * .y ) %>%
rename_all(~ paste0("costnew", seq_along(.))) %>%
bind_cols(dt, .)
In tidyverse you can also try an alternative approach by transforming the data from wide to long
library(tidyverse)
dt %>%
# data transformation to long
gather(k, v, -loc) %>%
mutate(ID=paste0("costnew", str_extract(k, "[:digit:]")),
k=str_remove(k, "[:digit:]")) %>%
spread(k, v) %>%
# left_join and calculations of new costs
left_join(lookupd_tb , by="id") %>%
mutate(cost_new=ifelse(is.na(rate), cost,rate*cost)) %>%
# clean up and expected output
select(loc, ID, cost_new) %>%
spread(ID, cost_new) %>%
left_join(dt,., by="loc") # or %>% bind_cols(dt, .)
loc id1 id2 id3 cost1 cost2 cost3 costnew1 costnew2 costnew3
1 a NA 2 2 10 50 40 10 40 32
2 b 9 3 NA 20 20 20 20 14 20
3 c 3 7 5 30 30 30 21 9 15
4 d 4 5 NA 40 30 10 24 15 10
5 e 5 6 7 50 50 20 25 20 6
The idea ist to bring the data in suitable long format for the lef_joining using a gather & spread combination with new index columns k and ID. After the calculation we will transform to the expected output using a second spread and binding to dt

Subtract data using id in 1 column

I have 3 data set. All of them has 1 column called ID. I would like to list out each ID for whole 3 tables (I'm not sure I'm explaining right). For example
df1
ID age
1 34
2 33
5 34
7 35
43 32
76 33
df2
ID height
1 178
2 176
5 166
7 159
43 180
76 178
df3
ID class type
1 a 1
2 b 1
5 a 2
7 b 3
43 b 2
76 a 3
I would like to have an output which looks like this
ID = 1
df1 age
34
df2 height
178
df3 class type
a 1
ID = 2
df1 age
33
df2 height
176
df3 class type
b 1
I wrote a script
listing <- function(x) {
for(i in 1:n) {
data <- print(x[x$ID == 'i', ])
print(data)
}
return(data)
}
why am I not getting the output I wanted?
This is a hack. If you want/need to export to a word document, I strongly urge you to use something like R-Markdown (such as RStudio) using knitr (and, behind the scenes, pandoc). I'd encourage you to look at knitr::kable, for instance, as well as better looping structures for dealing with large numbers of datasets.
This hack can be improved considerably. But it gets you the output you want.
func <- function(...) {
dfnames <- as.character(match.call()[-1])
dfs <- setNames(list(...), dfnames)
IDs <- unique(unlist(lapply(dfs, `[[`, "ID")))
fmt <- paste("%", max(nchar(dfnames)), "s %s", sep = "")
for (id in IDs) {
cat(sprintf("ID = %d\n", id))
for (nm in dfnames) {
df <- dfs[[nm]][ dfs[[nm]]$ID == id, names(dfs[[nm]]) != "ID", drop =FALSE]
cat(paste(sprintf(fmt, c(nm, ""),
capture.output(print(df, row.names = FALSE))),
collapse = "\n"), "\n")
}
}
}
Execution. Though this is showing just two data.frames, you can provide an arbitrary number of data.frames (and in your preferred order) in the function arguments. It assumes you are providing them as direct variables and not subsetting within the function call ... you'll understand if you try it.
func(df1, df3)
# ID = 1
# df1 age
# 34
# df3 class type
# a 1
# ID = 2
# df1 age
# 33
# df3 class type
# b 1
# ID = 5
# df1 age
# 34
# df3 class type
# a 2
# ID = 7
# df1 age
# 35
# df3 class type
# b 3
# ID = 43
# df1 age
# 32
# df3 class type
# b 2
# ID = 76
# df1 age
# 33
# df3 class type
# a 3
(Personally, I can't imagine providing output in this format, but I don't know your tastes or use-case. There are many many other ways to show data like this. Like:
Reduce(function(x,y) merge(x, y, by = "ID"), list(df1, df2, df3))
# ID age height class type
# 1 1 34 178 a 1
# 2 2 33 176 b 1
# 3 5 34 166 a 2
# 4 7 35 159 b 3
# 5 43 32 180 b 2
# 6 76 33 178 a 3
It's much more concise. But, then again, I'm also assuming that you want to show them all at once instead of "show one, talk about it, then show another one, talk about it ...".)
Why not do a merge by id ?
df_1 <- merge( df1, df2, by='ID')
df_fianl <- merge( df_1, df3, by='ID')
or by using
library(dplyr)
full_join(df1, df2)

Compare two columns of factors from two data frame, get the entries exist in both data frame in r

I have two data frames:
df1
vehicle speed time
a 23 234
b 34 421
d 45 290
df2
vehicle speed time
a 29 215
b 54 450
c 45 21
f 40 367
Both vehicle columns are factors. I want to find the common vehicles and add the corresponding df2$time to df1, name it as time.2.
The output I want:
df1
vehicle speed time time.2
a 23 234 215
b 34 421 450
I tried:
df1 <- df1[df1$vehicle %in% df2$vehicle, ]
df2 <- df2[df2$vehicle %in% df1$vehicle, ]
df1 <- cbind(df1, time.2 = df2$time)
But after the first two commands, both df1 and df2 have 0 rows inside. I have tried before, when I use another data frame to compare the vehicle with df1, it works. I don't why df2 doesn't work.
Thanks!
Try:
library(dplyr)
inner_join(df1,
df2 %>%
select(-speed) %>%
rename(time.2 = time) )
Use the merge() function:
df1$vehicle <- as.character(df1$vehicle)
df2$vehicle <- as.character(df2$vehicle)
df <- merge(df1, df2, by="vehicle")
df <- df[, c("vehicle.x", "speed.x", "time.x", "time.y")]

Pivot rows into a single column and index them using column names in R

I need to flip row values into a single column and create an index based on the column name and row number. I checked a lot of pivot solutions in R but none seem to simply flip things around without creating means, sums, etc. Help would be appreciated.
df1 <- read.table(textConnection("a1,a2,a3
23,34,4
34,44,98"), sep=",", header=TRUE)
df2 <- read.table(textConnection("id,val
1_a1,23
2_a2,34
3_a3,4
4_a1,34
5_a2,44
6_a3,98"), sep=",", header=TRUE)
I need to go from a data frame looking like this:
a1 a2 a3
1 23 34 4
2 34 44 98
To this:
id val
1 1_a1 23
2 2_a2 34
3 3_a3 4
4 4_a1 34
5 5_a2 44
6 6_a3 98
Many thanks!!
This can easily be done with gather from the tidyr package:
library(tidyr)
df2 <- gather(df1, id, val)
Note that this requires the latest development version of tidyr, after this commit- you can install it with devtools::install_github("hadley/tidyr"). Otherwise, you can change the line to gather(df1, id, val, a1:a3).
To add the 1_, 2_, etc, you can do:
df2$id <- paste(df2$id, 1:nrow(df2), sep = "_")
If you use the dplyr package as well, you could do this as:
library(dplyr)
library(tidyr)
df2 <- df1 %>% gather(id, val) %>% mutate(id = paste(id, seq_len(n()), sep = "_"))
You could try
m1 <- t(df1)
d1 <- data.frame(id=paste(seq_along(m1),
rownames(m1)[row(m1)], sep="_"), val=c(m1))
d1
# id val
#1 1_a1 23
#2 2_a2 34
#3 3_a3 4
#4 4_a1 34
#5 5_a2 44
#6 6_a3 98
require(dplyr) # for mutate()
require(tidyr) # for gather()
d <- data.frame(
a1 = c(23, 34),
a2 = c(34, 44),
a3 = c(4, 98)
)
gather(d, id, val, a1:a3) %>%
mutate(id = paste(row_number(), "id", sep = "_"))

Resources