Simple but not easy merge task

Simple but not easy merge task - r

I have two incomplete dataframes (df_a, df_b): Columns are missing or NA values. "by" is the merge index and df_a has "priority" over df_b.
df_a = structure(list(Datum = structure(c(1635163200, 1635166800, 1635170400, 1635174000), class = c("POSIXct", "POSIXt")), Vorhersage = c(10.297922, 10.155121, 10.044135, 9.699513), Export = c("10.912", "10.47", NA, NA), color = c("rgb(0,128,0)", "rgb(0,128,0)", NA, NA), Status = c("ok", "ok", NA, NA), Plausibilität = c("4", "4", NA, NA), min = c(7.93000000000001, 9.4, 8.7, 8.3), max = c(12.31715325, 12.42822725, 12.51326325, 12.28620625)), row.names = c(NA, -4L), class = "data.frame")
df_b = structure(list(Datum = structure(c(1632510000, 1632513600, 1632517200, 1632520800), class = c("POSIXct", "POSIXt")), Vorhersage = c(14.821988, 14.832919, 14.706179, 14.573527), Referenz = c(16.6, 16.2, 15.9, 16), DWD_Name = c("Elpersbüttel", "Elpersbüttel", "Elpersbüttel", "Elpersbüttel"), Export = c(17.198, 16.713, 16.378, 16.358), color = c("rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)"), Status = c("ok", "ok", "ok", "ok"), Plausibilität = c(4, 4, 4, 4), min = c(13.05, 12.808, 11.631891, 12.312), max = c(17, 17, 16.9, 16.7)), row.names = c(NA, -4L), class = "data.frame")
desired output is:
Datum Vorhersage Export color Status Plausibilität min max Referenz
1 2021-10-25 14:00:00 10.3 10.912 rgb(0,128,0) ok 4 7.9 12 NA
2 2021-10-25 15:00:00 10.2 10.47 rgb(0,128,0) ok 4 9.4 12 NA
3 2021-10-25 16:00:00 10.0 <NA> <NA> <NA> <NA> 8.7 13 NA
4 2021-10-25 17:00:00 9.7 <NA> <NA> <NA> <NA> 8.3 12 NA
5 2021-09-24 21:00:00 14.8 17.198 rgb(0,128,0) ok 4 13.1 17 17
6 2021-09-24 22:00:00 14.8 16.713 rgb(0,128,0) ok 4 12.8 17 16
7 2021-09-24 23:00:00 14.7 16.378 rgb(0,128,0) ok 4 11.6 17 16
8 2021-09-25 00:00:00 14.6 16.358 rgb(0,128,0) ok 4 12.3 17 16
DWD_Name
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 Elpersbüttel
6 Elpersbüttel
7 Elpersbüttel
8 Elpersbüttel
# for rebuild:
structure(list(Datum = structure(c(1635163200, 1635166800, 1635170400,
1635174000, 1632510000, 1632513600, 1632517200, 1632520800), class = c("POSIXct",
"POSIXt")), Vorhersage = c(10.297922, 10.155121, 10.044135, 9.699513,
14.821988, 14.832919, 14.706179, 14.573527), Export = c("10.912",
"10.47", NA, NA, "17.198", "16.713", "16.378", "16.358"), color = c("rgb(0,128,0)",
"rgb(0,128,0)", NA, NA, "rgb(0,128,0)", "rgb(0,128,0)", "rgb(0,128,0)",
"rgb(0,128,0)"), Status = c("ok", "ok", NA, NA, "ok", "ok", "ok",
"ok"), Plausibilität = c("4", "4", NA, NA, "4", "4", "4", "4"
), min = c(7.93000000000001, 9.4, 8.7, 8.3, 13.05, 12.808, 11.631891,
12.312), max = c(12.31715325, 12.42822725, 12.51326325, 12.28620625,
17, 17, 16.9, 16.7), Referenz = c(NA, NA, NA, NA, 16.6, 16.2,
15.9, 16), DWD_Name = c(NA, NA, NA, NA, "Elpersbüttel", "Elpersbüttel",
"Elpersbüttel", "Elpersbüttel")), row.names = c(NA, -8L), class = "data.frame")
Thanks to the help of #r2evans I tried the following:
by = "Datum"
library(data.table)
colnms <- setdiff(intersect(names(df_a), names(df_b)), by)
setDT(df_a)
setDT(df_b)
merge(df_a, df_b, by = by, all = TRUE
)[, (colnms) := lapply(colnms, function(nm) fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]))
][, c(outer(colnms, c(".x", ".y"), paste0)) := NULL ][]
but I get the following error:
Error in fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]) :
Item 2 is type double but the first item is type character. Please coerce

Most of the other answers are good, but many either over-complicate the result (in my opinion) or they perform a left or right join, not the full join as expected in the OP.
Here's a quick solution that uses dynamic column names.
library(data.table)
colnms <- setdiff(intersect(names(df_a), names(df_b)), "by")
colnms
# [1] "a"
setDT(df_a)
setDT(df_b)
merge(df_a, df_b, by = "by", all = TRUE
)[, (colnms) := lapply(colnms, function(nm) fcoalesce(.SD[[paste0(nm, ".x")]], .SD[[paste0(nm, ".y")]]))
][, c(outer(colnms, c(".x", ".y"), paste0)) := NULL ][]
# by b c a
# <num> <num> <num> <num>
# 1: 1 1 NA 1
# 2: 2 NA 2 2
# 3: 3 3 3 3
# 4: 4 NA 4 4
Notes:
the normal data.table::[ merge is a left-join only, so we need to use data.table::merge in order to be able to get a full-join with all=TRUE;
because it's using merge, the repeated columns get the .x and .y suffixes, something we can easily capitalize on;
the canonical and most-performant way when using (colnms) := ... is to also include .SDcols=colnms, but that won't work as well here since we need the suffixed columns, not the colnms columns themselves; this is a slight performance penalty but certainly not an anti-pattern (I believe) given what we need to do; and since we could have more than one duplicate column, we have to be careful to do it with each pair at a time, not all of them at once;
the last [-block (using outer) is for removing the duplicate columns; without it, the output would have column names c("by", "a.x", "b", "a.y", "c", "a"). It uses outer because that's a straight-forward way to get 1-or-more colnms and combine .x and .y to each of them; it then uses data.table's := NULL shortcut for removing one-or-more columns.

This isn't the most elegant, but you can make a function that applies your rule to coalesce the values if they occur in both data frames.
# find the unique column names (not called "by")
cols <- union(names(df_a),names(df_b))
cols <- cols[!(cols == "by")]
# merge the data sets
df_merge <- merge(df_a, df_b, by = "by", all = TRUE)
# function to check for the base column names that now have a '.x' and
# a '.y' version. for the columns, fill in the NAs from '.x' with the
# value from '.y'
col_val <- function(col_base, df) {
x <- names(df)
if (all(paste0(col_base, c(".x", ".y")) %in% x)) {
na.x <- is.na(df[[paste0(col_base, ".x")]])
df[[paste0(col_base, ".x")]][na.x] <- df[[paste0(col_base, ".y")]][na.x]
df[[paste0(col_base, ".x")]]
} else {
df[[col_base]]
}
}
# apply this function to every column
cbind(df_merge["by"], sapply(cols, col_val, df = df_merge))
This will give the following result.
by a b c
1 1 1 1 NA
2 2 2 NA 2
3 3 3 3 3
4 4 4 NA 4
I know you specified base, by the natural_join() function is worth mentioning.
library(rqdatatable)
natural_join(df_a, df_b, by = "by", jointype = "FULL")
This gives exactly what you want.
by a b c
1 1 1 1 NA
2 2 2 NA 2
3 3 3 3 3
4 4 4 NA 4

Not the answer with R base. But one possible solution with the package data.table
library(data.table)
setDT(df_a)
setDT(df_b)
df_a <- rbind(df_a, list(4, NA, NA))
df_b <- rbind(list(1, NA, NA), df_b)
df_a[df_b, `:=` (a = fifelse(is.na(a), i.a, a), c = c), on = .(by)][]
#> by a b c
#> 1: 1 1 1 NA
#> 2: 2 2 NA 2
#> 3: 3 3 3 3
#> 4: 4 4 NA 4
Edit with the help of #r2evans, A much more elegant and efficient solution:
df_a[df_b, `:=` (a = fcoalesce(a, i.a), c = c), on = .(by)][]
#> by a b c
#> 1: 1 1 1 NA
#> 2: 2 2 NA 2
#> 3: 3 3 3 3
#> 4: 4 4 NA 4
Created on 2021-10-19 by the reprex package (v2.0.1)

here a dynamic solution.. not bad, but maybe someone knows how to speed it up.
get_complete_df<-function(df_a,df_b, by = "by"){
df_a = unique(df_a)
df_b = unique(df_b)
nam_a = names(df_a)[!(names(df_a) == by)]
nam_b = names(df_b)[!(names(df_b) == by)]
nums_a = unlist(lapply(df_a, is.numeric))
nums_b = unlist(lapply(df_b, is.numeric))
nums = unique(names(df_a)[nums_a],names(df_b)[nums_b])
## try to supplement NAs
x = df_b[[by]][df_b[[by]] %in% df_a[[by]]]
y = nam_b[nam_b %in% nam_a]
vna = is.na(df_a[df_a[,1] %in% x,y])
df_a[df_a[,1] %in% x ,y][vna] = df_b[df_b[,1] %in% x,y][vna]
## get complete df
all_names = c(nam_a,nam_b )
all_names = c(by, unique(all_names))
all_by = na.omit(unique(c(df_a[[by]],df_b[[by]]) ))
## build
df_o = as.data.frame(matrix(nrow = length(all_by),ncol = length(all_names)))
names(df_o) = all_names
df_o[[by]] = all_by
## fill in content
df_o[df_o[,1] %in% df_b[,1],names(df_b)] = df_b
df_o[df_o[,1] %in% df_a[,1],names(df_a)] = df_a ## df_a has priority!
# fix numeric:
# why did some(!) num fields changed to chr ?
df_o[,nums] = as.data.frame(apply(df_o[,nums], 2, as.numeric))
df_o
}

Related

how to groupby and take mean of value by symetrically looping forward and backward on the date value in r

I need to group data on ID and then replace the missing value of price by iterating on a date value up and down. first, look for 1 date value up and down if no data go 2 date values up and down until there is a mean value for all rows.
Input data :
df1 <- data.frame(id = c(11,11,11,11,11,11,11,11,555,555,555,555,555,555,555,555,555),
Date = c("1-Jun", "18-Jun", "3-Jul", "4-Jul", "25-Jul", "3-Nov", "7-Nov", "28_Nov",
"1-Jun", "18-Jun", "3-Jul", "4-Jul", "25-Jul", "3-Nov", "7-Nov", "28_Nov",
"30-Nov"),
price = c(NA, NA, 100, NA, 25, NA, 50, NA, 400, NA, NA, NA, NA, NA, NA, NA, 200)
)
Updated requirement:
Input data :
df1 <- data.frame(id = c(11,11,11,11,11,11,11,11),
Date = c("1-Jun", "5-Jun", "8-Jun", "9-Jun", "14-Jun", "16-Jun", "20-Jun", "21-Jun"),
price = c(NA, NA,100, NA, 50, NA, 200, NA)
)
I need to impute all missing dates between the available dates for each id's and then go symmetrically up and down to impute missing. Also, not always I need the average between two, eg: when I go 2 dates up and down and I see only 1 value, then I would impute that value.

Please find below with a reprex one possible solution using the data.table library.
I built a function to make it easier to use.
Reprex
Code of the NA_imputations() function
library(data.table)
NA_imputations <- function(x) {
x[, rows := .I]
z <- x[, .I[!is.na(price)]]
id_1 <- z[-length(z)]
id_2 <- z[-1]
values <- x[z, .(price = price, id = id)]
values_1 <- values[-nrow(values)]
names(values_1) <- c("price_1", "id_o1")
values_2 <- values[-1]
names(values_2) <- c("price_2", "id_o2")
subtract <- z[-1] - z[-length(z)]
r <- data.table(id_1, values_1, id_2, values_2, subtract)
Results <- r[, `:=` (id_mean = fifelse(subtract > 2 & subtract %% 2 == 0, id_1+(subtract/2), (id_1+id_2)/2),
mean = fifelse(subtract >= 2 & subtract %% 2 == 0 & id_o1 == id_o2, (price_1+price_2)/2, NA_real_))
][, `:=` (price_1 = NULL, id_o1 = NULL, id_2 = NULL, price_2 = NULL, id_o2 = NULL)
][x, on = .(id_mean = rows)
][, price := fcoalesce(price, mean)
][, mean := NULL
][r[subtract > 2 & subtract %% 2 == 0,id_1]:r[subtract > 2 & subtract %% 2 == 0,id_mean-1], price := lapply(price, nafill, type = "nocb"), by = .(id)
][, price := nafill(price, type = "nocb"), by = .(id)
][, price := nafill(price, type = "locf")
][, `:=` (id_1 = NULL, id_mean = NULL, subtract = NULL)][]
return(Results)
}
Output of the NA_imputations() function
NA_imputations(df1)
#> id Date price
#> <num> <char> <num>
#> 1: 11 1-Jun 100.0
#> 2: 11 18-Jun 100.0
#> 3: 11 3-Jul 100.0
#> 4: 11 4-Jul 62.5
#> 5: 11 25-Jul 25.0
#> 6: 11 3-Nov 37.5
#> 7: 11 7-Nov 50.0
#> 8: 11 28_Nov 50.0
#> 9: 555 1-Jun 400.0
#> 10: 555 18-Jun 400.0
#> 11: 555 3-Jul 400.0
#> 12: 555 4-Jul 400.0
#> 13: 555 25-Jul 300.0
#> 14: 555 3-Nov 200.0
#> 15: 555 7-Nov 200.0
#> 16: 555 28_Nov 200.0
#> 17: 555 30-Nov 200.0
Created on 2021-12-05 by the reprex package (v2.0.1)

Iterate through columns' suffixes in a for loop. R

I am trying to modify my dataset with a for loop. I want to modify certain cells of some columns depending on the value of its "paired" column. My dataset could be:
data1989 <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, 0.589, 0.120),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 0.589 0.447 NA 66.897 66.097 NA
3 1987-01-19 0.120 NA NA 90.599 NA NA
Columns are "paired" by the suffix of each column, so NDVI_1 is paired with pixelQA_1, and so on. I want to modify the values under NDVI's columns depending on it's "paired" values on pixelQA column, following:
if PixelQa is NA -> then NDVI should be also NA.
if Pixel Qa is 66±0.5 OR 130±0.5 -> then NDVI remains the same value.
if Pixel Qa is different to 66±0.5 OR 130±0.5 -> then NDVI value is set to NA (this is bad quality data which needs to be ignored).
Applying these very simple rules my data should look like:
data1989clean <- data.frame("date" = c("1987-01-01", "1987-01-03", "1987-01-19"),
"NDVI_1" = c(NA, NA, NA),
"NDVI_3" = c(NA, 0.447, NA),
"NDVI_4" = c(NA, NA, NA),
"pixelQA_1" = c(NA, 66.897,90.599),
"pixelQA_3" = c(NA, 66.097,NA),
"pixelQA_4" = c(NA, NA, NA),
stringsAsFactors = FALSE)
> data1989clean
date NDVI_1 NDVI_3 NDVI_4 pixelQA_1 pixelQA_3 pixelQA_4
1 1987-01-01 NA NA NA NA NA NA
2 1987-01-03 NA 0.447 NA 66.897 66.097 NA
3 1987-01-19 NA NA NA 90.599 NA NA
To reach my goal I am trying the following for loop:
for(i in 1:4){
data1989$NDVI_[i] <- ifelse(data1989$pixelQA_[i] < 66.5 & data1989$pixelQA_[i] > 65.5 |
data1989$pixelQA_[i] < 130.5 & data1989$pixelQA_[i] > 129.5,
data1989$NDVI_[i], NA)
}
But so far it is not working, as the dataset output looks exactly the same as the original one. Any suggestion will be welcomed.

As suggested by #George Savva, you can achieve this by pivoting longer, correcting the data, and pivoting back wider. So, using the tidyverse, that gives:
library(tidyverse)
newdd1 <-
#
data1989 %>%
#
pivot_longer(cols = -date,
names_to = c(".value", "set"),
names_sep = "_") %>%
#
mutate(NDVI = case_when(is.na(pixelQA) ~ NA_real_,
between(pixelQA, 65.5, 66.5) ~ NDVI,
between(pixelQA, 129.5, 130.5) ~ NDVI,
TRUE ~ NA_real_)) %>%
#
pivot_wider(names_from = set,
values_from = c(NDVI, pixelQA))

compute an average of the last two columns which differ for all subjects

I'm an R beginner and it's my first post here. I'm struggling with a problem and would love your advice. Basically, I have a dataset with 3 sets of columns that I need to manipulate altogether in order to obtain the desired outcome, which is an average of the 2 most recent observations (and that these observations must occur after a cutoff date, say, 3/15/2018) that are of high quality, but what makes it complex is that the relevant columns that go into the average differ for all cases.
The first set of data columns has to do with the number of observations each case has, so subject one has 2 observations, subject two has 3, etc.
The second set of columns describe the data quality for each of these observations. So for example, subject 1 has two good observations whereas subject 2 has 1 bad data quality for the first observation and good data quality for the 2 latter ones, and subject 3 has 3 observations that are of good quality and one observation (obs_3)that is of bad data quality.
The third set of columns specify the dates of the observations.
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date obs_3_date obs_4_date desired.average
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16 <NA> <NA> NA
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16 2018-04-10 <NA> 9.5
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18 2018-04-02 2018-04-10 12.0
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08 2018-03-10 2018-03-15 NA
In order to compute an average of TWO latest observations that are of good data quality:
I must first decide which observations are of good quality,
Then, compute an average (and it has to be an average of 2 observations) that occur after 3/15 and they must be the two most recent observations.
Below is my sample dataset. I've tried to do this manually in Excel and it was really painstaking. I'm hoping to do this in R and would very much appreciate your feedback. Thank you!
Here is my sample dataset:
> dput(head(df,5))
structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA,
NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA,
4L), class = "data.frame")

This should also work, and though a bit verbose it doesn't rely on column indices, so should be robust:
library(dplyr)
library(tidyr)
num_date <- as.numeric(as.Date("2018-03-15"))
df <- df[,-ncol(df)]
df_join <- df %>%
gather(Obs, value, 2:ncol(df)) %>%
mutate(
nr = as.numeric(gsub("[^\\d]", "", Obs, perl = TRUE))
) %>%
group_by(subject_id, nr) %>%
filter(!(is.na(value) | (grepl("_dq", Obs) & value == 0) | any(value[grepl("_date", Obs)] <= num_date))) %>%
ungroup() %>%
group_by(subject_id, Obs) %>%
filter(!row_number() < (max(row_number() - 1))) %>%
ungroup() %>%
group_by(subject_id) %>%
mutate(
desired.average = mean(value[grepl("_date|_dq", Obs) == FALSE], na.rm = TRUE)
) %>%
filter(!max(row_number()) == 3) %>%
distinct(subject_id, desired.average)
df <- left_join(df, df_join)
Result:
subject_id obs_1 obs_2 obs_3 obs_4 obs_1_dq obs_2_dq obs_3_dq obs_4_dq obs_1_date obs_2_date
1 1 5 6 NA NA TRUE TRUE NA NA 2018-02-01 2018-03-16
2 2 6 8 11 NA FALSE TRUE TRUE NA 2018-02-18 2018-03-16
3 3 7 9 12 15 TRUE TRUE FALSE TRUE 2018-02-15 2018-03-18
4 4 3 4 8 15 TRUE TRUE TRUE TRUE 2018-02-16 2018-03-08
obs_3_date obs_4_date desired.average
1 <NA> <NA> NA
2 2018-04-10 <NA> 9.5
3 2018-04-02 2018-04-10 12.0
4 2018-03-10 2018-03-15 NA

See if this works for you. Code is annotated briefly.
df=structure(list(subject_id = c(1, 2, 3, 4), obs_1 = c(5, 6, 7,
3), obs_2 = c(6, 8, 9, 4), obs_3 = c(NA, 11, 12, 8), obs_4 = c(NA,
NA, 15, 15), obs_1_dq = c(TRUE, FALSE, TRUE, TRUE), obs_2_dq = c(TRUE,
TRUE, TRUE, TRUE), obs_3_dq = c(NA, TRUE, FALSE, TRUE), obs_4_dq =
c(NA, NA, TRUE, TRUE), obs_1_date = structure(c(17563, 17580, 17577,
17578), class = "Date"), obs_2_date = structure(c(17606, 17606,
17608, 17598), class = "Date"), obs_3_date = structure(c(NA,
17631, 17623, 17600), class = "Date"), obs_4_date = structure(c(NA,
NA, 17631, 17605), class = "Date"), desired.average = c(NA, 9.5,
12, NA)), .Names = c("subject_id", "obs_1", "obs_2", "obs_3",
"obs_4", "obs_1_dq", "obs_2_dq", "obs_3_dq", "obs_4_dq", "obs_1_date",
"obs_2_date", "obs_3_date", "obs_4_date", "desired.average"), row.names
= c(NA, 4L), class = "data.frame")
# separate each section
obs=df[,2:5]
dq=df[, 6:9]
dt=sapply(df[, 10:13], as.numeric) # for easier calculations
# remove bad quality
obs[dq==F]=NA
# remove dates before 2018-3-15
obs[dt - as.numeric(as.Date("2018-03-15")) <= 0] = NA
# only leave two most recent dates
dt[is.na(obs)]=NA
dt=t(apply(dt,1,function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
obs[is.na(dt)]=NA
# average
df$avg=apply(obs,1,function(x)ifelse(sum(!is.na(x))>=2, mean(x,na.rm=T), NA))
df
Edits:
Explanation
dt=t(apply(dt,1, function(x){x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA;x}))
I think this might be a little confusing for x[x<max(x[x!=max(x, na.rm=T)],na.rm=T)]=NA. The na.rm=T meaning remove NA values. max(x[x!=max(x)]) meaning the second largest number. So x[x < 2nd_largest_num]=NA just removed any number except the largest and the 2nd largest. This function is then applied to the data frame row-wise. The final result is dt contains only two largest number in each row (most recent date in numeric format). All "discarded" values (NA in dt) will be removed from obs in the next line obs[is.na(dt)]=NA. After all these, obs only contains the two recent values in each line.

Transpose dplyr::tbl object

I am using src_postgres to connect and dplyr::tbl function to fetch data from redshift database. I have applied some filters and top function to it using the dplyr itself. Now my data looks as below:
riid day hour
<dbl> <chr> <chr>
1 5542. "THURSDAY " 12
2 5862. "FRIDAY " 15
3 5982. "TUESDAY " 15
4 6022. WEDNESDAY 16
My final output should be as below:
riid MON TUES WED THUR FRI SAT SUN
5542 12
5862 15
5988 15
6022 16
I have tried spread. It throws the below error because of the class type:
Error in UseMethod("spread_") : no applicable method for 'spread_'
applied to an object of class "c('tbl_dbi', 'tbl_sql', 'tbl_lazy',
'tbl')"
Since this is a really big table, I do not want to use dataframe as it takes a longer time.
I was able to use as below:
df_mon <- df2 %>% filter(day == 'MONDAY') %>% mutate(MONDAY = hour) %>% select(riid,MONDAY)
df_tue <- df2 %>% filter(day == 'TUESDAY') %>% mutate(TUESDAY = hour) %>% select(riid,TUESDAY)
df_wed <- df2 %>% filter(day == 'WEDNESDAY') %>% mutate(WEDNESDAY = hour) %>% select(riid,WEDNESDAY)
df_thu <- df2 %>% filter(day == 'THURSDAY') %>% mutate(THURSDAY = hour) %>% select(riid,THURSDAY)
df_fri <- df2 %>% filter(day == 'FRIDAY') %>% mutate(FRIDAY = hour) %>% select(riid,FRIDAY)
Is it possible to write all above in one statement?
Any help to transpose this in a faster manner is really appreciated.
EDIT
Adding the dput of the tbl object:
structure(list(src = structure(list(con = <S4 object of class structure("PostgreSQLConnection", package = "RPostgreSQL")>,
disco = <environment>), .Names = c("con", "disco"), class = c("src_dbi",
"src_sql", "src")), ops = structure(list(name = "select", x = structure(list(
name = "filter", x = structure(list(name = "filter", x = structure(list(
name = "group_by", x = structure(list(x = structure("SELECT riid,day,hour,sum(weightage) AS score FROM\n (SELECT riid,day,hour,\n POWER(2,(cast(datediff (seconds,convert_timezone('UTC','PKT',SYSDATE),TO_DATE(TO_CHAR(event_captured_dt,'mm/dd/yyyy hh24:mi:ss'),'mm/dd/yyyy hh24:mi:ss')) as decimal) / cast(7862400 as decimal))) AS weightage\n FROM (\n SELECT riid,convert_timezone('GMT','PKT',event_captured_dt) AS EVENT_CAPTURED_DT,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'DAY') AS day,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'HH24') AS hour\n FROM Zameen_STO_DATA WHERE EVENT_CAPTURED_DT >= TO_DATE((sysdate -30),'yyyy-mm-dd') and LIST_ID = 4282\n )) group by riid,day,hour", class = c("sql",
"character")), vars = c("riid", "day", "hour", "score"
)), .Names = c("x", "vars"), class = c("op_base_remote",
"op_base", "op")), dots = structure(list(riid = riid,
day = day), .Names = c("riid", "day")), args = structure(list(
add = FALSE), .Names = "add")), .Names = c("name",
"x", "dots", "args"), class = c("op_group_by", "op_single",
"op")), dots = structure(list(~min_rank(desc(~score)) <=
1), .Names = ""), args = list()), .Names = c("name",
"x", "dots", "args"), class = c("op_filter", "op_single",
"op")), dots = structure(list(~row_number() == 1), .Names = ""),
args = list()), .Names = c("name", "x", "dots", "args"), class = c("op_filter",
"op_single", "op")), dots = structure(list(~riid, ~day, ~hour), class = "quosures", .Names = c("",
"", "")), args = list()), .Names = c("name", "x", "dots", "args"
), class = c("op_select", "op_single", "op"))), .Names = c("src",
"ops"), class = c("tbl_dbi", "tbl_sql", "tbl_lazy", "tbl"))

I think what you're looking for is the ability to run the tidyr::spread() function against a remote source, or database. I have a PR for dbplyr that attempts to implement that here: https://github.com/tidyverse/dbplyr/pull/72, you can try it out by using: devtools::install_github("tidyverse/dbplyr", ref = devtools::github_pull(72)).

Use dcast from reshape2 package
> data
# A tibble: 4 x 3
riid day hour
<dbl> <chr> <dbl>
1 1.00 TH 12.0
2 2.00 FR 15.0
3 3.00 TU 15.0
4 4.00 WE 16.0
> dcast(data, riid~day, value.var = "hour")
riid FR TH TU WE
1 1 NA 12 NA NA
2 2 15 NA NA NA
3 3 NA NA 15 NA
4 4 NA NA NA 16
Further if you want to remove NA, then
> z <- dcast(data, riid~day, value.var = "hour")
> z[is.na(z)] <- ""
> z
riid FR TH TU WE
1 1 12
2 2 15
3 3 15
4 4 16

I tried to combine your multiple line attempts into one. Can you try this and let us know the outcome?
library(dplyr)
df %>%
rowwise() %>%
mutate(Mon = ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
Tue = ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
Wed = ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
Thu = ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
Fri = ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
Sat = ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
Sun = ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA)) %>%
select(-day, -hour)
Output is:
riid Mon Tue Wed Thu Fri Sat Sun
1 5542 NA NA NA 12 NA NA NA
2 5862 NA NA NA NA 15 NA NA
3 5982 NA 15 NA NA NA NA NA
4 6022 NA NA 16 NA NA NA NA
Sample data:
# A tibble: 4 x 3
riid day hour
* <dbl> <chr> <int>
1 5542 THURSDAY 12
2 5862 FRIDAY 15
3 5982 TUESDAY 15
4 6022 WEDNESDAY 16
Update:
Can you try below approach using data.table?
library(data.table)
dt <- setDT(df)[, c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") :=
list(ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA))][, !c("day","hour"), with=F]

How to look up values from a table and insert name of the lookup-list?

I have a (sample)table like this:
df <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL Values
TP53 2 3.55
XBP1 5 4.06
TP27 1 2.53
REDD1 4 3.99
ERO1L 6 5.02
STK11 9 3.64
HIF2A 8 2.96")
I want to look up the symbols from two different genelists, given here as genelist1 and genelist2:
genelist1 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
P4H 10
PLK 7
TP27 1
KTD 11
ERO1L 6")
genelist2 <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="Gene SYMBOL
TP53 2
XBP1 5
BHLHB 12
STK11 9
TP27 1
UPK 18")
What I want to is to get a new column where I can see in which genelist(s) I can find each of the genes in my dataframe, but when I run the following code it is just the symbols that are repeated in the new columns.
df_geneinfo <- df %>%
join(genelist1,by="SYMBOL") %>%
join(genelist2, by="SYMBOL")
Any suggestions of how to solve this, either to make one new column with the name of the genelists, or to make one column for each of the genelists?
Thanks in advance! :)

For the sake of completeness (and performance with large tables, perhaps), here is a data.table approach:
library(data.table)
rbindlist(list(genelist1, genelist2), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 2
2: XBP1 5 4.06 2
3: TP27 1 2.53 1, 2
4: REDD1 4 3.99 1
5: ERO1L 6 5.02 NA
6: STK11 9 3.64 2
7: HIF2A 8 2.96 NA
rbindlist() creates a data.table from all genelists and adds a column glid to identify the origin of each row. The Gene column is ignored as the subsequent join is only on SYMBOL. Before joining, df is coerced to class data.table using setDT(). The joined result is then aggregated by SYMBOL to exhibit cases where a symbol appears in both genelists which is the case for SYMBOL == 1.
Edit
In case there are many genelists or the full name of the genelist is required instead of just a number, we can try this:
rbindlist(mget(ls(pattern = "^genelist")), idcol = "glid")[, -"Gene"][
setDT(df), on = "SYMBOL"][, .(glid = toString(glid)), by = .(Gene, SYMBOL, Values)][]
Gene SYMBOL Values glid
1: TP53 2 3.55 genelist2
2: XBP1 5 4.06 genelist2
3: TP27 1 2.53 genelist1, genelist2
4: REDD1 4 3.99 NA
5: ERO1L 6 5.02 genelist1
6: STK11 9 3.64 genelist2
7: HIF2A 8 2.96 NA
ls()is looking for objects in the environment the name of which is starting with genelist.... mget() returns a named list of those objects which is passed to rbindlist().
Data
As provided by the OP
df <- structure(list(Gene = c("TP53", "XBP1", "TP27", "REDD1", "ERO1L",
"STK11", "HIF2A"), SYMBOL = c(2L, 5L, 1L, 4L, 6L, 9L, 8L), Values = c(3.55,
4.06, 2.53, 3.99, 5.02, 3.64, 2.96)), .Names = c("Gene", "SYMBOL",
"Values"), class = "data.frame", row.names = c(NA, -7L))
genelist1 <- structure(list(Gene = c("P4H", "PLK", "TP27", "KTD", "ERO1L"),
SYMBOL = c(10L, 7L, 1L, 11L, 4L)), .Names = c("Gene", "SYMBOL"
), class = "data.frame", row.names = c(NA, -5L))
genelist2 <- structure(list(Gene = c("TP53", "XBP1", "BHLHB", "STK11", "TP27",
"UPK"), SYMBOL = c(2L, 5L, 12L, 9L, 1L, 18L)), .Names = c("Gene",
"SYMBOL"), class = "data.frame", row.names = c(NA, -6L))

I just wrote my own function, which replaces the column values:
replace_by_lookuptable <- function(df, col, lookup) {
assertthat::assert_that(all(col %in% names(df))) # all cols exist in df
assertthat::assert_that(all(c("new", "old") %in% colnames(lookup)))
cond_na_exists <- is.na(unlist(lapply(df[, col], function(x) my_match(x, lookup$old))))
assertthat::assert_that(!any(cond_na_exists))
df[, col] <- unlist(lapply(df[, col], function(x) lookup$new[my_match(x, lookup$old)]))
return(df)
}
df is the data.frame, col is a vector of column names which should be replaced using lookup, a data.frame with column "old" and "new".

If you add a listid column to your genelists
genelist1$listid = 1
genelist2$listid = 2
you can then merge your df with the genelists:
merge(df,rbind(genelist1,genelist2),all.x=T, by = "SYMBOL")
Note that ERO1L is SYMBOL 6 in your df and 4 in genelist1, and HIF2A and REDD1 are missing from genelists but REDD1 is symbol 4 in your df (which is ERO1L in genlist1... so I'm a not sure of what output you're expecting in that case.
You could also merge only on Gene names:
merge(df,rbind(genelist1,genelist2),all.x=T, by.x = "Gene", by.y= "Gene")

You could put all of your genlists in a list:
gen_list <- list(genelist1 = genelist1,genelist2 = genelist2)
and compare them to your target data.frame:
cbind(df,do.call(cbind,lapply(seq_along(gen_list),function(x) ifelse( df$Gene %in% gen_list[[x]]$Gene,names(gen_list[x]),NA))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Simple but not easy merge task - r

Related

how to groupby and take mean of value by symetrically looping forward and backward on the date value in r

Iterate through columns' suffixes in a for loop. R

compute an average of the last two columns which differ for all subjects

Transpose dplyr::tbl object

How to look up values from a table and insert name of the lookup-list?

Categories

Resources