Replacing missing values - r

Let's say I have a dataframe containing the sales for some quarters, while the values for the following quarters are missing. I would like to replace the NAs by a simple formula (with mutate/dplyr like below). The issue is that I don't want to use mutate so many times. How could I do that for all NAs at the same time? Is there a way?
structure(list(Period = c("1999Q1", "1999Q2", "1999Q3", "1999Q4",
"2000Q1", "2000Q2", "2000Q3", "2000Q4", "2001Q1", "2001Q2", "2001Q3",
"2001Q4", "2002Q1", "2002Q2", "2002Q3", "2002Q4", "2003Q1", "2003Q2",
"2003Q3", "2003Q4"), Sales= c(353.2925571, 425.9299841, 357.5204626,
363.80247, 302.8081066, 394.328576, 435.15573, 387.99768, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-20L))
test %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales)) %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales)) %>%
mutate(Sales = ifelse(is.na(Sales), 1.05*lag(Sales, 4), Sales))

One dplyr and tidyr possibility could be:
df %>%
group_by(quarter = substr(Period, 5, 6)) %>%
mutate(Sales_temp = replace_na(Sales, last(na.omit(Sales)))) %>%
group_by(quarter, na = is.na(Sales)) %>%
mutate(constant = 1.05,
Sales_temp = Sales_temp * cumprod(constant),
Sales = coalesce(Sales, Sales_temp)) %>%
ungroup() %>%
select(1:2)
Period Sales
<chr> <dbl>
1 1999Q1 353.
2 1999Q2 426.
3 1999Q3 358.
4 1999Q4 364.
5 2000Q1 303.
6 2000Q2 394.
7 2000Q3 435.
8 2000Q4 388.
9 2001Q1 318.
10 2001Q2 414.
11 2001Q3 457.
12 2001Q4 407.
13 2002Q1 334.
14 2002Q2 435.
15 2002Q3 480.
16 2002Q4 428.
17 2003Q1 351.
18 2003Q2 456.
19 2003Q3 504.
20 2003Q4 449.
Or with just dplyr:
df %>%
group_by(quarter = substr(Period, 5, 6)) %>%
mutate(Sales_temp = if_else(is.na(Sales), last(na.omit(Sales)), Sales)) %>%
group_by(quarter, na = is.na(Sales)) %>%
mutate(constant = 1.05,
Sales_temp = Sales_temp * cumprod(constant),
Sales = coalesce(Sales, Sales_temp)) %>%
ungroup() %>%
select(1:2)

x <- test$Sales
# find that last non-NA data
last.valid <- tail(which(!is.na(x)),1)
# store the "base"
base <- ceiling(last.valid/4)*4 + (-3:0)
base <- base + ifelse(base > last.valid, -4, 0)
base <- x[base]
# calculate the "exponents"
expos <- ceiling( ( seq(length(x)) - last.valid ) / 4 )
test$Sales <- ifelse(is.na(x), bases * 1.05 ^ expos, x)
tail(test)
# Period Sales
# 15 2002Q3 479.7592
# 16 2002Q4 427.7674
# 17 2003Q1 350.5382
# 18 2003Q2 456.4846
# 19 2003Q3 503.7472
# 20 2003Q4 449.1558

Here's another base solution:
non_nas <- na.omit(test$Sales)
nas <- length(attr(non_nas, 'na.action'))
test$Sales <- c(non_nas, #keep non_nas
tail(non_nas, 4) * 1.05 ^(rep(1:floor(nas / 4), each = 4, length.out = nas)))
test

Related

In R, how do you make a new row that is the combination of two other rows without removing the original rows?

I have a longitudinal dataset about individuals from different socioeconomic backgrounds. The raw data is broken up into high, middle, lower middle, and lower SES statuses. However, I want to add a fifth row that aggregates the lower middle and lower SES statuses. I know how to get the information that I need as columns (see below), but I'm not sure how to elegantly get that information into another row.
Here's a subset of my raw dataset:
library(dplyr)
test_data <- tibble(month = c(rep(c("Jan"), 4), rep(c("Feb"), 4)),
ses = c(rep(c("High", "Mid", "Mid Low", "Low"), 2)),
total = c(10, 20, 20, 30, 9, 11, 40, 60),
total_selected = c(9, 10, 8, 3, 8, 6, 8, 6)) %>%
group_by(month, ses) %>%
mutate(success_rate = total_selected/total)
And here's my code that does get the information that I need (i.e., it aggregates the information for lower and lower middle ses), but it puts them as columns instead of rows:
(test_data2 <- test_data %>%
group_by(month) %>%
mutate(three_ses_total = case_when(
ses %in% c("High", "Mid") ~ total,
ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"] + total[ses == "Low"])
),
three_ses_total_selected = case_when(
ses %in% c("High", "Mid") ~ total_selected,
ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"] + total_selected[ses == "Low"])
),
three_ses_success_rate = case_when(
ses %in% c("High", "Mid") ~ success_rate,
ses %in% c("Mid Low", "Low") ~ three_ses_total_selected/three_ses_total
)))
Last, this is what I want the output to look like. Note: I want 5 rows--in other words, I still want the 4 raw classes in the dataset, but I also want the new combined lower middle and lower:
(answer <- tibble(month = c(rep(c("Jan"), 5), rep(c("Feb"), 5)),
ses = c(rep(c("High", "Mid", "Mid Low", "Low", "Mid Low and Low"), 2)),
total = c(10, 20, 20, 30, 50, 9, 11, 40, 60, 100),
total_selected = c(9, 10, 8, 3, 11, 8, 6, 8, 6, 14)) %>%
group_by(month, ses) %>%
mutate(success_rate = total_selected/total))
I'm open to any suggestion, but if there's a dplyr, tidyr, or other tidyverse function(s) that could help, I'd especially appreciate that. I was trying to think if tidyr's pivot functions would work, but I can't seem to crack it.
Try this:
library(dplyr)
test_data %>%
filter(ses %in% c("Low", "Mid Low")) %>%
group_by(month) %>%
summarize(
ses = "Mid Low and Low",
across(-c(ses, succes_rate), sum),
succes_rate = total_selected / total
) %>%
bind_rows(test_data) %>%
arrange(month, ses)
# # A tibble: 10 x 5
# month ses total total_selected succes_rate
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Feb High 9 8 0.889
# 2 Feb Low 60 6 0.1
# 3 Feb Mid 11 6 0.545
# 4 Feb Mid Low 40 8 0.2
# 5 Feb Mid Low and Low 100 14 0.14
# 6 Jan High 10 9 0.9
# 7 Jan Low 30 3 0.1
# 8 Jan Mid 20 10 0.5
# 9 Jan Mid Low 20 8 0.4
# 10 Jan Mid Low and Low 50 11 0.22
The intent of this is to produce the additional rows first (and separately), which in this case produces just two rows:
test_data %>%
filter(ses %in% c("Low", "Mid Low")) %>%
group_by(month) %>%
summarize(
ses = "Mid Low and Low",
across(-c(ses, succes_rate), sum),
succes_rate = total_selected / total
)
# # A tibble: 2 x 5
# month ses total total_selected succes_rate
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 Feb Mid Low and Low 100 14 0.14
# 2 Jan Mid Low and Low 50 11 0.22
Once we have those two, add them to the original data with %>% bind_rows(test_data). (I added the arrange since the months would be out of order.)
It's a little hacky, but this solution works.
First, get the values that you want by combining them together as I did above, but instead of making new column names, keep the original names:
library(dplyr)
(test_data2 <- test_data %>%
group_by(month) %>%
mutate(total = case_when(
ses %in% c("High", "Mid") ~ total,
ses %in% c("Mid Low", "Low") ~ (total[ses == "Mid Low"] + total[ses == "Low"])
),
total_selected = case_when(
ses %in% c("High", "Mid") ~ total_selected,
ses %in% c("Mid Low", "Low") ~ (total_selected[ses == "Mid Low"] + total_selected[ses == "Low"])
),
success_rate = case_when(
ses %in% c("High", "Mid") ~ success_rate,
ses %in% c("Mid Low", "Low") ~ total_selected/total
))
Then, filter down to just one of the duplicated rows and change the name to what you want:
(test_data2 <- test_data2 %>%
filter(ses == "Low") %>%
mutate(ses = "Mid Low and Low"))
Last, full_join it with your original data:
(test_data3 <- test_data %>%
full_join(test_data2))
Still open to more parsimonious options, but this works!
Here is a solution that combines group_modify and adorn_totals from janitor package:
library(janitor)
library(dplyr)
df %>%
filter(ses == "Mid Low" | ses == "Low") %>%
group_by(month) %>%
group_modify(~ .x %>%
adorn_totals("row")) %>%
filter(ses == "Total") %>%
mutate(succes_rate = total_selected/total)
month ses total total_selected succes_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Feb Total 100 14 0.14
2 Jan Total 50 11 0.22
An option with add_row
library(dplyr)
test_data %>%
group_by(month) %>%
group_modify(~ add_row(.x, ses = "Mid Low and Low",
!!! colSums(.x[.x$ses %in% c("Mid Low", "Low"),
c("total", "total_selected")]))) %>%
ungroup %>%
mutate(success_rate = coalesce(success_rate, total_selected/total))
-output
# A tibble: 10 × 5
month ses total total_selected success_rate
<chr> <chr> <dbl> <dbl> <dbl>
1 Feb High 9 8 0.889
2 Feb Mid 11 6 0.545
3 Feb Mid Low 40 8 0.2
4 Feb Low 60 6 0.1
5 Feb Mid Low and Low 100 14 0.14
6 Jan High 10 9 0.9
7 Jan Mid 20 10 0.5
8 Jan Mid Low 20 8 0.4
9 Jan Low 30 3 0.1
10 Jan Mid Low and Low 50 11 0.22
Or in data.table
library(data.table)
setDT(test_data)[, rbind(.SD, c(list(ses = "Mid Low and Low"),
lapply(.SD[ses %in% c("Mid Low", "Low"),
.(total, total_selected)], sum)), fill = TRUE), month][,
success_rate := fcoalesce(success_rate, total_selected/total)][]
month ses total total_selected success_rate
<char> <char> <num> <num> <num>
1: Jan High 10 9 0.9000000
2: Jan Mid 20 10 0.5000000
3: Jan Mid Low 20 8 0.4000000
4: Jan Low 30 3 0.1000000
5: Jan Mid Low and Low 50 11 0.2200000
6: Feb High 9 8 0.8888889
7: Feb Mid 11 6 0.5454545
8: Feb Mid Low 40 8 0.2000000
9: Feb Low 60 6 0.1000000
10: Feb Mid Low and Low 100 14 0.1400000

R: Expand rows according to start and end date and calculate hours between days

My question extends this one: Generate rows between two dates into a data frame in R
I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:
ID Admission Discharge Stay_in_days
1 2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2 2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
...
Dates are in POSIXct format so far.
I aim for this:
ID Date Stay_in_days
1 2020-08-20 15:25:03 0.357
1 2020-08-21 21:09:49 1.239
2 2020-10-04 21:53:43 0.087
2 2020-10-05 00:00:00 1.087
2 2020-10-06 00:00:00 2.087
2 2020-10-07 00:00:00 3.087
2 2020-10-08 00:00:00 4.087
2 2020-10-09 11:02:57 4.548
...
What I have done so far:
M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
ID = rep.int(patients$ID, vapply(M, length, 1L)),
Date = do.call(c, M)
)
patients <- patients %>%
mutate(
Date2=as.Date(Date, format = "%Y-%m-%d"),
Dat2=Date2+1,
Diff=difftime(Date2, Date, units = "days")
)
but this gives me:
ID Date Date2 Diff
1 2020-08-20 17:25:03 2020-08-21 0.375
1 2020-08-21 17:25:03 2020-08-22 0.357
2 2020-10-04 23:53:43 2020-10-05 0.087
2 2020-10-05 23:53:43 2020-10-06 0.087
2 2020-10-06 23:53:43 2020-10-07 0.087
2 2020-10-07 23:53:43 2020-10-08 0.087
2 2020-10-08 23:53:43 2020-10-09 0.087
...
Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?
Here is some data:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872,
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704,
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96,
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104,
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44,
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208,
1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424,
1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2,
1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16,
1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416,
1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513,
30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426,
9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance for your help!
Though it is a bit crude but it'll work
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
group_by(ID) %>%
mutate(dummy = case_when(Event == "Admission" ~ 0,
Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
mutate(Event = ifelse(is.na(Event), "Dummy", Event),
DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
Event == "Discharge" ~ Stay_in_days,
TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
select(ID, DATE, Stay_in_days)
# A tibble: 199 x 3
# Groups: ID [20]
ID DATE Stay_in_days
<dbl> <dttm> <dbl>
1 1 2020-08-20 15:25:03 0.358
2 1 2020-08-21 21:09:49 1.24
3 2 2020-08-29 16:16:08 0.322
4 2 2020-08-30 16:16:08 1.32
5 2 2020-08-31 18:08:57 2.08
6 3 2020-09-02 15:28:41 0.355
7 3 2020-09-03 14:51:10 0.974
8 4 2020-09-10 17:14:47 0.281
9 4 2020-09-11 17:25:18 1.01
10 5 2020-09-11 09:11:36 0.617
# ... with 189 more rows
Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.
Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.
You can achieve this with pivot_longer from tidyr.
Edit: with comments:
df1 <- df %>%
select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
pivot_longer(
cols = starts_with("date"),
names_to = "Date1",
values_to = "Date",
) %>% # pivot to longformat
select(-Date1) %>% # remove temporary Date1
relocate(Stay_in_days, .after = Date) %>% # change column order
group_by(ID) %>%
mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>% # set Admission to zero
ungroup()

Perform a series of mutations to columns in dataframe

I am trying to replace some text in my dataframe (a few rows given below)
> dput(Henry.longer[1:4,])
structure(list(N_l = c(4, 4, 4, 4), UG = c("100", "100", "100",
"100"), S = c(12, 12, 12, 12), Sample = c(NA, NA, NA, NA), EQ = c("Henry",
"Henry", "Henry", "Henry"), DF = c(0.798545454545455, 0.798545454545455,
0.798545454545455, 0.798545454545455), meow = c("Henry.Exterior.single",
"Multi", "Henry.Exterior.multi", "Henry.Interior.single"), Girder = c("Henry.Exterior.single",
"Henry.Interior.multi", "Henry.Exterior.multi", "Interior")), row.names = c(NA,
-4L), groups = structure(list(UG = "100", S = 12, .rows = list(
1:4)), row.names = c(NA, -1L), class = c("tbl_df", "tbl",
"data.frame"), .drop = FALSE), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
I try to mutate the dataframe as:
Henry.longer <- Henry.longer %>%
mutate(Loading = str_replace(meow, "Henry.Exterior.single", "Single")) %>%
mutate(Loading = str_replace(meow, "Henry.Exterior.multi", "Multi")) %>%
mutate(Loading = str_replace(meow, "Henry.Interior.single", "Single")) %>%
mutate(Loading = str_replace(meow, "Henry.Interior.multi", "Multi")) %>%
mutate(Girder = str_replace(meow, "Henry.Exterior.multi", "Exterior")) %>%
mutate(Girder = str_replace(meow, "Henry.Exterior.single", "Exterior")) %>%
mutate(Girder = str_replace(meow, "Henry.Interior.multi", "Interior")) %>%
mutate(Girder = str_replace(meow, "Henry.Interior.single", "Interior")) %>%
select(-meow)
But for some reason the results does not get applied to all the rows and only:
N_l UG S Sample EQ DF Loading Girder
1 4 100 12 NA Henry 0.799 Henry.Exterior.single Henry.Exterior.single
2 4 100 12 NA Henry 0.799 Multi Henry.Interior.multi
3 4 100 12 NA Henry 0.799 Henry.Exterior.multi Henry.Exterior.multi
4 4 100 12 NA Henry 0.799 Henry.Interior.single Interior
I think we can use lookup vectors for this, if it's easy or safer to use static string lookups:
tr_vec <- c(Henry.Exterior.single = "Single", Henry.Exterior.multi = "Multi", Henry.Interior.single = "Single", Henry.Interior.multi = "Multi")
tr_vec2 <- c(Henry.Exterior.multi = "Exterior", Henry.Exterior.single = "Exterior", Henry.Interior.multi = "Interior", Henry.Interior.single = "Interior")
Henry.longer %>%
mutate(
Loading = coalesce(tr_vec[Loading], Loading),
Girder = coalesce(tr_vec2[Girder], Girder)
)
# # A tibble: 4 x 8
# # Groups: UG, S [1]
# N_l UG S Sample EQ DF Loading Girder
# <dbl> <chr> <dbl> <lgl> <chr> <dbl> <chr> <chr>
# 1 4 100 12 NA Henry 0.799 Single Exterior
# 2 4 100 12 NA Henry 0.799 Multi Interior
# 3 4 100 12 NA Henry 0.799 Multi Exterior
# 4 4 100 12 NA Henry 0.799 Single Interior
The advantage of RonakShah's regex solution is that it can very easily handle many of the types of substrings you appear to need. Regexes do carry a little risk, though, in that they may (unlikely in that answer, but) miss match.
Instead of using str_replace I guess it would be easier to extract what you want using regex.
library(dplyr)
Henry.longer %>%
mutate(Loading = sub('.*\\.', '', meow),
Girder = sub('.*\\.(\\w+)\\..*', '\\1', meow))
where
Loading - removes everything until last dot
Girder - extracts a word between two dots.
Oh boy, looks like you've got some answers here already but here's a super-simple one that uses stringr::str_extract:
Henry.longer <- Henry.longer %>%
mutate(Loading = str_extract(meow, "single|multi")) %>%
mutate(Girder = str_extract(meow, "Interior|Exterior"))
It's worth noting that the demo data has a weird entry for meow in one column, so it didn't run perfectly on my machine:

difference between first non-NA and last non-NA in each row

I have a data frame with up to 5 measurements (x) and their corresponding time:
df = structure(list(x1 = c(92.9595722286402, 54.2085219673818,
46.3227062573019,
NA, 65.1501442134141, 49.736451235317), time1 = c(43.2715277777778,
336.625, 483.975694444444, NA, 988.10625, 510.072916666667),
x2 = c(82.8368681534474, 53.7981639701784, 12.9993531230419,
NA, 64.5678816290574, 55.331442940348), time2 = c(47.8166666666667,
732, 506.747222222222, NA, 1455.25486111111, 958.976388888889
), x3 = c(83.5433119686794, 65.723072881366, 19.0147593408309,
NA, 65.1989838202356, 36.7000828457705), time3 = c(86.5888888888889,
1069.02083333333, 510.275, NA, 1644.21527777778, 1154.95694444444
), x4 = c(NA, 66.008102917677, 40.6243513885846, NA, 62.1694420909955,
29.0078249523063), time4 = c(NA, 1379.22986111111, 520.726388888889,
NA, 2057.20833333333, 1179.86805555556), x5 = c(NA, 61.0047472617535,
45.324715258421, NA, 59.862110645527, 45.883161439362), time5 = c(NA,
1825.33055555556, 523.163888888889, NA, 3352.26944444444,
1364.99513888889)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
"NA" means that the person (row) didn't have a measurement.
I would like to calculate the difference between the last existing measurement and the first one.
So for the first one it would be x3 minus x1 (6.4), for the second it would be -6.8 and so on.
I tried something like this, which didnt work:
df$diff = apply(df %>% select(., contains("x")), 1, function(x) head(x,
na.rm = T) - tail(x, na.rm=T))
Any suggestions? Also, is apply/rowwise the most efficient way, or is there a vectorized function to do that?
A vectorized way would be using max.col where we get "first" and "last" non-NA value using ties.method parameter
#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")
#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])
#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] -
new_df[cbind(1:nrow(df), first_col)]
Using apply you could do
x_cols <- grep("^x", names(df))
df$new_calc <- apply(df[x_cols], 1, function(x) {
new_x <- x[!is.na(x)]
if (length(new_x) > 0)
new_x[length(new_x)] - new_x[1L]
else NA
})
We can use tidyverse methods on the tbl_df. Create a row names column (rownames_to_column), gather the 'x' columns to 'long' format while removing the NA elements (na.rm = TRUE), grouped by row name, get the difference of first and last 'val'ues and bind the extracted column with the original dataset 'df'
library(tidyverse)
rownames_to_column(df, 'rn') %>%
select(rn, starts_with('x')) %>%
gather(key, val, -rn, na.rm = TRUE) %>%
group_by(rn) %>%
summarise(Diff = diff(c(first(val), last(val)))) %>%
mutate(rn = as.numeric(rn)) %>%
complete(rn = min(rn):max(rn)) %>%
pull(Diff) %>%
bind_cols(df, new_col = .)
# A tibble: 6 x 11
# x1 time1 x2 time2 x3 time3 x4 time4 x5 time5 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 93.0 43.3 82.8 47.8 83.5 86.6 NA NA NA NA -9.42
#2 54.2 337. 53.8 732 65.7 1069. 66.0 1379. 61.0 1825. 6.80
#3 46.3 484. 13.0 507. 19.0 510. 40.6 521. 45.3 523. -0.998
#4 NA NA NA NA NA NA NA NA NA NA NA
#5 65.2 988. 64.6 1455. 65.2 1644. 62.2 2057. 59.9 3352. -5.29
#6 49.7 510. 55.3 959. 36.7 1155. 29.0 1180. 45.9 1365. -3.85

Transpose dplyr::tbl object

I am using src_postgres to connect and dplyr::tbl function to fetch data from redshift database. I have applied some filters and top function to it using the dplyr itself. Now my data looks as below:
riid day hour
<dbl> <chr> <chr>
1 5542. "THURSDAY " 12
2 5862. "FRIDAY " 15
3 5982. "TUESDAY " 15
4 6022. WEDNESDAY 16
My final output should be as below:
riid MON TUES WED THUR FRI SAT SUN
5542 12
5862 15
5988 15
6022 16
I have tried spread. It throws the below error because of the class type:
Error in UseMethod("spread_") : no applicable method for 'spread_'
applied to an object of class "c('tbl_dbi', 'tbl_sql', 'tbl_lazy',
'tbl')"
Since this is a really big table, I do not want to use dataframe as it takes a longer time.
I was able to use as below:
df_mon <- df2 %>% filter(day == 'MONDAY') %>% mutate(MONDAY = hour) %>% select(riid,MONDAY)
df_tue <- df2 %>% filter(day == 'TUESDAY') %>% mutate(TUESDAY = hour) %>% select(riid,TUESDAY)
df_wed <- df2 %>% filter(day == 'WEDNESDAY') %>% mutate(WEDNESDAY = hour) %>% select(riid,WEDNESDAY)
df_thu <- df2 %>% filter(day == 'THURSDAY') %>% mutate(THURSDAY = hour) %>% select(riid,THURSDAY)
df_fri <- df2 %>% filter(day == 'FRIDAY') %>% mutate(FRIDAY = hour) %>% select(riid,FRIDAY)
Is it possible to write all above in one statement?
Any help to transpose this in a faster manner is really appreciated.
EDIT
Adding the dput of the tbl object:
structure(list(src = structure(list(con = <S4 object of class structure("PostgreSQLConnection", package = "RPostgreSQL")>,
disco = <environment>), .Names = c("con", "disco"), class = c("src_dbi",
"src_sql", "src")), ops = structure(list(name = "select", x = structure(list(
name = "filter", x = structure(list(name = "filter", x = structure(list(
name = "group_by", x = structure(list(x = structure("SELECT riid,day,hour,sum(weightage) AS score FROM\n (SELECT riid,day,hour,\n POWER(2,(cast(datediff (seconds,convert_timezone('UTC','PKT',SYSDATE),TO_DATE(TO_CHAR(event_captured_dt,'mm/dd/yyyy hh24:mi:ss'),'mm/dd/yyyy hh24:mi:ss')) as decimal) / cast(7862400 as decimal))) AS weightage\n FROM (\n SELECT riid,convert_timezone('GMT','PKT',event_captured_dt) AS EVENT_CAPTURED_DT,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'DAY') AS day,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'HH24') AS hour\n FROM Zameen_STO_DATA WHERE EVENT_CAPTURED_DT >= TO_DATE((sysdate -30),'yyyy-mm-dd') and LIST_ID = 4282\n )) group by riid,day,hour", class = c("sql",
"character")), vars = c("riid", "day", "hour", "score"
)), .Names = c("x", "vars"), class = c("op_base_remote",
"op_base", "op")), dots = structure(list(riid = riid,
day = day), .Names = c("riid", "day")), args = structure(list(
add = FALSE), .Names = "add")), .Names = c("name",
"x", "dots", "args"), class = c("op_group_by", "op_single",
"op")), dots = structure(list(~min_rank(desc(~score)) <=
1), .Names = ""), args = list()), .Names = c("name",
"x", "dots", "args"), class = c("op_filter", "op_single",
"op")), dots = structure(list(~row_number() == 1), .Names = ""),
args = list()), .Names = c("name", "x", "dots", "args"), class = c("op_filter",
"op_single", "op")), dots = structure(list(~riid, ~day, ~hour), class = "quosures", .Names = c("",
"", "")), args = list()), .Names = c("name", "x", "dots", "args"
), class = c("op_select", "op_single", "op"))), .Names = c("src",
"ops"), class = c("tbl_dbi", "tbl_sql", "tbl_lazy", "tbl"))
I think what you're looking for is the ability to run the tidyr::spread() function against a remote source, or database. I have a PR for dbplyr that attempts to implement that here: https://github.com/tidyverse/dbplyr/pull/72, you can try it out by using: devtools::install_github("tidyverse/dbplyr", ref = devtools::github_pull(72)).
Use dcast from reshape2 package
> data
# A tibble: 4 x 3
riid day hour
<dbl> <chr> <dbl>
1 1.00 TH 12.0
2 2.00 FR 15.0
3 3.00 TU 15.0
4 4.00 WE 16.0
> dcast(data, riid~day, value.var = "hour")
riid FR TH TU WE
1 1 NA 12 NA NA
2 2 15 NA NA NA
3 3 NA NA 15 NA
4 4 NA NA NA 16
Further if you want to remove NA, then
> z <- dcast(data, riid~day, value.var = "hour")
> z[is.na(z)] <- ""
> z
riid FR TH TU WE
1 1 12
2 2 15
3 3 15
4 4 16
I tried to combine your multiple line attempts into one. Can you try this and let us know the outcome?
library(dplyr)
df %>%
rowwise() %>%
mutate(Mon = ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
Tue = ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
Wed = ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
Thu = ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
Fri = ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
Sat = ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
Sun = ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA)) %>%
select(-day, -hour)
Output is:
riid Mon Tue Wed Thu Fri Sat Sun
1 5542 NA NA NA 12 NA NA NA
2 5862 NA NA NA NA 15 NA NA
3 5982 NA 15 NA NA NA NA NA
4 6022 NA NA 16 NA NA NA NA
Sample data:
# A tibble: 4 x 3
riid day hour
* <dbl> <chr> <int>
1 5542 THURSDAY 12
2 5862 FRIDAY 15
3 5982 TUESDAY 15
4 6022 WEDNESDAY 16
Update:
Can you try below approach using data.table?
library(data.table)
dt <- setDT(df)[, c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") :=
list(ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA))][, !c("day","hour"), with=F]

Resources