R transpose including NA - r

I have data like,
trackingnumer = c(1,1,2,2,3)
date = c("2017-08-01", "2017-08-10", "2017-08-02", "2017-08-05", "2017-08-12")
scan = c("Pickup", "Delivered", "Pickup", "Delivered", "Delivered")
df = data.frame(trackingnumer, date, scan)
I want to transpose this data by trackignumber
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = row_number()) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}
I have tried this one, but I couldn't get the desirable outcome.I want to set data_1 as pickup date, and date_2 as delivered date. As you can see, trackingnumber 3 doesn't have pickup record so I want date_1 to be NA.

Base R attempt, using relevel to set the appropriate ordering of the scan column:
reshape(
cbind(df, time=as.numeric(relevel(df$scan, "Pickup"))),
idvar="trackingnumer", direction="wide", sep="_"
)
# trackingnumer date_1 scan_1 date_2 scan_2
#1 1 2017-08-01 Pickup 2017-08-10 Delivered
#3 2 2017-08-02 Pickup 2017-08-05 Delivered
#5 3 <NA> <NA> 2017-08-12 Delivered

The problem was that your function in mutate was just counting the rows, it wasn’t paying attention to what was in them. The case_when() function lets you specify specific values for the “n” column based on the value of “scan”
df2 <- df %>%
group_by(trackingnumer) %>%
mutate(n = case_when(scan == "Pickup" ~ 1,
scan == "Delivered" ~ 2)) %>%
{data.table::dcast(data = setDT(.), trackingnumer ~ n, value.var = c('date', 'scan'))}

Or with tidyr
library(tidyr)
df %>% group_by(trackingnumer,scan2 = scan) %>%
nest(date,scan) %>%
spread(scan2,data) %>%
mutate_at(c("Delivered","Pickup"),~ifelse(map_lgl(.x,is_tibble),.x,lst(tibble(date=NA,scan=NA)))) %>%
unnest %>%
rename_at(c("date","scan"),paste0,2)
# # A tibble: 3 x 5
# trackingnumer date2 scan2 date1 scan1
# <dbl> <fctr> <fctr> <fctr> <fctr>
# 1 1 2017-08-10 Delivered 2017-08-01 Pickup
# 2 2 2017-08-05 Delivered 2017-08-02 Pickup
# 3 3 2017-08-12 Delivered <NA> <NA>

Related

Reshape () and modify_shape()

df <- data.frame(
code1 = c ("ZAZ","ZAZ","ZAZ","ZAZ","ZAZ","ZAZ","JOZ","JOZ","JOZ","JOZ","JOZ","JOZ","TSV","TSV"),
code2 = c("NAN","NAN","NAN","NAN","NAN","NAN","NAN","NAN","NAN","NAN","NAN","NAN","TSA","TSA"),
start = c("Date1.1","Date1.1","Date1.3","Date1.3","Date1.5","Date1.5","Date3.1","Date3.1","Date3.3","Date3.3","Date3.5","Date3.5","Date 5.1","Date 5.1"),
end = c("Date2.1","Date2.1","Date2.3","Date2.3","Date2.5","Date2.5","Date4.1","Date4.1","Date4.3","Date4.3","Date4.5","Date4.5","Date6.1","Date6.1"),
price = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2))
I'm trying to achieve:
I have so far done:
df <- df %>%
group_by(code1, code2,start,end) %>%
slice_min(price) #%>%
group_modify()
df <- df[order(df$price),]
All well explained in the image but in brief:
To group by code1,code2,start,end and select smallest price for each
Reshape sending start,end,price to different columns (max 3 start,end,price per key code1,code2
I understand that this can be done within group_modify() but unsure how
Any help so much appreciated!
Brian
Here is one way using dplyr and tidyr libraries.
For each group (code1, code2, start and end) calculate the minimum value of price.
Create an index column for code1 and code2. This is to name start, end and price as start_1, start_2 etc.
Get the data in wide format using pivot_wider.
library(dplyr)
library(tidyr)
df %>%
group_by(code1, code2, start, end) %>%
summarise(price = min(price, na.rm = TRUE)) %>%
group_by(code1, code2) %>%
mutate(index = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = index, values_from = c(start, end, price),
names_vary = "slowest")
# code1 code2 start_1 end_1 price_1 start_2 end_2 price_2 start_3 end_3 price_3
# <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl>
#1 JOZ NAN Date3.1 Date4.1 1 Date3.3 Date4.3 3 Date3.5 Date4.5 5
#2 TSV TSA Date 5.1 Date6.1 1 NA NA NA NA NA NA
#3 ZAZ NAN Date1.1 Date2.1 1 Date1.3 Date2.3 3 Date1.5 Date2.5 5
Note that names_vary = "slowest" allows to have columns in an orderly fashion (start_1, end_1, price_1... instead of start_1, start_2 ..., end_1, end_2... etc. )
I guess you can try aggregate + reshape + ave (all from base R)
reshape(
transform(
aggregate(price ~ ., df, min),
id = ave(seq_along(price), code1, code2, FUN = seq_along)
),
direction = "wide",
idvar = c("code1", "code2"),
timevar = "id"
)
which gives
code1 code2 start.1 end.1 price.1 start.2 end.2 price.2 start.3 end.3
1 ZAZ NAN Date1.1 Date2.1 1 Date1.3 Date2.3 3 Date1.5 Date2.5
4 JOZ NAN Date3.1 Date4.1 1 Date3.3 Date4.3 3 Date3.5 Date4.5
7 TSV TSA Date5.1 Date6.1 1 <NA> <NA> NA <NA> <NA>
price.3
1 5
4 5
7 NA

How to output the sum of specific rows from one data frame to a new column in another data frame?

I would ultimately like to have df2 with certain dates and the cumulative sum of values connected to those date ranges from df1.
df1 = data.frame("date"=c("10/01/2020","10/02/2020","10/03/2020","10/04/2020","10/05/2020",
"10/06/2020","10/07/2020","10/08/2020","10/09/2020","10/10/2020"),
"value"=c(1:10))
df1
> df1
date value
1 10/01/2020 1
2 10/02/2020 2
3 10/03/2020 3
4 10/04/2020 4
5 10/05/2020 5
6 10/06/2020 6
7 10/07/2020 7
8 10/08/2020 8
9 10/09/2020 9
10 10/10/2020 10
df2 = data.frame("date"=c("10/05/2020","10/10/2020"))
df2
> df2
date
1 10/05/2020
2 10/10/2020
I realize this is incorrect, but I am not sure how to define df2$value as the sums of certain df1$value rows:
df2$value = filter(df1, c(sum(1:5),sum(6:10)))
df2
I would like the output to look like this:
> df2
date value
1 10/05/2020 15
2 10/10/2020 40
Here is another approach using dplyr and lubridate:
library(lubridate)
library(dplyr)
df1 %>%
mutate(date = dmy(date)) %>%
mutate(date = if_else(date == "2020-05-10" |
date == "2020-10-10", date, NA_Date_)) %>%
fill(date, .direction = "up") %>%
group_by(date) %>%
summarise(value = sum(value))
date value
<date> <int>
1 2020-05-10 15
2 2020-10-10 40
We may use a non-equi join after converting the 'date' columns to Date class
library(lubridate)
library(data.table)
setDT(df1)[, date := mdy(date)]
setDT(df2)[, date := mdy(date)]
df2[, start_date := fcoalesce(shift(date) + days(1), floor_date(date, 'month'))]
df1[df2,.(value = sum(value)), on = .( date >= start_date,
date <= date), by = .EACHI][, -1, with = FALSE]
date value
<Date> <int>
1: 2020-10-05 15
2: 2020-10-10 40
Or another option is creating a group with findInterval and then do the group by sum
library(dplyr)
df1 %>%
group_by(grp = findInterval(date, df2$date, left.open = TRUE)) %>%
summarise(date = last(date), value = sum(value)) %>%
select(-grp)
-output
# A tibble: 2 × 2
date value
<date> <int>
1 2020-10-05 15
2 2020-10-10 40

How to Reshape Long to Wide While Preserving Some Variables in R [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have a data frame that is not really in a 'long form' but it is in a longer form than I would like. I would like to condense it into a 'wide form' that has all the information associated with an id into one line. Right now, some of the information is repeated on each line (like the date in the example below) and other information needs to be preserved when the lines are consolidated (like type column below). thanks!
id <- c(1000, 1000, 1000, 1001, 1001, 1001)
type <- c("A", "B", "B", "C", "C", "A")
dates <- c("10/5/2019", "10/5/2019", "10/5/2019", "9/17/2020", "9/17/2020", "9/17/2020")
df <- as.data.frame(cbind(id, type, dates))
df
id type dates
1 1000 A 10/5/2019
2 1000 B 10/5/2019
3 1000 B 10/5/2019
4 1001 C 9/17/2020
5 1001 C 9/17/2020
6 1001 A 9/17/2020
I would like it to looks like this:
Another option only using tidyverse:
library(tidyverse)
#Code
df %>% group_by(id) %>% mutate(idv=paste0('type.',1:n())) %>%
pivot_wider(names_from = idv,values_from=type)
Output:
# A tibble: 2 x 5
# Groups: id [2]
id dates type.1 type.2 type.3
<chr> <chr> <chr> <chr> <chr>
1 1000 10/5/2019 A B B
2 1001 9/17/2020 C C A
Or using row_number() (credits to #r2evans):
#Code 2
df %>% group_by(id) %>% mutate(idv=paste0('type.',row_number())) %>%
pivot_wider(names_from = idv,values_from=type)
Output:
# A tibble: 2 x 5
# Groups: id [2]
id dates type.1 type.2 type.3
<chr> <chr> <chr> <chr> <chr>
1 1000 10/5/2019 A B B
2 1001 9/17/2020 C C A
Here is a base R option using reshape
reshape(
within(df, num <- ave(1:nrow(df), id, FUN = seq_along)),
direction = "wide",
idvar = c("id", "dates"),
timevar = "num"
)
which gives
id dates type.1 type.2 type.3
1 1000 10/5/2019 A B B
4 1001 9/17/2020 C C A
We can use pivot_wider to reshape from 'long' to 'wide' after creating a sequence column with rowid (from data.table)
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>%
mutate(rn = str_c('type.', rowid(id))) %>%
pivot_wider(names_from = rn, values_from = type)
-output
# A tibble: 2 x 5
# id dates type.1 type.2 type.3
# <chr> <chr> <chr> <chr> <chr>
#1 1000 10/5/2019 A B B
#2 1001 9/17/2020 C C A
Or only using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
df %>%
group_by(id) %>%
mutate(rn = str_c('type.', row_number())) %>%
pivot_wider(names_from = rn, values_from = type)
Or using data.table in a compact way
library(data.table)
dcast(setDT(df), id + dates ~ paste0('type.', rowid(id)), value.var = 'type')
-output
# id dates type.1 type.2 type.3
#1: 1000 10/5/2019 A B B
#2: 1001 9/17/2020 C C A

how to replace NA with the value that later input with same ID and date

I have a data that looks and I want to filled n/a with the result that is later input with same ID and test_date, and only keep one record for each ID each day.
What should I do?
Here is the codes for sample data:
ID <-c("1", "1", "1","2", "2")
Test_date <-c("2020-07-09", "2020-07-09","2020-07-09", "2020-07-07","2020-07-08")
Art <-c("N/A","D","N/A","N/A", "B")
PE<-c("N/A","N/A","B","A","N/A")
Sample.data <- data.frame(ID, Test_date, Art, PE)
In Base-R
First change character strings "N/A" to actual NA
Sample.data[Sample.data=="N/A"] <- NA
now the the real meat of the answer
merge(
aggregate(Art ~ ID + Test_date, Sample.data, paste),
aggregate(PE ~ ID + Test_date, Sample.data, paste),
all=T
)
output:
ID Test_date Art PE
1 1 2020-07-09 D B
2 2 2020-07-07 <NA> A
3 2 2020-07-08 B <NA>
Using data.table:
library(data.table)
# Convert to data.table
setDT(Sample.data)
# Format NA properly as NA
Sample.data[, c("Art", "PE") := lapply(.SD, function(x) fifelse(x == "N/A", NA_character_, x)), .SDcols = c("Art", "PE")]
Sample.data[, .(Art[!is.na(Art)], PE[!is.na(PE)]), by = .(ID, Test_date)]
# ID Test_date V1 V2
# 1: 1 2020-07-09 D B
# 2: 2 2020-07-07 <NA> A
# 3: 2 2020-07-08 B <NA>
Alternatively:
Sample.data[, lapply(.SD, function(x) x[!is.na(x)]), by = .(ID, Test_date)]
(Edited to correct my misgrouping.)
I'm going to suggest a tidyverse solution to be expeditious, though this can be done (with a little more effort) in base R (and data.table).
A few tasks:
replace "N/A" (which is a completely valid and definite string) with NA (actually, NA_character_, since there are over six types of NA in R);
convert Test_date to a real Date class, and order by this;
fill up by group;
group by id/date and keep only one
The first few are done with
library(dplyr)
library(tidyr) # fill
Sample.data %>%
mutate(Test_date = as.Date(Test_date)) %>%
mutate_at(vars(Art, PE), ~ replace(., . == "N/A", NA_character_)) %>%
arrange(Test_date) %>%
group_by(ID, Test_date) %>%
tidyr::fill(., Art, PE, .direction = "up") %>%
ungroup()
# # A tibble: 5 x 4
# ID Test_date Art PE
# <chr> <date> <chr> <chr>
# 1 2 2020-07-07 <NA> A
# 2 2 2020-07-08 B <NA>
# 3 1 2020-07-09 D B
# 4 1 2020-07-09 D B
# 5 1 2020-07-09 <NA> B
though you need to think about what happens when your last observation is NA.
Now for your last point
and only keep one record for each ID each day
I'll expand the above with a little more. I'm going to infer first, but frankly you haven't provided enough information to know if it should be first, last, sum, max, row-with-the-fewest-NA-values, or whatever.
Sample.data %>%
mutate(Test_date = as.Date(Test_date)) %>%
mutate_at(vars(Art, PE), ~ replace(., . == "N/A", NA_character_)) %>%
arrange(Test_date) %>%
group_by(ID, Test_date) %>%
tidyr::fill(., Art, PE, .direction = "up") %>%
slice(1) %>%
ungroup()
# # A tibble: 3 x 4
# ID Test_date Art PE
# <chr> <date> <chr> <chr>
# 1 1 2020-07-09 D B
# 2 2 2020-07-07 <NA> A
# 3 2 2020-07-08 B <NA>

Trying to merge two dataframes with specific conditions and gap in the rows in R

I have two dataframes (df1 and df2). I am working with dplyr to manipulate my data. However, I have some trouble finding the following result :
df1 contains some information about id, price, and date (id is not unique : a given id can decide of several prices)
df2 can tell if for a given id there has been a modification of the value of price and/or date in df1
I want to know if there has been a modification of price and/or date, and if that's the case, I want to take this new value as the price/date
However, both df1 and df2 can be a little tricky since you can have several modifications for a given id.
More specifically, for a given modification of price (if it exists, otherwise I take the price given in df1), I want to associate it with the last modification of date (if it exists, otherwise I take the date given in df1) as long as it is <= df1$date + 30
To sum it up, here's an example:
df1 <- data.frame(
Id = c(1,1,2),
price = c(1000,2000,1000),
date = c("2016-01-01","2016-09-01","2016-01-01")
)
df1
Id price date
1 1000 2016-01-01
1 2000 2016-09-01
2 1000 2016-01-01
And df2 is the following :
df2 <- data.frame(
Id = c(1,1,1,1,1,2,2),
price = c(1500,NA,2000,NA,3000,NA,NA),
date = c(NA, "2016-01-03", "2016-01-05", "2016-09-02","2016-09-03","2016-01-03","2016-01-05")
)
df2
Id price date
1 1500 <NA>
1 NA 2016-01-03
1 2000 2016-01-05
1 NA 2016-09-02
1 3000 2016-09-03
2 NA 2016-01-03
2 NA 2016-01-05
And the result I wish to have something similar to this :
Id initial_price initial_date is_modification_price is_modification_date true_price true_date
1 1000 2016-01-01 TRUE TRUE 2000 2016-01-05
1 2000 2016-09-01 TRUE TRUE 3000 2016-09-03
2 1000 2016-01-01 FALSE TRUE 1000 2016-01-05
I hope I'm clear enough
Does anyone have an idea of how to implement this ; or even a completely different approach ?
First, prepare your dataframes:
# fix type
df1 <- mutate(df1, date = as.Date(date))
# fill NAs in df2
df2 <- df2 %>%
mutate(date = as.Date(date)) %>%
group_by(Id) %>%
tidyr::fill(price, date) %>%
ungroup
# fill remaining NAs with default values taken from df1
default_values <- df1 %>%
group_by(Id) %>%
slice(1) %>%
rename(price0 = price, date0 = date) %>%
ungroup
df2 <- df2 %>%
left_join(default_values, by = "Id") %>%
mutate(price = if_else(is.na(price), price0, price),
date = if_else(is.na(date), date0, date)) %>%
select(Id, price, date)
Then join:
df1 %>%
left_join(df2, by = "Id") %>%
filter(date.y <= date.x + 30) %>%
group_by(Id, price.x, date.x) %>%
arrange(date.y) %>%
slice(n()) %>%
ungroup %>%
rename(initial_price = price.x, initial_date = date.x,
true_price = price.y, true_date = date.y) %>%
mutate(is_modification_price = (initial_price != true_price),
is_modification_date = (initial_date != true_date))
# # A tibble: 3 x 7
# Id initial_price initial_date true_price true_date is_modification_price is_modification_date
# <dbl> <dbl> <date> <dbl> <date> <lgl> <lgl>
# 1 1 1000 2016-01-01 2000 2016-01-05 TRUE TRUE
# 2 1 2000 2016-09-01 3000 2016-09-03 TRUE TRUE
# 3 2 1000 2016-01-01 1000 2016-01-05 FALSE TRUE
Note that the left_join followed by filter in the last step could take too much memory. If it's the case, use the non-equi join functionality in data.table instead.

Resources