Spreading data over a date range from a column (R) - r

I have a set of survey data, where each survey covers multiple days. Here is an example of what the data looks like in the current form:
| Survey | Dates | Result |
|--------|--------------|--------|
| A | 11/30 - 12/1 | 33% |
| B | 12/2 - 12/4 | 26% |
| C | 12/4 - 12/5 | 39% |
This example can be made with the following:
frame <- data.frame(Survey = c('A','B','C'),
Dates = c('11/30 - 12/1', '12/2 - 12/4', '12/4 - 12/5'),
Result = c('33%', '26%', '39%'))
What I would like to do is make a column for each date, and if the date is within the range of the survey, to put the result in the cell. It would look something like this:
| Survey | 11/30 | 12/1 | 12/2 | 12/3 | 12/4 | 12/5 |
|--------|-------|------|------|------|------|------|
| A | 33% | 33% | | | | |
| B | | | 26% | 26% | 26% | |
| C | | | | | 39% | 39% |
Any help would be appreciated.

Here's an idea:
library(dplyr)
library(tidyr)
frame %>%
separate_rows(Dates, sep = " - ") %>%
mutate(Dates = as.Date(Dates, format = "%m/%d")) %>%
group_by(Survey) %>%
complete(Dates = seq(min(Dates), max(Dates), 1)) %>%
fill(Result) %>%
spread(Dates, Result)
Which gives:
# Survey `2017-11-30` `2017-12-01` `2017-12-02` `2017-12-03` `2017-12-04` `2017-12-05`
#* <fctr> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 A 33% 33% NA NA NA NA
#2 B NA NA 26% 26% 26% NA
#3 C NA NA NA NA 39% 39%

A tidyverse solution but it requires that you play with the Dates column a bit:
#install.packages('tidyverse')
library(tidyverse)
dframe <- data.frame(Survey = c('A','B','C'),
Dates = c('11/30 - 12/1', '12/2 - 12/4', '12/4 - 12/5'),
Result = c('33%', '26%', '39%'), stringsAsFactors = F)
dframe$Dates <- lapply(strsplit(dframe$Dates, split = " - "), function(x) {
x <- strptime(x, "%m/%d")
x <- seq(min(x), max(x), '1 day')
paste0(strftime(x, "%m/%d"), collapse = " - ")
})
dframe %>%
separate_rows(Dates, sep = " - ") %>%
spread(Dates, Result)
Should get:
Survey 11/30 12/01 12/02 12/03 12/04 12/05
A 33% 33% <NA> <NA> <NA> <NA>
B <NA> <NA> 26% 26% 26% <NA>
C <NA> <NA> <NA> <NA> 39% 39%
I hope this helps.

Related

R Studio: How to perform separate data wrangling procedures for different values of a variable into a list of individual dataframes?

I have a dataframe that looks like this:
+-----------+------------+--------+------------+
| Geography | Dates | Sales | Avg_Volume |
+-----------+------------+--------+------------+
| A | 2020-01-01 | | |
+-----------+------------+--------+------------+
| A | 2020-01-02 | | |
+-----------+------------+--------+------------+
| A | 2020-01-03 | | |
+-----------+------------+--------+------------+
| A | 2020-01-04 | | |
+-----------+------------+--------+------------+
| A | 2020-01-05 | | |
+-----------+------------+--------+------------+
| B | 2020-01-01 | | |
+-----------+------------+--------+------------+
| B | 2020-01-02 | | |
+-----------+------------+--------+------------+
| B | 2020-01-03 | | |
+-----------+------------+--------+------------+
| B | 2020-01-04 | | |
+-----------+------------+--------+------------+
| B | 2020-01-05 | | |
+-----------+------------+--------+------------+
| C | 2020-01-01 | | |
+-----------+------------+--------+------------+
| C | 2020-01-02 | | |
+-----------+------------+--------+------------+
| C | 2020-01-03 | | |
+-----------+------------+--------+------------+
| C | 2020-01-04 | | |
+-----------+------------+--------+------------+
| C | 2020-01-05 | | |
+-----------+------------+--------+------------+
| D | 2020-01-01 | | |
+-----------+------------+--------+------------+
| D | 2020-01-02 | | |
+-----------+------------+--------+------------+
| D | 2020-01-03 | | |
+-----------+------------+--------+------------+
| D | 2020-01-04 | | |
+-----------+------------+--------+------------+
| D | 2020-01-05 | | |
+-----------+------------+--------+------------+
I would like to have 3 dataframes dedicated to City B,C,D that looks like this (I need A_Sales to be always present:
+------------+----------+---------+--------------+
| Dates | A_Sales | B_Sales | B_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | C_Sales | C_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | D_Sales | D_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
Currently this is what I have:
data_A <- data %>%
filter(Geography == "A") %>%
rename("A_Sales" = Sales) %>%
select(Dates, A_Sales)
data_B <- data %>%
filter(Geography == 'B') %>%
rename("B_Sales" = Sales)%>%
rename("B_Avg_Volume" = Avg_Volume)%>%
select(Dates, B_Sales, B_Avg_Volume)
data_a_n_b <- data_A %>%
left_join(data_B, by = 'Dates')
This is very redundant and inefficient, because I would have to change Geography == '...') to "B,C,D..." everytime and re-run. My real data has ~ 50 cities so it is unrealistic for me to do this process for each city individually.
What is a elegant way to batch processing this process?
I am imagining the end result be a list of dataframes for city B,C,D ... and so on, with the name of each individual dataframe be the city name. This way I can easily access each individual dataframe. For example, calling data_result$C (or sth like that) will give me the dataframe for City C. Any other output format is also welcomed, as long as accessing individual dataframe is easy.
Thanks so much for your help!
Using purrr this could be achieved like so:
Split your df by Geography
Loop over the list (except for region "A") and join the dfs to the one for region A
Do some renaming
set.seed(42)
dat <- data.frame(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
library(purrr)
library(dplyr)
library(stringr)
dat_list <- dat %>%
split(.$Geography) %>%
map(select, -Geography)
imap(dat_list[setdiff(names(dat_list), "A")], function(x, y) {
left_join(dat_list[["A"]], x, by = "Dates", suffix = c(paste0("_", y), "_A")) %>%
rename_with(~ str_replace(.x, "(Sales|Avg_Volume)_(.*)", "\\2_\\1"), -Dates) %>%
select(-A_Avg_Volume)
})
#> $B
#> Dates B_Sales B_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6417455
#> 2 2020-01-02 0.9370754 0.1174874 0.5190959
#> 3 2020-01-03 0.2861395 0.4749971 0.7365883
#> 4 2020-01-04 0.8304476 0.5603327 0.1346666
#>
#> $C
#> Dates C_Sales C_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6569923
#> 2 2020-01-02 0.9370754 0.1174874 0.7050648
#> 3 2020-01-03 0.2861395 0.4749971 0.4577418
#> 4 2020-01-04 0.8304476 0.5603327 0.7191123
#>
#> $D
#> Dates D_Sales D_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.9346722
#> 2 2020-01-02 0.9370754 0.1174874 0.2554288
#> 3 2020-01-03 0.2861395 0.4749971 0.4622928
#> 4 2020-01-04 0.8304476 0.5603327 0.9400145
Created on 2021-02-05 by the reprex package (v1.0.0)
I took Stefan's setup dataframe and added another way to do it. The steps are:
Get a list of the list of city names (excluding A). The way I wrote it assumes A is first, but you could also use discard() to remove "A" from the city list.
Use map with filter to get a list of data frames that have A and each city in cities. set_names to make sure each list element is accessible by its city name
Take each data frame in the list and pivot_wider, then select everything by Avg_Volume for A.
#Set up a sample data frame
library(dplyr)
set.seed(42)
dat <- tibble(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
#Code to wrangle into list of filtered, wide format data frames
library(dpylyr)
library(tidyr)
library(purrr)
cities <- unique(dat$Geography)[-1]
dat_list <- map(cities, ~ filter(dat, Geography == "A" | Geography == .x)) %>% set_names(cities)
dat_list_wider <- map(dat_list,
~pivot_wider(.x, id_cols = "Dates",
names_from = "Geography",
values_from = c("Sales","Avg_Volume")) %>%
select(-Avg_Volume_A))

Grouping by column and finding preceeding value of another column

I have a very long sales data, below an exemplary excerpt:
| Date | CountryA | CountryB | PriceA | PriceB | |
+------------+----------+----------+--------+--------+--+
| 05/09/2019 | US | Japan | 20 | 55 | |
| 28/09/2019 | Japan | Germany | 30 | 28 | |
| 16/10/2019 | Canada | US | 25 | 78 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | |
+------------+----------+----------+--------+--------+--+
I would like to group on column "CountryB" and then generate a new column which displays the preceding value of PriceA of that respective country, i.e. when that specific country was present in column "CountryA" the last time based on date order. In this exemplary table, I want to get the following results:
| Date | CountryA | CountryB | PriceA | PriceB | PriceA_lag1 | |
+------------+----------+----------+--------+--------+-------------+--+
| 05/09/2019 | US | Japan | 20 | 55 | | |
| 28/09/2019 | Japan | Germany | 30 | 28 | | |
| 16/10/2019 | Canada | US | 25 | 78 | 20 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | 30 | |
+------------+----------+----------+--------+--------+-------------+--+
I have tried the following with dplyr:
data=data%>%group_by(CountryB)%>%mutate_at(list(lag1=~dplyr::lag(.,1,order_by=Date)),.vars=vars(PriceA))
However this does not give me the preceding value when the respective country is in column "CountryA", but rather when the respective country is in "CountryB".
Can someone please help me out on this one?
Thanks.
Quite possibly some of the ugliest code I've written, but...
# install.packages('dplyr', 'magrittr')
library(dplyr)
library(magrittr)
d <- data.frame(
stringsAsFactors = FALSE,
Date = c("05/09/2019", "28/09/2019", "16/10/2019", "28/10/2019"),
CountryA = c("US", "Japan", "Canada", "Germany"),
CountryB = c("Japan", "Germany", "US", "Japan"),
PriceA = c(20L, 30L, 25L, 60L),
PriceB = c(55L, 28L, 78L, 17L)
) %>%
mutate(Date = as.Date(Date, format = '%d/%m/%Y'))
priceA_lag <- c()
for(row in 1:nrow(d)){
country <- slice(d, row) %$% CountryB
date <- slice(d, row) %$% Date
thePrice <- d %>%
filter(CountryA == country,
date > Date) %>%
filter(Date == max(Date)) %$%
PriceA
thePrice <- ifelse(length(thePrice) > 0, thePrice, NA)
priceA_lag <- priceA_lag %>%
append(thePrice)
}
d$priceA_lag <- priceA_lag
> d
Date CountryA CountryB PriceA PriceB priceA_lag
1 2019-09-05 US Japan 20 55 NA
2 2019-09-28 Japan Germany 30 28 NA
3 2019-10-16 Canada US 25 78 20
4 2019-10-28 Germany Japan 60 17 30

parse values based on groups in R

I have a very large dataset and a sample of that looks something like the one below:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | | 1/1/2000 | 9/24/2018 |
| 25 | | 5/3/1968 | 6/3/2000 |
| 25 | | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | | 9/12/2014 | 11/26/2019 |
I need to parse the names from Name column based on their Id such that the output table looks like:
| Id | Name | Start_Date | End_Date |
|----|---------|------------|------------|
| 10 | Mark | 4/2/1999 | 7/5/2018 |
| 10 | Mark | 1/1/2000 | 9/24/2018 |
| 25 | Anthony | 5/3/1968 | 6/3/2000 |
| 25 | Antony | 6/6/2009 | 4/23/2010 |
| 25 | Anthony | 2/20/2010 | 7/21/2016 |
| 25 | Anthony | 9/12/2014 | 11/26/2019 |
How can I achieve an output as shown above? I went through the substitute and parse functions, but was unable to understand how they apply to this problem.
My dataset would be:
df=data.frame(Id=c("10","10","25","25","25","25"),Name=c("Mark","","","","Anthony",""),
Start_Date=c("4/2/1999", "1/1/2000","5/3/1968","6/6/2009","2/20/2010","9/12/2014"),
End_Date=c("7/5/2018","9/24/2018","6/3/2000","4/23/2010","7/21/2016","11/26/2019"))
We can change the blanks ("") to NA and use fill to replace the NA elements with the previous non-NA element
library(dplyr)
library(tidyr)
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "down") %>%
fill(Name, .direction = "up)
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
In the devel version of tidyr (‘0.8.3.9000’), this can be done in a single fill statement as .direction = "downup" is also an option
df1 %>%
mutate(Name = na_if(Name, "")) %>%
group_by(Id) %>%
fill(Name, .direction = "downup")
Or another option is to group by 'Id', and mutate the 'Name' as the first non-blank element
df1 %>%
group_by(Id) %>%
mutate(Name = first(Name[Name!=""]))
# A tibble: 6 x 4
# Groups: Id [2]
# Id Name Start_Date End_Date
# <chr> <chr> <chr> <chr>
#1 10 Mark 4/2/1999 7/5/2018
#2 10 Mark 1/1/2000 9/24/2018
#3 25 Anthony 5/3/1968 6/3/2000
#4 25 Anthony 6/6/2009 4/23/2010
#5 25 Anthony 2/20/2010 7/21/2016
#6 25 Anthony 9/12/2014 11/26/2019
data
df1 <- structure(list(Id = c("10", "10", "25", "25", "25", "25"), Name = c("Mark",
"", "", "", "Anthony", ""), Start_Date = c("4/2/1999", "1/1/2000",
"5/3/1968", "6/6/2009", "2/20/2010", "9/12/2014"), End_Date = c("7/5/2018",
"9/24/2018", "6/3/2000", "4/23/2010", "7/21/2016", "11/26/2019"
)), class = "data.frame", row.names = c(NA, -6L))
Using DF defined reproducibly in the Note at the end, replace each zero-length element of Name with NA and then use na.omit to get the unique non-NA to use to fill. We have assumed that there is only one non-NA per Id which is the case in the question. If not we could replace na.omit with function(x) unique(na.omit(x)) assuming that the non-NAs are all the same within Id. No packages are used.
transform(DF, Name = ave(replace(Name, !nzchar(Name), NA), Id, FUN = na.omit))
giving:
Id Name Start_Date End_Date
1 10 Mark 4/2/1999 7/5/2018
2 10 Mark 1/1/2000 9/24/2018
3 25 Anthony 5/3/1968 6/3/2000
4 25 Anthony 6/6/2009 4/23/2010
5 25 Anthony 2/20/2010 7/21/2016
6 25 Anthony 9/12/2014 11/26/2019
na.strings
We can simplify this slightly if we make sure that the zero length elements of Name are NA in the first place. We replace the read.table line in the Note with the first line below. Then it is just a matter of using na.locf0.
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|",
strip.white = TRUE, na.strings = "")
transform(DF, Name = ave(Name, Id, FUN = na.omit))
Note
The input in reproducible form:
Lines <- "
Id | Name | Start_Date | End_Date
10 | Mark | 4/2/1999 | 7/5/2018
10 | | 1/1/2000 | 9/24/2018
25 | | 5/3/1968 | 6/3/2000
25 | | 6/6/2009 | 4/23/2010
25 | Anthony | 2/20/2010 | 7/21/2016
25 | | 9/12/2014 | 11/26/2019"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, sep = "|", strip.white = TRUE)

How can I take two columns in R and flatten them like the below?

Before
+---------+------------------------------------+
| Word | Tags |
+---------+------------------------------------+
| morning | #sunrise #droplets #waterdroplets |
| morning | #sky #ocean #droplets |
+---------+------------------------------------+
After
+---------+---------------+
| Word | Tags |
+---------+---------------+
| morning | sunrise |
| morning | droplets |
| morning | waterdroplets |
| morning | sky |
| morning | ocean |
| morning | droplets |
+---------+---------------+
Notice how I want to keep droplets appearing twice. This table is very big, over 5m rows, if this method can be efficient that would be very helpful. Thanks!
We can use separate_rows from tidyr.
library(dplyr)
library(tidyr)
dat <- tribble(
~Word, ~Tags,
"morning", "#sunrise #droplets #waterdroplets",
"morning", "#sky #ocean #droplets"
)
dat2 <- dat %>%
separate_rows(Tags, sep = " #") %>%
mutate(Tags = gsub("#", "", Tags))
dat2
# # A tibble: 6 x 2
# Word Tags
# <chr> <chr>
# 1 morning sunrise
# 2 morning droplets
# 3 morning waterdroplets
# 4 morning sky
# 5 morning ocean
# 6 morning droplets

Populating column based on row matches without for loop

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}
I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

Resources