Using pivot_wider to get true or false [duplicate] - r

This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
I'm trying to use pivot_wider to get a binary result for each country in each year between 1991 - 1995 like this table:
+------+-------+--------+--------+
| year | USA | Israel | Sweden |
| 1991 | FALSE | TRUE | TRUE |
| 1992 | FALSE | FALSE | TRUE |
| 1993 | FALSE | TRUE | TRUE |
| 1994 | FALSE | FALSE | TRUE |
| 1995 | TRUE | TRUE | TRUE |
+------+-------+--------+--------+
Of course, any binary indication will be great, besides true/false.
However, my data frame looks like:
country = c("Sweden", "Sweden", "Sweden", "Sweden", "Sweden", "Israel", "Israel",
"Israel", "USA")
year = c(1991,1992,1993,1994,1995,1991,1993,1995,1995)
df = as.data.frame(cbind(year,country))
df
+---------+------+
| country | Year |
| Sweden | 1991 |
| Sweden | 1992 |
| Sweden | 1993 |
| Sweden | 1994 |
| Sweden | 1995 |
| Israel | 1991 |
| Israel | 1993 |
| Israel | 1995 |
| USA | 1995 |
+---------+------+
I tried the following code and obtained the result below which is not what I'm looking for
library(dplyr)
df2 = df %>%
group_by(country) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = country, values_from = year) %>%
select(-row)
df2
+------+--------+--------+
| USA | Israel | Sweden |
| 1995 | 1991 | 1991 |
| NA | 1993 | 1992 |
| NA | 1995 | 1993 |
| NA | NA | 1994 |
| NA | NA | 1995 |
+------+--------+--------+

You can try this:
library(dplyr)
library(tidyr)
df %>% mutate(val=1) %>% pivot_wider(names_from = country,values_from = val) %>%
mutate(across(-year, ~replace_na(.x, 0))) %>%
mutate(across(-year, ~ifelse(.x==1, TRUE,FALSE)))
Output:
# A tibble: 5 x 4
year Sweden Israel USA
<fct> <lgl> <lgl> <lgl>
1 1991 TRUE TRUE FALSE
2 1992 TRUE FALSE FALSE
3 1993 TRUE TRUE FALSE
4 1994 TRUE FALSE FALSE
5 1995 TRUE TRUE TRUE

here is a data.table solution
library( data.table )
#custom function, odetermins is the length of a vector >1 (TRUE/FALSE)
cust_fun <- function(x) length(x) > 0
#cast to wide, aggregating with the custom function above
dcast( setDT(df), year ~ country, fun.aggregate = cust_fun )
# year Israel Sweden USA
# 1: 1991 TRUE TRUE FALSE
# 2: 1992 FALSE TRUE FALSE
# 3: 1993 TRUE TRUE FALSE
# 4: 1994 FALSE TRUE FALSE
# 5: 1995 TRUE TRUE TRUE

Related

R Studio: How to perform separate data wrangling procedures for different values of a variable into a list of individual dataframes?

I have a dataframe that looks like this:
+-----------+------------+--------+------------+
| Geography | Dates | Sales | Avg_Volume |
+-----------+------------+--------+------------+
| A | 2020-01-01 | | |
+-----------+------------+--------+------------+
| A | 2020-01-02 | | |
+-----------+------------+--------+------------+
| A | 2020-01-03 | | |
+-----------+------------+--------+------------+
| A | 2020-01-04 | | |
+-----------+------------+--------+------------+
| A | 2020-01-05 | | |
+-----------+------------+--------+------------+
| B | 2020-01-01 | | |
+-----------+------------+--------+------------+
| B | 2020-01-02 | | |
+-----------+------------+--------+------------+
| B | 2020-01-03 | | |
+-----------+------------+--------+------------+
| B | 2020-01-04 | | |
+-----------+------------+--------+------------+
| B | 2020-01-05 | | |
+-----------+------------+--------+------------+
| C | 2020-01-01 | | |
+-----------+------------+--------+------------+
| C | 2020-01-02 | | |
+-----------+------------+--------+------------+
| C | 2020-01-03 | | |
+-----------+------------+--------+------------+
| C | 2020-01-04 | | |
+-----------+------------+--------+------------+
| C | 2020-01-05 | | |
+-----------+------------+--------+------------+
| D | 2020-01-01 | | |
+-----------+------------+--------+------------+
| D | 2020-01-02 | | |
+-----------+------------+--------+------------+
| D | 2020-01-03 | | |
+-----------+------------+--------+------------+
| D | 2020-01-04 | | |
+-----------+------------+--------+------------+
| D | 2020-01-05 | | |
+-----------+------------+--------+------------+
I would like to have 3 dataframes dedicated to City B,C,D that looks like this (I need A_Sales to be always present:
+------------+----------+---------+--------------+
| Dates | A_Sales | B_Sales | B_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | C_Sales | C_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
+------------+----------+---------+--------------+
| Dates | A_Sales | D_Sales | D_Avg_Volume |
+------------+----------+---------+--------------+
| 2020-01-01 | | | |
+------------+----------+---------+--------------+
| 2020-01-02 | | | |
+------------+----------+---------+--------------+
| 2020-01-03 | | | |
+------------+----------+---------+--------------+
| 2020-01-04 | | | |
+------------+----------+---------+--------------+
| 2020-01-05 | | | |
+------------+----------+---------+--------------+
Currently this is what I have:
data_A <- data %>%
filter(Geography == "A") %>%
rename("A_Sales" = Sales) %>%
select(Dates, A_Sales)
data_B <- data %>%
filter(Geography == 'B') %>%
rename("B_Sales" = Sales)%>%
rename("B_Avg_Volume" = Avg_Volume)%>%
select(Dates, B_Sales, B_Avg_Volume)
data_a_n_b <- data_A %>%
left_join(data_B, by = 'Dates')
This is very redundant and inefficient, because I would have to change Geography == '...') to "B,C,D..." everytime and re-run. My real data has ~ 50 cities so it is unrealistic for me to do this process for each city individually.
What is a elegant way to batch processing this process?
I am imagining the end result be a list of dataframes for city B,C,D ... and so on, with the name of each individual dataframe be the city name. This way I can easily access each individual dataframe. For example, calling data_result$C (or sth like that) will give me the dataframe for City C. Any other output format is also welcomed, as long as accessing individual dataframe is easy.
Thanks so much for your help!
Using purrr this could be achieved like so:
Split your df by Geography
Loop over the list (except for region "A") and join the dfs to the one for region A
Do some renaming
set.seed(42)
dat <- data.frame(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
library(purrr)
library(dplyr)
library(stringr)
dat_list <- dat %>%
split(.$Geography) %>%
map(select, -Geography)
imap(dat_list[setdiff(names(dat_list), "A")], function(x, y) {
left_join(dat_list[["A"]], x, by = "Dates", suffix = c(paste0("_", y), "_A")) %>%
rename_with(~ str_replace(.x, "(Sales|Avg_Volume)_(.*)", "\\2_\\1"), -Dates) %>%
select(-A_Avg_Volume)
})
#> $B
#> Dates B_Sales B_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6417455
#> 2 2020-01-02 0.9370754 0.1174874 0.5190959
#> 3 2020-01-03 0.2861395 0.4749971 0.7365883
#> 4 2020-01-04 0.8304476 0.5603327 0.1346666
#>
#> $C
#> Dates C_Sales C_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.6569923
#> 2 2020-01-02 0.9370754 0.1174874 0.7050648
#> 3 2020-01-03 0.2861395 0.4749971 0.4577418
#> 4 2020-01-04 0.8304476 0.5603327 0.7191123
#>
#> $D
#> Dates D_Sales D_Avg_Volume A_Sales
#> 1 2020-01-01 0.9148060 0.9782264 0.9346722
#> 2 2020-01-02 0.9370754 0.1174874 0.2554288
#> 3 2020-01-03 0.2861395 0.4749971 0.4622928
#> 4 2020-01-04 0.8304476 0.5603327 0.9400145
Created on 2021-02-05 by the reprex package (v1.0.0)
I took Stefan's setup dataframe and added another way to do it. The steps are:
Get a list of the list of city names (excluding A). The way I wrote it assumes A is first, but you could also use discard() to remove "A" from the city list.
Use map with filter to get a list of data frames that have A and each city in cities. set_names to make sure each list element is accessible by its city name
Take each data frame in the list and pivot_wider, then select everything by Avg_Volume for A.
#Set up a sample data frame
library(dplyr)
set.seed(42)
dat <- tibble(
Geography = rep(LETTERS[1:4], each = 4),
Dates = rep(seq(as.Date("2020-01-01"), as.Date("2020-01-04"), by = "1 day"), 4),
Sales = runif(4 * 4),
Avg_Volume = runif(4 * 4)
)
#Code to wrangle into list of filtered, wide format data frames
library(dpylyr)
library(tidyr)
library(purrr)
cities <- unique(dat$Geography)[-1]
dat_list <- map(cities, ~ filter(dat, Geography == "A" | Geography == .x)) %>% set_names(cities)
dat_list_wider <- map(dat_list,
~pivot_wider(.x, id_cols = "Dates",
names_from = "Geography",
values_from = c("Sales","Avg_Volume")) %>%
select(-Avg_Volume_A))

Weekly Weight Based on a category using dplyr in R

I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))

Grouping by column and finding preceeding value of another column

I have a very long sales data, below an exemplary excerpt:
| Date | CountryA | CountryB | PriceA | PriceB | |
+------------+----------+----------+--------+--------+--+
| 05/09/2019 | US | Japan | 20 | 55 | |
| 28/09/2019 | Japan | Germany | 30 | 28 | |
| 16/10/2019 | Canada | US | 25 | 78 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | |
+------------+----------+----------+--------+--------+--+
I would like to group on column "CountryB" and then generate a new column which displays the preceding value of PriceA of that respective country, i.e. when that specific country was present in column "CountryA" the last time based on date order. In this exemplary table, I want to get the following results:
| Date | CountryA | CountryB | PriceA | PriceB | PriceA_lag1 | |
+------------+----------+----------+--------+--------+-------------+--+
| 05/09/2019 | US | Japan | 20 | 55 | | |
| 28/09/2019 | Japan | Germany | 30 | 28 | | |
| 16/10/2019 | Canada | US | 25 | 78 | 20 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | 30 | |
+------------+----------+----------+--------+--------+-------------+--+
I have tried the following with dplyr:
data=data%>%group_by(CountryB)%>%mutate_at(list(lag1=~dplyr::lag(.,1,order_by=Date)),.vars=vars(PriceA))
However this does not give me the preceding value when the respective country is in column "CountryA", but rather when the respective country is in "CountryB".
Can someone please help me out on this one?
Thanks.
Quite possibly some of the ugliest code I've written, but...
# install.packages('dplyr', 'magrittr')
library(dplyr)
library(magrittr)
d <- data.frame(
stringsAsFactors = FALSE,
Date = c("05/09/2019", "28/09/2019", "16/10/2019", "28/10/2019"),
CountryA = c("US", "Japan", "Canada", "Germany"),
CountryB = c("Japan", "Germany", "US", "Japan"),
PriceA = c(20L, 30L, 25L, 60L),
PriceB = c(55L, 28L, 78L, 17L)
) %>%
mutate(Date = as.Date(Date, format = '%d/%m/%Y'))
priceA_lag <- c()
for(row in 1:nrow(d)){
country <- slice(d, row) %$% CountryB
date <- slice(d, row) %$% Date
thePrice <- d %>%
filter(CountryA == country,
date > Date) %>%
filter(Date == max(Date)) %$%
PriceA
thePrice <- ifelse(length(thePrice) > 0, thePrice, NA)
priceA_lag <- priceA_lag %>%
append(thePrice)
}
d$priceA_lag <- priceA_lag
> d
Date CountryA CountryB PriceA PriceB priceA_lag
1 2019-09-05 US Japan 20 55 NA
2 2019-09-28 Japan Germany 30 28 NA
3 2019-10-16 Canada US 25 78 20
4 2019-10-28 Germany Japan 60 17 30

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Populating column based on row matches without for loop

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}
I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

Resources