R: using the mutate function to combine strings - r

I've got a dataset called data1 with headers year and count.
My sample data looks like this:
Year Count
1 2005 3000
2 2006 4000
3 2007 5000
4 2008 6000
I add another column to the data which works out the yearly increase. This is my code:
data1growth <- data1 %>%
mutate(Growth = Count - lag(Count))
I want to be able to add another column called period so that I can get the following output:
Year Count Growth Period
1 2005 3000 NA NA
2 2006 4000 1000 2005-2006
3 2007 5000 1000 2006-2007
4 2008 6000 1000 2007-2008
What code should I add to the mutate function to get the desired output, or am I off the mark completely? Any help is appreciated.
Thanks everyone.

library(dplyr)
data1 %>%
mutate(
Growth = Count - lag(Count),
period = if_else(
row_number() > 1,
paste0(lag(Year), "-", Year),
NA_character_
)
)
# Year Count Growth period
# 1 2005 3000 NA <NA>
# 2 2006 4000 1000 2005-2006
# 3 2007 5000 1000 2006-2007
# 4 2008 6000 1000 2007-2008
Reproducible data
data1 <- data.frame(
Year = seq(2005L, 2008L, 1L),
Count = seq(3000L, 6000L, 1000L)
)

If you want 'Period' to just be a string, you can just use another mutate:
library(tidyverse)
data1 <- tibble(Year = 2005:2008, Count = c(3000, 4000, 5000, 6000))
data1growth <- data1 %>%
mutate(Growth = Count - lag(Count))
# Period as string
data1growth %>%
mutate(Period = paste0(Year, "-", Year-1))
#> # A tibble: 4 x 4
#> Year Count Growth Period
#> <int> <dbl> <dbl> <chr>
#> 1 2005 3000 NA 2005-2004
#> 2 2006 4000 1000 2006-2005
#> 3 2007 5000 1000 2007-2006
#> 4 2008 6000 1000 2008-2007
# Period as string (don't include NA Growth)
data1growth %>%
mutate(Period = ifelse(is.na(Growth), NA, paste0(Year, "-", Year-1)))
#> # A tibble: 4 x 4
#> Year Count Growth Period
#> <int> <dbl> <dbl> <chr>
#> 1 2005 3000 NA <NA>
#> 2 2006 4000 1000 2006-2005
#> 3 2007 5000 1000 2007-2006
#> 4 2008 6000 1000 2008-2007

Here is a base R option
transform(df1,
Grouth = c(NA, diff(Count)),
Period = c(NA, paste0(Year[-nrow(df1)], "-", Year[-1]))
)
which gives
Year Count Grouth Period
1 2005 3000 NA <NA>
2 2006 4000 1000 2005-2006
3 2007 5000 1000 2006-2007
4 2008 6000 1000 2007-2008

Related

Growth Rates in R with dplyr not working, NAs generated - Help needed

I'm trying to generate growth rates for each fiscal half for various products in R using dplyr and the lag function.
Usually, this works for me. However this time it's generating NAs. I'm not sure what the issue is. The following code generates "NA" for all growth rates. Hoping Someone can help.
library(flexdashboard)
library(dplyr)
library(magrittr)
library(scales)
library(sqldf)
library(ggplot2)
library(lubridate)
library(knitr)
library(tidyr)
library(kableExtra)
library(ggrepel)
library(htmltools)
library(stringr)
library(readxl)
t <- c(3000,2000, 6000)
u <- c("FY18H1", "FY18H2", "FY19H1", "FY19H2", "FY20H1", "FY20H2")
x <- c(1,2,3,4,5)
y <- c("a","b","c","d","e")
z <- c("apples","oranges")
identifer <- sort(c(replicate(x,n =6)))
name <- sort(c(replicate(y,n=6)))
business <- sort(c(replicate(z,n=15)))
half <- c(replicate(u, n=5))
dollars <- c(replicate(t, n = 10))
df <- data.frame(identifer,name, business,half, dollars)
df <- df %>% group_by(
identifer,
name,
business,
half
) %>%
mutate(
YoY_GROWTH_DOLLARS = dollars - lag(dollars, 2),
YoY_GROWTH_RATE = round(YoY_GROWTH_DOLLARS/lag(dollars,2),4)
)
I think you should not group_by half. Try -
library(dplyr)
df %>% group_by(
identifer,
name,
business
) %>%
mutate(
YoY_GROWTH_DOLLARS = dollars - lag(dollars, 2),
YoY_GROWTH_RATE = round(YoY_GROWTH_DOLLARS/lag(dollars,2),4)
) %>% ungroup
# identifer name business half dollars YoY_GROWTH_DOLLARS YoY_GROWTH_RATE
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 1 a apples FY18H1 3000 NA NA
# 2 1 a apples FY18H2 2000 NA NA
# 3 1 a apples FY19H1 6000 3000 1
# 4 1 a apples FY19H2 3000 1000 0.5
# 5 1 a apples FY20H1 2000 -4000 -0.667
# 6 1 a apples FY20H2 6000 3000 1
# 7 2 b apples FY18H1 3000 NA NA
# 8 2 b apples FY18H2 2000 NA NA
# 9 2 b apples FY19H1 6000 3000 1
#10 2 b apples FY19H2 3000 1000 0.5
# … with 20 more rows
Instead of using dplyr::mutate, use plyr::mutate
df %>% dplyr::group_by(
identifer,
name,
business,
half
) %>%
plyr::mutate(
YoY_GROWTH_DOLLARS = dollars - lag(dollars, 2),
YoY_GROWTH_RATE = round(YoY_GROWTH_DOLLARS/lag(dollars,2),4)
)
identifer name business half dollars YoY_GROWTH_DOLLARS YoY_GROWTH_RATE
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 a apples FY18H1 3000 NA NA
2 1 a apples FY18H2 2000 NA NA
3 1 a apples FY19H1 6000 3000 1
4 1 a apples FY19H2 3000 1000 0.5
5 1 a apples FY20H1 2000 -4000 -0.667
6 1 a apples FY20H2 6000 3000 1
7 2 b apples FY18H1 3000 1000 0.5
8 2 b apples FY18H2 2000 -4000 -0.667
9 2 b apples FY19H1 6000 3000 1
10 2 b apples FY19H2 3000 1000 0.5

Using dplyr and group_by to calculate number of repetition for a value

I have a dataset which includes seller_ID, product_ID and year the product was sold, and I am trying to find the year that one seller had maximum sold product and the specific number of sold in that year for each individual seller. Here is an example of data
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
seller_ID Product_ID year
1 1 1000 2015
2 1 1000 2016
3 1 1005 2015
4 2 1004 2020
5 2 1005 2020
6 3 1003 2000
7 4 1010 2000
8 4 1000 2001
9 4 1001 2001
10 4 1019 2001
11 4 1017 2005
so the ideal result would be:
seller_ID Max_sold_num_year Max_year
1 1 2 2015
2 2 2 2020
3 3 1 2000
4 4 3 2001
I tried the approach I explained below and it worked ...
df_temp<- data %>%
group_by(seller_ID, year) %>%
summarize(Sold_in_Year=length(Product_ID))
unique_seller=unique(data$seller_ID)
ID_list=c()
Max_list=c()
Max_Sold_Year=c()
j=1
for (ID in unique_seller) {
df_temp_2<- subset(df_temp, df_temp$seller_ID==ID)
Max_year<- subset(df_temp_2,df_temp_2$Sold_in_Year==max(df_temp_2$Sold_in_Year))
if (nrow(Max_year)>1){
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
else {
ID_list[j]<-Max_year[1,1]
Max_Sold_Year[j]<-Max_year[1,2]
Max_list[j]<-Max_year[1,3]
j<-j+1
}
}
#changing above list to dataframe
mm=length(ID_list)
df_test_list<- data.frame(seller_ID=numeric(mm), Max_sold_num_year=numeric(mm),Max_year=numeric(mm))
for (i in 1:mm){
df_test_list$seller_ID[i] <- ID_list[[i]]
df_test_list$Max_sold_num_year[i] <- Max_list[[i]]
df_test_list$Max_year[i] <- Max_Sold_Year[[i]]
}
however, due to subsetting each time and using for loop this approach is kind of slow for a large dataset. Do you have any suggestions on how I can improve my code? is there any other way that I can calculate the desired result without using for loop?
Thanks
Try this
library(dplyr)
seller_ID <- c(1,1,1,2,2,3,4,4,4,4,4)
Product_ID <- c(1000,1000,1005,1004,1005,1003,1010,
1000,1001,1019,1017)
year <- c(2015,2016,2015,2020,2020,2000,2000,2001,2001,2001,2005)
data<- data.frame(seller_ID,Product_ID,year)
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::filter(n == max(n)) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)
#> # A tibble: 4 x 3
#> # Groups: seller_ID [4]
#> seller_ID Max_year Max_sold_num_year
#> <dbl> <dbl> <int>
#> 1 1 2015 2
#> 2 2 2020 2
#> 3 3 2000 1
#> 4 4 2001 3
And thanks to the comment by #yung_febreze this could be achieved even shorter with
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1)
EDIT In case of duplicated maximum values one can add dplyr::top_n(1, wt = year) which filters for the latest (or maximum) year:
data %>%
dplyr::count(seller_ID, year) %>%
dplyr::group_by(seller_ID) %>%
dplyr::top_n(1, wt = n) %>%
dplyr::top_n(1, wt = year) %>%
dplyr::rename(Max_sold_num_year = n, Max_year = year)

Calculate change over time with tidy data in R - do you have to spread and gather?

Quick question about calculating a change over time for tidy data. Do I need to spread the data, mutate the variable and then gather the data again (see below), or is there a quicker way to do this keeping the data tidy.
Here is an example:
df <- data.frame(country = c(1, 1, 2, 2),
year = c(1999, 2000, 1999, 2000),
value = c(20, 30, 40, 50))
df
country year value
1 1 1999 20
2 1 2000 30
3 2 1999 40
4 2 2000 50
To calculate the change in value between 1999 and 2000 I would:
library(dplyr)
library(tidyr)
df2 <- df %>%
spread(year, value) %>%
mutate(change.99.00 = `2000` - `1999`) %>%
gather(year, value, c(`1999`, `2000`))
df2
country change.99.00 year value
1 1 10 1999 20
2 2 10 1999 40
3 1 10 2000 30
4 2 10 2000 50
This seems a laborious way to do this. I assume there should be a neat way to do this while keeping the data in narrow, tidy format, by grouping the data or something but I can't think of it and I can't find an answer online.
Is there an easier way to do this?
After grouping by 'country', get the diff of 'value' filtered with the logical expression year %in% 1999:2000
library(dplyr)
df %>%
group_by(country) %>%
mutate(change.99.00 = diff(value[year %in% 1999:2000]))
# A tibble: 4 x 4
# Groups: country [2]
# country year value change.99.00
# <dbl> <dbl> <dbl> <dbl>
#1 1 1999 20 10
#2 1 2000 30 10
#3 2 1999 40 10
#4 2 2000 50 10
NOTE: Here, we assume that the 'year' is not duplicated per 'country'

Remove rows out of a specific year range, without using for-loop in R

I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1

Selecting distinct rows in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
dat <- data.frame(loc.id = rep(1:2, each = 3),
year = rep(1981:1983, times = 2),
prod = c(200,300,400,150,450,350),
yld = c(1200,1250,1200,3000,3200,3200))
If I want to select for each loc.id distinct values of yld, I do this:
dat %>% group_by(loc.id) %>% distinct(yld)
loc.id yld
<int> <dbl>
1 1200
1 1250
2 3000
2 3200
However, what I want to do is for loc.id, if years have the same yld, then select the yld with a lower
prod value. My dataframe should look like i.e. I want the prod and year column too included in the final dataframe
loc.id year prod yld
1 1981 200 1200
1 1982 300 1250
2 1981 150 3000
2 1983 350 3200
We can do an arrange by 'prod' and then slice the first observation
dat %>%
arrange(loc.id, prod) %>%
group_by(loc.id, yld) %>%
slice(1)
# A tibble: 4 x 4
# Groups: loc.id, yld [4]
# loc.id year prod yld
# <int> <int> <dbl> <dbl>
#1 1 1981 200 1200
#2 1 1982 300 1250
#3 2 1981 150 3000
#4 2 1983 350 3200

Resources