How best to do this pivot operation in R - r

Below is the sample data and the desired outcome. This is a much simplified version of the actual data set. In the actual data set, there are 20 years and 4 quarters apiece. Looking to have each unique company entry listed once and the employment data series running from beginning to end from left to right. In the event that there is no data for Vision Inc in 2019 quarter 3, then I would want it to return a O and not an NA.
library(tidyverse)
library(dplyr)
legalname <- c("Vision Inc.","Expedia","Strong Enterprise","Vision Inc.","Expedia","Strong Enterprise")
year <- c(2019,2019,2019,2019,2019,2019)
quarter <- c(1,1,1,2,2,2)
cnty <- c(031,029,027,031,029,027)
naics <- c(345110,356110,362110,345110,356110,345110)
mnth1emp <- c (11,13,15,15,17,20)
mnth2emp <- c(12,14,15,16,18,22)
mnth3emp <-c(13,15,15,17,21,29)
employers <- data.frame(legalname,year,quarter,naics,mnth1emp,mnth2emp,mnth3emp)
Desired Outcome
legalname cnty naics 2019m1 2019m2 2019m3 2019m4 2019m5 2019m6
Vision Inc 031 345110 11 12 13 15 16 17
Expedia 029 356110 13 14 15 17 18 21

I first pivot to a long form, then arrange by legalname and year(just to double-check that they are in numerical order). Then, I create a unique month series for each year for each company. Then, I drop quarter and pivot back to wide form and put name and year together, and finally replace NA with 0. Here, I'm assuming that you want each unique naics on it's own row.
library(tidyverse)
employers %>%
pivot_longer(starts_with("mnth")) %>%
arrange(legalname, year) %>%
group_by(legalname, year, naics) %>%
mutate(name = paste0("m", 1:n())) %>%
select(-quarter) %>%
pivot_wider(names_from = c("year", "name"), names_sep = "", values_from = "value") %>%
mutate(across(everything(), ~replace_na(.,0)))
Output
legalname naics `2019m1` `2019m2` `2019m3` `2019m4` `2019m5` `2019m6`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Expedia 356110 13 14 15 17 18 21
2 Strong Enterprise 362110 15 15 15 0 0 0
3 Strong Enterprise 345110 0 0 0 20 22 29
4 Vision Inc. 345110 11 12 13 15 16 17

Does this work for you?
First pivot longer to get the months and values in a quarter; and then pivot wider to get the wide format you want.
employers %>%
filter(legalname != "Strong Enterprise") %>%
pivot_longer(mnth1emp:mnth3emp, names_to = "mnth", values_to = "value") %>%
mutate(month_in_quarter = as.numeric(str_extract(mnth, "\\d")),
month =str_c("m", month_in_quarter + 3*(quarter - 1))) %>%
select(-c(month_in_quarter, mnth)) %>%
pivot_wider(c(legalname,cnty, naics), names_from = c(year, month),
values_from = value,
values_fill = 0)
values_fill will fill NAs with 0s.

perhaps try this.
I found a way to get the pivot right in R. I used the library("pivottabler") with the data.frame "bhmtrains". This worked now.
library(pivottabler)
qhpvt(bhmtrains, c("=","TOC"), "TrainCategory",
c("Mean Speed"="mean(SchedSpeedMPH, na.rm=TRUE)", "Std Dev
Speed"="sd(SchedSpeedMPH, na.rm=TRUE)"),
formats=list("%.0f", "%.1f"), totals=list("", "TrainCategory"="All",
"Categories"))
my results out of the code

Related

In R, pivot duplicate row values into column values

My problem is similar to this one, but I am having trouble making the code work for me:
Pivot dataframe to keep column headings and sub-headings in R
My data looks like this:
prod1<-c(1000,2000,1400,1340)
prod2<-c(5000,5400,3400,5400)
partner<-c("World","World","Turkey","Turkey")
year<-c("2017","2018","2017","2018")
type<-c("credit","credit","debit","debit")
s<-as.data.frame(rbind(partner,year,type,prod1,prod2)
But I need to convert all the rows into individual variables so that it my columns are:
column.names<-c("products","partner","year","type","value")
I've been trying the code below:
#fix partners
colnames(s)[seq(2, 7, 1)] <- colnames(s)[2] #seq(start,end,increment)
colnames(s)[seq(9, ncol(s), 1)] <- colnames(s)[8]
colnames(s) <-
c(s[1, 1], paste(sep = '_', colnames(s)[2:ncol(s)], as.character(unlist(s[1, 2:ncol(s)]))))
test<-s[-1,]
s <- rename(s, category=1)
test<- s %>%
slice(-1) %>%
pivot_longer(-1,
names_to = c("partner", ".value"),
names_sep = "_") %>%
arrange(partner, `Service item`) %>%
mutate(partner = as.character(partner))
But it keeps saying I can't have duplicate column names. Can someone please help? The initial data is submitted in this format so I need to get it in the right shape.
s <- rownames_to_column(s)
s %>% pivot_longer(starts_with("V")) %>%
pivot_wider(names_from = rowname,values_from = value) %>%
select(-name) %>% pivot_longer(starts_with("prod"), names_to = "product",
values_to = "value")
# A tibble: 8 × 5
partner year type product value
<chr> <chr> <chr> <chr> <chr>
1 World 2017 credit prod1 1000
2 World 2017 credit prod2 5000
3 World 2018 credit prod1 2000
4 World 2018 credit prod2 5400
5 Turkey 2017 debit prod1 1400
6 Turkey 2017 debit prod2 3400
7 Turkey 2018 debit prod1 1340
8 Turkey 2018 debit prod2 5400
sorry misread the question at the beginning, is that what you look for ?

Reshaping multiple long columns into wide column format in R

My sample dataset has multiple columns that I want to convert into wide format. I have tried using the dcast function, but I get error. Below is my sample dataset:
df2 = data.frame(emp_id = c(rep(1,2), rep(2,4),rep(3,3)),
Name = c(rep("John",2), rep("Kellie",4), rep("Steve",3)),
Year = c("2018","2019","2018","2018","2019","2019","2018","2019","2019"),
Type = c(rep("Salaried",2), rep("Hourly", 2), rep("Salaried",2),"Hourly",rep("Salaried",2)),
Dept = c("Sales","IT","Sales","Sales", rep("IT",3),rep("Sales",2)),
Salary = c(100,1000,95,95,1500,1500,90,1200,1200))
I'm expecting my output to look like:
One option is the function pivot_wider() from the tidyr package:
df.wide <- tidyr::pivot_wider(df2,
names_from = c("Type", "Dept", "Year"),
values_from = "Salary",
values_fn = {mean})
This should get you the desired result.
What do you think about this output? It is not the expected output, but somehow I find it easier to interpret the data??
df2 %>%
group_by(Name, Year, Type, Dept) %>%
summarise(mean = mean(Salary))
Output:
Name Year Type Dept mean
<chr> <chr> <chr> <chr> <dbl>
1 John 2018 Salaried Sales 100
2 John 2019 Salaried IT 1000
3 Kellie 2018 Hourly Sales 95
4 Kellie 2019 Salaried IT 1500
5 Steve 2018 Hourly IT 90
6 Steve 2019 Salaried Sales 1200

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources