Blank column separates two variables in wide format - r

I am tidying the Quarterly Employment Statistics dataset that Statistics South Africa provides. The Excel file is just aggregated data.
Two variables (employment levels and aggregate earnings) are in wide format, with the data spanning quarters over time. Row 1 in the Excel file gives the variable names, and row 2 gives the quarters. The two variables are separated by a blank column, but the quarter values are the same. I am happy to ignore row 1, and instead set row 2 as the "names", however, the quarter values are repeated between the two variables, so I need to rename the columns.
Instead of writing code to tidy the current dataset (hard-coding in a numerical vector for the columns to rename), I want to write code that will tidy all future updated datasets, where I expect that each new quarter for employees will be added just before the blank column. The following code imports the dataset correctly at present. Please tell me how I can rather select the columns to rename based on the blank column separation.
QES <- read_xlsx(
path="QES.xlsx",
range=cell_limits(c(2, 1), c(115, NA))
) %>%
rename(
industry=...1,
SIC=...2
) %>%
rename_with(
.fn=function(q) paste("employees", q, sep="."),
.cols=seq(3, 45, 1) # '45' will change every quarter.
) %>%
rename_with(
.fn=~substr(x=., start=1, stop=16),
.cols=starts_with("employees.")
) %>%
rename_with(
.fn=function(q) paste("earnings", q, sep="."),
.cols=seq(47, 89, 1) # '47' and '89' will change every quarter.
) %>%
rename_with(
.fn=~substr(x=., start=1, stop=15),
.cols=starts_with("earnings.")
) %>%
select(-46) %>% # blank column
pivot_longer(
cols=-c(1, 2),
names_to=c(".value", "time"),
names_pattern="(.+).([0-9]{6})"
) %>%
mutate( time=as.yearmon(time, "%Y%m") )
Output:
# A tibble: 4,859 x 5
industry SIC time employees earnings
<chr> <chr> <yearmon> <dbl> <dbl>
1 Mining and quarrying 2 Sep 2009 487132 16448188825
2 Mining and quarrying 2 Dec 2009 487745 17510873008
3 Mining and quarrying 2 Mar 2010 490847 17149708859
4 Mining and quarrying 2 Jun 2010 496710 17603372833
5 Mining and quarrying 2 Sep 2010 505244 19129235237
6 Mining and quarrying 2 Dec 2010 504067 19696688896
7 Mining and quarrying 2 Mar 2011 511433 19567627283
8 Mining and quarrying 2 Jun 2011 517104 20444707224
9 Mining and quarrying 2 Sep 2011 518719 21593220480
10 Mining and quarrying 2 Dec 2011 518240 24878861928
# ... with 4,849 more rows

Related

How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
Leading NAs are to be replaced with the first non-NA.
Trailing NAs are to be replaced with the last non-NA.
If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1

In 0:(b - 1) : numerical expression has 6 elements: only the first used

I've been working on a project which includes times series. My issue is that I don't have information for each year but for the period. I basically want to duplicate each row as long as the period last: for a n year period, I want to create (n-1) new rows with exactly the same informations. So far, so good.
stack = data.frame(c("ville","commune","université","pole emploi", "ministère","collège"),
c(2014,2015,2016,2014,2015,2014),
c(5,3,2,6,4,1))
colnames(stack) = c("benefit recipient","beginning year", "length of the period")
->
b = stack$`beginning year`
stack2 = stack[rep(rownames(stack),b),]
Now what I want to do is to modify the beginning year into the current year. So I want to add one year after one year into each row. To visualise it, here some code where I do it manually (also a screenshot of what I have and what I want on my real project.
stack3 = data.frame(c("ville","ville","ville","ville","ville","commune","commune","commune","université","université","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi","pole emploi", "ministère","ministère","ministère","ministère","collège"),
c(2014,2015,2016,2017,2018,2015,2016,2017,2016,2017,2014,2015,2016,2017,2018,2019,2015,2016,2017,2018,2014),
c(5,5,5,5,5,3,3,3,2,2,6,6,6,6,6,6,4,4,4,4,1))
colnames(stack3) = c("benefit recipient","effective year", "length of the period")
So far, my idea was to split my period and to add the value of this new vector to my table. I tried with the function:
c = c(0:(b-1))
But it didn't work, I have the message In 0:(b - 1) : numerical expression has 6 elements: only the first used. It's a shame because it did exactly what I wanted but, only for the first element...
Do you have any idea of how I can solve it ?
Thanks a lot for your time!
What I have
What I would like to have
Solution using lapply() and seq():
b = stack$`beginning year`
c = stack$`length of the period`
stack2 = stack[rep(rownames(stack),b),]
stack2$`beginning year` = unlist(lapply(1:length(b), function(x) seq(b[x], b[x]+c[x]-1, by=1)))
We can use rowwise
library(dplyr)
library(tidyr)
stack %>%
rowwise %>%
mutate(year = list(`beginning year`:(`beginning year` +
`length of the period` - 1))) %>%
unnest(year)
You can use map2 to create sequence and unnest to create new rows.
library(tidyverse)
stack %>%
mutate(year = map2(`beginning year`, `beginning year` + `length of the period` - 1, seq)) %>%
unnest(year)
# `benefit recipient` `beginning year` `length of the period` year
# <chr> <dbl> <dbl> <int>
# 1 ville 2014 5 2014
# 2 ville 2014 5 2015
# 3 ville 2014 5 2016
# 4 ville 2014 5 2017
# 5 ville 2014 5 2018
# 6 commune 2015 3 2015
# 7 commune 2015 3 2016
# 8 commune 2015 3 2017
# 9 université 2016 2 2016
#10 université 2016 2 2017
# … with 11 more rows

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

R include rows conditioned to other variables with `add_row`

I have a data.frame like test. It corresponds to information associated with a registry of firms. year.entry reflects the time period when a firm gets into the registry. items are elements that represent capacity and remain fixed throughout time. It may happen that the firm increases its capacity in a particular year. My aim is to present this information longitudinally.
For doing that I would ideally include rows for the years that are missing between 2010 and 2015. I have tried with this with add_row() from tibble but I am having difficulties to make it work.
> test %>% add_row(firm = firm, year.entry == (year.entry)+1, item = item, .before = row_number(year.entry) == n())
Error in eval(expr, envir, enclos) : object 'firm' not found
I wonder whether there is an easier way to solve this problem. The ideal data frame should look like this:
firm year.entry item
<chr> <chr> <int>
1 1-102642692 2010 15
2 1-102642692 2011 15
3 1-102642692 2012 15
4 1-102642692 2013 15
5 1-102642692 2014 15
6 1-102642692 2015 8
test is given by:
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
I add a dummy firm to the data to use later.
First I make sure every firm has all the years of the period of interest with complete. That is why I entered a dummy firm.
The missing years are added to the dataframe.
Then I take the last observation carried forward with na.locf.
When completed I remove the dummy firm.
comp <- data.frame(firm="test", year.entry= (2009:2016), item=0)
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
library(zoo)
rbind(test,comp) %>%
complete(firm,year.entry) %>%
arrange(firm, year.entry)%>%
group_by(firm) %>%
mutate(item = na.locf(item, na.rm=FALSE)) %>%
filter(firm !="test")
result:
firm year.entry item
<fctr> <dbl> <dbl>
1-102642692 2009 NA
1-102642692 2010 15
1-102642692 2011 15
1-102642692 2012 15
1-102642692 2013 15
1-102642692 2014 15
1-102642692 2015 8
1-102642692 2016 8

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources