Transforming a dataframe from wide to long using dplyr [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I would like to transform my database from a wide format to a long format so that I can make a plot where there are the years in the x axis and the shares in the y axis. I would like to draw a line for each family because the goal is to see the gap between the two 2034 values.
This is what my dataframe currently looks like:
And this is my desired output (I would call the x-axis "Year" and the y-axis "Share")
I have already tried using the "gather" option of "dplyr" using:
gather(CPS_fam.long, Year, Share, 2:5)
but I optain the years duplicated instead of the Family Name.
I cannot provide the data but any suggestion using a sample dataframe would be highly appreciated.

I think you already got the right code but simply have to arrange the resulting data frame by Fam_Name.
Let me reproduce your problem:
library(tidyverse)
df <- tibble("Fam_Name" = c("Architecture", "Arts", "Business"), "2002" = c(0.134, 0.116, 0.399), "2018" = c(0.161, 0.089, 0.06))
df %>% gather(., key = Year, value = Shares, c("2002", "2018"))
# Fam_Name Year Shares
# <chr> <chr> <dbl>
#1 Architecture 2002 0.134
#2 Arts 2002 0.116
#3 Business 2002 0.399
#4 Architecture 2018 0.161
#5 Arts 2018 0.089
#6 Business 2018 0.06
Now, with arrange as last part of the pipe:
df %>% gather(., key = Year, value = Shares, c("2002", "2018")) %>% arrange(Fam_Name)
# Fam_Name Year Shares
# <chr> <chr> <dbl>
#1 Architecture 2002 0.134
#2 Architecture 2018 0.161
#3 Arts 2002 0.116
#4 Arts 2018 0.089
#5 Business 2002 0.399
#6 Business 2018 0.06
Is this what you want?

Related

Different results of `summarize` and `group_by` with different months in time-series datasets

I have daily time-series data for more than 20 years. I want to extract the quantiles (0.1, 0.5, 0.9) by three months window for each year, which divided into JFM (Jan-Mar), FMA (Feb-Apr), ... and so on until OND (Oct-Dec). As a newbie in R, after so many days of research in the past two weeks, I finally found the method to do this. However, in the final step, I am stuck on this problem.
Actually, I am working using lists. But, for example, let's say we have this dataframe:
library(lubridate)
Date<-seq.Date(ymd(19700101),ymd(19721231),"day")
Q<-runif(ymd(19730101)-ymd(19700101),1,20)
df<-data.frame(Date,Q)
Now, we subset the df to obtain only specific three months (in this case JFM and FMA):
df.JFM<-df[months(df$Date) %in% month.name[1:3],] #cutting Jan-Mar
df.FMA<-df[months(df$Date) %in% month.name[2:4],] #cutting Feb-Apr
Then, to find the quantile of 50% for three-month series, I use this method:
library(dplyr)
df.JFM %>% group_by(Year=floor_date(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=T))
# A tibble: 3 x 2
Year Q
<date> <dbl>
1 1970-01-01 8.83
2 1971-01-01 9.88
3 1972-01-01 11.3
No issue in the JFM set. Let's see for FMA set:
df.FMA %>% group_by(Year=floor_date(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=T))
# A tibble: 6 x 2
Year Q
<date> <dbl>
1 1970-01-01 8.75
2 1970-04-01 13.5
3 1971-01-01 8.58
4 1971-04-01 13.2
5 1972-01-01 10.2
6 1972-04-01 8.29
Here, we found that the floor_date function round down the February dates to January dates of the same year. I expected that after I cut the data with February as the first element in the Date column, the floor_date would start from February. Apparently no. I also have tried with other three-month series and found that they give the same result as the FMA set. I also tried to change the index of the dataframe to become the same as the original index before the subset/cut, but no luck.
How to solve this problem?
Other methods for obtaining quantiles from a given period in a year (in the sense of my aim described at the beginning of the post) are also very welcomed.
Thank you.
Here, floor_date/ceiling_date performs rounding every 3 months always from the start of the year and not based on the dates in the data.
Here you may use cut instead which works as per your requirement.
library(dplyr)
df.JFM %>%
group_by(Year=cut(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=TRUE))
# Year Q
# <fct> <dbl>
#1 1970-01-01 11.0
#2 1971-01-01 11.5
#3 1972-01-01 9.57
df.FMA %>%
group_by(Year= cut(Date, '3 months')) %>%
summarize(Q = quantile(Q, 0.5, na.rm=T))
# Year Q
# <fct> <dbl>
#1 1970-02-01 11.3
#2 1971-02-01 10.5
#3 1972-02-01 9.67

In r, how do I add rows together to get totals for a specific set of variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 1 year ago.
My goal is to have a list of how much FDI China sent to each country per year. At the moment I have a list of individual projects that looks like this
Year
Country
Amount
2001
Angola
6000000
2001
Angola
8000000
2001
Angola
5.0E7
I want to sum it so it looks like this.
Year
Country
Amount
2001
Angola
6.4E7
How do I merge the rows and add the totals to get nice country-year data? I can't find an R command that does this precise thing.
library(tidyverse)
I copied the data table and read your dataframe into R using:
df <- clipr::read_clip_tbl(clipr::read_clip())
I like using dplyr to solve this question:
df2 <- as.data.frame(df %>% group_by(Country,Year) %>% summarize(Amount=sum(Amount)))
# A tibble: 1 x 3
# Groups: Country [1]
Country Year Amount
<chr> <int> <dbl>
1 Angola 2001 64000000

In R, how to combine two data frames where column names in one equals row values in another?

Say I have one data frame of tooth brush brands and a measure of how popular they are over time:
year brand_1 brand_2
2010 0.7 0.3
2011 0.6 0.6
2012 0.4 0.9
And another that says when each tooth brush brand went electrical, with NA meaning they never did so:
brand went_electrical_year
brand_1 NA
brand_2 2011
Now I'd like to combine these to get the prevalence of electrical tooth brush brands (as a proportion of the total) each year:
year electrical_prevalence
2010 0
2011 0.5
2012 0.69
In 2010 it's 0 b/c none of the brands are electrical. In 2011 it's 0.5 b/c both are and they are equally prevalent. In 2012 it's 0.69 b/c both are but the electrical one is more prevalent.
I've wrestled with this in R but can't figure out a way to do it. Would appreciate any help or suggestions. Cheers.
Assuming your data frames are df1 and df2, you can use the following tidyverse approach.
First, use pivot_longer to put your data into a long format which will be easier to manipulate. Use left_join to add the relevant years of when the brands went electrical.
We can create an indicator mult which will be 1 if the brand has gone electrical, or zero if it hadn't.
Then, for each year, you can determine the proportion by multiplying the popularity value by mult for each brand, and then dividing by the total sum for that year.
library(tidyverse)
df1 %>%
pivot_longer(cols = -year) %>%
left_join(df2, by = c("name" = "brand")) %>%
mutate(mult = ifelse(went_electrical_year > year | is.na(went_electrical_year), 0, 1)) %>%
group_by(year) %>%
summarise(electrical_prevalence = sum(value * mult) / sum(value))
Output
year electrical_prevalence
<int> <dbl>
1 2010 0
2 2011 0.5
3 2012 0.692

Calculate annual average of quarterly data in R [duplicate]

This question already has answers here:
Summarising by a group variable in r
(2 answers)
Closed 2 years ago.
I have a dataframe with some TS data reported quarterly, as follows
quarter region value
2018T4 A 4
2018T3 A 2
2018T2 A 3
2018T1 A 9
2018T4 B 6
2018T3 B 2
2018T2 B 5
2018T1 B 8
2017T4 A 2
...
I want to aggregate the quarterly observations and average them to obtain an annual mean value for each year and region, as such
quarter region value
2018 A 4.5
2018 B 5.25
2017 A 2
...
What would be an appropriate approach to this?
We can remove the quarter information from year and take mean by year and region.
aggregate(value~year+region, transform(df, year = sub('T.*', '', quarter)), mean)
# year region value
#1 2017 A 2.00
#2 2018 A 4.50
#3 2018 B 5.25
Same using dplyr :
library(dplyr)
df %>%
group_by(year = sub('T.*', '', quarter), region) %>%
summarise(value = mean(value))

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Resources