In R, how to combine two data frames where column names in one equals row values in another? - r

Say I have one data frame of tooth brush brands and a measure of how popular they are over time:
year brand_1 brand_2
2010 0.7 0.3
2011 0.6 0.6
2012 0.4 0.9
And another that says when each tooth brush brand went electrical, with NA meaning they never did so:
brand went_electrical_year
brand_1 NA
brand_2 2011
Now I'd like to combine these to get the prevalence of electrical tooth brush brands (as a proportion of the total) each year:
year electrical_prevalence
2010 0
2011 0.5
2012 0.69
In 2010 it's 0 b/c none of the brands are electrical. In 2011 it's 0.5 b/c both are and they are equally prevalent. In 2012 it's 0.69 b/c both are but the electrical one is more prevalent.
I've wrestled with this in R but can't figure out a way to do it. Would appreciate any help or suggestions. Cheers.

Assuming your data frames are df1 and df2, you can use the following tidyverse approach.
First, use pivot_longer to put your data into a long format which will be easier to manipulate. Use left_join to add the relevant years of when the brands went electrical.
We can create an indicator mult which will be 1 if the brand has gone electrical, or zero if it hadn't.
Then, for each year, you can determine the proportion by multiplying the popularity value by mult for each brand, and then dividing by the total sum for that year.
library(tidyverse)
df1 %>%
pivot_longer(cols = -year) %>%
left_join(df2, by = c("name" = "brand")) %>%
mutate(mult = ifelse(went_electrical_year > year | is.na(went_electrical_year), 0, 1)) %>%
group_by(year) %>%
summarise(electrical_prevalence = sum(value * mult) / sum(value))
Output
year electrical_prevalence
<int> <dbl>
1 2010 0
2 2011 0.5
3 2012 0.692

Related

How to exclude observation that does not appear at least once every year - R

I have a database where companies are identified by an ID (cnpjcei) from 2009 to 2018, where we can have 1 or more observations of a given company in a given year or no observations of a given company in a given year.
Here is a sample of the database:
> df
cnpjcei year
<chr> <dbl>
1 4774 2009
2 4774 2010
3 28959 2009
4 29688 2009
5 43591 2010
6 43591 2010
7 65803 2011
8 105104 2011
9 113980 2012
10 220043 2013
I would like to keep in that df only the companies that appear at least once a year.
What would be the easiest way to do this?
Using the data.table library:
library(data.table)
df<-data.table(df)
df<-df[,unique_years:=length(unique(year)), by=list(cnpjcei),][unique_years==10]
We can use dplyr, group_by id and filter only the cases in which all the elements in 2009:2018 can be found %in% the year column.
Please mind that, for this code to work with the sample database as in the question, the range would have to be replaced with 2009:2013
library(dplyr)
df %>% group_by(cnpjcei) %>% filter(all(2009:2018 %in% year))
You can keep the ids (cnpjcei) which has all the unique years available in the data.
library(dplyr)
result <- df %>%
group_by(cnpjcei) %>%
filter(n_distinct(year) == n_distinct(.$year)) %>%
ungroup

Growth Rates in Unbalanced Panel Data

I am trying to get a Growth Rate for some variables in an Unbalanced Panel data, but I´m still getting results for years in which the lag does not exist.
I've been trying to get the Growth Rates using library Dplyr. As I Show down here:
total_firmas_growth <- total_firmas %>%
group_by(firma) %>%
arrange(anio, .by_group = T) %>% mutate(
ing_real_growth = (((ingresos_real_2/Lag(ingresos_real_2))-1)*100)
)
for Instance, if a firm has a value for "ingresos_real_2" in the year 2008 and the next value is in year 2012, the code calculate the growth rate instead of get an NA, because of the missing year (i.e 2011 is missing to calculate 2012 growth rate, as you can see in the example with the "firma" 115 (id) right below:
total_firmas_growth <-
" firma anio ingresos_real_2 ing_real_growth
1 110 2005 14000 NA
2 110 2006 15000 7.14
3 110 2007 13000 -13.3
4 115 2008 15000 NA
5 115 2012 13000 NA
6 115 2013 14000 7.69
I will really appreciate your help.
The easiest way to get your original table into a format where there are NAs for columns is to create a tibble with an all-by-all of the grouping columns and your years. Expand creates an all-by-all tibble of the variables you are interested in and {.} takes in whatever was piped more robustly than . (by creating a copy, I believe). Since any mathematical operation that includes an NA will result in an NA, this should get you what you're after if you use your group_by, arrange, mutate code after it.
total_firmas %>%
left_join(
expand({.}, firma, anio),
by = c("firma","anio")
)

Transforming a dataframe from wide to long using dplyr [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I would like to transform my database from a wide format to a long format so that I can make a plot where there are the years in the x axis and the shares in the y axis. I would like to draw a line for each family because the goal is to see the gap between the two 2034 values.
This is what my dataframe currently looks like:
And this is my desired output (I would call the x-axis "Year" and the y-axis "Share")
I have already tried using the "gather" option of "dplyr" using:
gather(CPS_fam.long, Year, Share, 2:5)
but I optain the years duplicated instead of the Family Name.
I cannot provide the data but any suggestion using a sample dataframe would be highly appreciated.
I think you already got the right code but simply have to arrange the resulting data frame by Fam_Name.
Let me reproduce your problem:
library(tidyverse)
df <- tibble("Fam_Name" = c("Architecture", "Arts", "Business"), "2002" = c(0.134, 0.116, 0.399), "2018" = c(0.161, 0.089, 0.06))
df %>% gather(., key = Year, value = Shares, c("2002", "2018"))
# Fam_Name Year Shares
# <chr> <chr> <dbl>
#1 Architecture 2002 0.134
#2 Arts 2002 0.116
#3 Business 2002 0.399
#4 Architecture 2018 0.161
#5 Arts 2018 0.089
#6 Business 2018 0.06
Now, with arrange as last part of the pipe:
df %>% gather(., key = Year, value = Shares, c("2002", "2018")) %>% arrange(Fam_Name)
# Fam_Name Year Shares
# <chr> <chr> <dbl>
#1 Architecture 2002 0.134
#2 Architecture 2018 0.161
#3 Arts 2002 0.116
#4 Arts 2018 0.089
#5 Business 2002 0.399
#6 Business 2018 0.06
Is this what you want?

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

removing all rows if a row = 0 for certain number of days per quarter

First and foremost, thank you for taking YOUR time to view/answer my question.
I am getting a bit stuck on this question - I believe I am close but could not get to my desired solution. I have quite a bit of stock data, see example below.
id date qtr price volume
1 2/8/12 2012 Q1 101 0
1 2/9/12 2012 Q1 101.1 105
1 2/17/12 2012 Q1 102.1 0
1 3/13/12 2012 Q1 104.1 0
1 5/12/12 2012 Q2 99.1 0
1 5/14/12 2012 Q2 101.1 24
2 2/12/12 2012 Q1 4 0
2 2/15/12 2012 Q1 4 0
2 3/19/12 2012 Q1 4.5 102
2 5/12/12 2012 Q2 6.5 291
2 5/13/12 2012 Q2 6.54 45
Essentially, I want to group_by(qtr, id), and If the volume is 0 for a security for more than 3 days - I want to remove it from the DF for that quarter.
I am assuming the formula would look something like this:
df %>% group_by(qtr, id) %>% filter(.....)
I have looked at other similar questions, however, most of them use rowSums, but not sure how that can be applicable in this case.
Thank you very much!
library(dplyr)
df %>%
group_by(id, qtr) %>%
filter(sum(volume == 0) <= 3)
Or with data.table
library(data.table)
setDT(df)
df[, if(sum(volume == 0) <= 3) .SD, by = .(id, qtr)]
We can use rle within the filter out the 'qtr', 'id' that have 'volume' for consecutively 3 days or more
library(dplyr)
df %>%
group_by(qtr, id) %>%
filter(with(rle(volume == 0), !any(lengths[values] >= 3)))
NOTE: Using the above example, it would give the full dataset as the condition is not satisfied
Here, we assumed about consecutive 'volume'. If it is not the case, i.e. any 3 days per each group, one option similar to #RyanD's in base R would be
df[with(df, ave(volume == 0, id, qtr, FUN = sum) <=3),]
df %>%
mutate(volume_ind = volume == 0) %>%
group_by(qtr, id) %>%
mutate(volume_ind = sum(volume_ind))) %>%
ungroup %>%
filter(volume_ind <3) %>%
select(-volume_ind)

Resources