I am trying to get a Growth Rate for some variables in an Unbalanced Panel data, but I´m still getting results for years in which the lag does not exist.
I've been trying to get the Growth Rates using library Dplyr. As I Show down here:
total_firmas_growth <- total_firmas %>%
group_by(firma) %>%
arrange(anio, .by_group = T) %>% mutate(
ing_real_growth = (((ingresos_real_2/Lag(ingresos_real_2))-1)*100)
)
for Instance, if a firm has a value for "ingresos_real_2" in the year 2008 and the next value is in year 2012, the code calculate the growth rate instead of get an NA, because of the missing year (i.e 2011 is missing to calculate 2012 growth rate, as you can see in the example with the "firma" 115 (id) right below:
total_firmas_growth <-
" firma anio ingresos_real_2 ing_real_growth
1 110 2005 14000 NA
2 110 2006 15000 7.14
3 110 2007 13000 -13.3
4 115 2008 15000 NA
5 115 2012 13000 NA
6 115 2013 14000 7.69
I will really appreciate your help.
The easiest way to get your original table into a format where there are NAs for columns is to create a tibble with an all-by-all of the grouping columns and your years. Expand creates an all-by-all tibble of the variables you are interested in and {.} takes in whatever was piped more robustly than . (by creating a copy, I believe). Since any mathematical operation that includes an NA will result in an NA, this should get you what you're after if you use your group_by, arrange, mutate code after it.
total_firmas %>%
left_join(
expand({.}, firma, anio),
by = c("firma","anio")
)
Related
I have a dataset with ~40 variables with rows for each of the 25 areas and quarters, we have data from 2019 Q1 to today, 2022 Q2. For each quarter I am creating a rate (variable/population*10000) to allow comparison, however, we want each quarters rate to be based on the preceding 4 quarters i.e. 2022 Q2 rate will be the sum of the variable for 2022 Q2, Q1, 2021 Q4 and Q3. I can calculate this for all the relevant columns using the below
full_data_rates_pop %>%
group_by(Area) %>%
summarise(across(4:21, ~(sum(., na.rm = T))/(mean(Population_17.24))*10000)) %>%
bind_rows(full_data_rates_pop) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),timeframe_value, 'Quarterly'))
This does the job for my areas however I also want to create regional rates for each time period, originally I just summed up the variable and population for all the areas and created the rates in the same way. However, I have realised that for some areas/time periods data is missing and as such the current method produces inaccurate results. I want for each column to be able to exclude any rows which are Null.
Area
Quarter
Metric_1
Metric_2
Population
A
2022.2
45
89
12000
A
2022.1
58
23
12000
A
2021.4
NULL
64
11000
A
2021.3
20
76
11000
B
2022.2
56
101
9700
B
2022.1
32
78
9700
B
2021.4
41
NULL
10100
B
2021.3
38
NULL
10100
This is a mini dummy version of my data just with the latest 4 quarters but I want the new row to calculate so that the values are the sum of all values and the sum of population excluding any rows where the metric value was null
Area
Quarter
Metric_1_rate
Metric_2_rate
ALL
2022.2
38.87
75.08
Is there a way to filter out any rows which have a null value for that column however it will still be needed for other rows where there is no null value?
Say I have one data frame of tooth brush brands and a measure of how popular they are over time:
year brand_1 brand_2
2010 0.7 0.3
2011 0.6 0.6
2012 0.4 0.9
And another that says when each tooth brush brand went electrical, with NA meaning they never did so:
brand went_electrical_year
brand_1 NA
brand_2 2011
Now I'd like to combine these to get the prevalence of electrical tooth brush brands (as a proportion of the total) each year:
year electrical_prevalence
2010 0
2011 0.5
2012 0.69
In 2010 it's 0 b/c none of the brands are electrical. In 2011 it's 0.5 b/c both are and they are equally prevalent. In 2012 it's 0.69 b/c both are but the electrical one is more prevalent.
I've wrestled with this in R but can't figure out a way to do it. Would appreciate any help or suggestions. Cheers.
Assuming your data frames are df1 and df2, you can use the following tidyverse approach.
First, use pivot_longer to put your data into a long format which will be easier to manipulate. Use left_join to add the relevant years of when the brands went electrical.
We can create an indicator mult which will be 1 if the brand has gone electrical, or zero if it hadn't.
Then, for each year, you can determine the proportion by multiplying the popularity value by mult for each brand, and then dividing by the total sum for that year.
library(tidyverse)
df1 %>%
pivot_longer(cols = -year) %>%
left_join(df2, by = c("name" = "brand")) %>%
mutate(mult = ifelse(went_electrical_year > year | is.na(went_electrical_year), 0, 1)) %>%
group_by(year) %>%
summarise(electrical_prevalence = sum(value * mult) / sum(value))
Output
year electrical_prevalence
<int> <dbl>
1 2010 0
2 2011 0.5
3 2012 0.692
I have a table with countries and gdp and missing value. I want to replace with a mean but not the whole colomn mean just which include in the same group
I have 27 countries and 11 years. like
countries year GDP
1 2001 125
1 2002 ...
1 2003 525
2 2001 222
2 2002 ...
So I would like to get the mean of the first country all year and replace with missing value for GDP
I know how to replace the whole colomn
data$gdp[which(is.na(data$gdp))]<- mean(data$gdp, na.rm=TRUE)
but this will calculate the whole colomn. Dont want to take a subset of each country and calculate seperatly I was thinking if I could do it in one go.
One option is using na.aggregate (from zoo - by default it takes the mean and replace the NA elements) grouped by 'countries'
library(dplyr)
library(zoo)
df1 %>%
group_by(countries) %>%
mutate(GDP = na.aggregate(GDP))
Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]
I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.
You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"
This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});