How to calculate the average of a column using other columns as a reference using excel - r

I have three columns in excel year, month value.
I want to average value considering month and year. In R language this function is done by group_by(). In excel how could this be done?
year month value
2019 1 12
2019 1 34
2019 2 56
2019 2 15
2020 1 16
2020 3 67
2020 4 89
2018 6 123
2018 6 45
2018 7 98
2019 3 53
2019 1 23
2020 1 12
2020 3 1

If one has Office 365 we can use:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(CHOOSE({1,2},y,m)),{1,2}),
CHOOSE({1,1,2},u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))
Put this in the first cell and it will spill the results.
Once the HSTACK is release we can replace the CHOOSE with it:
=LET(
y,A2:A15,
m,B2:B15,
v,C2:C15,
u,SORT(UNIQUE(HSTACK(y,m)),{1,2}),
HSTACK(u,AVERAGEIFS(v,y,INDEX(u,0,1),m,INDEX(u,0,2))))

Averageifs would do what you want, but you might want to review using the Filter function to duplicate the Group_By() method for other similar procedures. Once grouped, you can sum/average/sort, etc.
Averageifs:
=AVERAGEIFS(C:C,A:A,2018,B:B,6)
Filter:
=filter(C:C,(A:A=2018)*(B:B=6))
=Average(filter(C:C,(A:A=2018)*(B:B=6)))
See this spreadsheet for examples of both. I realize you're using Excel, but these formulas should work on both (though they are not the same)

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

How can I add new variable with MUTATE: growth rate?

I haven't coded for several months and now am stuck with the following issue.
I have the following dataset:
Year World_export China_exp World_import China_imp
1 1992 3445.534 27.7310 3402.505 6.2220
2 1993 1940.061 27.8800 2474.038 18.3560
3 1994 2458.337 39.6970 2978.314 3.3270
4 1995 4641.168 15.9790 5504.787 18.0130
5 1996 5680.688 74.1650 6939.291 25.1870
6 1997 7206.604 70.2440 8639.422 31.9030
7 1998 7069.725 99.6510 8530.293 41.5030
8 1999 5916.077 169.4593 6673.743 37.8139
9 2000 7331.588 136.2180 8646.253 47.3789
10 2001 7471.374 143.0542 8292.893 41.2899
11 2002 8074.975 217.4286 9092.341 46.4730
12 2003 9956.433 162.2522 11558.007 71.7753
13 2004 13751.671 282.8678 16345.452 157.0768
14 2005 15976.238 430.8655 16708.094 284.1065
15 2006 19728.935 398.6704 22344.856 553.6356
16 2007 24275.244 484.5276 28693.113 815.7914
17 2008 32570.781 613.3714 39381.251 1414.8120
18 2009 21282.228 173.9463 28563.576 1081.3720
19 2010 25283.462 475.7635 34884.450 1684.0839
20 2011 41418.670 636.5881 45759.051 2193.8573
21 2012 46027.529 432.6025 46404.382 2373.4535
22 2013 37132.301 460.7133 43022.550 2829.3705
23 2014 36046.461 640.2552 40502.268 2373.2351
24 2015 26618.982 781.0016 30264.299 2401.1907
25 2016 23537.354 472.7022 27609.884 2129.4806
What I need is simple: to compute growth rates of each variable, that is, find difference between two elements, divide it by first element and multiply by 100.
I'm trying to write a script, that ends up with error message:
trade_Ch %>%
mutate (
World_exp_grate = sapply(2:nrow(trade_Ch),function(i)((World_export[i]-World_export[i-1])/World_export[i-1]))
)
Error in mutate_impl(.data, dots) : Column World_exp_grate must
be length 25 (the number of rows) or one, not 24
although this piece of code gives me right values:
x <- sapply(2:nrow(trade_Ch),function(i)((trade_Ch$World_export[i]-trade_Ch$World_export[i-1])/trade_Ch$World_export[i-1]))
How can I correctly embedd the code into my MUTATE part from dplyr package?
OR
Is there is another elegant way to solve this issue?
library(dplyr)
df %>%
mutate_each(funs(chg = ((.-lag(.))/lag(.))*100), World_export:China_imp)
trade_Ch %>%
mutate(world_exp_grate = 100*(World_export - lag(World_export))/lag(World_export))
The problem is that you cannot calculate the World_exp_grate for your first row. Therefore you have to set it to NA.
One variant to solve this is
trade_Ch %>%
mutate (World_export_lag = lag(World_export),
World_exp_grate = (World_export - World_export_lag)/World_export_lag)) %>%
select(-World_export_lag)
lag shifts the vector by one position.
lag(1:5)
# [1] NA 1 2 3 4

count unique values in one column for specific values in another column,

I have a data frame on bills that has (among other variables) a column for 'year', a column for 'issue', and a column for 'sub issue.' A simplified example df looks like this:
year issue sub issue
1970 4 20
1970 3 21
1970 4 22
1970 2 8
1971 5 31
1971 4 22
1971 9 10
1971 3 21
1971 4 22
Etc., for about 60 years. I want to count the unique values in the issue and sub issue columns for each year, and use those to create a new df- dat2. Using the df above, dat2 would look like this:
year issues sub issues
1970 3 4
1971 4 4
Weary of factors, I confirmed that the values in all columns are integers, if that makes a difference. I am new at R (obviously), and I haven't been able to find relevant code for this specific purpose online. Thanks for any help!!
That's a one-liner, with aggregate:
with(d,aggregate(cbind(issue,subissue) ~ year,FUN=function(x){length(unique(x))}))
returning:
year issue subissue
1 1970 3 4
2 1971 4 4

R Table data with a grouping command

This seems like a very simple problem, but I can't seem to sort it out. I have sought help from this forum, with the below topics being close, but don't seem to do exactly what I need. I have count data over several years. I want to obtain frequencies of the count value by year. It seems I need a table function with a grouping option, but I haven't found the proper syntax.
Data:
count year
1 15 1957
2 6 1957
3 23 1957
4 23 1957
5 2 1957
6 28 1980
7 15 1980
8 32 1980
9 18 1981
thank you in advance!
Counting the number of elements with the values of x in a vector
grouping data splitted by frequencies
Aggregate data in R
You're looking for the table function. Something like:
with(yourdata, table(Year, Count))

How can I separate the years of my data in ggplot2 without adding data to my frame?

I retrieve data from google analytics with RGoogleAnalytics.
The data I get have the form
year month visits
1 2011 11 106
2 2011 12 118
3 2012 01 273
4 2012 02 354
5 2012 03 353
6 2012 04 302
....
When I use the following statement I do not get separated bars for the data. Just a bunch of bars which add up the single years. I would want them to be separated.
ggplot(ga.data,aes(x=month,y=visits),group=year,colour=as.factor(year))+
+ geom_bar(stat="identity")
If you want to group bars by year you could consider to use faceting for that.
ggplot(ga.data,aes(x=as.factor(month),y=visits,fill=as.factor(year)))+
geom_bar(stat="identity")+facet_grid(~year,scale="free_x",space="free_x")

Resources