R plots: simple statistics on data by year. Base package - r

How to apply simple statistics to data and plot them elegantly by year using the R base plotting system and default functions?
The database is quite heavy, hence do not generate new variables would be preferable.
I hope it is not a silly question, but I am wondering about this problem without finding a specific solution not involving additional packages such as ggplot2, dplyr, lubridate, such as the ones I found on the site:
ggplot2: Group histogram data by year
R group by year
Split data by year
The use of the R default systems is due to didactic purposes. I think it could be an important training before turn on the more "comfortable" R specific packages.
Consider a simple dataset:
> prod_dat
lab year production(kg)
1 2010 0.3219
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410
I would like to plot with an histogram of, let's say, the total production of material during specific years.
> hist(sum(prod_dat$production[prod_dat$year == c(2010, 2013)]))
Unfortunately, this is my best attempt, and it trow an error:
in prod_dat$year == c(2010, 2012):
longer object length is not a multiple of shorter object length
I am really out of route, hence any suggestion can turn in use.

without ggplot I used to do it like this but there are smarter way I think
all <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "lab year production
1 2010 1
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410")
ar <- data.frame(year = unique(all$year), prod = tapply(all$production, list(all$year), FUN = sum))
barplot(ar$prod)

Related

r edgar Error: Input year(s) is not numeric

I have a "test" dataframe with 3 companies (ciknum variable) and years in which each company filed annual reports (fyearq):
ciknum fyearq
1 1408356 2012
2 1557255 2012
3 1557255 2013
4 1557255 2014
5 1557255 2015
6 1557255 2016
7 1555538 2013
8 1555538 2014
9 1555538 2015
10 1555538 2016
After obtaining the MasterIndex folder and running this code (see proposed solution) I use the R edgar package to obtain 10-K filings. I run the following code:
for (i in 1:nrow(test)){
firm<-test[i,"ciknum"] #edit: seems like mistake can be here since new firm data only contains 1 obs of 1 variable
year<-test[i,"fyearq"] #edit: seems like mistake can be here since new year data only contains 1 obs of 1 variable
my_getFilings(firm,'10-K',year,downl.permit="y")
}
And it keeps spitting the following error: Error: Input year(s) is not numeric. I checked the variable type and it seems my fyearq variable is numeric.
sapply(test,class)
ciknum fyearq
"numeric" "numeric"
Don't really understand why the "numeric" fyearq variable is not read as such by the my_getFilings function. Any help would be much appreciated.
Thank you in advance.
Martins
The ordering seems to matter here. I solved this problem by using the descriptor from the function, so that
my_getFilings(firm,'10-K',year,downl.permit="y")
as you wrote is written as
my_getFilings(cik.no = firm, form.def = '10-K', filing.year = 2016, downl.permit = "y")
Thank you #bartosz25 and M Grace,
I finally made it work through the following code:
for (row in 1:nrow(test)){
firm <- as.numeric(test[row, "ciknum"])
year <- as.numeric(test[row, "fyearq"])
my_getFilings(firm, c('10-K'), year, downl.permit="y")
}
Apologies for not posting it before.

How to apply a summarization measure to matching data.frame columns in R

I have a hypothetical data-frame as follows:
# inventory of goods
year category count-of-good
2010 bikes 1
2011 bikes 3
2013 bikes 5
2010 skates 1
2011 skates 1
2013 skates 0
2010 skis 0
2011 skis 2
2013 skis 2
my end goal is to show a stacked bar chart of how the %-<good>-of-decade-total has changed year-to-year.
therefore, i want to compute the following:
now, i should be able to ggplot(df, aes(factor(year), fill=percent.total.decade.goods) + geom_bar, or similar (hopefully!), creating a bar chart where each bar sums to 100%.
however, i'm struggling to determine how to get percent.good.of.decade.total (the far right column) in non-hacky way. Thanks for your time!
You can use dplyr to compute the sum:
library("dplyr")
newDf=df%>%group_by(year)%>%mutate(decades.total.goods=sum(count.of.goods))%>%ungroup()
Either use mutate or normal R syntax to compute the "% good of decade total"
Note: you have not shared your exact data-frame, so the names are obviously made up.
We can do this with ave from base R
df1$decades.total.goods <- with(df1, ave(count.of.good, year, FUN = sum))
df1$decades.total.goods
#[1] 2 6 7 2 6 7 2 6 7

Are there simple ways to lag (by group) in data frames without workarounds like data tables, xts, zoo, dplyr etc in R?

Whenever I want to lag in a data frame I realize that something that should be simple is not. While the problem has been asked & answered many times (see p.s.), I did not find a simple solution which I can remember until the next time I lag. In general, lagging does not seem to be a simple thing in R as the multiple workarounds testify. I run into this problem often and it would be very helpful to have some basic R solutions which do not need extra packages. Could you provide your simple solution for lagging?
If that is not possible, could you at least provide your workaround here so we can choose amongst second best alternatives? One collection already exists here
Also, in all blog posts on this subject I see people complain about how unexpectedly difficult lagging is so how can we get a simple lag function for data frames into R Core? This must be extremely disappointing for anyone coming from Stata or EViews. Or am I missing something and there is a simple built in solution?
say we want to lag "value" by 3 "year"s for each "country" here:
Data <- data.frame(year=c(rep(2010:2015,2)),country=c(rep("AT",6),rep("DE",6)),value=rnorm(12))
to create L3 like:
year country value L3
2010 AT 0.3407 NA
2011 AT -1.7981 NA
2012 AT -0.8390 NA
2013 AT -0.6888 0.3407
2014 AT -1.1019 -1.7981
2015 AT -0.8953 -0.8390
2010 DE 0.5877 NA
2011 DE -1.0204 NA
2012 DE -0.6576 NA
2013 DE 0.6620 0.5877
2014 DE 0.9579 -1.0204
2015 DE -0.7774 -0.6576
And we neither want to change the nature of our data (to ts or data table) nor do we want to immerse ourselves in three new packages when the deadline is tonight and our supervisor uses Stata and thinks lagging is easy ;-) (its not, I just want to be prepared...)
p.s.:
without groups
with data.table: Lag in dataframe or How to create a lag variable within each group?
time series are straightforward
If the question is how to provide a column with the prior third year's value not using packages then try this:
prior_year3 <- function(x, k = 3) head(c(rep(NA, k), x), length(x))
transform(Data, prior_year_value = ave(value, country, FUN = prior_year3))
giving:
year country value prior_year_value
1 2010 AT -1.66562121 NA
2 2011 AT -0.04950063 NA
3 2012 AT 1.55930293 NA
4 2013 AT -0.40462394 -1.66562121
5 2014 AT 0.78602610 -0.04950063
6 2015 AT 0.73912916 1.55930293
7 2010 DE 1.03710539 NA
8 2011 DE -1.13370942 NA
9 2012 DE -1.20530981 NA
10 2013 DE 1.66870572 1.03710539
11 2014 DE 1.53615793 -1.13370942
12 2015 DE -0.09693335 -1.20530981
That said, to use R effectively you do need to learn how to use the key packages.
Try slide from data combine package, its simple
slide(Data,Var='value',GroupVar = 'country',slideBy=-3)

How to form linear model from two data frames?

MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)

Confidence Interval's in a nested function

New student in R, taking a very accelerated class with little/no instruction. Please be patient with me...so far y'all have been extremely helpful, and I appreciate it. I apologize in advance if this doesn't make sense.
I am trying to make a function that reads columns from an object with columns "year", "complex", "mean", "2_sd", and "n" and calculates the confidence interval, then merges the lower and upper CI's as two separate columns into a new object with the same dimensions as the products of the CI calculations. However, I keep getting an error:
code for lower CI:
x=aggregate(m.all$mean, by=list(year,complex),FUN=(m.all$mean - qnorm(0.9) * sd(m.all$mean)/sqrt(m.all$n)))
error:
'(m.all$mean - qnorm(0.9) * sd(m.all$mean)/sqrt(m.all$n))' is not a function, character or symbol
I tried to use:
x=aggregate(total_male, by=list(year,complex),FUN=t.test(total_male,conf.level=0.90))
(where "total_male", "year", "complex" variables were sourced from the BASE object) but R doesn't recognize t.test when it's inside aggregate() for some reason...
The BASE object is 3 columns of "year", "complex", "total_males". The NEW object has "year", "complex", "mean", "2_sd", and "n"
I built "mean", 2_sd" and "n" out of the BASE object, with functions, and then merged them to create the NEW object, so I understand that. But CI's is confusing me.
The BASE object has been attach()'ed so I could work with the variables more easily.
Any ideas?
NEW object:
m.all
year complex mean X2st.dev n
1 2007 3corners 26.28571 52.04760 7
2 2007 Blue 18.87500 20.15476 8
3 2007 book_cliffs 4.50000 13.19091 6
4 2007 Diamond 13.25000 48.83431 20
The OLD object is 41 observations (all in 2007) of 4 complexes, with various numeric tot_male values:
head(d4)
year complex tot_male
2 2007 Diamond 17
21 2007 3corners 19
36 2007 Blue 40
73 2007 Diamond 22
85 2007 Diamond 0
115 2007 Diamond 2
You need to learn how to construct an R function. All you have now is an expression but it's not capable of accepting arguments. Perhaps:
x=aggregate(m.all$mean, by=list(year,complex),
FUN=function(v){ v - qnorm(0.9) * sd(v)/sqrt(length(v)) })
Please do not attach data objects. It just ends up making things less stable and less easy to understand. If 'm.all' is actually a dataframe with named columns "mean", "year", then the first line you proposed might be:
with( m.all, aggregate(mean, by=list(year,complex),
FUN=function(v){ v - qnorm(0.9) * sd(v)/sqrt(length(v)) }))
With lets you create a small environment where the column names will get interpreted as objects. Generally it's is a bad idea to use names like mean and sd since those are functions names.
Following may be what you want (since mean and sd have already been generated):
> m.all$lowerci = with(m.all, mean - qnorm(0.9)*X2st.dev/sqrt(n))
> m.all
year complex mean X2st.dev n lowerci
1: 2007 3corners 26.28571 52.04760 7 1.0748434
2: 2007 Blue 18.87500 20.15476 8 9.7429407
3: 2007 book_cliffs 4.50000 13.19091 6 -2.4013685
4: 2007 Diamond 13.25000 48.83431 20 -0.7441377
Similarly for upper CI:
> m.all$upperci = with(m.all, mean + qnorm(0.9)*X2st.dev/sqrt(n))
> m.all
year complex mean X2st.dev n lowerci upperci
1: 2007 3corners 26.28571 52.04760 7 1.0748434 51.49658
2: 2007 Blue 18.87500 20.15476 8 9.7429407 28.00706
3: 2007 book_cliffs 4.50000 13.19091 6 -2.4013685 11.40137
4: 2007 Diamond 13.25000 48.83431 20 -0.7441377 27.24414

Resources