Confidence Interval's in a nested function - r

New student in R, taking a very accelerated class with little/no instruction. Please be patient with me...so far y'all have been extremely helpful, and I appreciate it. I apologize in advance if this doesn't make sense.
I am trying to make a function that reads columns from an object with columns "year", "complex", "mean", "2_sd", and "n" and calculates the confidence interval, then merges the lower and upper CI's as two separate columns into a new object with the same dimensions as the products of the CI calculations. However, I keep getting an error:
code for lower CI:
x=aggregate(m.all$mean, by=list(year,complex),FUN=(m.all$mean - qnorm(0.9) * sd(m.all$mean)/sqrt(m.all$n)))
error:
'(m.all$mean - qnorm(0.9) * sd(m.all$mean)/sqrt(m.all$n))' is not a function, character or symbol
I tried to use:
x=aggregate(total_male, by=list(year,complex),FUN=t.test(total_male,conf.level=0.90))
(where "total_male", "year", "complex" variables were sourced from the BASE object) but R doesn't recognize t.test when it's inside aggregate() for some reason...
The BASE object is 3 columns of "year", "complex", "total_males". The NEW object has "year", "complex", "mean", "2_sd", and "n"
I built "mean", 2_sd" and "n" out of the BASE object, with functions, and then merged them to create the NEW object, so I understand that. But CI's is confusing me.
The BASE object has been attach()'ed so I could work with the variables more easily.
Any ideas?
NEW object:
m.all
year complex mean X2st.dev n
1 2007 3corners 26.28571 52.04760 7
2 2007 Blue 18.87500 20.15476 8
3 2007 book_cliffs 4.50000 13.19091 6
4 2007 Diamond 13.25000 48.83431 20
The OLD object is 41 observations (all in 2007) of 4 complexes, with various numeric tot_male values:
head(d4)
year complex tot_male
2 2007 Diamond 17
21 2007 3corners 19
36 2007 Blue 40
73 2007 Diamond 22
85 2007 Diamond 0
115 2007 Diamond 2

You need to learn how to construct an R function. All you have now is an expression but it's not capable of accepting arguments. Perhaps:
x=aggregate(m.all$mean, by=list(year,complex),
FUN=function(v){ v - qnorm(0.9) * sd(v)/sqrt(length(v)) })
Please do not attach data objects. It just ends up making things less stable and less easy to understand. If 'm.all' is actually a dataframe with named columns "mean", "year", then the first line you proposed might be:
with( m.all, aggregate(mean, by=list(year,complex),
FUN=function(v){ v - qnorm(0.9) * sd(v)/sqrt(length(v)) }))
With lets you create a small environment where the column names will get interpreted as objects. Generally it's is a bad idea to use names like mean and sd since those are functions names.

Following may be what you want (since mean and sd have already been generated):
> m.all$lowerci = with(m.all, mean - qnorm(0.9)*X2st.dev/sqrt(n))
> m.all
year complex mean X2st.dev n lowerci
1: 2007 3corners 26.28571 52.04760 7 1.0748434
2: 2007 Blue 18.87500 20.15476 8 9.7429407
3: 2007 book_cliffs 4.50000 13.19091 6 -2.4013685
4: 2007 Diamond 13.25000 48.83431 20 -0.7441377
Similarly for upper CI:
> m.all$upperci = with(m.all, mean + qnorm(0.9)*X2st.dev/sqrt(n))
> m.all
year complex mean X2st.dev n lowerci upperci
1: 2007 3corners 26.28571 52.04760 7 1.0748434 51.49658
2: 2007 Blue 18.87500 20.15476 8 9.7429407 28.00706
3: 2007 book_cliffs 4.50000 13.19091 6 -2.4013685 11.40137
4: 2007 Diamond 13.25000 48.83431 20 -0.7441377 27.24414

Related

Return closest matching value in seperate dataframe in R

DATAFRAMES
DataFrame 1
GENDER HANDICAP PRIMARYREC
male 10 Model1
female 12 Model2
male 18 Model3
DataFrame 2
MODEL1Lofts MODEL2Lofts MODEL3Lofts
9 10.5 9
10.5 10.5
12
My current filter function:
driverfunc = driver_model[driver_model$Gender==gender &
driver_model$HandicapMin<=handicap & driver_model$HandicapMax>=handicap, ]
return(driverfunc) }
**EXAMPLE INPUT** GetDriver(gender = "Male", handicap = 5)
Note: It automatically finds the recommended model
GOAL
What I'm looking to do is add a nested function within my filter (if possible) that includes the user inputting their "current loft" and then it lookups the possible values from DataFrame2 and adds that output to driverfunc.
This is literally my 4th day doing R and I've searched around a lot but getting stuck and any help would be tremendously appreciated!

How to apply a summarization measure to matching data.frame columns in R

I have a hypothetical data-frame as follows:
# inventory of goods
year category count-of-good
2010 bikes 1
2011 bikes 3
2013 bikes 5
2010 skates 1
2011 skates 1
2013 skates 0
2010 skis 0
2011 skis 2
2013 skis 2
my end goal is to show a stacked bar chart of how the %-<good>-of-decade-total has changed year-to-year.
therefore, i want to compute the following:
now, i should be able to ggplot(df, aes(factor(year), fill=percent.total.decade.goods) + geom_bar, or similar (hopefully!), creating a bar chart where each bar sums to 100%.
however, i'm struggling to determine how to get percent.good.of.decade.total (the far right column) in non-hacky way. Thanks for your time!
You can use dplyr to compute the sum:
library("dplyr")
newDf=df%>%group_by(year)%>%mutate(decades.total.goods=sum(count.of.goods))%>%ungroup()
Either use mutate or normal R syntax to compute the "% good of decade total"
Note: you have not shared your exact data-frame, so the names are obviously made up.
We can do this with ave from base R
df1$decades.total.goods <- with(df1, ave(count.of.good, year, FUN = sum))
df1$decades.total.goods
#[1] 2 6 7 2 6 7 2 6 7

Interpreting results of sequence mining with SPADE

Please help to interpret the results of SPADE frequent sequence mining algorithm (http://www.inside-r.org/packages/cran/arulesSequences/docs/cspade​)
With support = 0.05:
s1 <- cspade(x, parameter = list(support = 0.05), control = list(verbose = TRUE))
I get, for example, these sequences:
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
Looks like these are the same sequences, aren't they? How <{C},{V}> semantically differes from <{C,V}> ? Any real life examples?
From Spade paper (M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42, 31--60):
"An input-sequence C is said to contain another sequence A, if A is a
subsequence of the input-sequence C. The support or frequency of a sequence is the the total number of input-sequences in the database D that contain A."
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in database D, correct?
Complete output that I get from my data:
> as(s1, "data.frame")
sequence support
1 <{C}> 1.00000000
2 <{L}> 0.20468120
3 <{V}> 0.73127376
4 <{C},{V}> 0.15644023
5 <{C,V}> 0.73127376
6 <{L,V}> 0.07882027
7 <{V},{V}> 0.13343431
8 <{C,V},{V}> 0.13343431
9 <{C},{C},{V}> 0.05558572
10 <{C,L,V}> 0.07882027
11 <{V},{C,V}> 0.13343431
12 <{C},{C,V}> 0.15644023
13 <{C,V},{C,V}> 0.13343431
14 <{C},{C},{C,V}> 0.05558572
15 <{C},{L}> 0.05738619
16 <{C,L}> 0.20468120
17 <{C},{C,L}> 0.05738619
18 <{C},{C}> 0.22128547
19 <{L},{C}> 0.06233031
20 <{V},{C}> 0.16921494
21 <{V},{V},{C}> 0.05047012
22 <{V},{C},{C}> 0.06233031
23 <{C,V},{C}> 0.16921494
24 <{C},{V},{C}> 0.05781487
25 <{C,V},{V},{C}> 0.05047012
26 <{V},{C,V},{C}> 0.05047012
27 <{C},{C,V},{C}> 0.05781487
28 <{C,V},{C,V},{C}> 0.05047012
29 <{C,L},{C}> 0.06233031
30 <{C},{C},{C}> 0.07882027
31 <{C,V},{C},{C}> 0.06233031
> summary(s1)
set of 31 sequences with
most frequent items:
C V L (Other)
27 22 8 8
most frequent elements:
{C} {V} {C,V} {L} {C,L} (Other)
21 12 12 3 3 2
element (sequence) size distribution:
sizes
1 2 3
7 13 11
sequence length distribution:
lengths
1 2 3 4 5
3 9 12 6 1
summary of quality measures:
support
Min. :0.05047
1st Qu.:0.05760
Median :0.07882
Mean :0.17121
3rd Qu.:0.16283
Max. :1.00000
includes transaction ID lists: FALSE
mining info:
data ntransactions nsequences support
x 61000 34991 0.05
> ​
When using SPADE algorithm, remember that you are also dealing with temporal data (i.e. you can know the order or time of occurrence of the item).
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differs from <{C,V}> ? Any real life examples?
In your example, <{C}, {V}> means that item C occurred first, and then item V; <{C, V}> means than item C and V occurred at the same time.
Then, for example, if:
sequence support
1 <{C}> 1.00000000
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
An item with support value of 1 means that it happened (in a market basket analysis example) in ALL transactions.
Hope this helps.
Looks like these are the same sequences, aren't they? How <{C},{V}>
semantically differes from <{C,V}> ? Any real life examples?
As user2552108 pointed, {C,V} implies that C and V occurred at the same time. In practice this can be used to encode multi-dimensional sequential data. For example, suppose that C was Canada and V was Vancouver. Now this could have been something like:
[{C,V,M,peanut,butter,maple_syrup}, ... , {}]
In this case, your frequent item-set can not only have single length sets like say {C}, {V}, {U}, {W}, or {X}, but also sets with length > 1 (the sets that appeared simultaneously - at the same time).
For this reason, the element in transactions/sequences are defined as sets and not single elements.
Does it mean that sequence <{C}> is contained in all sequences in
database D, correct?
That's correct!

R plots: simple statistics on data by year. Base package

How to apply simple statistics to data and plot them elegantly by year using the R base plotting system and default functions?
The database is quite heavy, hence do not generate new variables would be preferable.
I hope it is not a silly question, but I am wondering about this problem without finding a specific solution not involving additional packages such as ggplot2, dplyr, lubridate, such as the ones I found on the site:
ggplot2: Group histogram data by year
R group by year
Split data by year
The use of the R default systems is due to didactic purposes. I think it could be an important training before turn on the more "comfortable" R specific packages.
Consider a simple dataset:
> prod_dat
lab year production(kg)
1 2010 0.3219
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410
I would like to plot with an histogram of, let's say, the total production of material during specific years.
> hist(sum(prod_dat$production[prod_dat$year == c(2010, 2013)]))
Unfortunately, this is my best attempt, and it trow an error:
in prod_dat$year == c(2010, 2012):
longer object length is not a multiple of shorter object length
I am really out of route, hence any suggestion can turn in use.
without ggplot I used to do it like this but there are smarter way I think
all <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "lab year production
1 2010 1
1 2011 0.3222
1 2012 0.3305
2 2010 0.3400
2 2011 0.3310
2 2012 0.3310
3 2010 0.3400
3 2011 0.3403
3 2012 0.3410")
ar <- data.frame(year = unique(all$year), prod = tapply(all$production, list(all$year), FUN = sum))
barplot(ar$prod)

Eliminating Existing Observations in a Zoo Merge

I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034

Resources