ANOVA of subsetted data - r

I am manipulating a data set comprising several factors with several variables. The idea is that I want to do ANOVA analysis between factor levels nested within one level of another factor.
Here is an example similar to my data set:
treatment category trial individual response
1 A big 1 F1 0.10
2 A big 2 F1 0.20
3 A big 1 F2 0.30
4 A big 2 F2 0.11
5 A small 1 F3 0.12
6 A small 2 F3 0.13
7 A small 1 F4 0.20
8 A small 2 F4 0.30
9 B big 1 F5 0.40
10 B big 2 F5 0.21
11 B big 1 F6 0.22
12 B big 2 F6 0.23
13 B small 1 F7 0.31
14 B small 2 F7 0.32
15 B small 1 F8 0.34
16 B small 2 F8 0.25
So basically, I'd like to do an ANOVA between big and small when treatment is A, then B, then same idea with ANOVA between big and small when treatment is A and trial 1... you get the logic.
It seems I have to use:
anova(lm(Y~x,data=dataset))
and add a subset argument, but I can't work the logic out of it and I can't find any example similar to mine. Any hint for it? Thank you in advance!

By your description, you want to apply separated ANOVAs to different subsets of your data.
Try this:
df1 <- df[df$treatment=="A",]
df2 <- df[df$treatment=="B",]
aov(response ~ category, data=df1)
aov(response ~ category, data=df2)
If you are interested in the effect of factor treatment, maybe you should keep it in a more complex model and use a posthoc to test differences within treatment A and B. But it's just a suggestion.

Related

Nested logit model using panel data in R

I am new to R and I would love it if you can help me with this because I am having serious difficulties.
I have unbalanced panel data that shows monthly companies' performance compared to the rest of the market in terms of $$ (eg. this month company 1 has made $1000 more than the average of the market). Each of these companies had decided on a strategy when they entered the market (1 through 8). These strategies are nested into two different groups (a,b) so that strategies 1,2, and 3 are part of the group a, while strategies 4 through 8 are part of group b. I would need a rank of the best strategies from best to worst.
I have discretized my DV so that now it only shows whether that month company 1 performed higher or lower than the market. However, I am not sure it is the right way because I would then lose how much better or worse each month companies performed compared to the market.
My data looks like this:
ID Main Strategy YearMonth DiffPerformance Control1 Control 2 DiffPerformanceHL
1 a 2 201706 9.037 2 57 H
1 a 2 201707 4.371 2 57 H
1 a 2 201708 1.633 2 57 H
1 a 2 201709 -3.521 2 59 L
1 a 2 201710 13.096 2 59 H
1 a 2 201711 5.070 2 60 H
1 a 2 201712 4.25 2 60 H
2 b 5 201904 6.78 4 171 H
2 b 5 201905 -15.26 4 169 L
2 b 5 201906 7.985 4 169 H
Where ID is the company, Main is the group (a or b) Strategies are 1 through 8 and nested as previously stated, YearMonth represents the specific month, DifferencePerformance is the DV as a continuous variable, Control 1 is static over time and is a categorical variable (1 through 6), Control 2 is a control count variable that changes over time, and DiffPerformance HL is the discretized DV.
Can you please help me figuring out how to create a nested logit model in R? I would be super appreciative
Thanks

How to fix the error "Subscript out of bounds"

I have a question about fixing the error:
"subscript out of bounds".
I am analyzing data of an eye-tracking experiment. You may find example data below:
Stimulus Timebin Language Percentage on AOI
1 11 L1 0.80
1 11 L2 0.60
1 12 L1 0.80
1 12 L2 0.50
1 13 L1 0.83
1 13 L2 0.50
...
10 37 L1 0.00
10 37 L2 0.50
10 38 L1 0.70
10 38 L2 0.50
10 39 L1 0.60
10 39 L2 0.70
10 40 L1 0.75
10 40 L2 0.89
...
I would like to do a Growth curve analysis with the Language and Timebin as independent variables and percentage on Area of Interest (AOI) as dependent variable. Besides, the Stimulus as random factor. I got 40 timebins for each stimulus and condition. In order to avoid the potential problem of collinearity, I want to create orthogonalized polynomials. The code below was used to create independent (orthogonal) polynomial time terms (linear, quadratic, and cubic).
Gaze_1_Poly <- poly((unique(Gaze_1$timebin)), 3)
Gaze_1[,paste("ot", 1:3, sep="")] <- Gaze_1_Poly[Gaze_1$timebin, 1:3]
I always get an error told me that there is a Out of Bounds Subscript.
Error in Gaza_1_Poly[Gaze_1$timebin, :
subscript out of bounds
So I checked the class of variables and I think it is of no problem:
Stimulus Timebin Language percentage on AOI
"character" "integer" "factor" "numeric"
I can not figure out the reason. Can someone give me a hand?
See comment above. Let me know if this is what you had in mind.
library(dplyr)
Gaze_1 %>%
left_join(data.frame(Timebin = unique(.$Timebin), poly(unique(.$Timebin), degree = 3)),
by = 'Timebin') %>%
setNames(c("Stimulus", "Timebin", "Language", "Percentage on AOI", "ot1", "ot2", "ot3"))

Stacking two data frame columns into a single separate data frame column in R

I will present my question in two ways. First, requesting a solution for a task; and second, as a description of my overall objective (in case I am overthinking this and there is an easier solution).
1) Task Solution
Data context: each row contains four price variables (columns) representing (a) the price at which the respondent feels the product is too cheap; (b) the price that is perceived as a bargain; (c) the price that is perceived as expensive; (d) the price that is too expensive to purchase.
## mock data set
a<-c(1,5,3,4,5)
b<-c(6,6,5,6,8)
c<-c(7,8,8,10,9)
d<-c(8,10,9,11,12)
df<-as.data.frame(cbind(a,b,c,d))
## result
# a b c d
#1 1 6 7 8
#2 5 6 8 10
#3 3 5 8 9
#4 4 6 10 11
#5 5 8 9 12
Task Objective: The goal is to create a single column in a new data frame that lists all of the unique values contained in a, b, c, and d.
price
#1 1
#2 3
#3 4
#4 5
#5 6
...
#12 12
My initial thought was to use rbind() and unique()...
price<-rbind(df$a,df$b,df$c,df$d)
price<-unique(price)
...expecting that a, b, c and d would stack vertically.
[Pseudo illustration]
a[1]
a[2]
a[...]
a[n]
b[1]
b[2]
b[...]
b[n]
etc.
Instead, the "columns" are treated as rows and stacked horizontally.
V1 V2 V3 V4 V5
1 1 5 3 4 5
2 6 6 5 6 8
3 7 8 8 10 9
4 8 10 9 11 12
How may I stack a, b, c and d such that price consists of only one column ("V1") that contains all twenty responses? (The unique part I can handle separately afterwards).
2) Overall Objective: The Bigger Picture
Ultimately, I want to create a cumulative share of population for each price (too cheap, bargain, expensive, too expensive) at each price point (defined by the unique values described above). For example, what percentage of respondents felt $1 was too cheap, what percentage felt $3 or less was too cheap, etc.
The cumulative shares for bargain and expensive are later inverted to become not.bargain and not.expensive and the four vectors reside in a data frame like this:
buckets too.cheap not.bargain not.expensive too.expensive
1 0.01 to 0.50 0.000000000 1 1 0
2 0.51 to 1.00 0.000000000 1 1 0
3 1.01 to 1.50 0.000000000 1 1 0
4 1.51 to 2.00 0.000000000 1 1 0
5 2.01 to 2.50 0.001041667 1 1 0
6 2.51 to 3.00 0.001041667 1 1 0
...
from which I may plot something that looks like this:
Above, I accomplished my plotting objective using defined price buckets ($0.50 ranges) and the hist() function.
However, the intersections of these lines have meanings and I want to calculate the exact price at which any of the lines cross. This is difficult when the x-axis is defined by price range buckets instead of a specific value; hence the desire to switch to exact values and the need to generate the unique price variable.
[Postscript: This analysis is based on Peter Van Westendorp's Price Sensitivity Meter (https://en.wikipedia.org/wiki/Van_Westendorp%27s_Price_Sensitivity_Meter) which has known practical limitations but is relevant in the context of my research which will explore consumer perceptions of value under different treatments rather than defining an actual real-world price. I mention this for two reasons 1) to provide greater insight into my objective in case another approach comes to mind, and 2) to keep the thread focused on the mechanics rather than whether or not the Price Sensitivity Meter should be used.]
We can unlist the data.frame to a vector and get the sorted unique elements
sort(unique(unlist(df)))
When we do an rbind, it creates a matrix and unique of matrix calls the unique.matrix
methods('unique')
#[1] unique.array unique.bibentry* unique.data.frame unique.data.table* unique.default unique.IDate* unique.ITime*
#[8] unique.matrix unique.numeric_version unique.POSIXlt unique.warnings
which loops through the rows as the default MARGIN is 1 and then looks for unique elements. Instead, if we use the 'price', either as.vector or c(price) converts into vector
sort(unique(c(price)))
#[1] 1 3 4 5 6 7 8 9 10 11 12
If we use unique.default
sort(unique.default(price))
#[1] 1 3 4 5 6 7 8 9 10 11 12

Merging dataframes with all.equal on numeric(float) keys?

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:
> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 <NA> B
3 0.12 b <NA>
4 0.13 c C
5 0.14 d D
6 0.15 <NA> E
7 0.15 e <NA>
8 0.16 f F
9 0.17 g G
10 0.18 h H
11 0.19 i I
12 0.20 j J
Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.
Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?
> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
EDIT: A little more about the data
I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.
Round to the level of precision that will allow the number to be equal.
> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
>
> (merged <- merge(df1, df2, by = "number", all = TRUE))
number letters LETTERS
1 0.11 a A
2 0.12 b B
3 0.13 c C
4 0.14 d D
5 0.15 e E
6 0.16 f F
7 0.17 g G
8 0.18 h H
9 0.19 i I
10 0.20 j J
If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Resources