Two-way ANOVA with weighted dependent variable - r

Trying to compare 3 independent populations across the years by the size of their individuals, I have this kind of data set:
year <- c(rep(2000,5),rep(2001,3),rep(2002,7))
region <- c(1,1,2,3,3,1,2,3,rep(1,3),rep(2,3),3)
size <- c(28,24,26,56,47,85,12,24,68,71,42,59,12,25,33)
count <- c(3,8,9,1,2,4,7,12,4,8,3,2,7,15,4)
df <- data.frame(year, region, size, count)
Which gives:
year region size count
2000 1 28 3
2000 1 24 8
2000 2 26 9
2000 3 56 1
2000 3 47 2
2001 1 85 4
2001 2 12 7
2001 3 24 12
2002 1 68 4
2002 1 71 8
2002 1 42 3
2002 2 59 2
2002 2 12 7
2002 2 25 15
2002 3 33 4
I want to make a 2-Way ANOVA:
model.2way <- lm(size ~ year * region, df) # example of code
anova(model.2way)
My issue is that the variable size is weighted by count: for each size, I have count number of individuals. I got millions of data and can't easily transform my data to have millions of size values.
Do you know a way to make a 2-Way ANOVA with this kind of weighted data?
Thanks in advance!

model.2way <- lm(size ~ year * region, df, weights = count)
From ?lm:
... when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations ...
In other words, a weight of 2 means that case appears twice.

Related

How to write a loop for this case in R?

I have a data base with 121 rows and like 10 columns. One of these columns corresponds to Station, another to depth and the rest to chemical variables (temperature, salinity, etc.). I want to calculate the integrated value of these chemical properties by station, using the function oce::integrateTrapezoid. It's my first time doing a loop, so i dont know how. Could you help me?
dA<-matrix(data=NA, nrow=121, ncol=3)
for (Station in unique(datos$Station))
{dA[Station, cd] <- integrateTrapezoid(cd, Profundidad..m., "cA")
}
Station
Depth
temp
1
10
28
1
50
25
1
100
15
1
150
10
2
9
27
2
45
24
2
98
14
2
152
11
3
11
28.7
3
48
23
3
102
14
3
148
9

Trying to keep values of a column based on the unique values of two other columns

I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].

Loop of all regressions with previous rows

I am trying to create a general linear regression model between a variable, and all variables before that, from two matrices, for every row.
I have two alike matrices with 30 rows and 41 columns, that both look like this:
Subject1 Subject2 Subject3 Subject4 Subject5 Subject6 Subject7 Subject8 Subject9
Trial 1 NA 66 NA 6 NA 45 NA NA NA
Trial 2 10 105 10 6 6 NA 6 10 15
Trial 3 NA 136 10 6 10 45 15 10 NA
Trial 4 10 NA 10 6 10 45 28 NA 6
Trial 5 10 NA 15 6 15 45 36 0 10
Trial 6 NA 21 NA 6 15 45 55 10 10
Where one is NA the other one has a value.
I'm trying to loop a regression for every Trial, where the predicted Trial value (n-th) has all the previous values (n-1) as regressors.
After a lot of research I found a way for all possible regressions, with
expand.grid(c(TRUE,FALSE), c(TRUE,FALSE), c(TRUE,FALSE), c(TRUE,FALSE))
And building from that, but as the number of my regressors for the last model is 29 times 2 because of the 2 matrices it would create a way too huge grid, also I only need the model for previous Trials.
Any help is very much appreciated.
Thanks.

Is there a way to order output by multiple columns within the aggregate() function in R?

I'd like to use the aggregate function but then have the output be ordered (smallest to largest) based on 2 columns (first one, and then subset by the other).
Here is an example:
test<-data.frame(c(sample(1:4),1),sample(2001:2005),11:15,c(letters[1:4],'a'),sample(101:105))
names(test)<-c("plot","year","age","spec","biomass")
test
plot year age spec biomass
1 2 2001 11 a 102
2 4 2005 12 b 101
3 1 2004 13 c 105
4 3 2002 14 d 103
5 1 2003 15 a 104
aggregate(biomass~plot+year,data=test,FUN='sum')
This creates output with just year ordered from smallest to largest.
plot year biomass
1 2 2001 102
2 3 2002 103
3 1 2003 104
4 1 2004 105
5 4 2005 101
But I'd like the output to be ordered by plot and THEN year.
plot year biomass
1 1 2003 104
2 1 2004 105
3 2 2001 102
4 3 2002 103
5 4 2005 101
Thanks!!
The aggregate function does sort by columns. Switch the order of the arguments to get your desired sorting:
# switch from
a0 <- aggregate(biomass~plot+year,data=test,FUN='sum')
# to
a <- aggregate(biomass~year+plot,data=test,FUN='sum')
The data is sorted in the way described in the question. No further sorting is needed.
If you want to change the order in which columns are displayed to exactly match your desired output, try a[,c(1,3,2)]. This reordering is not computationally costly. My understanding is that a data.frame is a list of pointers to column vectors; and this just reorders that list.

Compute percent change variable only when ID is the same across rows

I have a dataframe with a KEY/ID column, a year column, two variables V1 and V2.
KEY V1 V2 YEAR
1 10 5 1990
1 20 10 1991
1 30 15 1992
2 40 20 1990
2 50 25 1991
2 60 30 1992
I would like to compute the percent change for the values of V1 from one year to another one. That is, I would like to compute (V1[i+1]-V1[i])/V1[i] but only when the value in KEY[i+1] is equal to the value of KEY[i]. When they are different, I would like to get a NA.
KEY V1 V2 YEAR CHANGE
1 10 5 1990 1
1 20 10 1991 1
1 30 15 1992 NA
2 40 20 1990 0.25
2 50 25 1991 0.2
2 60 30 1992 NA
This is my attempt by using the Delt function from the quantmode package and ddply from plyr.
data$change <- ddply(data, "data$KEY", transform, DeltaCol=Delt(data$V1) )
Unfortunately, it doesn't do the trick.
Any help would be appreciated.
I don't know how to do it with ddply but it's pretty easy with ave:
> dat$pctchg <- ave(dat$V1, dat$KEY, FUN=function(x) c( NA, diff(x)/x[-length(x)]) )
> dat
KEY V1 V2 YEAR pctchg
1 1 10 5 1990 NA
2 1 20 10 1991 1.00
3 1 30 15 1992 0.50
4 2 40 20 1990 NA
5 2 50 25 1991 0.25
6 2 60 30 1992 0.20
ave works when you want a result that depends only on one vector within any number of categories. As far as I know you cannot have multiple vector calculations with ave nor do you have access to the factor levels within hte function. If you want the same calculation(s) on all of a group of vectors considered separately, then aggregate is the best; and finally if you want calculations that each depend on on multiple vectors use either do.call(rbvind, by(dat ,cats, function)) or lapply( split(dat, cats), function)

Resources