Compute percent change variable only when ID is the same across rows - r

I have a dataframe with a KEY/ID column, a year column, two variables V1 and V2.
KEY V1 V2 YEAR
1 10 5 1990
1 20 10 1991
1 30 15 1992
2 40 20 1990
2 50 25 1991
2 60 30 1992
I would like to compute the percent change for the values of V1 from one year to another one. That is, I would like to compute (V1[i+1]-V1[i])/V1[i] but only when the value in KEY[i+1] is equal to the value of KEY[i]. When they are different, I would like to get a NA.
KEY V1 V2 YEAR CHANGE
1 10 5 1990 1
1 20 10 1991 1
1 30 15 1992 NA
2 40 20 1990 0.25
2 50 25 1991 0.2
2 60 30 1992 NA
This is my attempt by using the Delt function from the quantmode package and ddply from plyr.
data$change <- ddply(data, "data$KEY", transform, DeltaCol=Delt(data$V1) )
Unfortunately, it doesn't do the trick.
Any help would be appreciated.

I don't know how to do it with ddply but it's pretty easy with ave:
> dat$pctchg <- ave(dat$V1, dat$KEY, FUN=function(x) c( NA, diff(x)/x[-length(x)]) )
> dat
KEY V1 V2 YEAR pctchg
1 1 10 5 1990 NA
2 1 20 10 1991 1.00
3 1 30 15 1992 0.50
4 2 40 20 1990 NA
5 2 50 25 1991 0.25
6 2 60 30 1992 0.20
ave works when you want a result that depends only on one vector within any number of categories. As far as I know you cannot have multiple vector calculations with ave nor do you have access to the factor levels within hte function. If you want the same calculation(s) on all of a group of vectors considered separately, then aggregate is the best; and finally if you want calculations that each depend on on multiple vectors use either do.call(rbvind, by(dat ,cats, function)) or lapply( split(dat, cats), function)

Related

Two-way ANOVA with weighted dependent variable

Trying to compare 3 independent populations across the years by the size of their individuals, I have this kind of data set:
year <- c(rep(2000,5),rep(2001,3),rep(2002,7))
region <- c(1,1,2,3,3,1,2,3,rep(1,3),rep(2,3),3)
size <- c(28,24,26,56,47,85,12,24,68,71,42,59,12,25,33)
count <- c(3,8,9,1,2,4,7,12,4,8,3,2,7,15,4)
df <- data.frame(year, region, size, count)
Which gives:
year region size count
2000 1 28 3
2000 1 24 8
2000 2 26 9
2000 3 56 1
2000 3 47 2
2001 1 85 4
2001 2 12 7
2001 3 24 12
2002 1 68 4
2002 1 71 8
2002 1 42 3
2002 2 59 2
2002 2 12 7
2002 2 25 15
2002 3 33 4
I want to make a 2-Way ANOVA:
model.2way <- lm(size ~ year * region, df) # example of code
anova(model.2way)
My issue is that the variable size is weighted by count: for each size, I have count number of individuals. I got millions of data and can't easily transform my data to have millions of size values.
Do you know a way to make a 2-Way ANOVA with this kind of weighted data?
Thanks in advance!
model.2way <- lm(size ~ year * region, df, weights = count)
From ?lm:
... when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations ...
In other words, a weight of 2 means that case appears twice.

lag and summarize time series data

I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated
something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2

Define a dummy variable based on binary code in R

Take the following patient data example from a hospital.
YEAR <- sample(1980:1995,15, replace=T)
Pat_ID <- sample(1:100,15)
sex <- c(1,0,1,0,1,0,0,1,0,0,0,0,1,0,0)
df1 <- data.frame(Pat_ID,YEAR,sex)
I want to introduce a dummy variable $PAIR_IDENTIFIER that takes a new value each time a new sex==1 appears. The problem is there is no constant patern to the sex variable.
You see sometimes the succeeding 1 appears in the ith+2 position and then ith+3 position etc.
so $PAIR_IDENTIFIER <- c(1,1,2,2,3,3,3,4,4,4,4,4 .....)
You can do this by simply using the cumsum,
df1$PAIR_IDENTIFIER <- cumsum(df1$sex)
df1
# Pat_ID YEAR sex PAIR_IDENTIFIER
#1 54 1991 1 1
#2 100 1992 0 1
#3 6 1995 1 2
#4 99 1994 0 2
#5 42 1988 1 3
#6 65 1990 0 3
#7 53 1994 0 3
#8 96 1987 1 4

How to turn monadic into dyadic data in R?

NOTE: This is a modified version of How do I turn monadic data into dyadic data in R (country-year into pair-year)?
I have data organized by country-year, with a ID for a dyadic relationship. I want to organize this by dyad-year.
Here is how my data is organized:
dyadic_id country_codes year
1 1 200 1990
2 1 20 1990
3 1 200 1991
4 1 20 1991
5 1 200 1991
6 1 300 1991
7 1 300 1991
8 1 20 1991
9 2 300 1990
10 2 10 1990
11 3 100 1990
12 3 10 1990
13 4 500 1991
14 4 200 1991
Here is how I want the data to be:
dyadic_id_want country_codes_1 country_codes_2 year_want
1 1 200 20 1990
2 1 200 20 1991
3 1 200 300 1991
4 1 300 20 1991
5 2 300 10 1990
6 3 100 10 1990
7 4 500 200 1991
Here is reproducible code:
dyadic_id<-c(1,1,1,1,1,1,1,1,2,2,3,3,4,4)
country_codes<-c(200,20,200,20,200,300,300,20,300,10,100,10,500,200)
year<-c(1990,1990,1991,1991,1991,1991,1991,1991,1990,1990,1990,1990,1991,1991)
mydf<-as.data.frame(cbind(dyadic_id,country_codes,year))
dyadic_id_want<-c(1,1,1,1,2,3,4)
country_codes_1<-c(200,200,200,300,300,100,500)
country_codes_2<-c(20,20,300,20,10,10,200)
year_want<-c(1990,1991,1991,1991,1990,1990,1991)
my_df_i_want<-as.data.frame(cbind(dyadic_id_want,country_codes_1,country_codes_2,year_want))
This is a unique problem since there are more than one country that participate in each event (noted by a dyadic_id).
You can actually do it very similar to akrun's solution for dplyr. Unfortunately I'm not well versed enough in data.table to help you with that part, and I'm sure others may have better solution to this one.
Basically for the mutate(ind=...) portion you need to be a little more clever on how you construct this indicator so that it is unique and will lead to the same result that you're looking for. For my solution, I notice that since you have groups of two, then your indicator should just have modulus operator attached to it.
ind=paste0('country_codes', ((row_number()+1) %% 2+1))
Then you need an indentifier for each group of two which again can be constructed using the similar idea.
ind_row = ceiling(row_number()/2)
Then you can proceed as normal in the code.
The full code is as follows:
mydf %>%
group_by(dyadic_id, year) %>%
mutate(ind=paste0('country_codes', ((row_number()+1) %% 2+1)),
ind_row = ceiling(row_number()/2)) %>%
spread(ind, country_codes) %>%
select(-ind_row)
# dyadic_id year country_codes1 country_codes2
#1 1 1990 200 20
#2 1 1991 200 20
#3 1 1991 200 300
#4 1 1991 300 20
#5 2 1990 300 10
#6 3 1990 100 10
#7 4 1991 500 200
All credit to akrun's solution though.

Count different IDs in a same month and in different months

I have a data frame like this:
FisherID Year Month VesselID
1 2000 1 56
1 2000 1 81
1 2000 2 81
1 2000 3 81
1 2000 4 81
1 2000 5 81
1 2000 6 81
1 2000 7 81
1 2000 8 81
1 2000 9 81
1 2000 10 81
1 2001 1 56
1 2001 2 56
1 2001 3 81
1 2001 4 56
1 2001 5 56
1 2001 6 56
1 2001 7 56
1 2002 3 81
1 2002 4 81
1 2002 5 81
1 2002 6 81
1 2002 7 81
...and I need the number of time that ID changes per year, so the output that I want to is:
FisherID Year DiffVesselUsed
1 2000 1
1 2001 2
1 2002 0
I tried to get that using aggregate():
aggregate(vesselID, by=list(FisherID,Year,Month ), length)
but what I got was:
FisherID Year DiffVesselUsed
1 2000 2
1 2001 1
1 2002 1
because aggregate() counted those different vessels when those only appeared in the same month. I have tried different way to aggregate without success. Any help will be deeply appreciated. Cheers, Rafael
First a question: Your expected output does't seem to reflect what you ask for. You ask for the number of times an ID changes per year, but your expected output seems to indicate that you want to know how many unique VesselIDs are observed per year. For example, in 2000, the ID changes once, and in 2001 the ID changes twice. In both years, two unique IDs are observed.
So to get the result you posted,
If you're looking for a statistic by FisherID and Year, then there's no reason to look by Month as well. Instead, you should look at the unique values of VesselID for each combination of FisherID and Year.
aggregate(VesselID, by = list(FisherID, Year), function(x) length(unique(x)))
# Group.1 Group.2 x
# 1 1 2000 2
# 2 1 2001 2
# 3 1 2002 1
If you really want the number of times ID changes, use the rle function.
aggregate(VesselID, by = list(FisherID, Year),
function(x) length(rle(x)$values) - 1)
# Group.1 Group.2 x
# 1 1 2000 1
# 2 1 2001 2
# 3 1 2002 0

Resources