Count different IDs in a same month and in different months - r

I have a data frame like this:
FisherID Year Month VesselID
1 2000 1 56
1 2000 1 81
1 2000 2 81
1 2000 3 81
1 2000 4 81
1 2000 5 81
1 2000 6 81
1 2000 7 81
1 2000 8 81
1 2000 9 81
1 2000 10 81
1 2001 1 56
1 2001 2 56
1 2001 3 81
1 2001 4 56
1 2001 5 56
1 2001 6 56
1 2001 7 56
1 2002 3 81
1 2002 4 81
1 2002 5 81
1 2002 6 81
1 2002 7 81
...and I need the number of time that ID changes per year, so the output that I want to is:
FisherID Year DiffVesselUsed
1 2000 1
1 2001 2
1 2002 0
I tried to get that using aggregate():
aggregate(vesselID, by=list(FisherID,Year,Month ), length)
but what I got was:
FisherID Year DiffVesselUsed
1 2000 2
1 2001 1
1 2002 1
because aggregate() counted those different vessels when those only appeared in the same month. I have tried different way to aggregate without success. Any help will be deeply appreciated. Cheers, Rafael

First a question: Your expected output does't seem to reflect what you ask for. You ask for the number of times an ID changes per year, but your expected output seems to indicate that you want to know how many unique VesselIDs are observed per year. For example, in 2000, the ID changes once, and in 2001 the ID changes twice. In both years, two unique IDs are observed.
So to get the result you posted,
If you're looking for a statistic by FisherID and Year, then there's no reason to look by Month as well. Instead, you should look at the unique values of VesselID for each combination of FisherID and Year.
aggregate(VesselID, by = list(FisherID, Year), function(x) length(unique(x)))
# Group.1 Group.2 x
# 1 1 2000 2
# 2 1 2001 2
# 3 1 2002 1
If you really want the number of times ID changes, use the rle function.
aggregate(VesselID, by = list(FisherID, Year),
function(x) length(rle(x)$values) - 1)
# Group.1 Group.2 x
# 1 1 2000 1
# 2 1 2001 2
# 3 1 2002 0

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

How to turn monadic into dyadic data in R?

NOTE: This is a modified version of How do I turn monadic data into dyadic data in R (country-year into pair-year)?
I have data organized by country-year, with a ID for a dyadic relationship. I want to organize this by dyad-year.
Here is how my data is organized:
dyadic_id country_codes year
1 1 200 1990
2 1 20 1990
3 1 200 1991
4 1 20 1991
5 1 200 1991
6 1 300 1991
7 1 300 1991
8 1 20 1991
9 2 300 1990
10 2 10 1990
11 3 100 1990
12 3 10 1990
13 4 500 1991
14 4 200 1991
Here is how I want the data to be:
dyadic_id_want country_codes_1 country_codes_2 year_want
1 1 200 20 1990
2 1 200 20 1991
3 1 200 300 1991
4 1 300 20 1991
5 2 300 10 1990
6 3 100 10 1990
7 4 500 200 1991
Here is reproducible code:
dyadic_id<-c(1,1,1,1,1,1,1,1,2,2,3,3,4,4)
country_codes<-c(200,20,200,20,200,300,300,20,300,10,100,10,500,200)
year<-c(1990,1990,1991,1991,1991,1991,1991,1991,1990,1990,1990,1990,1991,1991)
mydf<-as.data.frame(cbind(dyadic_id,country_codes,year))
dyadic_id_want<-c(1,1,1,1,2,3,4)
country_codes_1<-c(200,200,200,300,300,100,500)
country_codes_2<-c(20,20,300,20,10,10,200)
year_want<-c(1990,1991,1991,1991,1990,1990,1991)
my_df_i_want<-as.data.frame(cbind(dyadic_id_want,country_codes_1,country_codes_2,year_want))
This is a unique problem since there are more than one country that participate in each event (noted by a dyadic_id).
You can actually do it very similar to akrun's solution for dplyr. Unfortunately I'm not well versed enough in data.table to help you with that part, and I'm sure others may have better solution to this one.
Basically for the mutate(ind=...) portion you need to be a little more clever on how you construct this indicator so that it is unique and will lead to the same result that you're looking for. For my solution, I notice that since you have groups of two, then your indicator should just have modulus operator attached to it.
ind=paste0('country_codes', ((row_number()+1) %% 2+1))
Then you need an indentifier for each group of two which again can be constructed using the similar idea.
ind_row = ceiling(row_number()/2)
Then you can proceed as normal in the code.
The full code is as follows:
mydf %>%
group_by(dyadic_id, year) %>%
mutate(ind=paste0('country_codes', ((row_number()+1) %% 2+1)),
ind_row = ceiling(row_number()/2)) %>%
spread(ind, country_codes) %>%
select(-ind_row)
# dyadic_id year country_codes1 country_codes2
#1 1 1990 200 20
#2 1 1991 200 20
#3 1 1991 200 300
#4 1 1991 300 20
#5 2 1990 300 10
#6 3 1990 100 10
#7 4 1991 500 200
All credit to akrun's solution though.

How to populate a matrix from a for loop in R

I keep getting a 'subscript out of bounds' error when I try to populate a matrix using a for loop that I have scripted below. My data are a large csv file that look similar to the following dummy dataset:
Sample k3 Year
1 B92028UUU 1 1990
2 B93001UUU 1 1993
3 B93005UUU 1 1993
4 B93006UUU 1 1993
5 B93010UUU 1 1993
6 B93011UUU 1 1994
7 B93022UUU 1 1994
8 B93035UUU 1 2014
9 B93036UUU 1 2014
10 B95015UUU 2 2013
11 B95016UUU 2 2013
12 B98027UUU 2 1990
13 B05005FUS 2 1990
14 B06006FIS 2 2001
15 B06010MUS 2 2001
16 B05023FUN 2 2001
17 B05024FUN 3 2001
18 B05025FIN 3 2001
19 B05034MMN 3 2002
20 B05037MMS 3 1996
21 B05041MUN 3 1996
22 B06047FUS 3 2007
23 B05048MUS 3 2000
24 B06059FUS 3 2000
25 B05063MUN 3 2000
My script is as follows:
Year.Matrix = matrix(1:75,nrow=25,byrow=T)
colnames(Year.Matrix)=c("Group 1","Group 2","Group 3")
rownames(Year.Matrix)=1990:2014
for(i in 1:3){
x=subset(data2,k3==i)
for(j in 1990:2014){
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
Not sure why I am getting the error message but from other posts I gather that the issue arises when I try to populate my matrix, and perhaps because I do not have an entry for each year from each of my k3 levels?
Any commentary would be helpful!
No need to use a loop here. You are just computing length by year and k3 columns:
library(data.table)
setDT(dat)[,.N,"Year,k3"]
Year k3 N
1: 1990 1 1
2: 1993 1 4
3: 1994 1 2
4: 2014 1 2
5: 2013 2 2
6: 1990 2 2
7: 2001 2 3
8: 2001 3 2
9: 2002 3 1
10: 1996 3 2
11: 2007 3 1
12: 2000 3 3
You can also use dplyr to do this. A dplyr solution would be the following:
dat %>%
group_by(Year, k3) %>%
summarize(N=n())
Not sure what you are trying to do but as Hubert L said. Your value of j index should be an integer while populating Year.Matrix it should be values like 1..2..3.. since you have done (j in 1990:2014) it will give j values as 1990..1991..1992.....2014
to fix this offset your row index as below. Your for loop
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in seq_along(1990:2014)){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[j,i]=z
}
}
keep using print statement to debug your function. Running this loop will immediately tell you data you are going to index Year.Matrix[1990,1] which will through out of bound exception.
Fix this for loop by offsetting the index as:
for(i in 1:3){
print(i)
x=subset(data2,k3==i)
for(j in 1990:2014){
print(j)
y=subset(x,Year==j)
z=nrow(y)
Year.Matrix[1990-j+1,i]=z
}
}

fill the time gap in data frame in r

I have a data set including the following info:
id class year n
25 A63 2006 3
25 F16 2006 1
39 0901 2001 1
39 0903 2001 3
39 0903 2003 2
39 1901 2003 1
...
There are about 100k different ids and more than 300 classes. The year varies from 1998 to 2007.
What I want to do, is to fill the time gap, after some id and classes happened, with n=0 by id and class.
And then calculate the sum of n and the quantity of classes.
For example, the above 6 lines data should expand to the following table:
id class year n sum Qc Qs
25 A63 2006 3 3 2 2
25 F16 2006 1 1 2 2
25 A63 2007 0 3 0 2
25 F16 2007 0 1 0 2
39 0901 2001 1 1 2 2
39 0903 2001 3 3 2 2
39 0901 2002 0 1 0 2
39 0903 2002 0 3 0 2
39 0901 2003 0 1 2 3
39 0903 2003 2 5 2 3
39 1901 2003 1 1 2 3
39 0901 2004 0 1 0 3
39 0903 2004 0 5 0 3
39 1901 2004 0 1 0 3
...
39 0901 2007 0 1 0 3
39 0903 2007 0 5 0 3
39 1901 2007 0 1 0 3
I can solve it by the ugly for loop and it will takes one hour to get the result. Is there any better way to do that? Vectorize or using the data.table?
Using dplyr you could try:
library(dplyr)
df%>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))
It groups the data by class and id, and merges each group to a dataframe containing all the years with the id and class of that group.
Edit: if you want to do this only after a certain id you could do:
as.data.frame(rbind(df[df$id<=25,],df%>% filter(id>25) %>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))))
Use expand.grid to get the cartesian product of class and year.
Then merge your current data frame to this new one. Then do the classic subset replacement.
df <- data.frame(class = as.factor(c("A63","F16","0901","0903","0903","1901")),
year = c(2006,2006,2001,2001,2003,2003),
n=c(3,1,1,3,2,1))
df2 <- expand.grid(class = levels(df$class),
year= 1997:2006)
df2 <- merge(df2,df, all.x=TRUE)
df2$n[is.na(df2$n)] <- 0

R Example - ddply, ave, and merge

I have written a code. It would be great if you guys can suggest better way of doing the stuff I am trying to do. The dt is given as follows:
SIC FYEAR AU AT
1 1 2003 6 212.748
2 1 2003 5 3987.884
3 1 2003 4 100.835
4 1 2003 4 1706.719
5 1 2003 5 9.159
6 1 2003 7 60.069
7 1 2003 5 100.696
8 1 2003 4 113.865
9 1 2003 6 431.552
10 1 2003 7 309.109 ...
My job is to create a new column for a given SIC, and FYEAR, the AU which has highest percentage AT and the difference between highest AT and second highest AT will get a value 1, otherwise 0. Here, is my attempt to do the stuff mentioned.
a <- ddply(dt,.(SIC,FYEAR),function(x){ddply(x,.(AU),function(x) sum(x$AT))});
SIC FYEAR AU V1
1 1 2003 4 3412.619
2 1 2003 5 13626.241
3 1 2003 6 644.300
4 1 2003 7 1478.633
5 1 2003 9 0.003
6 1 2004 4 3976.242
7 1 2004 5 9383.516
8 1 2004 6 457.023
9 1 2004 7 456.167
10 1 2004 9 238.282
where V1 represnts the sum AT for all the rows for a given AU for a given SIC and FYEAR. Next I do :
a$V1 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) x/sum(x));
SIC FYEAR AU V1
1 1 2003 4 1.780949e-01
2 1 2003 5 7.111150e-01
3 1 2003 6 3.362420e-02
4 1 2003 7 7.716568e-02
5 1 2003 9 1.565615e-07
6 1 2004 4 2.740114e-01
7 1 2004 5 6.466382e-01
8 1 2004 6 3.149444e-02
9 1 2004 7 3.143545e-02
10 1 2004 9 1.642052e-02
The column V1 now represents the percentage value for each AU for AT contribution for a given SIC, and FYEAR. Next,
a$V2 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) {t<-((sort(x, TRUE))[2]);
ifelse((x-t)> 0.1,1,0)});
SIC FYEAR AU V1 V2
1 1 2003 4 1.780949e-01 0
2 1 2003 5 7.111150e-01 1
3 1 2003 6 3.362420e-02 0
4 1 2003 7 7.716568e-02 0
5 1 2003 9 1.565615e-07 0
6 1 2004 4 2.740114e-01 0
7 1 2004 5 6.466382e-01 1
8 1 2004 6 3.149444e-02 0
9 1 2004 7 3.143545e-02 0
10 1 2004 9 1.642052e-02 0
The AU for a given SIC, and FYEAR, which has highest percentage contribution to AT, and f the difference is greater than 10%, the that AU gets 1 else gets 0.
Then I merge the result with original data dt.
dt <- merge(dt,a,key=c("SIC","FYEAR","AU"));
SIC FYEAR AU AT V1 V2
1 1 2003 4 1706.719 1.780949e-01 0
2 1 2003 4 100.835 1.780949e-01 0
3 1 2003 4 113.865 1.780949e-01 0
4 1 2003 4 1491.200 1.780949e-01 0
5 1 2003 5 3987.884 7.111150e-01 1
6 1 2003 5 100.696 7.111150e-01 1
7 1 2003 5 67.502 7.111150e-01 1
8 1 2003 5 9461.000 7.111150e-01 1
9 1 2003 5 9.159 7.111150e-01 1
10 1 2003 6 212.748 3.362420e-02 0
What I did is very cumbersome. Is there a better way to do the same stuff? Thanks.
I'm not sure if the deleted answer was the same as this, but you can effectively do it in a couple of lines.
# Simulate data
set.seed(1)
n<-1000
dt<-data.frame(SIC=sample(1:10,n,replace=TRUE),FYEAR=sample(2003:2007,n,replace=TRUE),
AU=sample(1:7,n,replace=TRUE),AT=abs(rnorm(n)))
# Cacluate proportion.
dt$prop<-ave(dt$AT,dt$SIC,dt$FYEAR,FUN=prop.table)
# Find AU with max proportion.
dt$au.with.max.prop<-
ave(dt,dt$SIC,dt$FYEAR,FUN=function(x)x$AU[x$prop==max(x$prop)])[,1]
It is all in base, and avoids merge so it won't be that slow.
Here's a version using data.table:
require(data.table)
DT <- data.table(your_data_frame)
setkey(DT, SIC, FYEAR, AU)
DT[setkey(DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1),
by=list(SIC, FYEAR)])[, V2 := (V1 - V1[.N-1] > 0.1) * 1,
by=list(SIC, FYEAR)]]
The part DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1), by=list(SIC, FYEAR)] first sums AT by all three columns and then replaces V1 by V1/sum(V1) by columns SIC, FYEAR by reference. The setkey wrapping this code orders all four columns. Therefore, the last but one value will always be the second highest value (under the condition that there are no duplicated values). Using this, we can create V2 as: [, V2 := (V1 - V1[.N-1] > 0.1) * 1, by=list(SIC, FYEAR)]] by reference. Once we've this, we can perform a join by using DT[.].
Hope this helps.

Resources