I have a data set including the following info:
id class year n
25 A63 2006 3
25 F16 2006 1
39 0901 2001 1
39 0903 2001 3
39 0903 2003 2
39 1901 2003 1
...
There are about 100k different ids and more than 300 classes. The year varies from 1998 to 2007.
What I want to do, is to fill the time gap, after some id and classes happened, with n=0 by id and class.
And then calculate the sum of n and the quantity of classes.
For example, the above 6 lines data should expand to the following table:
id class year n sum Qc Qs
25 A63 2006 3 3 2 2
25 F16 2006 1 1 2 2
25 A63 2007 0 3 0 2
25 F16 2007 0 1 0 2
39 0901 2001 1 1 2 2
39 0903 2001 3 3 2 2
39 0901 2002 0 1 0 2
39 0903 2002 0 3 0 2
39 0901 2003 0 1 2 3
39 0903 2003 2 5 2 3
39 1901 2003 1 1 2 3
39 0901 2004 0 1 0 3
39 0903 2004 0 5 0 3
39 1901 2004 0 1 0 3
...
39 0901 2007 0 1 0 3
39 0903 2007 0 5 0 3
39 1901 2007 0 1 0 3
I can solve it by the ugly for loop and it will takes one hour to get the result. Is there any better way to do that? Vectorize or using the data.table?
Using dplyr you could try:
library(dplyr)
df%>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))
It groups the data by class and id, and merges each group to a dataframe containing all the years with the id and class of that group.
Edit: if you want to do this only after a certain id you could do:
as.data.frame(rbind(df[df$id<=25,],df%>% filter(id>25) %>% group_by(class,id) %>% arrange(year) %>%
do(merge(data.frame(year=c(.$year[1]:2007),id=rep(.$id[1],2007-.$year[1]+1),class=rep(.$class[1],2007-.$year[1]+1)),.,all.x=T))))
Use expand.grid to get the cartesian product of class and year.
Then merge your current data frame to this new one. Then do the classic subset replacement.
df <- data.frame(class = as.factor(c("A63","F16","0901","0903","0903","1901")),
year = c(2006,2006,2001,2001,2003,2003),
n=c(3,1,1,3,2,1))
df2 <- expand.grid(class = levels(df$class),
year= 1997:2006)
df2 <- merge(df2,df, all.x=TRUE)
df2$n[is.na(df2$n)] <- 0
Related
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
Given the following toy example:
set.seed(200)
h<-data.frame(T1=sample(0:100,size = 20),ID=sample(c("A","B","C","D"),size=20,replace=T),yr=sample(c(2006:2010),size = 20,replace=T))
How can I
calculate the proportion of ID having more than 1 instance per year
Create a variable that increments for each ascending value of T1 per ID and year
Subtract each instance T1(2) from T1(1) and T1(3) from T1(2) etc for each ID
I figured out the first one:
h %>% group_by(yr,ID) %>% summarise(n=n()) %>% summarise(n2=sum(n>1),n3=n(),n4=n2/n3)
Now, to the last two questions - this is the desired output:
T1 ID yr Inc.var diff
1 92 A 2006 1 6
2 98 A 2006 2 0
3 41 B 2006 1 0
4 26 C 2006 1 71
5 97 C 2006 2 0
6 11 D 2006 1 56
7 67 D 2006 2 0
8 9 B 2008 1 44
9 53 B 2008 2 4
10 57 B 2008 3 19
11 76 B 2008 4 0
12 33 D 2008 etc etc
13 48 A 2009
14 58 A 2009
15 99 A 2009
16 52 B 2009
17 80 B 2009
18 13 B 2010
19 64 B 2010
20 21 C 2010
Here is how I solved the last two questions:
j <- h %>% group_by(ID,yr) %>% arrange(T1) %>% mutate(diff=lead(T1)-T1,inc.var=seq(length(T1))) %>% arrange(yr)
I would like to compute the mean age for every value from 1-7 in another variable called period.
This is how my data looks like:
work1 <- read.table(header=T, text="ID dead age gender inclusion_year diagnosis surv agrp period
87 0 25 2 2006 1 2174 1 5
396 0 19 2 2003 1 3077 1 3
446 0 23 2 2003 1 3144 1 3
497 0 19 2 2011 1 268 1 7
522 1 57 2 1999 1 3407 2 1
714 0 58 2 2003 1 3041 2 3
741 0 27 2 2004 1 2587 1 4
767 0 18 1 2008 1 1104 1 6
786 0 36 1 2005 1 2887 3 4
810 0 25 1 1998 1 3783 4 2")
This is a subset of a data with more then 1500 observations
This is what I'm trying to achieve:
sim <- read.table(header=T, text="Period diagnosis dead surv age
1 1 50 50000 35.5
2 1 80 70000 40.3
3 1 100 80000 32.8
4 1 120 100000 39.8
5 1 140 1200000 28.7
6 1 150 1400000 36.2
7 1 160 1600000 37.1")
In this data set I would like to group by period and diagnosis while all deaths(dead) and surv(survival time in days) is summarised in period time. I would also like for a mean value of the age in every period.
Have tried everything, still can't create the data set I'm striving for.
All help is appreciated!
You could try data.table
library(data.table)
as.data.table(work1)[, .(dead_sum=sum(dead),
surv_sum=sum(surv),
age_mean=mean(age)), keyby=.(period, diagnosis)]
Or dplyr
library(dplyr)
work1 %>% group_by(period, diagnosis) %>%
summarise(dead_sum=sum(dead), surv_sum=sum(surv), age_mean=mean(age))
# result
period diagnosis dead_sum surv_sum age_mean
1: 1 1 1 3407 57.00000
2: 2 1 0 3783 25.00000
3: 3 1 0 9262 33.33333
4: 4 1 0 5474 31.50000
5: 5 1 0 2174 25.00000
6: 6 1 0 1104 18.00000
7: 7 1 0 268 19.00000
I have the following code.
DT <- data.table(s3ITR)
DTKey <- data.table(s3Key, key = "Age")
> DT
Index Country Age Time Charity
1: 1 France 30 40 1
2: 2 France 40 40 0
3: 3 France 40 50 0
4: 4 Germany 40 40 1
5: 5 France 60 40 1
6: 6 France 40 40 1
7: 7 Germany 30 40 0
8: 8 Germany 30 40 1
9: 9 Germany 30 40 NA
10: 10 Germany 30 40 1
> DTKey
Index Country Age Time Charity
1: 1 France 30 40 0
2: 2 Germany 30 40 0
3: 3 Germany 30 40 1
4: 4 Germany 30 40 0
5: 5 Germany 30 40 1
6: 6 Germany 30 40 1
I would like to impute the NA in DT by random sample from DTKey, this may be stored in a new column called impute.
I can easily set a key within DT and sample from DT itself with the code below
DT <- data.table(s3ITR, key = "Age")
DT[, Impute := sample(na.omit(Charity), length(Charity), replace = T), by = key(DT)]
DT[!is.na(Charity), Impute := Charity]
It is a bit convoluted, but it works and I get the result below
Index Country Age Time Charity Impute
1: 1 France 30 40 1 1
2: 2 France 40 40 0 0
3: 3 France 40 50 0 0
4: 4 Germany 40 40 1 1
5: 5 France 60 40 1 1
6: 6 France 40 40 1 1
7: 7 Germany 30 40 0 0
8: 8 Germany 30 40 1 1
9: 9 Germany 30 40 NA 1
10: 10 Germany 30 40 1 1
Where the probability of NA being imputed as 1 is 3/4. I would like to this exact same thing but sample from DTkey instead, where the probability would be 3/6.
Is there a easy way to do this without merging the tables ?
Is there a special reason why you want to sample from DTKey? To achieve a "fair" probability you could simply use:
sample(0:1,1,replace=T)
Assuming that Charity is either 0 or 1 respectively.
UPDATE:
Okay in that case you could try the following:
DT[, Impute:= sample(DTKey[,Charity], length(DT[,Charity]), replace=T)]
I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.