An increasing counter for occurence of new values in R - r

I am trying to make a counter which increases for each new change in another vector. E.g. I have several individuals that are observed over several weeks, and I want to know how many weeks they are observed. So I'll end up with a table like this:
Id year Week Weeks observed
1 2006 10 1
1 2006 10 1
1 2006 11 2
1 2006 11 2
1 2006 12 3
1 2006 13 4
1 2007 1 5
1 2007 2 6
1 2007 3 7
1 2007 4 8
1 2007 5 9
1 2007 6 10
2 2006 10 1
2 2006 10 1
2 2006 11 2
2 2006 11 2
2 2006 12 3
2 2006 13 4
2 2007 1 5
2 2007 2 6
2 2007 3 7
2 2007 4 8
2 2007 5 9
2 2007 6 10

Assuming you have your data in a data.frame called dat, you could use tapply and convert Phase to a factor then strip it of its levels to use the underlying integer values:
dat$newcounter <- unlist(tapply(dat$Phase, dat$Id,
function(x) unclass(as.factor(x))))
Obligatory data.table answer:
library(data.table)
dt<-as.data.table(dat)
dt[, newcounter := unclass(as.factor(Phase)), by = Id]
EDIT
To account for the newly phrased question, here is a possibility using data.table.
dt <- as.data.table(dat[, -4]) # Create data.table
setkeyv(dt, c("Id", "year", "Week")) # Create key for data.table
dt2 <- unique(dt) # Get only unique rows by key
dt3 <- dt2[, Weeks.observed := seq_len(.N), by = "Id"] # Create new variable
dt[dt3] # Merge data.tables back together

Related

Lapply in a dataframe over different variables using filters

I'm trying to calculate several new variables in my dataframe. Take initial values for example:
Say I have:
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
time geo var1 var2
1 1990 AT 1 7
2 1991 AT 2 8
3 1992 AT 3 9
4 1990 DE 4 10
5 1991 DE 5 11
6 1992 DE 6 12
And I want:
time geo var1 var2 var1_1990 var1_1991 var2_1990 var2_1991
1 1990 AT 1 7 1 2 7 8
2 1991 AT 2 8 1 2 7 8
3 1992 AT 3 9 1 2 7 8
4 1990 DE 4 10 4 5 10 11
5 1991 DE 5 11 4 5 10 11
6 1992 DE 6 12 4 5 10 11
So both time and the variable are changing for the new variables. Here is my attempt:
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
# ideally, I want code where I only have to change these two vectors
# and where it's possible to change their dimensions
for (i in initialyears){
lapply(initialvars,function(x){
rep(Dataset[time==i,x],each=length(unique(Dataset$time)))
})}
Which runs without error but yields nothing. I would like to assign the variable names in the example (eg. "var1_1990") and immediately make the new variables part of the dataframe. I would also like to avoid the for loop but I don't know how to wrap two lapply's around this function. Should I rather have the function use two arguments? Is the problem that the apply function does not carry the results into my environment? I've been stuck here for a while so I would be grateful for any help!
p.s.: I have the solution to do this combination by combination without apply and the likes but I'm trying to get away from copy and paste:
Dataset$var1_1990 <- c(rep(Dataset$var1[which(Dataset$time==1990)],
each=length(unique(Dataset$time))))
This can be done with subset(), reshape(), and merge():
merge(Dataset,reshape(subset(Dataset,time%in%c(1990,1991)),dir='w',idvar='geo',sep='_'));
## geo time var1 var2 var1_1990 var2_1990 var1_1991 var2_1991
## 1 AT 1990 1 7 1 7 2 8
## 2 AT 1991 2 8 1 7 2 8
## 3 AT 1992 3 9 1 7 2 8
## 4 DE 1990 4 10 4 10 5 11
## 5 DE 1991 5 11 4 10 5 11
## 6 DE 1992 6 12 4 10 5 11
The column order isn't exactly what you have in your question, but you can fix that up after-the-fact with an index operation, if necessary.
Here's a data.table method:
require(data.table)
dt <- as.data.table(Dataset)
in_cols = c("var1", "var2")
out_cols = do.call("paste", c(CJ(in_cols, unique(dt$time)), sep="_"))
dt[, (out_cols) := unlist(lapply(.SD, as.list), FALSE), by=geo, .SDcols=in_cols]
# time geo var1 var2 var1_1990 var1_1991 var1_1992 var2_1990 var2_1991 var2_1992
# 1: 1990 AT 1 7 1 2 3 7 8 9
# 2: 1991 AT 2 8 1 2 3 7 8 9
# 3: 1992 AT 3 9 1 2 3 7 8 9
# 4: 1990 DE 4 10 4 5 6 10 11 12
# 5: 1991 DE 5 11 4 5 6 10 11 12
# 6: 1992 DE 6 12 4 5 6 10 11 12
This assumes that the time variable is identical (and in the same order) for each geo value.
With dplyr and tidyr and using a custom function try the following:
Data
Dataset <- data.frame(time=rep(c(1990:1992),2),
geo=c(rep("AT",3),rep("DE",3)),var1=c(1:6), var2=c(7:12))
Code
library(dplyr); library(tidyr)
intitialyears <- c(1990,1991)
intitialvars <- c("var1", "var2")
#create this function
myTranForm <- function(dataSet, varName, years){
temp <- dataSet %>% select(time, geo, eval(parse(text=varName))) %>%
filter(time %in% years) %>% mutate(time=paste(varName, time, sep="_"))
names(temp)[names(temp) %in% varName] <- "someRandomStringForVariableName"
temp <- temp %>% spread(time, someRandomStringForVariableName)
return(temp)
}
#Then lapply on intitialvars using the custom function
DatasetList <- lapply(intitialvars, function(x) myTranForm(Dataset, x, intitialyears))
#and loop over the data frames in the list
for(i in 1:length(intitialvars)){
Dataset <- left_join(Dataset, DatasetList[[i]])
}
Dataset

Merge unequal length data.frames by id in R

Sample data
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002,2003))
Merge doesn't give good solution merge(x,y,by="id"), which gives duplicates.
Now for the above sample data simple cbind works cbind(x,y) and this is what I'm after, just paring the year with corresponding id.
Problem arrises when the two data.frames do not match! So that the data.frame containing variable year is shorter. Someting like this:
x <- data.frame(id=c(1,1,1,2,2,7,7,7,7),dna=c(232,424,5345,45345,45,345,4543,345345,4545))
y <- data.frame(id=c(1,1,1,2,2,7,7,7),year=c(2001,2002,2003,2005,2006,2000,2001,2002))
So I need paring the two data.frames and the corresponding unmatched rows of data.frame x could be NA's so that I would remove that row.
Desired output for the shorter sample data:
id year dna
1 1 2001 232
2 1 2002 424
3 1 2003 5345
4 2 2005 45345
5 2 2006 45
6 7 2000 345
7 7 2001 4543
8 7 2002 345345
You should add a record number to each id so you can work with merge:
x <- transform(x, rec = ave(id, id, FUN = seq_along))
y <- transform(y, rec = ave(id, id, FUN = seq_along))
merge(x, y, c("id", "rec"))
# id rec dna year
# 1 1 1 232 2001
# 2 1 2 424 2002
# 3 1 3 5345 2003
# 4 2 1 45345 2005
# 5 2 2 45 2006
# 6 7 1 345 2000
# 7 7 2 4543 2001
# 8 7 3 345345 2002

Create a panel data frame

I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. Using the following example:
id <- seq(1:4)
year <- c(2005, 2008, 2008, 2007)
y <- c(1,0,0,1)
frame <- data.frame(id, year, y)
frame
id year y
1 1 2005 1
2 2 2008 0
3 3 2008 0
4 4 2007 1
For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like:
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
....
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
I haven't had much success with loops; Any and all thoughts would be greatly appreciated.
1) reshape2 Create a grid g of all years and id values crossed and rbind it with frame.
Then using the reshape2 package cast frame from long to wide form and then melt it back to long form. Finally rearrange the rows and columns as desired.
The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.
library(reshape2)
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")
long <- long[order(long$id, long$year), c("id", "year", "y")] ##
giving:
> long
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
5 2 2005 0
6 2 2006 0
7 2 2007 0
8 2 2008 0
9 3 2005 0
10 3 2006 0
11 3 2007 0
12 3 2008 0
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate as shown. This solution uses no addon packages.
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]
This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id a factor whereas it is not in this solution.
Using data.table:
require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
# id year y
# 1: 1 2005 1
# 2: 1 2006 0
# 3: 1 2007 0
# 4: 1 2008 0
# 5: 2 2005 0
# 6: 2 2006 0
# 7: 2 2007 0
# 8: 2 2008 0
# 9: 3 2005 0
# 10: 3 2006 0
# 11: 3 2007 0
# 12: 3 2008 0
# 13: 4 2005 0
# 14: 4 2006 0
# 15: 4 2007 1
# 16: 4 2008 0
maybe not an elegant solution, but anyway:
df <- expand.grid(id=id, year=unique(year))
frame <- frame[frame$y != 0,]
df$y <- 0
df2 <- rbind(frame, df)
df2 <- df2[!duplicated(df2[,c("id", "year")]),]
df2 <- df2[order(df2$id, df2$year),]
rownames(df2) <- NULL
df2
# id year y
# 1 1 2005 1
# 2 1 2006 0
# 3 1 2007 0
# 4 1 2008 0
# 5 2 2005 0
# 6 2 2006 0
# 7 2 2007 0
# 8 2 2008 0
# 9 3 2005 0
# 10 3 2006 0
# 11 3 2007 0
# 12 3 2008 0
# 13 4 2005 0
# 14 4 2006 0
# 15 4 2007 1
# 16 4 2008 0

R Example - ddply, ave, and merge

I have written a code. It would be great if you guys can suggest better way of doing the stuff I am trying to do. The dt is given as follows:
SIC FYEAR AU AT
1 1 2003 6 212.748
2 1 2003 5 3987.884
3 1 2003 4 100.835
4 1 2003 4 1706.719
5 1 2003 5 9.159
6 1 2003 7 60.069
7 1 2003 5 100.696
8 1 2003 4 113.865
9 1 2003 6 431.552
10 1 2003 7 309.109 ...
My job is to create a new column for a given SIC, and FYEAR, the AU which has highest percentage AT and the difference between highest AT and second highest AT will get a value 1, otherwise 0. Here, is my attempt to do the stuff mentioned.
a <- ddply(dt,.(SIC,FYEAR),function(x){ddply(x,.(AU),function(x) sum(x$AT))});
SIC FYEAR AU V1
1 1 2003 4 3412.619
2 1 2003 5 13626.241
3 1 2003 6 644.300
4 1 2003 7 1478.633
5 1 2003 9 0.003
6 1 2004 4 3976.242
7 1 2004 5 9383.516
8 1 2004 6 457.023
9 1 2004 7 456.167
10 1 2004 9 238.282
where V1 represnts the sum AT for all the rows for a given AU for a given SIC and FYEAR. Next I do :
a$V1 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) x/sum(x));
SIC FYEAR AU V1
1 1 2003 4 1.780949e-01
2 1 2003 5 7.111150e-01
3 1 2003 6 3.362420e-02
4 1 2003 7 7.716568e-02
5 1 2003 9 1.565615e-07
6 1 2004 4 2.740114e-01
7 1 2004 5 6.466382e-01
8 1 2004 6 3.149444e-02
9 1 2004 7 3.143545e-02
10 1 2004 9 1.642052e-02
The column V1 now represents the percentage value for each AU for AT contribution for a given SIC, and FYEAR. Next,
a$V2 <- ave(a$V1, a$SIC, a$FYEAR, FUN = function(x) {t<-((sort(x, TRUE))[2]);
ifelse((x-t)> 0.1,1,0)});
SIC FYEAR AU V1 V2
1 1 2003 4 1.780949e-01 0
2 1 2003 5 7.111150e-01 1
3 1 2003 6 3.362420e-02 0
4 1 2003 7 7.716568e-02 0
5 1 2003 9 1.565615e-07 0
6 1 2004 4 2.740114e-01 0
7 1 2004 5 6.466382e-01 1
8 1 2004 6 3.149444e-02 0
9 1 2004 7 3.143545e-02 0
10 1 2004 9 1.642052e-02 0
The AU for a given SIC, and FYEAR, which has highest percentage contribution to AT, and f the difference is greater than 10%, the that AU gets 1 else gets 0.
Then I merge the result with original data dt.
dt <- merge(dt,a,key=c("SIC","FYEAR","AU"));
SIC FYEAR AU AT V1 V2
1 1 2003 4 1706.719 1.780949e-01 0
2 1 2003 4 100.835 1.780949e-01 0
3 1 2003 4 113.865 1.780949e-01 0
4 1 2003 4 1491.200 1.780949e-01 0
5 1 2003 5 3987.884 7.111150e-01 1
6 1 2003 5 100.696 7.111150e-01 1
7 1 2003 5 67.502 7.111150e-01 1
8 1 2003 5 9461.000 7.111150e-01 1
9 1 2003 5 9.159 7.111150e-01 1
10 1 2003 6 212.748 3.362420e-02 0
What I did is very cumbersome. Is there a better way to do the same stuff? Thanks.
I'm not sure if the deleted answer was the same as this, but you can effectively do it in a couple of lines.
# Simulate data
set.seed(1)
n<-1000
dt<-data.frame(SIC=sample(1:10,n,replace=TRUE),FYEAR=sample(2003:2007,n,replace=TRUE),
AU=sample(1:7,n,replace=TRUE),AT=abs(rnorm(n)))
# Cacluate proportion.
dt$prop<-ave(dt$AT,dt$SIC,dt$FYEAR,FUN=prop.table)
# Find AU with max proportion.
dt$au.with.max.prop<-
ave(dt,dt$SIC,dt$FYEAR,FUN=function(x)x$AU[x$prop==max(x$prop)])[,1]
It is all in base, and avoids merge so it won't be that slow.
Here's a version using data.table:
require(data.table)
DT <- data.table(your_data_frame)
setkey(DT, SIC, FYEAR, AU)
DT[setkey(DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1),
by=list(SIC, FYEAR)])[, V2 := (V1 - V1[.N-1] > 0.1) * 1,
by=list(SIC, FYEAR)]]
The part DT[, sum(AT), by=key(DT)][, V1 := V1/sum(V1), by=list(SIC, FYEAR)] first sums AT by all three columns and then replaces V1 by V1/sum(V1) by columns SIC, FYEAR by reference. The setkey wrapping this code orders all four columns. Therefore, the last but one value will always be the second highest value (under the condition that there are no duplicated values). Using this, we can create V2 as: [, V2 := (V1 - V1[.N-1] > 0.1) * 1, by=list(SIC, FYEAR)]] by reference. Once we've this, we can perform a join by using DT[.].
Hope this helps.

subset dataframe

I have a dataframe with counts of geese at several different sites. The aim was to make monthly counts of geese in
all 8 months between September-April at each site in consecutive winter periods. A winter period is defined as the 8 months between
September-April.
If the method had been carried out as planned, this is what the data would look like:
df <- data.frame(site=c(rep('site 1', 16), rep('site 2', 16), rep('site 3', 16)),
date=dmy(rep(c('01/09/2007', '02/10/2007', '02/11/2007',
'02/12/2007', '02/01/2008', '02/02/2008', '02/03/2008',
'02/04/2008', '01/09/2008', '02/10/2008', '02/11/2008',
'02/12/2008', '02/01/2009', '02/02/2009', '02/03/2009',
'02/04/2009'),3)),
count=sample(1:100, 48))
Its ended up with a situation where some sites have all 8 counts in some September-April periods, but not in other September-April periods. In addition, some sites, never achieved 8 counts in a September-April period. These toy data look like my actual data:
df <- df[-c(11:16, 36:48),]
I need to remove rows from the dataframe which do not form part of 8 consecutive counts in a September-April period. Using the toy data, this is the dataframe I need:
df <- df[-c(9:10, 27:29), ]
I've tried various commands using ddply() from plyr package but without success. Is there a solution to this problem?
One way I could think of is to subtract four months from your date so that, then you could group by year. To get the corresponding date by subtracting by 4 months, I suggest you use mondate package. See here for an excellent answer as to what problem you'd face when you subtract month and how you can overcome it.
require(mondate)
df$grp <- mondate(df$date) - 4
df$year <- year(df$grp)
df$month <- month(df$date)
ddply(df, .(site, year), function(x) {
if (all(c(1:4, 9:12) %in% x$month)) {
return(x)
} else {
return(NULL)
}
})
# site date count grp year month
# 1 site 1 2007-09-01 87 2007-05-02 2007 9
# 2 site 1 2007-10-02 44 2007-06-02 2007 10
# 3 site 1 2007-11-02 50 2007-07-03 2007 11
# 4 site 1 2007-12-02 65 2007-08-02 2007 12
# 5 site 1 2008-01-02 12 2007-09-02 2007 1
# 6 site 1 2008-02-02 2 2007-10-03 2007 2
# 7 site 1 2008-03-02 100 2007-11-02 2007 3
# 8 site 1 2008-04-02 29 2007-12-03 2007 4
# 9 site 2 2007-09-01 3 2007-05-02 2007 9
# 10 site 2 2007-10-02 22 2007-06-02 2007 10
# 11 site 2 2007-11-02 56 2007-07-03 2007 11
# 12 site 2 2007-12-02 5 2007-08-02 2007 12
# 13 site 2 2008-01-02 40 2007-09-02 2007 1
# 14 site 2 2008-02-02 15 2007-10-03 2007 2
# 15 site 2 2008-03-02 10 2007-11-02 2007 3
# 16 site 2 2008-04-02 20 2007-12-03 2007 4
# 17 site 2 2008-09-01 93 2008-05-02 2008 9
# 18 site 2 2008-10-02 13 2008-06-02 2008 10
# 19 site 2 2008-11-02 58 2008-07-03 2008 11
# 20 site 2 2008-12-02 64 2008-08-02 2008 12
# 21 site 2 2009-01-02 92 2008-09-02 2008 1
# 22 site 2 2009-02-02 69 2008-10-03 2008 2
# 23 site 2 2009-03-02 89 2008-11-02 2008 3
# 24 site 2 2009-04-02 27 2008-12-03 2008 4
An alternative solution using data.table:
require(data.table)
require(mondate)
dt <- data.table(df)
dt[, `:=`(year=year(mondate(date)-4), month=month(date))]
dt.out <- dt[, .SD[rep(all(c(1:4,9:12) %in% month), .N)],
by=list(site,year)][, c("year", "month") := NULL]

Resources