I have a really simple problem, but I'm probably not thinking vector-y enough to solve it efficiently. I tried two different approaches and they've been looping on two different computers for a long time now. I wish I could say the competition made it more exciting, but ... bleh.
rank observations in group
I have long data (many rows per person, one row per person-observation) and I basically want a variable, that tells me how often the person has been observed already.
I have the first two columns and want the third one:
person wave obs
pers1 1999 1
pers1 2000 2
pers1 2003 3
pers2 1998 1
pers2 2001 2
Now I'm using two loop-approaches. Both are excruciatingly slow (150k rows). I'm sure I'm missing something, but my search queries didn't really help me yet (hard to phrase the problem).
Thanks for any pointers!
# ordered dataset by persnr and year of observation
person.obs <- person.obs[order(person.obs$PERSNR,person.obs$wave) , ]
person.obs$n.obs = 0
# first approach: loop through people and assign range
unp = unique(person.obs$PERSNR)
unplength = length(unp)
for(i in 1:unplength) {
print(unp[i])
person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs =
1:length(person.obs[which(person.obs$PERSNR==unp[i]),]$n.obs)
i=i+1
gc()
}
# second approach: loop through rows and reset counter at new person
pnr = 0
for(i in 1:length(person.obs[,2])) {
if(pnr!=person.obs[i,]$PERSNR) { pnr = person.obs[i,]$PERSNR
e = 0
}
e=e+1
person.obs[i,]$n.obs = e
i=i+1
gc()
}
The answer from Marek in this question has proven very useful in the past. I wrote it down and use it almost daily since it was fast and efficient. We'll use ave() and seq_along().
foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011))
foo <- transform(foo, obs = ave(rep(NA, nrow(foo)), person, FUN = seq_along))
foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Another option using plyr
library(plyr)
ddply(foo, "person", transform, obs2 = seq_along(person))
person year obs obs2
1 pers1 1999 1 1
2 pers1 2000 2 2
3 pers1 2003 3 3
4 pers2 1998 1 1
5 pers2 2011 2 2
A few alternatives with the data.table and dplyr packages.
data.table:
library(data.table)
# setDT(foo) is needed to convert to a data.table
# option 1:
setDT(foo)[, rn := rowid(person)]
# option 2:
setDT(foo)[, rn := 1:.N, by = person]
both give:
> foo
person year rn
1: pers1 1999 1
2: pers1 2000 2
3: pers1 2003 3
4: pers2 1998 1
5: pers2 2011 2
If you want a true rank, you should use the frank function:
setDT(foo)[, rn := frank(year, ties.method = 'dense'), by = person]
dplyr:
library(dplyr)
# method 1
foo <- foo %>% group_by(person) %>% mutate(rn = row_number())
# method 2
foo <- foo %>% group_by(person) %>% mutate(rn = 1:n())
both giving a similar result:
> foo
Source: local data frame [5 x 3]
Groups: person [2]
person year rn
(fctr) (dbl) (int)
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
Would by do the trick?
> foo <-data.frame(person=c(rep("pers1",3),rep("pers2",2)),year=c(1999,2000,2003,1998,2011),obs=c(1,2,3,1,2))
> foo
person year obs
1 pers1 1999 1
2 pers1 2000 2
3 pers1 2003 3
4 pers2 1998 1
5 pers2 2011 2
> by(foo, foo$person, nrow)
foo$person: pers1
[1] 3
------------------------------------------------------------
foo$person: pers2
[1] 2
Another option using aggregate and rank in base R:
foo$obs <- unlist(aggregate(.~person, foo, rank)[,2])
# person year obs
# 1 pers1 1999 1
# 2 pers1 2000 2
# 3 pers1 2003 3
# 4 pers2 1998 1
# 5 pers2 2011 2
Related
I have a dataset that identifies observations based on two variables: Time and Country. The variable of interest is dichotomous, and has the value 0 if the event didn't occur and 1 if it did.
For some countries more than one observation is reported per year.
The data can be summarized like this:
Country
Time
Conflict
Bio Weapons
A
2000
1
0
A
2000
2
0
B
2000
3
1
C
2000
4
0
D
2000
5
1
D
2000
6
0
D
2000
7
0
D
2000
8
1
Is it possible two colapse these multiple observations into one observation per year and country with either outcome 0 (if the event never occured) or 1(if the event occured at least once)? Like this?:
Country
Time
Bio Weapons
A
2000
0
B
2000
1
C
2000
0
D
2000
1
Thank you in advance !
Your output is a bit unlcear since it doesn't match with what your description is, but this is what I think you want:
dat %>%
dplyr::group_by(Country, Time) %>%
dplyr::summarise(Bio_Weapons = dplyr::if_else(1 %in% Bio.Weapons, 1, 0))
# A tibble: 4 x 3
# Groups: Country [4]
Country Time Bio_Weapons
<chr> <int> <dbl>
1 A 2000 0
2 B 2000 1
3 C 2000 0
4 D 2000 1
And since I like data.table solutions:
dat[, .(Bio_Weapons = fifelse(1 %in% Bio.Weapons, 1, 0)), by=c("Country", "Time")]
Country Time Bio_Weapons
1: A 2000 0
2: B 2000 1
3: C 2000 0
4: D 2000 1
An option without ifelse
library(dplyr)
dat %>%
group_by(Country, Time) %%
summarise(Bio_Weapons = +(1 %in% Bio.Weapons))
I have a firm-year longitudinal data but the year is not continuous for some firms, for example
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1999,2000,2001,1995,1997,2008,2009,2010))
For each firm, I want to keep observations in the most recent continuous years and remove other observations. For example, Firm 1 has five-year observations in (1990, 1991, 1999, 2000, 2001) and I want to keep (1999, 2000, 2001)
I can think of some awkward approaches to solve this issue but I am wondering if there is an easy way to solve it.
Enlighted by the comments, I am also wondering if there is any way to keep the longest continuous vector block of years. For example,
library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1992,2000,2001,1995,1997,2008,2009,2010))
The result would be
library(data.table)
DT2 <- setorder(dt, firm_id, year)[
,d := cumsum(c(TRUE, diff(year) > 1)), by = .(firm_id) ][
,n := .N, by = .(firm_id, d) ]
DT2
# firm_id year d n
# <num> <num> <int> <int>
# 1: 1 1990 1 3
# 2: 1 1991 1 3
# 3: 1 1992 1 3
# 4: 1 2000 2 2
# 5: 1 2001 2 2
# 6: 2 1995 1 1
# 7: 2 1997 2 1
# 8: 2 2008 3 3
# 9: 2 2009 3 3
# 10: 2 2010 3 3
From here, if you want runs of 3 consecutive years or more, then
DT2[ (n > 2), ]
If you want the longest run for each firm_id, then
DT2[, .SD[n == max(n),], by = .(firm_id) ]
I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1
Suppose I have a data frame with three variables as below. How do I create a new variable that for each group takes the first observation of x?
group year x
1 2000 3
1 2001 4
2 2000 1
2 2001 3
3 2000 5
3 2001 2
I want to to create something like this:
group year x y
1 2000 3 3
1 2001 4 3
2 2000 1 1
2 2001 3 1
3 2000 5 5
3 2001 2 5
Set up data for example:
dd <- data.frame(group=rep(1:3,each=2),
year=rep(2000:2001,3),
x=c(3,4,1,3,5,2))
In base R, use ave(). By default this finds the group average (rather than the first value), but we can use the FUN argument to ask it to select the first value instead.
dd$y <- ave(dd$x, dd$group, FUN=function(x) x[1])
## or
dd <- transform(dd,y=ave(x, group, FUN=function(x) x[1])
(alternatively could use FUN=function(x) head(x,1))
In tidyverse,
library(dplyr)
dd <- dd %>%
group_by(group) %>%
mutate(y=first(x))
#lmo points out another alternative in comments:
library(data.table)
setDT(dd)[, y := first(x), by=group]
You can find nearly endless discussion of the relative merits of these three major approaches (base R vs tidyverse vs data.table) elsewhere (on StackOverflow and on the interwebs generally).
Using package plyr:
df <- data.frame(group=c(1,1,2,2,3,3),
year=c(2000,2001,2000,2001,2000,2001),
x=c(3,4,1,3,5,2))
library(plyr)
ddply(df, .(group), transform, y=x[1])
A simple version in base R
### Your data
df = read.table(text="group year x
1 2000 3
1 2001 4
2 2000 1
2 2001 3
3 2000 5
3 2001 2",
header=TRUE)
df$y = aggregate(as.numeric(row.names(df)), list(df$group), min)$x[df$group]
df
group year x y
1 1 2000 3 1
2 1 2001 4 1
3 2 2000 1 3
4 2 2001 3 3
5 3 2000 5 5
6 3 2001 2 5
Here's yet another way, using base R:
dd <- data.frame(group = rep(1:3, each = 2),
year = rep(2000:2001, 3),
x = c(3, 4, 1, 3, 5, 2))
transform(dd, y = unsplit(tapply(x, group, function(x) x[1]), group))
Sample data:
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
Desired outcome:
desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34),
new.employee.rank=c(1,1,2,2,2,2,1,1))
id year month new.employee new.employee.rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1
The ranking rule is: I choose month 2 in each year to rank number of new employees between A and B. Then I need to give those ranks to month 1. i.e., month 1 of each year rankings must be equal to month 2 ranking in the same year.
I tried these code to get rankings for each month and each year,
library(data.table)
df1 <- data.table(df1)
df1[,rank:=rank(new.employee), by=c("year","month")]
If (anyone can roll the rank value within a column to replace rank of month 1 by rank of month 2 ), it might be a solution.
You've tried a data.table solution, so here's how would I do this using data.table
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
It appears somewhat similar to the above dplyr solution. Which is basically ranks the ids per year and joins them back to the original data set. I'm using data.table V1.9.6+ here.
Here's a dplyr-based solution. The idea is to reduce the data to the parts you want to compare, make the comparison, then join the results back into the original data set, expanding it to fill all of the relevant slots. Note the edits to your code for creating the sample data.
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=rep(c(2014,2014,2015,2015), 2),
month=rep(c(1,2), 4),
new.employee=c(4,6,2,6,23,2,5,34))
library(dplyr)
df1 %>%
# Reduce the data to the slices (months) you want to compare
filter(month==2) %>%
# Group the data by year, so the comparisons are within and not across years
group_by(year) %>%
# Create a variable that indicates the rankings within years in descending order
mutate(rank = rank(-new.employee)) %>%
# To prepare for merging, reduce the new data to just that ranking var plus id and year
select(id, year, rank) %>%
# Use left_join to merge the new data (.) with the original df, expanding the
# new data to fill all rows with id-year matches
left_join(df1, .) %>%
# Order the data by id, year, and month to make it easier to review
arrange(id, year, month)
Output:
Joining by: c("id", "year")
id year month new.employee rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1