Comparing elements between groups in R - r

I have a data frame arranged as below:
DEPUTIES CHAMBER (...)
1 2496 1
2 2577 1
3 2577 2
4 2577 3
5 2577 4
6 2578 2
(...)
I have 2322 different deputies and 4 chambers, but some deputies appear in more than one chamber. What I want to do is to create a variable that indicates whether a deputy was in the previous chamber ("reelection") or not (the first chamber will be discarded later). I think it is probably simple, but could someone help me out?

Like this?
df <- df[order(df$DEPUTIES,df$CHAMBER),]
df$r <- unlist(aggregate(CHAMBER~DEPUTIES,df,function(x)c(NA,diff(x)))$CHAMBER)
df
# DEPUTIES CHAMBER r
# 1 2496 1 NA
# 2 2577 1 NA
# 3 2577 2 1
# 4 2577 3 1
# 5 2577 4 1
# 6 2578 2 NA
This orders the df by deputies and chamber (already ordered that way it seems, but to be sure...). Then, using aggregate(...), for each deputy calculate the difference between the current chamber number and the previous. If this is >0, they went from, e.g ,chamber 1 to chamber 2. Not sure what to do if someone starts in a chamber > 1 but never advances??

Related

For loop skips rows in R dataframe

I have a for loop printing values out of this small test dataframe.
USA Finland China Sweden
1 1 3 5.505962 8.310596
2 2 4 11.033347 5.425747
3 3 5 14.932882 3.272544
4 4 6 10.155517 5.980190
5 5 7 11.020148 3.692313
Total 0 0 0.000000 0.000000
This line prints out a line from the dataframe:
print(countries[2,])
and results in this:
USA Finland China Sweden
2 2 4 11.03335 5.425747
So based on that, I imagine I could do the same in a for loop and print out all the lines. Code for the loop:
for (i in countries[1,])
{
print(countries[i,])
}
However this results in only every second line printed out which doesn't make sense. The result I get is this:
USA Finland China Sweden
1 1 3 5.505962 8.310596
USA Finland China Sweden
3 3 5 14.93288 3.272544
USA Finland China Sweden
5 5 7 11.02015 3.692313
USA Finland China Sweden
NA NA NA NA NA
What could possibly lead to this happening? I'm using R studio so could it be the console logging not keeping up with the values?
#lmo comment suggest solution. I think that you want to know why this happend, so I'll try to answer that.
You are using this code:
1: for (i in countries[1,])
2: {
3: print(countries[i,])
4: }
In line 1 you are selecting a vector of values that i will be using. This vector happens to be the first row of your data: 1 3 5.505962 8.310596. It translates to a vector c(1,3,5,8) - as indexes.
So in line 3 you are printing lines 1, 3, 5, 8 (because you choose that indexes). It was quite random that it were even rows, but I hope you understand it better.
Of course you should use df[1:5,] or print(df) instead of for.

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Combining observations with overlapping dates

Each observations in my dataframe contains a different "before date" and "after date instance". The problem is some dates overlap for each ID. For instance, in the table below, ID's 1 and 4 contain overlapping date values.
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 9/30/2003
1 1/1/2000 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
4 10/1/2004 12/30/2004
4 10/3/2004 11/28/2004
I am trying to get something like this:
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
Basically, I would like to replace any overlapping date values with the date range of the values with the overlap, leave the non-overlapping values alone, and delete any unnecessary rows. Not sure how to go about doing this
Firstly, you should convert your string dates into Date-classed values, which will make comparison possible. Here's how I've defined and coerced your data:
df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-12-30
## 8 4 2004-10-03 2004-11-28
Now, my solution involves computing an "overlapping grouping" vector which I've called og. It makes the assumption that the input df is ordered by ID and then before.date, which it is in your example data. If not, this could be achieved by df[order(df$ID,df$before.date),]. Here's how I compute og:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
Unfortunately, the base R cummax() function doesn't work on Date-classed objects, so I had to write a cummax.Date() shim. I'll explain the need for the ave() and cummax() business at the end of the post.
As you can see, the above computation lags the RHS of each of the two vectorized comparisons by excluding the first element via [-1]. This allows us to compare a record's ID for equality with the following record's ID, and also compare if its after.date is after the before.date of the following record. The resulting logical vectors are ANDed (&) together. The negation of that logical vector then represents adjacent pairs of records that do not overlap, and thus we can cumsum() the result (and prepend zero, as the first record must start with zero) to get our grouping vector.
Finally, for the final piece of the solution, I've used by() to work with each overlapping group independently:
do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Since all records in a group must have the same ID, and we've made the assumption that records are ordered by before.date (after being ordered by ID, which is no longer relevant), we can get the correct ID and before.date values from the first record in the group. That's why I started with g[1,]. Then we just need to get the greatest after.date from the group via max(g$after.date), and overwrite the first record's after.date with that, which I've done with transform().
A word about performance: The assumption about ordering aids performance, because it allows us to simply compare each record against the immediately following record via lagged vectorized comparisons, rather than comparing every record in a group with every other record.
Now, for the ave() and cummax() business. I realized after writing the initial version of my answer that there was a flaw in my solution, which happens to not be exposed by your example data. Say there are three records in a group. If the first record has a range that overlaps with both of the following two records, and then the middle record does not overlap with the third record, then my (original) code would fail to identify that the third record is part of the same overlapping group of the previous two records.
The solution is to not simply use the after.date of the current record when comparing against the following record, but instead use the cumulative maximum after.date within the group. If any earlier record sprawled completely beyond its immediately following record, then it obviously overlapped with that record, and its after.date is what's important in considering overlapping groups for subsequent records.
Here's a demonstration of input data that requires this fix, using your df as a base:
df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-10-02
## 8 4 2004-10-03 2004-11-28
Now record 6 overlaps with both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Here's a proof that the og calculation would be wrong without the ave()/cummax() fix:
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5
Minor adjustment to the solution, to overwrite after.date in advance of the og computation, and avoid the max() call (makes more sense if you're planning on overwriting the original df with the new aggregation):
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2005-09-30
## 8 4 2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30

selecting rows with specific conditions in R

I currently have a data that looks like this for multiple ids (that range until around 1600)
id year name status
1 1980 James 3
1 1981 James 3
1 1982 James 3
1 1983 James 4
1 1984 James 4
1 1985 James 1
1 1986 James 1
1 1987 James 1
2 1982 John 2
2 1983 John 2
2 1984 John 1
2 1985 John 1
I want to subset this data so that it only has the information for status=1 and the status right before that. I also want to eliminate multiple 1s and only save the first 1s. In conclusion I would want:
id year name status
1 1984 James 4
1 1985 James 1
2 1983 John 2
2 1984 John 1
I'm doing this because I'm in the process of figuring out in what year how many people from certain status changed to status 1. I only know the subset command and I don't think I can get this data from doing subset(data, subset=(status==1)). How could I save the information right before that
I want to add to this question one more time - I did not get same results when I applied the first reply to this question (which uses plr packages) and the third reply which uses duplicated command. I found out that the first reply preserved information accurately while the third one did not.
This does what you want.
library(plyr)
ddply(d, .(name), function(x) {
i <- match(1, x$status)
if (is.na(i))
NULL
else
x[c(i-1, i), ]
})
id year name status
1 1 1984 James 4
2 1 1985 James 1
3 2 1983 John 2
4 2 1984 John 1
Here's a solution - for each grouping of numbers (the cumsum bit), it looks at the first one and takes that and the previous row if status is 1:
library(data.table)
dt = data.table(your_df)
dt[dt[, if(status[1] == 1) c(.I[1]-1, .I[1]),
by = cumsum(c(0,diff(status)!=0))]$V1]
# id year name status
#1: 1 1984 James 4
#2: 1 1985 James 1
#3: 2 1983 John 2
#4: 2 1984 John 1
Using base R, here is a way to do this:
# this first line is how I imported your data after highlighting and copying (i.e. ctrl+c)
d<-read.table("clipboard",header=T)
# find entries where the subsequent row's "status" is equal to 1
# really what's going on is finding rows where "status" = 1, then subtracting 1
# to find the index of the previous row
e<-d[which(d$status==1)-1 ,]
# be careful if your first "status" entry = 1...
# What you want
# Here R will look for entries where "name" and "status" are both repeats of a
# previous row and where "status" = 1, and it will get rid of those entries
e[!(duplicated(e[,c("name","status")]) & e$status==1),]
id year name status
5 1 1984 James 4
6 1 1985 James 1
10 2 1983 John 2
11 2 1984 John 1
I like the data.table solution myself, but there actually is a way to do it with subset.
# import data from clipboard
x = read.table(pipe("pbpaste"),header=TRUE)
# Get the result table that you want
x1 = subset(x, status==1 |
c(status[-1],0)==1 )
result = subset(x1, !duplicated(cbind(name,status)) )

Resources