Add column from another data.frame based on multiple criteria - r

I have 2 data frames:
cars = data.frame(car_id=c(1,2,2,3,4,5,5),
max_speed=c(150,180,185, 200, 210, 230,235),
since=c('2000-01-01', '2000-01-01', '2007-10-01', '2000-01-01', '2000-01-01', '2000-01-01', '2009-11-18'))
voyages = data.frame(voy_id=c(1234,1235,1236,1237,1238),
car_id=c(1,2,3,4,5),
date=c('2000-01-01', '2002-02-02', '2003-03-03', '2004-04-04', '2010-05-05'))
If you look closely you can see that the cars occasionally has multiple entries for a car_id because the manufacturer decided to increase the max speed of that make. Each entry has a date marked by since that indicates the date from which the actual max speed is applied.
My goal: I want to add the max_speed variable to the voyages data frame based on the values found in cars. I can't just join the 2 data frames by car_id because I also have to check the date in voyages and compare it to since in cars to determine the proper max_speed
Question: What is the elegant way to do this without loops?

One approach:
Merge the two datasets, including duplicated observations in "cars".
Drop any observations where the date for "since" is later than the date for "date". Order the dataset so most recent dates are first, then drop duplicated observations for "voy_id"--this ensures that where there are two dates in "since", you'll only keep the most recent one that occurs before the voyage date.
z <- merge(cars, voyages, by="car_id")
z <- z[as.Date(z$since)<=as.Date(z$date),]
z <- z[order(as.Date(z$since), decreasing=TRUE),]
z <- z[!duplicated(z$voy_id),]
Also curious to see if someone comes up with a more elegant, parsimonious approach.

Related

PSM in R with specific lines

to get matched pairs due to PSM ("Matchit"-Package and Method = full) i need to specifiy my command for my longitudinal data frame. Every Case has several obeservations but i only need the first observation per patient to be included in the Matching. So the matching should be based on every patients' first observation but my later analysis should include the complete dataset of each patient with all observations.
Has anyone an idea how to achieve this?
I tried using a data subset (first observation per patient) but wasn't able to get the matching included in the data set (with all observations per patient) using "Match.data".
Thanks in advance
Simon (desperately writing his masters thesis)
My udnerstanding is that you want to create matches at just the first time point but have those matches be identified for each unit at all time points. Fortunatly, this is pretty straightforward: just perform the matching at the first time point and then merge the matched dataset with the full dataset. Here is how this might look. Let's say your original long dataset is d and has an ID column id and a time column time.
m <- matchit(treat ~ X1 + X2, data = subset(d, time == 1), method = "full")
md1 <- match.data(m)
d <- merge(d, md1[c("id", "subclass", "weights")], by = "id", all.x = TRUE)
Your new dataset should have two new columns, subclass and weights, which contain the matching subclass and matching weight for each unit. Rows with identical IDs (i.e., rows corresponding to the same unit at multiple time points) will have the same value of subclass and weight.

Comparing dates in different columns to isolate certain within-group entries in R

I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.

Expand Row with Multiple Observations into Individual Rows

Just wondering if there is a way to expand rows which have multiple observations, into rows of unique observations using R? I have data in an excel spreadsheet with the variable headings: Lease, Line, Bay, Date, Predators, Food.Index, DD, MM, YY.
On some dates, there have been multiple predators (from 1 to 4) recorded in the same row. Other days just have 0. On a day where there has been 4 predators recorded, I would like to somehow transform the data to show four unique observations (instead of one row with 4 recorded under "Predators").
I have 1669 rows of data and multiple rows need to be expanded
Example of Data set
Many thanks for your help in advance.
enter image description here
Assuming you have your data in a data.frame, df, one possible solution would be
df.expanded <- df[rep(row.names(df), df$Predators), ]
EDIT: If you also want to keep the rows with 0 predators, you can use pmax to always return at least one:
df.expanded <- df[rep(row.names(df), pmax(df$Predators, 1)),]
Here the pmax(df$Predators, 1) will return the elementwise maximum of df$Predators and 1 so that it returns a new vector where each element is at least 1 but takes the value of df$Predators if that number is greater than 1.

In R, select rows that have one column that exists in another list

I'm new to R; have a simple stumbling block for which I've been searching for an answer for too long.
Dateframe includes a list of individuals with their performance over a five year period. The analysis needs to include only those individuals that participated in the most recent year, so I need to identify those individuals and then select all records from the original data frame for those individuals with all columns (there's 50 or more other columns).
Original data frame is performance_fiveyr; variables I'm working with are person_id and year. I have tried any number of possible ways to get what I need; I'm listing one of those ways here...
First step is to create the list of individuals that participated this past year
person_current <- subset (x = performance_fiveyr,
subset = year==2015, # keep only records from 2015
select = person_id # keep only the person_id variable
)
Next step then is to select from performance_fiveyr all rows that have a person_id that exists in person_current and return all other columns (more than 50 columns total).
performance_current <- performance_fiveyr[performance_fiveyr$person_id
%in% person_current, ]
I've tried more than a few variations of this and end up with either all columns and no rows or all rows and no variables.
Here is some example data:
set.seed(0)
p5 <- data.frame(id = sample(5, 20, replace=TRUE), year = sample(2010:2015, 20, replace=TRUE))
p5 <- p5[order(p5$id, p5$year), ]
I think you were on the right track. I think the below does what you are after:
current <- unique(p5[p5$year==2015, 'id'])
p_current <- p5[p5$id %in% current, ]
p_current

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

Resources