I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")
I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0
ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0
ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:
f <- function(x) {
idx <- which(is.na(x))
for (id in idx) {
sel <- x[id+c(-1,1)]
if (id < length(x))
sel <- sel[!is.na(sel)]
x[id] <- mean(sel)
}
return(x)
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
# id time ss ss_imp
# 11 1 0 2 2.0
# 12 1 1 2 2.0
# 13 1 2 4 4.0
# 14 1 3 3 3.0
# 15 1 4 NA 1.5
# 16 1 5 0 0.0
# 17 1 6 0 0.0
# 21 2 0 1 1.0
# 22 2 1 4 4.0
# 23 2 2 0 0.0
# 24 2 3 NA 0.0
# 25 2 4 0 0.0
# 26 2 5 0 0.0
# 27 2 6 0 0.0
# 31 3 0 4 4.0
# 32 3 1 2 2.0
# 33 3 2 1 1.0
# 34 3 3 3 3.0
# 35 3 4 3 3.0
# 36 3 5 2 2.0
# 37 3 6 NA NA
# 41 4 0 3 3.0
# 42 4 1 4 4.0
# 43 4 2 3 3.0
# 44 4 3 NA 3.0
# 45 4 4 NA 1.5
# 46 4 5 0 0.0
# 47 4 6 0 0.0
I have a data frame with columns for "Count","Transect Number","Data", and "Year". My goal is to split up the data frame by Transect, then again by Year, and create a new data frame with a column for "Transect", and then the appropriate data per Year in the following columns.
To build a dummy data frame:
Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)
DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")
All<-rbind(DF1,DF2,DF3)
Once I have the data frame, my thought was to split up the data by transect since this will be a permanent aspect of my ongoing data set.
#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]
But I'm a little less clear on the next step. I need to pull out data from the "Data" column organized by year. Something along the lines of further breaking down the data like so:
Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]
or even using split
ByYear1<-split(Trans1,Trans1$Year)
But I would prefer to avoid writing out the code as above as I hope to add new data every year as this data set progresses. And I'd like the code to be able to accommodate new "Year" data as it is added, as opposed to writing out new lines of code every year.
Once I have the data set up like so, I'd like to create a second data frame with columns for each year. One problem is that the each year contains differing numbers of rows, which has been an issue for me. But my final result would have columns:
"Transect", "Data 2014", "Data 2015", "Data 2016"
Since each year has can have different numbers of rows within a transect, I'd like to leave NA's at the end of each Transect section when the number of rows per individual transect differ between years.
It sounds like you are basically trying to convert your data into a semi-wide format, with columns for years, rather than keeping it in the "long" format.
If this is the case, you're better off adding a secondary index column that shows the repeated combination of "Transect" and "Year".
This can easily be done with getanID from my "splitstackshape" package. "splitstackshape" also loads "data.table", from which you could then use dcast.data.table to get a wide format.
library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")),
Transect + .id ~ Year, value.var = "Data")
# Transect .id 2014 2015 2016
# 1: 1 1 1 2 3
# 2: 1 2 2 1 4
# 3: 1 3 3 2 5
# 4: 1 4 1 2 4
# 5: 1 5 2 4 5
# 6: 1 6 3 3 6
# 7: 1 7 1 4 4
# 8: 1 8 2 5 4
# 9: 1 9 3 4 3
# 10: 1 10 NA NA 4
# 11: 2 1 2 3 4
# 12: 2 2 1 4 3
# 13: 2 3 2 3 2
# 14: 2 4 2 2 3
# 15: 2 5 1 3 2
# 16: 2 6 4 4 1
# 17: 2 7 4 3 4
# 18: 2 8 5 3 3
# 19: 2 9 4 2 2
# 20: 2 10 NA NA 3
# 21: 3 1 3 2 3
# 22: 3 2 4 1 3
# 23: 3 3 3 2 2
# 24: 3 4 3 3 5
# 25: 3 5 2 2 4
# 26: 3 6 1 3 3
# 27: 3 7 3 3 2
# 28: 3 8 3 4 4
# 29: 3 9 3 5 NA
# Transect .id 2014 2015 2016
Then, if you really want to split on the "Transect" column you can go ahead and use split, but since you now have a "data.table" it would be better to stick with that and take advantage of its many convenient features, including those related to subsetting and aggregation.
I think you are forcing your data into a format it does not have naturally. There are a lot of processing advantages to leaving it in "long" format. Have a look at this article if you have not seen it yet, it is a classic.
http://www.jstatsoft.org/v21/i12
I have a dataframe in R containing the columns ID.A, ID.B and DISTANCE, where distance represents the distance between ID.A and ID.B. For each value (1->n) of ID.A, there may be multiple values of ID.B and DISTANCE (i.e. there may be multiple duplicate rows in ID.A e.g. all of value 4 which each has a different ID.B and distance in that row).
I would like to be able to remove rows where ID.A is duplicated, but conditional upon the distance value such that I am left with the smallest distance values for each ID.A record.
Hopefully that makes sense?
Many thanks in advance
EDIT
Hopefully an example will prove more useful than my text. Here I would like to remove the second and third rows where ID.A = 3:
myDF <- read.table(text="ID.A ID.B DISTANCE
1 3 1
2 6 8
3 2 0.4
3 3 1
3 8 5
4 8 7
5 2 11", header = TRUE)
You can also do it easily in base R. If dat is your dataframe,
do.call(rbind,
by(dat, INDICES=list(dat$ID.A),
FUN=function(x) head(x[order(x$DISTANCE), ], 1)))
One possibility:
myDF <- myDF[order(myDF$ID.A, myDF$DISTANCE), ]
newdata <- myDF[which(!duplicated(myDF$ID.A)),]
Which gives :
ID.A ID.B DISTANCE
1 1 3 1.0
2 2 6 8.0
5 3 2 0.4
6 4 8 7.0
7 5 2 11.0
You can use the plyr package for that. For example, if your data are like these :
d <- data.frame(id.a=c(1,1,1,2,2,3,3,3,3),
id.b=c(1,2,3,1,2,1,2,3,4),
dist=c(12,10,15,20,18,16,17,25,9))
id.a id.b dist
1 1 1 12
2 1 2 10
3 1 3 15
4 2 1 20
5 2 2 18
6 3 1 16
7 3 2 17
8 3 3 25
9 3 4 9
You can use the ddply function like this :
library(plyr)
ddply(d, "id.a", function(df) return(df[df$dist==min(df$dist),]))
Which gives :
id.a id.b dist
1 1 2 10
2 2 2 18
3 3 4 9