match fundction with data frames that are differently constructed - r

I am relatively new to R and I have hit a wall with trying to figure out how to do what I want to do. I went through many questions on StackOF but still could not figure it out (exactly). Here is what I am trying to do:
data frame 1:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
data frame 2:
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
What I hope to create:
df2$new <- if (df2$dtime >= df1$d1 AND <= df1$d2),
return df1$rm with the hopes of creating a variable df2$new looking like this in the end:
df2$new <- 13
View(df2)
I am essentially trying to match the value that corresponds to the week (df1$rm) to the individual days (df2$new) within that week.

I think what you might be looking for is sapply
df2$new <- sapply(df2$dtime, function(row){df1[((row >= as.Date(df1$d1)) + (row <= as.Date(df1$d2))) == 2,]$rm})
In R vectors being preferred to for loops. What I'm doing is taking df2's dtime column, and applying the function(row) to each in turn. This get's me a list of lookups into df1, will there always be an entry in df1 or do we need a default case?

If the dataframe is not too big, this would be really easy using a simple for loop:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
df2$new <- NA
for(i in 1:nrow(df1)) df2$new[df2$dtime >= as.Date(df1$d1[i]) & df2$dtime <= as.Date(df1$d2[i])] <- df1$rm[i]

Related

Create list of dataframes subset by date range

I am trying to create a series of dataframes which are subset from a larger dataframe by a date range (2-year blocks), in order to do a separate survival analysis for each new dataframe. I cannot use "split" to split the dataframe based on one factor, as the data will need to be present in more than one subset.
I have some example data as follows:
Patient <- c(1:10)
First.Appt <- c("2014-01-01","2014-03-02","2015-05-17","2015-06-03","2016-01-12","2016-11-07","2017-07-08","2017-09-09","2018-04-12","2018-05-13")
DOD <- c("2014-01-29","2014-03-30","2015-06-14","2015-07-01","2016-02-09","2016-12-05","2017-08-05","2017-10-07","2018-05-10","2018-06-10")
First.Appt.Year <- c(2014,2014,2015,2015,2016,2016,2017,2017,2018,2018)
df <- as.data.frame(cbind(Patient, First.Appt, DOD, First.Appt.Year))%>%
mutate_at("First.Appt.Year", as.numeric)
I have created a start date (the minimum First.Appt.Year), the final start date (maximum First.Appt.Year - 1), and then a vector containing all my start dates from which to subset full 2-year blocks as follows:
Start.year <- as.numeric(min(df$First.Appt.Year))
Final.start.year <- max(df$First.Appt.Year) - 1
Start.vec <- c(Start.year:Final.start.year)
I thought to use a for loop with lapply to create a subset based on First.Appt.Year falling within the range of Start.vec and Start.vec + 1, for each value of Start.vec as follows:
for (i in 1:length(Start.vec)){
new.df = lapply(Start.vec, function(x)
subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1))
}
This almost works, but instead of creating four different dataframes (e.g. 2014-2015, 2015-2016, 2016-2017 and 2017-2018), all four of the dataframes in the output list only contain 2017-2018 values as below.
Patient
First.Appt
DOD
First.Appt.Year
7
08/07/2017
05/08/2017
2017
8
09/09/2017
07/10/2017
2017
9
12/04/2018
10/05/2018
2018
10
13/05/2018
10/06/2018
2018
Can anyone help me with what I am doing wrong and how to return the different subsets into each list object?
If there are other ways of doing this that seem more logical then please let me know.
It looks like a simple misunderstanding about the use of lapply. You don't need to wrap it in a for loop. Just replace your last block with :
new.df = lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And that should work. At least, it does on my side.
You are close! Instead of using both the for loop and the lapply, you need only one.
For example, with the lapply:
new.df <- lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And using only the for loop:
df_list <- list()
for (i in 1:length(Start.vec)){
new.df <- subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1)
df_list <- c(df_list, list(new.df))
}
df_list

Merging in R with data.table with two columns

I am fairly new to using data.table but I am using it since I have heard it is faster than data.frame and plan to loop.
I am trying to merge raster data (which comes with longitude "x", latitude "y", and temperature information) onto a master dataset, which is just all possible combinations of "x" and "y" for this particular country I am looking at.
For some reason, data.frame works (the temperature information is merged in. Some missing information but that's to be expected) while data.table does not (the temperature variable is "added" but all information is missing). I think it has to do something with the fact that I am merging with two columns, or maybe the data isn't sorted the right way, but I'm not completely sure.
Below is my code
# Set common parameters
x <- rep(seq(-49.975,49.975, by = 0.05), times = 2000)
y <- rep(seq(-49.975,49.975, by = 0.05), each = 2000)
xy <- cbind(x,y)
## What works
# Create data frame, then subset to possible coordinates of country
df_xy <- data.frame(xy)
eth_df_xy <- subset(df_xy, df_xy$x >= 30 & df_xy$x <= 50 & df_xy$y >=0 & df_xy$y <= 20)
# Bring in raster dataset
examine <- print(paste0(dir_tif, files[[1]]))
sds <- raster(examine)
x <- rasterToPoints(sds)
df_x <- data.frame(x)
# Merge
eth_df_xy <- merge(df_x, eth_df_xy, by = c("x","y"), all.x = F, all.y=T)
## What doesn't work but seems intuitive
# Create data table, then subset to possible coordinates of country (as above)
dt_xy <- data.table(xy)
eth_dt_xy <- subset(dt_xy, dt_xy$x >= 30 & dt_xy$x <= 50 & dt_xy$y >=0 & dt_xy$y <= 20)
# Bring in raster dataset (from above, skip to fourth step)
dt_x <- data.table(x)
# Merge
eth_dt_xy <- merge(dt_x, eth_dt_xy, by = c("x","y"), all.x = F, all.y=T)
Thanks

Delete rows based on result of value in concerned row and of other column value in previous row

Although there are many questions for deleting rows I couldn't find a solution for my problem.
Here is a data.frame as an example:
df <- data.frame(A = c(1,2,3,4,5,6),
D1 = as.Date(as.character(c("1863-12-01","1945-06-06","1955-03-01","1962-08-01","1980-08-01","1998-12-01")), format = "%Y-%m-%d"),
D2 = as.Date(as.character(c("1923-02-28","1953-05-28","1962-07-31","1978-06-30","1998-11-30","2015-12-31")), format = "%Y-%m-%d"))
The result should be without the rows where are more than one day between the date of the row of D1 and the previous row of D2, see this:
A D1 D2
5 1980-08-01 1998-11-30
6 1998-12-01 2015-12-31
I tried it by a loop, but it doesn't work in the required way - I have to repeat the loop again and again for the final result:
for (i in 1:length(df))
{
if ((df$D1[i + 1] - df$D2[i]) > 1)
df <- df[-c(i), ]
}
Where is the bug and is there a better way than a loop? Thank You!
Using dplyr u can do
library(dplyr)
filter(df, D1 - lag(D2) < 2)
EDIT
In case you want to keep the line where laged D2 fullfills condition use the following:
filter(df, lead(D1) - D2 < 2 | D1 - lag(D2) < 2)

Creating a dataframe from an lapply function with different numbers of rows

I have a list of dates (df2) and a separate data frame with weekly dates and a measurement on that day (df1). What I need is to output a data frame within a year prior to the sample dates (df2) and the measurements with this.
eg1 <- data.frame(Date=seq(as.Date("2008-12-30"), as.Date("2012-01-04"), by="weeks"))
eg2 <- as.data.frame(matrix(sample(0:1000, 79*2, replace=TRUE), ncol=1))
df1 <- cbind(eg1,eg2)
df2 <- as.Date(c("2011-07-04","2010-07-28"))
A similar question I have previously asked (Outputting various subsets from one data frame based on dates) was answered effectively with daily data (where there is a balanced number of rows) through this function...
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
However, with weekly data an uneven number of rows means this is not possible. When the 'as.data.frame' function is removed, the code works but I get a list of data frames. What I would like to do is append a row of NA's to those dataframes containing fewer observations so that I can output one dataframe, so that I can apply functions simply ignoring the NA values e.g...
df2 <- as.Date(c("2011-01-04","2010-07-28"))
output <- as.data.frame(lapply(df2, function(x) {
df1[difftime(df1[,1], x - days(365)) >= 0 & difftime(df1[,1], x) <= 0, ]
}))
col <- c(2,4)
output_two <- output[,col]
Mean <- as.data.frame(apply(output_two,2,mean), na.rm = TRUE)
Try
lst <- lapply(df2, function(x) {df1[difftime(df1[,1], x - days(365)) >= 0 &
difftime(df1[,1], x) <= 0, ]})
n1 <- max(sapply(lst, nrow))
output <- data.frame(lapply(lst, function(x) x[seq_len(n1),]))

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources