Merging in R with data.table with two columns - r

I am fairly new to using data.table but I am using it since I have heard it is faster than data.frame and plan to loop.
I am trying to merge raster data (which comes with longitude "x", latitude "y", and temperature information) onto a master dataset, which is just all possible combinations of "x" and "y" for this particular country I am looking at.
For some reason, data.frame works (the temperature information is merged in. Some missing information but that's to be expected) while data.table does not (the temperature variable is "added" but all information is missing). I think it has to do something with the fact that I am merging with two columns, or maybe the data isn't sorted the right way, but I'm not completely sure.
Below is my code
# Set common parameters
x <- rep(seq(-49.975,49.975, by = 0.05), times = 2000)
y <- rep(seq(-49.975,49.975, by = 0.05), each = 2000)
xy <- cbind(x,y)
## What works
# Create data frame, then subset to possible coordinates of country
df_xy <- data.frame(xy)
eth_df_xy <- subset(df_xy, df_xy$x >= 30 & df_xy$x <= 50 & df_xy$y >=0 & df_xy$y <= 20)
# Bring in raster dataset
examine <- print(paste0(dir_tif, files[[1]]))
sds <- raster(examine)
x <- rasterToPoints(sds)
df_x <- data.frame(x)
# Merge
eth_df_xy <- merge(df_x, eth_df_xy, by = c("x","y"), all.x = F, all.y=T)
## What doesn't work but seems intuitive
# Create data table, then subset to possible coordinates of country (as above)
dt_xy <- data.table(xy)
eth_dt_xy <- subset(dt_xy, dt_xy$x >= 30 & dt_xy$x <= 50 & dt_xy$y >=0 & dt_xy$y <= 20)
# Bring in raster dataset (from above, skip to fourth step)
dt_x <- data.table(x)
# Merge
eth_dt_xy <- merge(dt_x, eth_dt_xy, by = c("x","y"), all.x = F, all.y=T)
Thanks

Related

subset the data frame based on multiple ranges and save each range as element in the list

I want to make the data frame as a list based on its values which belong to multiple ranges so that each value belongs to each range to be an element in that list. for example, if I have 10 range and data frame of nrow= n, so I will get a list of 10 data frames.
The data
df<- data.frame(x=seq(33, 37, 0.12), y=seq(31,35, 0.12))
library(data.table)
range<- data.table(start =c(36.15,36.08,36.02,35.95,35.89,35.82,35.76,35.69),
end = c(36.08,36.02,35.95,35.89,35.82,35.76,35.69,35.63))
I tried
nlist<-list(
df[which(df$x>36.15),],
df[which(df$x<=36.15 & df$x>36.08),],
df[which(df$x<=36.08 & df$x>36.02),],
df[which(df$x<=36.02 & df$x>35.95),],
df[which(df$x<=35.95 & df$x>35.89),],
df[which(df$x<=35.89 & df$x>35.82),],
df[which(df$x<=35.82 & df$x>35.76),],
df[which(df$x<=35.76 & df$x>35.69),],
df[which(df$x<=35.69 & df$x>35.63),],
df[which(df$x <= 35.63),])
There are two problems. Firstly, I want to make in loop instead of writing the vaules of each range limit. Secondly, this code:
Reduce('+', lapply(nlist, nrow))
produces the sum of rows = 35 whereas my data frame has nrow = 34. Where does this extra value come from?
you could apply over the rows of your range object
apply(range, 1, function(z) df[df$x > z[2] & df$x <= z[1],])
You can split the data frame according to levels obtained by cutting df$x by range$start. You don't even need a loop for this:
nlist <- split(df, cut(df$x, breaks = c(-Inf, range$start, Inf)))
Or if you want it in the same format (an unnamed list in reverse order, you can do:
nlist <- setNames(rev(split(df, cut(df$x, breaks=c(-Inf, range$start, Inf)))),NULL)
This also gives the correct answer for Reduce:
Reduce('+', lapply(nlist, nrow))
#> [1] 34

data.table filtering solution based on posixct datetime and two variables?

I have a data frame of when animals are detected at different sites. I want to eliminate rows (filter) from the detection file (df) for only site A if the individual animal wasn't detected at site B within a time frame ( 5 minutes). I need to iterate this over every individual animal and across multiple sites. my real data has many animals and over a million detection observations. I'm looking for a data.table solution to be efficient.
The two variables would be the individuals (animals) and the site detected.
Example:
obs.num<-1:21 # a simple observation number
animal<-c(rep("RBT 1",10),rep("RBT 2",7) ,rep("RBT 3",2),"RBT 4","RBT 2") #
a fake list of animal id's (my data has many)
now <- Sys.time()
ts <- seq(from = now, length.out = 16, by = "mins")
ts <- c(ts,seq(from=tail(ts,1), length.out = 3, by = "hour")) # create a
fake series of time stamps
ts <- c(ts,seq(from=tail(ts,1), length.out = 2, by = "hour"))
df<-data.frame(obs.num,animal,ts) # make data frame
df$site<-c("A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","B")# make a fake series of sites detection occurred at
str(df)
df # my example data frame
In this example i would like to remove the entire row for observation 19.
I'm looking for a data.table solution similar to this solution
library(sqldf)
sqldf("with B as (select * from df where site == 'B')
select distinct df.* from df
join B on df.animal = B.animal and
B.ts - df.ts between -5 * 60 and 5 * 60
order by 1")
A bit clunky, but you can accomplish this with non-equi-joins in data.table:
library(data.table)
setDT(df)
nm = names(df)
# unfortunately non-equi-joins don't support on-the-fly
# columns yet, so we have to first define them explicitly; see:
# https://github.com/Rdatatable/data.table/issues/1639
df[ , ts_minus_5 := ts - 5*60]
df[ , ts_plus_5 := ts + 5*60]
# identify the observations _matching_ your criteria (i.e. those to keep)
found_at_b = unique(
df[site == 'A'][df[site == 'B'], .(x.obs.num, x.animal),
on = .(animal == animal, ts >= ts_minus_5, ts <= ts_plus_5),
# allow.cartesian allows this join to return any
# number of rows, necessary since any "B" row
# might match multiple "A" rows;
# nomatch = 0L drops any "B" row without a
# match found in "A" rows
allow.cartesian = TRUE, nomatch = 0L]
)
# to filter, define a "drop" flag (could also call it "filter")
df[site == 'B', drop := FALSE]
df[found_at_b, on = c(obs.num = 'x.obs.num', animal = 'x.animal'),
drop := FALSE]
# could define drop = TRUE for the other rows, but no need
df = df[(!drop)]
There are some other ways to clean the code up a bit by being more careful about potentially creating copies, perhaps split-ing the data by site first, doing as much as possible within one [] call, etc., but this will get you started.

Binning variables in a dataframe with input bin data from another dataframe

Being a beginner-level user of R, despite having read (1) numerous posts about binning&grouping here at SO, and (2) documentation on data.table and dplyr packages, I still can't figure out how to apply the power of those packages for binning continuous&factor variables, for further use in credit scoring modelling.
Problem: To build a code-efficient, easily-customisable, more or less automated solution for binning variables with minimal hard-coding.
These variables used to be binned with a stored procedure (Oracle), but I would like to switch entirely to R and operate the following dataframes:
to bin variables in "df_Raw" according to variable/bin ranges&levels in "binsDF" and to store binned variables in "df_Binned"
.
So far I have been able to produce simple straight-forward code that is lengthy, error-prone (cut levels and labels are hard-coded), difficult to rollback and is just ugly; although it works.
The goal is to have this binning operation automated as much as possible with minimal hard-coding, so that to re-do binning it would take
to update bin ranges&levels in "binsDF" and to re-run the code rather than to manually edit all hard code.
I wonder how **ply family of functions and dplyr functions could be applied nicely to this problem.
Data description - the datasets have 100+ variables and 1-2 mln observations, with two types of variables to be binned:
Continuous variables. Example - OVERDUEAMOUNT - has values 0 (zero), "NA", and both negative&positive numeric values.
OVERDUEAMOUNT needs to be split into 7 bins: bin#1 contains only zeros, bins#2-6 contain continuous values that need to be split into 5 custom-sized intervals, and bin#7 contains only NAs.
Factor variables, with both character and numeric values. Example - PROFESSION - has 4 levels: "NA" and 3 values/codes that stand for certain categories of professions/types of employment.
It is important to place zeros and NAs in 2 separate bins, as they usually have very different interpretation from each other and from other values.
Datasets like iris or GermanCredit are not applicable due to not having NAs, strings or zeros, so I wrote some code below to replicate my data.
Many thanks in advance!
Raw data to be binned.
OVERDUEAMOUNT_numbers <- rnorm(10000, mean = 9000, sd = 3000)
OVERDUEAMOUNT_zeros <- rep(0, 3000)
OVERDUEAMOUNT_NAs <- rep(NA, 4000)
OVERDUEAMOUNT <- c(OVERDUEAMOUNT_numbers, OVERDUEAMOUNT_zeros, OVERDUEAMOUNT_NAs)
PROFESSION_f1 <- rep("438", 3000)
PROFESSION_f2 <- rep("000", 4000)
PROFESSION_f3 <- rep("selfemployed", 5000)
PROFESSION_f4 <- rep(NA, 5000)
PROFESSION <- c(PROFESSION_f1, PROFESSION_f2, PROFESSION_f3, PROFESSION_f4)
ID <- sample(123456789:987654321, 17000, replace = TRUE); n_distinct(ID)
df_Raw <- cbind.data.frame(ID, OVERDUEAMOUNT, PROFESSION)
colnames(df_Raw) <- c("ID", "OVERDUEAMOUNT", "PROFESSION")
Convert PROFESSION to factor to replicate this variable is processed & prepared for further import into R. Reshuffle the dataframe row-wise to make it look like real data.
df_Raw$PROFESSION <- as.factor(df_Raw$PROFESSION)
df_Raw <- df_Raw[sample(nrow(df_Raw)), ]
Dataframe with bins.
variable <- c(rep("OVERDUEAMOUNT", 7), rep("PROFESSION", 4))
min <- c(0, c(-Inf, 1500, 4000, 8000, 12000), "", c("438", "000", "selfemployed", ""))
max <- c(0, c(1500, 4000, 8000, 12000, Inf), "", c("438", "000", "selfemployed", ""))
bin <- c(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4))
binsDF <- cbind.data.frame(variable, min, max, bin)
colnames(binsDF) <- c("variable", "min", "max", "bin")
How I bin the variables: copy the list of IDs "as is" in a separate dataframe for further use in semi-joins as a "reference/standard" for the original list of IDs.
dfID <- as.data.frame(df_Raw$ID); colnames(dfID) <- c("ID")
Continuous variable - OVERDUEAMOUNT. Split the variable into 3 temporary dataframes: zeros, NAs and numeric observations to cut.
df_tmp_zeros <- subset(x=df_Raw, subset=(OVERDUEAMOUNT == 0), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_zeros)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_NAs)
df_tmp_numbers <- subset(x=df_Raw, subset=(OVERDUEAMOUNT != 0 & OVERDUEAMOUNT != is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_numbers)
(nrow(df_tmp_zeros) + nrow(df_tmp_NAs) + nrow(df_tmp_numbers)) == nrow(df_Raw) # double-check that all observations are split into 3 parts.
Replace zeros and NAs with an appropriate bin numbers.
Specify number of intervals, interval ranges and partition numeric values into bins.
Cut the variable into intervals.
Merge 3 parts together.
Append the binned variable to the final dataframe.
df_tmp_zeros$OVERDUEAMOUNT <- as.factor(1)
df_tmp_NAs$OVERDUEAMOUNT <- as.factor(7)
cuts.OVERDUEAMOUNT <- c(-Inf, 1500, 4000, 8000, 12000, Inf)
labels.OVERDUEAMOUNT <- c(2:6)
df_tmp_numbers$OVERDUEAMOUNT <- cut(df_tmp_numbers$OVERDUEAMOUNT, breaks = cuts.OVERDUEAMOUNT, labels = labels.OVERDUEAMOUNT, right = FALSE)
df_tmp_allback <- rbind(df_tmp_zeros, df_tmp_NAs, df_tmp_numbers)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back.
df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
glimpse(df_semijoin); summary(df_semijoin)
df_Binned <- df_semijoin
str(df_Binned)
Factor variable - PROFESSION. Split the variable into several temporary dataframes: NAs and as many parts as there are other factor levels.
df_tmp_f1 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "438"), select=c(ID, PROFESSION)); nrow(df_tmp_f1)
df_tmp_f2 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "000"), select=c(ID, PROFESSION)); nrow(df_tmp_f2)
df_tmp_f3 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "selfemployed"), select=c(ID, PROFESSION)); nrow(df_tmp_f3)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(PROFESSION)), select=c(ID, PROFESSION)); nrow(df_tmp_NAs)
df_tmp_f1$PROFESSION <- as.factor(1)
df_tmp_f2$PROFESSION <- as.factor(2)
df_tmp_f3$PROFESSION <- as.factor(3)
df_tmp_NAs$PROFESSION <- as.factor(4)
df_tmp_allback <- rbind(df_tmp_f1, df_tmp_f2, df_tmp_f3, df_tmp_NAs)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back.
df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
str(df_semijoin); summary(df_semijoin)
df_Binned <- cbind(df_Binned, df_semijoin$PROFESSION)
str(df_Binned)
And so on...
P.S. UPDATE:
The best solution to this problem is given in this post.
roll join with start/end window
These posts are also helpful:
How to join (merge) data frames (inner, outer, left, right)?
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
The idea is as follows: make a subset from the dataframe with raw data (1 column with unique IDs, 1 column with raw data (values of the variable) and 1 column with the name of variable (use rep() to repeat the variable name as many times are there are observations of the variable; then make a subset from the dataframe with bins with just one variable (as many rows as many bins of that particular variable), and in my case 4 columns - Variable, Min, Max, Bin.
See sample code below:
Also I tried foverlaps() from data.table package, but it can't handle NAs; processing of NAs has to be done separately AFAIU; another solution is to use rolling joins but I haven't cracked that yet. Will appreciate advice with the rolling joins.
# Subset numeric variables by variable name.
rawDF_num_X <- cbind(rawDF2bin$ID,
rep(var_num, times = nrow(rawDF2bin[, vars_num_raw][var_num])),
rawDF2bin[, vars_num_raw][var_num])
colnames(rawDF_num_X) <- c("ID", "Variable", "Value")
rawDF_num_X <- as.data.table(rawDF_num_X)
# Subset table with bins for numeric variables by variable name.
bins_num_X <- bins_num[bins_num$Variable == var_num, ]
bins_num_X <- arrange(bins_num_X, Bin) # sort by bin values, in ascending order.
bins_num_X <- as.data.table(bins_num_X)
# Select and join numeric variables with their corresponding bins using sqldf package.
vars_num_join <- sqldf("SELECT a.ID, a.Variable, a.Value, b.Min, b.Max, b.Bin
FROM rawDF_num_X AS a, bins_num_X AS b
WHERE a.Variable = b.Variable AND a.Value between b.Min and b.Max
OR a.Value IS NULL AND b.Min IS NULL AND b.Max IS NULL")
View(vars_num_join); dim(vars_num_join)
# Create a TRUE/FALSE flag/check according to the binning conditions.
vars_num_join$check <- ifelse((is.na(vars_num_join$Value)= TRUE & is.na(vars_num_join$Min) == TRUE & is.na(vars_num_join$Max) == TRUE), "TRUE",
ifelse((vars_num_join$Value == 0 & vars_num_join$Min == 0 & vars_num_join$Max == 0), "TRUE",
ifelse((vars_num_join$Value != 0 & vars_num_join$Value >= vars_num_join$Min & vars_num_join$Value < vars_num_join$Max), "TRUE", "FALSE")))
# Remove (duplicate) rows that have FALSE flag due to not matching the binning conditions.
vars_num_join <- vars_num_join[vars_num_join$check == TRUE, ]
identical(rawDF2bin$ID, vars_num_join$ID) # should be TRUE

match fundction with data frames that are differently constructed

I am relatively new to R and I have hit a wall with trying to figure out how to do what I want to do. I went through many questions on StackOF but still could not figure it out (exactly). Here is what I am trying to do:
data frame 1:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
data frame 2:
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
What I hope to create:
df2$new <- if (df2$dtime >= df1$d1 AND <= df1$d2),
return df1$rm with the hopes of creating a variable df2$new looking like this in the end:
df2$new <- 13
View(df2)
I am essentially trying to match the value that corresponds to the week (df1$rm) to the individual days (df2$new) within that week.
I think what you might be looking for is sapply
df2$new <- sapply(df2$dtime, function(row){df1[((row >= as.Date(df1$d1)) + (row <= as.Date(df1$d2))) == 2,]$rm})
In R vectors being preferred to for loops. What I'm doing is taking df2's dtime column, and applying the function(row) to each in turn. This get's me a list of lookups into df1, will there always be an entry in df1 or do we need a default case?
If the dataframe is not too big, this would be really easy using a simple for loop:
d1 =c("2005/01/02")
d2 = c("2005/01/08")
rm = c(13)
df1 = data.frame(d1, d2, rm)
df2 <- as.data.frame(seq(as.Date("2005-01-02"), as.Date("2005-01-08"), by="days"))
colnames(df2)<-c("dtime")
df2$new <- NA
for(i in 1:nrow(df1)) df2$new[df2$dtime >= as.Date(df1$d1[i]) & df2$dtime <= as.Date(df1$d2[i])] <- df1$rm[i]

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources