Binning variables in a dataframe with input bin data from another dataframe - r

Being a beginner-level user of R, despite having read (1) numerous posts about binning&grouping here at SO, and (2) documentation on data.table and dplyr packages, I still can't figure out how to apply the power of those packages for binning continuous&factor variables, for further use in credit scoring modelling.
Problem: To build a code-efficient, easily-customisable, more or less automated solution for binning variables with minimal hard-coding.
These variables used to be binned with a stored procedure (Oracle), but I would like to switch entirely to R and operate the following dataframes:
to bin variables in "df_Raw" according to variable/bin ranges&levels in "binsDF" and to store binned variables in "df_Binned"
.
So far I have been able to produce simple straight-forward code that is lengthy, error-prone (cut levels and labels are hard-coded), difficult to rollback and is just ugly; although it works.
The goal is to have this binning operation automated as much as possible with minimal hard-coding, so that to re-do binning it would take
to update bin ranges&levels in "binsDF" and to re-run the code rather than to manually edit all hard code.
I wonder how **ply family of functions and dplyr functions could be applied nicely to this problem.
Data description - the datasets have 100+ variables and 1-2 mln observations, with two types of variables to be binned:
Continuous variables. Example - OVERDUEAMOUNT - has values 0 (zero), "NA", and both negative&positive numeric values.
OVERDUEAMOUNT needs to be split into 7 bins: bin#1 contains only zeros, bins#2-6 contain continuous values that need to be split into 5 custom-sized intervals, and bin#7 contains only NAs.
Factor variables, with both character and numeric values. Example - PROFESSION - has 4 levels: "NA" and 3 values/codes that stand for certain categories of professions/types of employment.
It is important to place zeros and NAs in 2 separate bins, as they usually have very different interpretation from each other and from other values.
Datasets like iris or GermanCredit are not applicable due to not having NAs, strings or zeros, so I wrote some code below to replicate my data.
Many thanks in advance!
Raw data to be binned.
OVERDUEAMOUNT_numbers <- rnorm(10000, mean = 9000, sd = 3000)
OVERDUEAMOUNT_zeros <- rep(0, 3000)
OVERDUEAMOUNT_NAs <- rep(NA, 4000)
OVERDUEAMOUNT <- c(OVERDUEAMOUNT_numbers, OVERDUEAMOUNT_zeros, OVERDUEAMOUNT_NAs)
PROFESSION_f1 <- rep("438", 3000)
PROFESSION_f2 <- rep("000", 4000)
PROFESSION_f3 <- rep("selfemployed", 5000)
PROFESSION_f4 <- rep(NA, 5000)
PROFESSION <- c(PROFESSION_f1, PROFESSION_f2, PROFESSION_f3, PROFESSION_f4)
ID <- sample(123456789:987654321, 17000, replace = TRUE); n_distinct(ID)
df_Raw <- cbind.data.frame(ID, OVERDUEAMOUNT, PROFESSION)
colnames(df_Raw) <- c("ID", "OVERDUEAMOUNT", "PROFESSION")
Convert PROFESSION to factor to replicate this variable is processed & prepared for further import into R. Reshuffle the dataframe row-wise to make it look like real data.
df_Raw$PROFESSION <- as.factor(df_Raw$PROFESSION)
df_Raw <- df_Raw[sample(nrow(df_Raw)), ]
Dataframe with bins.
variable <- c(rep("OVERDUEAMOUNT", 7), rep("PROFESSION", 4))
min <- c(0, c(-Inf, 1500, 4000, 8000, 12000), "", c("438", "000", "selfemployed", ""))
max <- c(0, c(1500, 4000, 8000, 12000, Inf), "", c("438", "000", "selfemployed", ""))
bin <- c(c(1, 2, 3, 4, 5, 6, 7), c(1, 2, 3, 4))
binsDF <- cbind.data.frame(variable, min, max, bin)
colnames(binsDF) <- c("variable", "min", "max", "bin")
How I bin the variables: copy the list of IDs "as is" in a separate dataframe for further use in semi-joins as a "reference/standard" for the original list of IDs.
dfID <- as.data.frame(df_Raw$ID); colnames(dfID) <- c("ID")
Continuous variable - OVERDUEAMOUNT. Split the variable into 3 temporary dataframes: zeros, NAs and numeric observations to cut.
df_tmp_zeros <- subset(x=df_Raw, subset=(OVERDUEAMOUNT == 0), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_zeros)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_NAs)
df_tmp_numbers <- subset(x=df_Raw, subset=(OVERDUEAMOUNT != 0 & OVERDUEAMOUNT != is.na(OVERDUEAMOUNT)), select=c(ID, OVERDUEAMOUNT)); nrow(df_tmp_numbers)
(nrow(df_tmp_zeros) + nrow(df_tmp_NAs) + nrow(df_tmp_numbers)) == nrow(df_Raw) # double-check that all observations are split into 3 parts.
Replace zeros and NAs with an appropriate bin numbers.
Specify number of intervals, interval ranges and partition numeric values into bins.
Cut the variable into intervals.
Merge 3 parts together.
Append the binned variable to the final dataframe.
df_tmp_zeros$OVERDUEAMOUNT <- as.factor(1)
df_tmp_NAs$OVERDUEAMOUNT <- as.factor(7)
cuts.OVERDUEAMOUNT <- c(-Inf, 1500, 4000, 8000, 12000, Inf)
labels.OVERDUEAMOUNT <- c(2:6)
df_tmp_numbers$OVERDUEAMOUNT <- cut(df_tmp_numbers$OVERDUEAMOUNT, breaks = cuts.OVERDUEAMOUNT, labels = labels.OVERDUEAMOUNT, right = FALSE)
df_tmp_allback <- rbind(df_tmp_zeros, df_tmp_NAs, df_tmp_numbers)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back.
df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
glimpse(df_semijoin); summary(df_semijoin)
df_Binned <- df_semijoin
str(df_Binned)
Factor variable - PROFESSION. Split the variable into several temporary dataframes: NAs and as many parts as there are other factor levels.
df_tmp_f1 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "438"), select=c(ID, PROFESSION)); nrow(df_tmp_f1)
df_tmp_f2 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "000"), select=c(ID, PROFESSION)); nrow(df_tmp_f2)
df_tmp_f3 <- subset(x=df_Raw, subset=(df_Raw$PROFESSION == "selfemployed"), select=c(ID, PROFESSION)); nrow(df_tmp_f3)
df_tmp_NAs <- subset(x=df_Raw, subset=(is.na(PROFESSION)), select=c(ID, PROFESSION)); nrow(df_tmp_NAs)
df_tmp_f1$PROFESSION <- as.factor(1)
df_tmp_f2$PROFESSION <- as.factor(2)
df_tmp_f3$PROFESSION <- as.factor(3)
df_tmp_NAs$PROFESSION <- as.factor(4)
df_tmp_allback <- rbind(df_tmp_f1, df_tmp_f2, df_tmp_f3, df_tmp_NAs)
nrow(df_tmp_allback) == nrow(df_Raw) # double-check that all observations are added back.
df_semijoin <- semi_join(x=df_tmp_allback, y=dfID, by=c("ID")) # return all rows from x where there are matching values in y, keeping just columns from x.
str(df_semijoin); summary(df_semijoin)
df_Binned <- cbind(df_Binned, df_semijoin$PROFESSION)
str(df_Binned)
And so on...
P.S. UPDATE:
The best solution to this problem is given in this post.
roll join with start/end window
These posts are also helpful:
How to join (merge) data frames (inner, outer, left, right)?
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
The idea is as follows: make a subset from the dataframe with raw data (1 column with unique IDs, 1 column with raw data (values of the variable) and 1 column with the name of variable (use rep() to repeat the variable name as many times are there are observations of the variable; then make a subset from the dataframe with bins with just one variable (as many rows as many bins of that particular variable), and in my case 4 columns - Variable, Min, Max, Bin.
See sample code below:
Also I tried foverlaps() from data.table package, but it can't handle NAs; processing of NAs has to be done separately AFAIU; another solution is to use rolling joins but I haven't cracked that yet. Will appreciate advice with the rolling joins.
# Subset numeric variables by variable name.
rawDF_num_X <- cbind(rawDF2bin$ID,
rep(var_num, times = nrow(rawDF2bin[, vars_num_raw][var_num])),
rawDF2bin[, vars_num_raw][var_num])
colnames(rawDF_num_X) <- c("ID", "Variable", "Value")
rawDF_num_X <- as.data.table(rawDF_num_X)
# Subset table with bins for numeric variables by variable name.
bins_num_X <- bins_num[bins_num$Variable == var_num, ]
bins_num_X <- arrange(bins_num_X, Bin) # sort by bin values, in ascending order.
bins_num_X <- as.data.table(bins_num_X)
# Select and join numeric variables with their corresponding bins using sqldf package.
vars_num_join <- sqldf("SELECT a.ID, a.Variable, a.Value, b.Min, b.Max, b.Bin
FROM rawDF_num_X AS a, bins_num_X AS b
WHERE a.Variable = b.Variable AND a.Value between b.Min and b.Max
OR a.Value IS NULL AND b.Min IS NULL AND b.Max IS NULL")
View(vars_num_join); dim(vars_num_join)
# Create a TRUE/FALSE flag/check according to the binning conditions.
vars_num_join$check <- ifelse((is.na(vars_num_join$Value)= TRUE & is.na(vars_num_join$Min) == TRUE & is.na(vars_num_join$Max) == TRUE), "TRUE",
ifelse((vars_num_join$Value == 0 & vars_num_join$Min == 0 & vars_num_join$Max == 0), "TRUE",
ifelse((vars_num_join$Value != 0 & vars_num_join$Value >= vars_num_join$Min & vars_num_join$Value < vars_num_join$Max), "TRUE", "FALSE")))
# Remove (duplicate) rows that have FALSE flag due to not matching the binning conditions.
vars_num_join <- vars_num_join[vars_num_join$check == TRUE, ]
identical(rawDF2bin$ID, vars_num_join$ID) # should be TRUE

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

subset the data frame based on multiple ranges and save each range as element in the list

I want to make the data frame as a list based on its values which belong to multiple ranges so that each value belongs to each range to be an element in that list. for example, if I have 10 range and data frame of nrow= n, so I will get a list of 10 data frames.
The data
df<- data.frame(x=seq(33, 37, 0.12), y=seq(31,35, 0.12))
library(data.table)
range<- data.table(start =c(36.15,36.08,36.02,35.95,35.89,35.82,35.76,35.69),
end = c(36.08,36.02,35.95,35.89,35.82,35.76,35.69,35.63))
I tried
nlist<-list(
df[which(df$x>36.15),],
df[which(df$x<=36.15 & df$x>36.08),],
df[which(df$x<=36.08 & df$x>36.02),],
df[which(df$x<=36.02 & df$x>35.95),],
df[which(df$x<=35.95 & df$x>35.89),],
df[which(df$x<=35.89 & df$x>35.82),],
df[which(df$x<=35.82 & df$x>35.76),],
df[which(df$x<=35.76 & df$x>35.69),],
df[which(df$x<=35.69 & df$x>35.63),],
df[which(df$x <= 35.63),])
There are two problems. Firstly, I want to make in loop instead of writing the vaules of each range limit. Secondly, this code:
Reduce('+', lapply(nlist, nrow))
produces the sum of rows = 35 whereas my data frame has nrow = 34. Where does this extra value come from?
you could apply over the rows of your range object
apply(range, 1, function(z) df[df$x > z[2] & df$x <= z[1],])
You can split the data frame according to levels obtained by cutting df$x by range$start. You don't even need a loop for this:
nlist <- split(df, cut(df$x, breaks = c(-Inf, range$start, Inf)))
Or if you want it in the same format (an unnamed list in reverse order, you can do:
nlist <- setNames(rev(split(df, cut(df$x, breaks=c(-Inf, range$start, Inf)))),NULL)
This also gives the correct answer for Reduce:
Reduce('+', lapply(nlist, nrow))
#> [1] 34

Calculate outliers of a group of specific columns then identify ids which have >5 columns with outliers

I'm working with a big dataframe (df). I would like to calculate outliers for a specific subset of columns based off the mean + 3 sd.
I first extracted the columns I wanted, so all the columns with color in their column names.
colors = colnames(df)[grep('color', colnames(df))]
I'm not sure how I should then go about looping it to calculate the outliers across all the columns using this new variable. The formula I had was:
# id those with upper outliers
uthr = mean(df$color)+3*sd(df$color)
rm_u_ids = df$id[which(df$color >= uthr)]
# id those with lower outliers
lthr = mean(df$color)-3*sd(df$color)
rm_l_ids = df$id[which(df$color <= lthr)]
# remove those with both upper and lower outliers
rm_ids = sort(c(rm_u_ids, rm_l_ids))
df_2 = df %>% filter(!id %in% rm_ids)
Now, the actual problem.
I would like to use something similar to do the following:
1) for each color in colors, identify those id's with outliers, maybe save this info elsewhere,
2) using that info (maybe in a list or separate data frame), identify the id's which appeared in 5 columns or more, or colors,
3) subset the original data frame with this list so we eliminate those id's with outliers in 5 color columns or more.
Does that make sense? I'm not sure if a loop is also recommended for this problem.
Thank you and sorry if I made it sound more complex than it should be!
An alternative to the clever answers already provided is to convert the relevant columns into a matrix and use some fast matrix operations:
df = iris
colors = colnames(iris)[1:4]
m = as.matrix(df[,colors])
# Standardize the numeric values in each column
m = scale(m)
# Apply some outlier definition rules, e.g.
# detect measurements with |Zscore|>3
outliers = abs(m)>3
# detect rows with at least 5 such measurements
outliers = rowSums(outliers)
which(outliers>=5)
You could create a function which returns the id's of outliers
find_outlier <- function(df, x) {
uthr = mean(x)+3*sd(x)
rm_u_ids = df$id[which(x >= uthr)]
# id those with lower outliers
lthr = mean(x)-3*sd(x)
rm_l_ids = df$id[which(x <= lthr)]
# remove those with both upper and lower outliers
unique(sort(c(rm_u_ids, rm_l_ids)))
}
Apply it to every colors column, calculate their count with table and remove the id's which occur more than 5 times.
all_ids <- lapply(df[colors], find_outlier, df = df)
temp_tab <- table(unlist(all_ids))
remove_ids <- names(temp_tab[temp_tab >= 5])
subset(df, !id %in% remove_ids)
I'm going to assume that your data.frame only has the numeric variables you want
findOutlierCols = function(color.df){
hasOutliers = function(col){
bds = mean(col) + c(-3,3)*sd(col)
if(any(col <= bds[1]) || any(col >= bds[2])){
return(TRUE)
}else{
return(FALSE)
}
}
apply(color.df, 2, hasOutliers)
}
## make some fake data
set.seed(123)
x = matrix(rnorm(1000), ncol = 10)
color.df = data.frame(x)
colnames(x) = paste0("color.", colors()[1:10])
color.df = apply(color.df, 2, function(col){col+rbinom(100, 5, 0.1)})
boxplot(color.df)
findOutlierCols(color.df)
> findOutlierCols(color.df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE

Merging in R with data.table with two columns

I am fairly new to using data.table but I am using it since I have heard it is faster than data.frame and plan to loop.
I am trying to merge raster data (which comes with longitude "x", latitude "y", and temperature information) onto a master dataset, which is just all possible combinations of "x" and "y" for this particular country I am looking at.
For some reason, data.frame works (the temperature information is merged in. Some missing information but that's to be expected) while data.table does not (the temperature variable is "added" but all information is missing). I think it has to do something with the fact that I am merging with two columns, or maybe the data isn't sorted the right way, but I'm not completely sure.
Below is my code
# Set common parameters
x <- rep(seq(-49.975,49.975, by = 0.05), times = 2000)
y <- rep(seq(-49.975,49.975, by = 0.05), each = 2000)
xy <- cbind(x,y)
## What works
# Create data frame, then subset to possible coordinates of country
df_xy <- data.frame(xy)
eth_df_xy <- subset(df_xy, df_xy$x >= 30 & df_xy$x <= 50 & df_xy$y >=0 & df_xy$y <= 20)
# Bring in raster dataset
examine <- print(paste0(dir_tif, files[[1]]))
sds <- raster(examine)
x <- rasterToPoints(sds)
df_x <- data.frame(x)
# Merge
eth_df_xy <- merge(df_x, eth_df_xy, by = c("x","y"), all.x = F, all.y=T)
## What doesn't work but seems intuitive
# Create data table, then subset to possible coordinates of country (as above)
dt_xy <- data.table(xy)
eth_dt_xy <- subset(dt_xy, dt_xy$x >= 30 & dt_xy$x <= 50 & dt_xy$y >=0 & dt_xy$y <= 20)
# Bring in raster dataset (from above, skip to fourth step)
dt_x <- data.table(x)
# Merge
eth_dt_xy <- merge(dt_x, eth_dt_xy, by = c("x","y"), all.x = F, all.y=T)
Thanks

Resources