R - removing data table rows based on two values - r

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.

filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)

I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]

What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

Related

create multiple train, val, test splits using sample

I want to create train, val, test splits (60:20:20). I repeated the process multiple times.
Test set should contain only 2 observations each time. But why does it sometimes contain 1 or 3 observations.
What is role of replace in sample(). Should I keep it FALSE
library(dplyr)
tbl <- tibble(id = 1:10)
train = list()
val = list()
test = list()
for (run in 1:5)
{
assignment <- sample(1:3, size = nrow(tbl), prob = c(0.6, 0.2, 0.2), replace = TRUE)
# Create a train, validation and test sets
train[[run]] <- tbl[assignment == 1, ]
val[[run]] <- tbl[assignment == 2, ]
test[[run]] <- tbl[assignment == 3, ]
}
If we exactly 6, 2, 2, values for 1, 2, 3 as sample, just replicate the 1, 2, 3 and sample it
v1 <- sample(rep(1:3, c(6, 2, 2)))
Then do a split
split(tbl, v1)
When we use prob, it can change the frequency slightly because it is just a probability. Regarding the use of replace = TRUE, it is needed in the OP's code as the length of 1:3 is just 3, whereas size = nrow(tbl) is 10, thus without replacement, it can't fill those 7 extra elements

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

ffbase: merge on columns X and Y and closest column Z

I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? If not possible, what would be the best library for this?
As a reproducible example, same as below:
set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
I borrow the following example from #ChinmayPatil here to highlight the similar procedure I would like to follow: (R - merge dataframes on matching A, B and *closest* C?):
require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )
## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)
##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]
Given that my question got so few views and no answers, I will describe the approach I came up with to solve this problem with the hopes that someone might find it useful (or even for me as a reminder for later in the future):
To me, the most difficult aspect of performing this match on one columns and then nearest match on another columns is that I kept thinking that doing an outer join (as described in the post) was necessary. The solution is pretty simple using data.table and ffdfdply. For the purpose of illustration, assume there is one large ffdf object and one regular data.table that fits in memory:
### Large ffdf object
A <- as.ffdf(data.table( dates.A = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-31'), by = '3 days'),
letters.A = LETTERS[1:4] , value.A = runif(4) ))
### Small data.table that fits in memory
B <- data.table( date.B = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-05'), by = 'days'),
letters.B = LETTERS[1:4] , value.B = runif(4) )
Then you can simply define a function that does the merging using data.table and roll = 'nearest':
merge.ff <- function(x){
setDT(x)
x[, ':=' (dates.merge = dates.A, letters.merge = letters.A)]
B[, ':=' (dates.merge = date.B, letters.merge = letters.B)]
setkeyv(x, c('letters.merge','dates.merge'))
setkeyv(B, c('letters.merge','dates.merge'))
as.data.frame(B[x, roll = 'nearest'])
}
and apply it to A:
result <- ffdfdply( A, split = A$dates.A, FUN = merge.ff)
the key was just essentially using the roll method in data.table and pass it to ffdfdply. It seemed to be quite efficient.

Divide vector with grouping vector

I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}

Resources