Create conditional variable in multiple data.tables (or data.frames) - r

I want to execute the same action in multiple data.tables (or data.frames). For example, I want to create the same variable conditional on the same rule in all data.tables.
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to loop on this list trying to create the variable:
list.df
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
My (bad) tentative:
for(i in 1:3){
a<-get(paste(list.df[[i]], "$var1", sep=""))
b<-get(paste(list.df[[i]], "$var2", sep=""))
name<-paste(list.df[[i]], "$var.new", sep="")
assign(name, ifelse(a==2 & b==10, 1, 0))
}
Clearly r cannot create this new variable the way I am doing as I get a error message "object not found".
Any clues on how to fix my bad code? I have a feeling that dplyr could help me but I don't know how.

We can use mget after creating the strings of object names with paste so that we get the values ie. data.frames in a list. We loop through the list (lapply(...,) and transform each dataset by creating the variable ('varNew') which is a binary variable. We can either use ifelse on the logical statement or just wrap with + to coerce the TRUE/FALSE to 1/0.
lst <- lapply(mget(paste0('df', 1:3)), transform,
varNew = +(var1==2 & var2==10))
If we need to update the original objects, we can use list2env.
list2env(lst, envir = .GlobalEnv)
df1
df2

Related

Check the correlation of two columns in a dataframe (in R)

I want to check the Correlation between two columns (c1 and c2) in my Dataframe (df)
cor(df$c1, df$c2, method = c("pearson", "kendall", "spearman"))
But as an output I just get "NA"
You get this result because one of your columns has an NA value in it. To deal with the NA values, set the use parameter.
From the documentation for cor()
use
covariances in the presence of missing values. This must be (an
abbreviation of) one of the strings "everything", "all.obs",
"complete.obs", "na.or.complete", or "pairwise.complete.obs".an optional character string giving a method for computing
The use argument defaults to everything, which produces an NA result if there are NA values. Set use to complete.obs to only use the non-NA values.
cor(df$c1, df$c2, method = c("pearson", "kendall", "spearman"),
use = "complete.obs")
Your call of cor appears to be correct.
It works well for me using the minimal example below:
df <- data.frame(
c1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
c2 = c(0, 2, 3, 4, 5, 6, 8, 8, 9, 10)
)
cor(df$c1, df$c2, method = c("pearson", "kendall", "spearman"))
which gives you the wanted output: [1] 0.9918652
I suspect your df data frame includes some NA values.
If any NA exists in your df then output of the cor call would indeed be NA.
Try call cor with use parameter set to na.or.complete instead. Like this:
df <- data.frame(
c1 = c(1, 2, 3, 4, 5, NA, 7, 8, 9, 10),
c2 = c(0, NA, 3, 4, 5, 6, 8, 8, 9, 10)
)
cor(df$c1, df$c2, method = c("pearson", "kendall", "spearman"), use = "na.or.complete")
This way you do not have to worry about NAsanymore. The correlation output will be given, just skipping all NA in the input structure.

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

Divide vector with grouping vector

I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}

Resources