Correlation between variables under the for loop - r

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing

We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Related

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

Loop over non-standard variable names in R

I have a dataframe (df) with variables that look similar to vector-variables:
myvariable[1], myvariable[2] , myvariable[3] , etc.
However, if I want to refer to them, R automatically creates barticks around them:
df$`myvariable[1]`
I want to use those variables within a for-loop, and hence, want to change the number within the brackets automatically. Does anyone know how to do this?
PS: This question is different from other questions insofar as R doesn't see my variables as vector variables but rather as single variables that look the same. Hence, the []-part of my variables is seen as only some kind of string and not as a subsetting operator.
PS2: dput(head(zTT$subjects[, c("myvariable[1]","myvariable[3]","myvariable[4]")],4))
structure(list(\`myvariable[1]\` = c(2, 4, 2, 9), \`myvariable[3]\` = c(1,
1,2, 3), \`myvariable[4]\` = c(2, 4, 2, 7)), .Names = c("myvariable[1]",
"myvariable[3]", "myvariable[4]"), row.names = c(NA, 4L), class = "data.frame")
As akrun has suggested, you can use [[. The code below uses your own data frame to construct the string which corresponds to the list names.
temp <- structure(list(`myvariable[1]` = c(2, 4, 2, 9),
`myvariable[3]` = c(1, 1,2, 3),
`myvariable[4]` = c(2, 4, 2, 7)),
.Names = c("myvariable[1]", "myvariable[3]",
"myvariable[4]"), row.names = c(NA, 4L),
class = "data.frame")
for (i in c(1, 3, 4)) {
myVar <- paste0("myvariable[", i, "]")
print(temp[[myVar]])
}

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

How to efficiently produce a desired matrix in R?

I was trying to produce the following 7 x 4 matrix in R:
m = matrix(c(seq(25, 1, by = -4),
seq(26, 2, by = -4),
seq(27, 3, by = -4),
seq(28, 4, by = -4)), nrow = 7, ncol = 4)
BUT, I'm wondering if could I achieve the same matrix with more efficient R code than what I used above?
Here's a solution:
m <- matrix(rev(c(1:28)),nrow=7,ncol=4,byrow = TRUE)[,rev(1:4)]
And this one is even faster:
m <- matrix(28:1,nrow=7,ncol=4,byrow = TRUE)[,4:1]
m = matrix(c(rep(seq(25, 1, by = -4),4) + rep(c(0:3),each=7) ), nrow = 7, ncol = 4)
Not sure if you would call this more efficient...

Create conditional variable in multiple data.tables (or data.frames)

I want to execute the same action in multiple data.tables (or data.frames). For example, I want to create the same variable conditional on the same rule in all data.tables.
A simple example can be (df1=df2=df3, without loss of generality here)
df1 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df2 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
df3 <- data.frame(var1 = c(1, 2, 2, 2, 1), var2 =c(20, 10, 10, 10, 20), var3 = c(10, 8, 15, 7, 9))
My approach was: (i) to create a list of the data frames (list.df), (ii) to loop on this list trying to create the variable:
list.df
list.df<-vector('list',3)
for(j in 1:3){
name <- paste('df',j,sep='')
list.df[j] <- name
}
My (bad) tentative:
for(i in 1:3){
a<-get(paste(list.df[[i]], "$var1", sep=""))
b<-get(paste(list.df[[i]], "$var2", sep=""))
name<-paste(list.df[[i]], "$var.new", sep="")
assign(name, ifelse(a==2 & b==10, 1, 0))
}
Clearly r cannot create this new variable the way I am doing as I get a error message "object not found".
Any clues on how to fix my bad code? I have a feeling that dplyr could help me but I don't know how.
We can use mget after creating the strings of object names with paste so that we get the values ie. data.frames in a list. We loop through the list (lapply(...,) and transform each dataset by creating the variable ('varNew') which is a binary variable. We can either use ifelse on the logical statement or just wrap with + to coerce the TRUE/FALSE to 1/0.
lst <- lapply(mget(paste0('df', 1:3)), transform,
varNew = +(var1==2 & var2==10))
If we need to update the original objects, we can use list2env.
list2env(lst, envir = .GlobalEnv)
df1
df2

Resources