Error in R: subscript out of bounds - r

So I'm trying to do something very simple. Loop over a data frame and calculate the max corelation coefficient between a pair of columns.
I am trying to do this in R.
My data frame has been read using fread()
Here's my code: I declared max=-1, a=0andb=0in the starting.
for(i in 2:1933)
{
for(j in i+1:1934)
{
if(is.numeric(data[[i]]) && is.numeric(data[[j]]))
{
if(isTRUE(sd(data[[i]], na.rm=TRUE) !=0) && isTRUE(sd(data[[j]], na.rm=TRUE) !=0))
{
c = cor(data[[i]], data[[j]], use="pairwise.complete.obs")
if(isTRUE(c>=max))
{
max = c
a = i
b = j
}
}
}
}
}
The error I get is
Error in .subset2(x, i, exact = exact) : subscript out of bounds
I do have 1934 columns, I can't figure out the problem. Am I missing something fairly obvious?

There's a much easier way to do this: cor(...) takes a matrix (nr X nc) and returns a new matrix (nc X nc) with the correlation coefficient of every column against every other column. The rest is pretty straightforward:
library(data.table) # to simulate fread(...)
set.seed(1) # for reproducibble example
dt <- as.data.table(matrix(1:50+rnorm(50,sd=5), ncol=5)) # create reproducible example
result <- cor(dt, use="pairwise.complete.obs") # matrix of correlation coefficients
diag(result) <- NA # set diagonals to NA
max(result, na.rm=TRUE) # maximum correlation coefficient
# [1] 0.7165304
which(result==max(result, na.rm=TRUE), arr.ind=TRUE) # location of max
# row col
# V3 3 2
# V2 2 3
There are two locations because of course the correlation between col 2 and 3 is the same as the correlation between cols 3 and 2.

Try this:::
drop_list <- NULL
#Guess the first column iS ID Column
feature.names <- names(data)[2:length(names(data)]
for(f in feature.names){
if(sd(data[[f]], na.rm=TRUE) == 0.0 | is.numeric(data[[f]])==FALSE)
{
drop_list <- c(drop_list, f)
}
}
data <- data[,!(names(data) %in% drop_list)]
corr_data <- cor(data, use="pairwise.complete.obs")
##remove Correlation between same variables
for(i in 1:dim(corr_data)[1]){corr_data[i,i] <- -99 }
#Please try to sort the correlation data.frame accordingly with which function as Howard suggested
Cheers

Related

Calculate the accuracy of an imputation function in R

I'm trying to test various imputation methods in R and I've written a function which takes a data frame, inserts some random NA values, imputes the missing values and then compares the imputation method back to the original data using MAE.
My function looks as follows:
pacman::p_load(tidyverse)
impute_diamonds_accuracy <- function(df, col, prop) {
require(tidyverse)
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[idx_na, col] - df[idx_na, col]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}
impute_diamonds_accuracy(df = diamonds, col = 'cut', prop = 0.02)
The function prints to the screen that it's doing the imputation but it fails when it performs that MAE calculation with the following error:
Error in imputed_df[idx_na, col] - df[idx_na, col] :
non-numeric argument to binary operator
How can I compare the original data against the imputed version to get a sense of the accuracy?
diamonds is a tibble.
> library(ggplot2)
> data(diamonds)
> is_tibble(diamonds)
[1] TRUE
so we may need to use [[ to extract the column as a vector. Also, the idx_na returns the index of NA elements in data. If we want to use the subset comparison, make a copy of the original data before we assign NAs, and then do the comparison between the imputed and original data
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
-full code
impute_diamonds_accuracy <- function(df, col, prop) {
# Sample the indices of the rows to convert to NA
n <- nrow(df)
idx_na <- sample(1:n, prop*n)
df_cpy <- data.table::copy(df)
# Convert the values at the sampled indices to NA
df[idx_na, col] <- NA
# Impute missing values using mice with pmm method
imputed_df <- mice::mice(df, method='pmm', m=1, maxit=10)
imputed_df <- mice::complete(imputed_df)
# Calculate MAE between imputed and original values
mae <- mean(abs(imputed_df[[col]][idx_na] - df_cpy[[col]][idx_na]), na.rm = TRUE)
return(list(original_data = df,imputed_data = imputed_df, accuracy = mae))
}

R - I have a for loop to identify outliers in each row for one column - how to loop to look across every column?

I have a long (row per pt) dataset, with columns for numerous variables. I've created a for loop to run over each row and print the id of every participant that is an outlier based on their results for a specific column/variable. In the below example looking at column x, this correctly identifies Pt6 as an outlier on variable x.
dat <- data.frame(id=c("Pt1","Pt2", "Pt3","Pt4", "Pt5", "Pt6"),
x=c(1,3,3,3,5,31),
y=c(2,9,10,10.5,10.5,11),
z=c(34,34,34,35,68,36))
for (row in 1:nrow(dat)) {
variable <- dat[row, "x"]
id <- dat[row, "id"]
if((variable>(mean(dat$x, na.rm=TRUE)
+ (2*sd(dat$x, na.rm=TRUE))))
|
(variable<(mean(dat$x, na.rm=TRUE)
- (2*sd(dat$x, na.rm=TRUE))))
)
{
print(id)
}}
However, I'd like to identify all participants that are an outlier based on each column individually - in the example data, it should identify Pt6 because of their x value AND Pt1 because of their y value AND Pt5 because of their z value.
I know I'll need to nest another for loop to go over the columns, something like the below, but it only identifies Pt5 so I think it is not looking at the columns individually?
for (row in 1:nrow(dat)) {
for (col in 1:ncol(dat))
value <- dat[row, col]
id <- dat[row, "id"]
if((value>(mean(dat[[col]], na.rm=TRUE)
+ (2*sd(dat[[col]], na.rm=TRUE))))
|
(value<(mean(dat[[col]], na.rm=TRUE)
- (2*sd(dat[[col]], na.rm=TRUE))))
)
{
print(id)
}}
I'm new to forloops (obviously) - trying to get out of the bad habit of copy pasting. I've tried looking at other answers but I can't see how to apply it here / they're not in R. Any help appreciated! Open to different approaches altogether (e.g apply based ones) but would quite like to plug my gap in forloop understanding if possible. Thanks!
Lets start by looking at your for-loops. You can optimize these quite easily, by storing your results (mean and such) in a variable, so these do not have to be recomputed. This is by far the slowest part of your loop so the boost will be significant. In your first code example this would look like this:
dat <- data.frame(id=c("Pt1","Pt2", "Pt3","Pt4", "Pt5", "Pt6"),
x=c(1,3,3,3,5,31),
y=c(2,9,10,10.5,10.5,11),
z=c(34,34,34,35,68,36))
# Pre-define variables
mu <- mean(dat$x, na.rm = TRUE)
sd2 <- 2 * sd(dat$x, na.rm = TRUE)
upper <- mu + sd2
lower <- mu - sd2
# Create storage
rows <- logical(n <- nrow(dat))
for (row in 1:n) {
variable <- dat[row, "x"]
if(variable > upper || variable < lower)
{
# Set index to true, for row being an "outlier"
rows[row] <- TRUE
}
}
# Print outlier rows
dat[rows,]
For you next loop, it would make sense to either store a matrix of "outlier indicators" or just the row/column pair, for example as a list. You are getting most of the way already. It makes sense to loop over columns in the outer loop, so you once again avoid recomputing mean and standard deviation at every iteration
# Specify columns to iterate over
cols <- names(dat)[-1]
# Storage for outliers
outliers <- list()
for(j in cols){
# Pre-define variables
mu <- mean(dat[, j], na.rm = TRUE)
sd2 <- 2 * sd(dat[, j], na.rm = TRUE)
upper <- mu + sd2
lower <- mu - sd2
# Create storage
rows <- logical(n <- nrow(dat))
for (row in 1:n) {
variable <- dat[row, j]
if(variable > upper || variable < lower)
{
# Set index to true, for row being an "outlier"
rows[row] <- TRUE
}
}
outliers[[j]] <- rows
}
# Print outliers
dat[outliers[['x']], ]
dat[outliers[['y']], ]
dat[outliers[['z']], ]
Now this is one method for doing it. But many functions in R are vectorized. So we could simplify this massively. Vectorization basically allows us to evaluate functions over vector inputs and this is also possible for logical comparison such as < <= == and so on. This allows us to remove the row iteration in this case, and simplifies the code drastically. For example the first code would be reduced to
# Only 1 column
mu <- mean(dat$x)
sd2 <- sd(dat$x) * 2
upper <- mu + sd2
lower <- mu - sd2
rows <- dat$x > upper | datx < lower
# Alternative, cheeky 1 liner:
rows <- abs(dat$x) - (mean(dat$x) + 2 * sd(dat$x)) > 0
while the latter could even be done as
outliers <- lapply(dat[, c('x', 'y', 'z')],
function(x)x[abs(x) - (mean(x) + 2 * sd(x)) > 0])
dat[outliers[['x']], ]
dat[outliers[['y']], ]
dat[outliers[['z']], ]
where I replace the for-loop with a call to lapply which will iterate over the columns in dat and apply the function specified, returning a list for each column. There is no real performance gain from replacing the for-loop, but it is easier to read for smaller calls like this.
The following code computes the column means and SD's first. Then the limits of mu +/- 2sd. Then uses a sapply loop to see which column elements are within those limits. Finally, it subsets the id column based on the results of sapply.
means <- colMeans(dat[-1], na.rm = TRUE)
sds <- apply(dat[-1], 2, sd, na.rm = TRUE)
ci95 <- means + cbind(-2*sds, 2*sds)
out <- sapply(seq_along(dat[-1]), function(i){
v <- dat[-1][[i]]
v < ci95[i, 1] | v > ci95[i, 2]
})
out
# [,1] [,2] [,3]
#[1,] FALSE TRUE FALSE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE
#[5,] FALSE FALSE TRUE
#[6,] TRUE FALSE FALSE
dat[[1]][rowSums(out) > 0]
#[1] "Pt1" "Pt5" "Pt6"

Writing a for loop with the output as a data frame in R

I am currently working my way through the book 'R for Data Science'.
I am trying to solve this exercise question (21.2.1 Q1.4) but have not been able to determine the correct output before starting the for loop.
Write a for loop to:
Generate 10 random normals for each of μ= −10, 0, 10 and 100.
Like the previous questions in the book I have been trying to insert into a vector output but for this example, it appears I need the output to be a data frame?
This is my code so far:
values <- c(-10,0,10,100)
output <- vector("double", 10)
for (i in seq_along(values)) {
output[[i]] <- rnorm(10, mean = values[[i]])
}
I know the output is wrong but am unsure how to create the format I need here. Any help much appreciated. Thanks!
There are many ways of doing this. Here is one. See inline comments.
set.seed(357) # to make things reproducible, set random seed
N <- 10 # number of loops
xy <- vector("list", N) # create an empty list into which values are to be filled
# run the loop N times and on each loop...
for (i in 1:N) {
# generate a data.frame with 4 columns, and add a random number into each one
# random number depends on the mean specified
xy[[i]] <- data.frame(um10 = rnorm(1, mean = -10),
u0 = rnorm(1, mean = 0),
u10 = rnorm(1, mean = 10),
u100 = rnorm(1, mean = 100))
}
# result is a list of data.frames with 1 row and 4 columns
# you can bind them together into one data.frame using do.call
# rbind means they will be merged row-wise
xy <- do.call(rbind, xy)
um10 u0 u10 u100
1 -11.241117 -0.5832050 10.394747 101.50421
2 -9.233200 0.3174604 9.900024 100.22703
3 -10.469015 0.4765213 9.088352 99.65822
4 -9.453259 -0.3272080 10.041090 99.72397
5 -10.593497 0.1764618 10.505760 101.00852
6 -10.935463 0.3845648 9.981747 100.05564
7 -11.447720 0.8477938 9.726617 99.12918
8 -11.373889 -0.3550321 9.806823 99.52711
9 -7.950092 0.5711058 10.162878 101.38218
10 -9.408727 0.5885065 9.471274 100.69328
Another way would be to pre-allocate a matrix, add in values and coerce it to a data.frame.
xy <- matrix(NA, nrow = N, ncol = 4)
for (i in 1:N) {
xy[i, ] <- rnorm(4, mean = c(-10, 0, 10, 100))
}
# notice that i name the column names post festum
colnames(xy) <- c("um10", "u0", "u10", "u100")
xy <- as.data.frame(xy)
As this is a learning question I will not provide the solution directly.
> values <- c(-10,0,10,100)
> for (i in seq_along(values)) {print(i)} # Checking we iterate by position
[1] 1
[1] 2
[1] 3
[1] 4
> output <- vector("double", 10)
> output # Checking the place where the output will be
[1] 0 0 0 0 0 0 0 0 0 0
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
Error in output[[i]] <- rnorm(10, mean = values[[i]]) :
more elements supplied than there are to replace
As you can see the error say there are more elements to put than space (each iteration generates 10 random numbers, (in total 40) and you only have 10 spaces. Consider using a data format that allows to store several values for each iteration.
So that:
> output <- ??
> for (i in seq_along(values)) { # Testing the full code
+ output[[i]] <- rnorm(10, mean = values[[i]])
+ }
> output # Should have length 4 and each element all the 10 values you created in the loop
# set the number of rows
rows <- 10
# vector with the values
means <- c(-10,0,10,100)
# generating output matrix
output <- matrix(nrow = rows,
ncol = 4)
# setting seed and looping through the number of rows
set.seed(222)
for (i in 1:rows){
output[i,] <- rnorm(length(means),
mean=means)
}
#printing the output
output

Spearman correlation between two matrices of same dimensions

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources