Appending every nth column using loop in R - r

I have a data frame which consists of paired columns of ratings given by participants and the reasons for giving their ratings. I would like to insert a blank column after each pair of columns, so that after column 1 and 2 there's a new column. I managed to do this manually by creating a vector, inserting them all at the end, and then reorganizing myself. Here's the code for that so it is clear what I am trying to achieve:
v <- rep(NA, 184)
Scheme1$Code1.1 <- v
Scheme1$Code2.1 <- v
Scheme1$Code1.2 <- v
Scheme1$Code2.2 <- v
Scheme1$Code1.3 <- v
Scheme1$Code2.3 <- v
Scheme1$Code1.4 <- v
Scheme1$Code2.4 <- v
Scheme1$Code1.5 <- v
Scheme1$Code2.5 <- v
Scheme1$Code1.6 <- v
Scheme1$Code2.6<- v
Scheme1$Code1.7 <- v
Scheme1$Code2.7 <- v
# Reorganize
Scheme1 <- Scheme1[,c(1,2,15,16,3,4,17,18,5,6,19,20,7,8,21,22,9,10,23,24
,11,12,25,26,13,14,27,28)]
I wanted to see how this could be achieved by using a for loop.
Thanks!

Based on the description, may be this helps
lst1 <- split.default(Scheme1, as.integer(gl(ncol(Scheme1), 2, ncol(Scheme1))))
do.call(cbind, unname(Map(function(x, i) {x[paste0(names(x), ".", i)] <- NA;x}, lst1, names(lst1))))
dta
set.seed(24)
Scheme1 <- as.data.frame(matrix(rnorm(14 * 5), ncol = 14))

Related

undefined columns selected and cannot xtfrm data frame error

I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?
I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)

How to repeat codes changing the variables in a sequence in R

This is the code I want to repeat
A_1981 <- Base[1:12]]
B <- sum(A_1981)
MFI_1981 <- sum(A_1981^2)/B
Base is a Raster brick
A_1981 is for a year
MFI_1981 is the final result
So i have to continue with the next year
A_1982 <- Base[13:24]]
B <- sum(A_1982)
MFI_1982 <- sum(A_1982^2)/B
To repeat the same code I think in replace values only in the names:
a <- seq(1,421,by=12)
b <- seq(12,432,by=12)
c <- seq(1981,2016, by=1)
And do it in sequence for the next third year, would be something like this
A_a[3] <- Base[[b[3]:c[3]]
B <- sum(A_a[3])
MFI_a[3] <- sum(A_[3]^2)/B
Have to be some way with for or make a function. But have no idea where to start.
I think you are looking for something like this
Example data (48 layers, i.e, 4 "years")
library(raster)
f <- system.file("external/rlogo.grd", package="raster")
Base <- stack(rep(f, 4*4))
Approach 1
f <- function(year) {
start <- (year-1981) * 12 + 1
A <- Base[[start:(start+11)]]
sum(A^2)/sum(A)
}
mfi <- lapply(1981:1984, f)
MFI <- stack(mfi)
Approach 2
for (year in 1981:1984) {
start <- (year-1981) * 12 + 1
A <- Base[[start:(start+11)]]
mfi <- sum(A^2)/sum(A)
writeRaster(mfi, paste0(year, ".tif"))
}
s <- stack(paste0(1981:1984, ".tif"))
Approach 3, with mapply as in Rui Barradas' answer, but fixed for when Base is a RasterBrick (and also including the last year)
n <- nlayers(Base)
a <- seq(1, n, by = 12)
mfi <- mapply(function(i, j) sum(Base[[i:j]]^2)/sum(Base[[i:j]]), a, a+11)
s <- stack(mfi)
The following does what you want using mapply and creates only one object in the .GlobalEnv, which I named MFI.
I start by creating a vector Base, since you have not posted a dataset example.
set.seed(2469) # Make the results reproducible
n <- 432
Base <- sample(100, n, TRUE)
step <- 12
b <- seq(1 + step, n, by = step)
a <- seq(1, n - step, by = step)
MFI <- mapply(function(i, j) sum(Base[i:j]^2)/sum(Base[i:j]), a, b)
head(MFI)
#[1] 63.66472 70.54014 67.60567 53.15550 58.71111 65.37008
Another way would be to use Map, like #Parfait suggests in his comment.
obj <- Map(function(i, j) sum(Base[i:j]^2)/sum(Base[i:j]), a, b)
names(obj) <- paste("MFI", 1980 + seq_along(obj), sep = "_")
obj$MFI_1981
#[1] 63.66472
Note that length(obj) is 35 and therefore the last obj is obj$MFI_2015 and not MFI_2016 like is said in the question. This can be easily solved by making n <- 444 right at the beginning of the code.

R, apply function on every second column of a data frame?

How to apply a function on every second column of a data frame? That is to say, how to modify df2 <- sapply(df1, fun) such that df2 equals df1 but with fun applied to every second column? Here is what I tried:
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- sapply(df1[c(TRUE, FALSE)], function(x) x^2)
isTRUE(dim(df1)==dim(df2)) # FALSE
The problem with this code is, that it deletes all columns to which fun was not applied to (dim(df2) # 5 1).
Assigning variables to slices
You can assign new values for subsets of an object. Say for:
x <- c(1,2,3)
x[2] <- 4
Now x will be c(1,4,2). Similarly you can do this for row/columns of a matrix or dataframe. Here we use the apply function with the second argument 2 for cols (1 for cols). I recommend the seq function to generate a sequence of indices from=1, by=2 gives odd and from=2, by=2 gives even indices. Specifying this it way generalises to other subsets and straightforward to check you got it right.
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
df1 <- data.frame(a,b)
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- apply(df2[,seq(1, ncol(df2), 2)], 2, function(x) x^2)
Loops
Note that you can also do this with a loop:
df2 <- df1
for(col in seq(1, ncol(df2), 2)) df2[,col] <- sapply(df2[,col], function(x) x^2)
Vectorised functions
Since the squared operation is "vectorised" in R, in this case you could also do:
for(col in seq(1, ncol(df2), 2)) df2[,col] <- df2[,col]x^2
Or use vectorisation completely:
df2 <- df1
df2[,seq(1, ncol(df2), 2)] <- df2[,seq(1, ncol(df2), 2)]^2

Avoid a for loop

I am trying to optimize an algorithm and I really want to avoid all my loops. Hence I am wondering if there is a way to avoid the following simple loop:
library(FNN)
data <- cbind(1:10, 1:10)
NN.index <- get.knn(data, 5)$nn.index
bc <- matrix(0, nrow(NN.index), max(NN.index))
for(i in 1:nrow(bc)){
bc[i,NN.index[i,]] <- 1
}
were bc is a matrix of zeros.
In R, if the bracket of a matrix M take a k-by-2 matrix 'I', then each row of the k-by-2 matrix I is recognized as the row and column index of M. For example
M = matrix(1:20, nrow =4, ncol = 3)
print(M)
I = rbind(c(1,2), c(4,2), c(3,3))
print(M[I])
In this case, M[1,2], M[4,2] and M[3,3] are extracted.
In your case, we can create row_index and col_index from NN.index as below, and then assign 1 to the corresponding entries.
bc <- matrix(0, nrow(NN.index), max(NN.index))
row_index <- rep(1:nrow(NN.index), times = ncol(NN.index))
col_index <- as.vector(NN.index)
bc[cbind(row_index, col_index)] <- 1
print(bc)

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources