Solve iteratively dataframe in R - r

i am trying to build a dataframe (df2) based on the following relationship: df1[i,j] = df2[i,j]^2. For doing this, i need to solve a system of non-linear equations:
library(nleqslv)
df1 = data.frame(a = c(9,9), b = c(9,9))
df2 = df1
for(i in colnames(df1)){
f = function(x) {df1[i] - x^2}
xstart = c(df2[i])
df2[i] = nleqslv(xstart, f)[[1]]
}
The expected result is:
a b
1 3 3
2 3 3
But i get the following error message:
Error in nleqslv(xstart, f) :
Argument 'x' cannot be converted to numeric!
not sure what causes the problem. Could you give me some advice please?

Well, I don't know what you are trying to accomplish, but I think the function you defined has to be fixed. You can do it in the following manner, although the answer is not correct.
f <- function(x) x - x^2
df1 = data.frame(a = c(9,9), b = c(9,9))
sapply(df1, function(y) nleqslv(y, f)[[1]])
You should instead use sqrt() since it is vectorized.
sqrt(df1)
# a b
# 1 3 3
# 2 3 3

I'm unclear as to why you need such a complex solution for such a simple operation (df2 <- sqrt(df1) would produce your example solution). But if you want to know what's producing that error, it comes down to how R indexes lists.
df1[1] returns a list, whereas df1[[1]] (double brackets) returns the vector. The nleqslv function expects vectors. So all we have to do is modify your existing code to use double brackets instead of singles:
library(nleqslv)
df1 = data.frame(a = c(9,9), b = c(9,9))
df2 = df1
for(i in colnames(df1)){
f = function(x) {df1[[i]] - x^2}
xstart = c(df2[[i]])
df2[i] = nleqslv(xstart, f)[[1]]
}

First creating the data:
df2 <- data.frame(a=c(9,9), b=c(9,9))
df1 <- df2
Now on solving it iteratively, here's the R code:
for(i in 1:nrow(df1)){
for(j in 1:ncol(df1)){
df2[i, j] <- sqrt(df1[i,j])
}
}
df2
This will return:
<dbl>
a b
3 3
3 3
You could have used a vectorized solution (df2 <- sqrt(df1)) to achieve the above as well, but the loop function above will work for you if you need to solve for it iteratively using a traditional loop.

Related

How to rbind output of repeated function to a df - vectorized

I am reading everywhere that you should not use for-loops in R, but rather do it 'vectorized'. However I find for-loops intuitive and struggle with transforming my code.
I have a function f1 that I want to use multiple times. The inputs for the function are in a list called l1. My f1 outputs a df. I want to rbind these output dfs into one df. The for loop I have now is this:
z3 <- data.frame()
for(i in l1) {
z3 <- rbind(z3, f1(i))
}
Could anyone help me to do the same, but without the for-loop?
You can use lapply(), and do.call()
do.call(rbind, lapply(l1, f1))
another more verbose approach:
## function that returns a 1 x 2 dataframe:
get_fresh_straw <- function(x) data.frame(col1 = x, col2 = x * pi)
l1 = list(A = 1, B = 5, C = 2)
Reduce(l1,
f = function(camel_back, list_item){
rbind(camel_back, get_fresh_straw(list_item))
},
init = data.frame(col1 = NULL, col2 = NULL)
)

R tapply() does not work on data.frame due to improper length check

This is a bug report, not a question. The procedure to report bugs in R core appears complicated, and I don't want to be part of a mailing list. So I'm posting this here (as recommended by https://www.r-project.org/bugs.html.)
Here it is:
The tapply() help of R 4.0.3 says the following on argument X:
an R object for which a split method exists. Typically vector-like, allowing subsetting with [.
Issue: this R object cannot be a data.frame, although a data.frame can be split and subsetted.
To reproduce, run the following:
func <- function(dt) {
sum(dt[,1] * dt[,2])
}
tab <- data.frame(x = sample(100), y = sample(100), z = sample(letters[1:10], 100, T))
tapply(tab[,1:2], INDEX = tab$z, FUN = func)
This results in
error in tapply(tab[, 1:2], INDEX = tab$z, FUN = func) :
arguments must have same length
which, upon looking at the tapply()source code, appears to result from this check:
if (!all(lengths(INDEX) == length(X)))
stop("arguments must have same length")
But length() is not the relevant function to call on a data.frame to determine if it has the right dimension for a split. nrow() should be used instead.
replacing the above code with
if(is.data.frame(X)) {
len <- nrow(X)
} else {
len <- length(X)
}
if (!all(lengths(INDEX) == len))
stop("arguments must have same length")
solves the error.
This fix looks rather straightforward, and implementing it would increase the usefulness of tapply() by a lot (I know there are powerful alternatives to tapply()), so I wonder if the current limitation reflects a design choice.
Based on the function, we could use
library(dplyr)
tab %>%
group_by(z) %>%
summarise(new = func(cur_data()), .groups = 'drop')
-output
# A tibble: 10 x 2
# z new
# <chr> <int>
# 1 a 26647
# 2 b 28010
# 3 c 31340
# 4 d 20780
# 5 e 33311
# 6 f 31880
# 7 g 37527
# 8 h 8752
# 9 i 15490
Or using by from base R
by(tab[, 1:2], tab$z, FUN = func)
According to ?tapply
X - an R object for which a split method exists. Typically vector-like, allowing subsetting with [.
Here, the tab[, 1:2] is a data.frame and not a vector. If it is a matrix, it would be a vector with dim attributes

making an integer vector from index of another vector in R

I have just started to learn R, and trying to do the following task.
I have a vector of 10 random values few are NAs and few are numeric values in it, like
a <- rnorm(100)
b <- rep(NA, 100)
c <- sample(c(a, b), 10)
now I want to make another vector "d" which has indices of all the NA values in "c" for example
d <- c(2, 7, 9)
I tried
d <- which(c %in% is.na(c))
but its not giving me desired result
also what is wrong with this code i tried for the above purpose
navects <- function(x) {
for(i in 1:length(x)) {
if(is.na(x[i])) c(i)
}
}
You can try with which
which(is.na(c))
NOTE: c is also a function, so it is better not to name objects with c.

Making a column to help aggregation in r dataframe

I need to construct a new column for R dataframe that would help in aggregation.
First, I have some vectors:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
and dataframe DF which has column VAR with the items included in the vectors. Now I want to make new column AGGVAR:
DF$AGGVAR[DF$VAR %in% vector1] <- "vector1"
This is manageable with small amount of vectors but I want to make it neater for more vectors. I made list
vectorList <- ls(pattern = "^vector")
and my obviously naive attempt was
for(i in regList){DF$AGGVAR[DF$VAR %in i] <- i}
What is still needed to make this work?
EDIT: My problem was actually bit more hairy than I first presented. The vectors don't actually have neat numerical suffixes, e.g.:
vectorGHI <- c("ITEM11","ITEM12","ITEM13")
vectorJKL <- c("ITEM21","ITEM22","ITEM32")
Something like this should do the trick:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
d <- data.frame(var=c(vector1, vector2))
L <- mget(ls(patt='^vector'))
d$aggvar <- paste0('vector', sapply(d$var, grep, L))
d
# var aggvar
# 1 ITEM11 vector1
# 2 ITEM12 vector1
# 3 ITEM13 vector1
# 4 ITEM21 vector2
# 5 ITEM22 vector2
# 6 ITEM32 vector2
An alternative, which might have better performance:
lookup <- cbind(unlist(L),
c(mapply(rep, names(L), sapply(L, length))))
d$aggvar <- lookup[match(d$var, lookup[, 1]), 2]
Slightly modified answer based on jbaums' suggestion to make this complete:
namesVectors <- ls(pattern = "^vector")
vectorList <- mget(namesVectors)
# Getting rid of auxiliary prefix
namesVectors <- substring(namesVectors, 7)
DF$AGGVAR <- sapply(DF$VAR, grep, vectorList)
for(i in length(namesVectors)) {DF$AGGVAR[DF$AGGVAR == i] <- namesVectors[i]}

tapply on matrices of data and indices

I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.
Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.
This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))
I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.
You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11

Resources