dealing with empty elements in lapply() - r

it might be a rather beginner level question. lapply() is useful in applying a specific function on each component of a list. However, when I deal with data periodically generated by the data base, it happens sometimes, that one or more elements in the list is empty, while all other components of the same class are, let's say, data frames.
When I use lapply() to deal with the whole list, error occurs when it is the turn for the empty elements, because somehow the dimension or length or class don't fit. What I do in this case is using if/else loop, but I guess there must be a neat and smart way to tackle this problem.
Here is a example:
FTSR.site.app <- lapply(sortier.d.f, function(x) {
if(length(x) != 1){
FTSR <- as.numeric(get.FTSR(x))
}else FTSR <- 0})
sortier.d.f is a list consisting of dataframes with numerous rows and columns. If an empty element among them is present, which means no data is generated there, it will not get alone with the get.FTSR function (I wrote for a particular calculation), because the latter can only process data frames. The length of this empty element will be 1, I guess because it still exists as a 0 or a FALSE. Otherwise without such empty elements I can simply use
FTSR.site.app <- lapply(sortier.d.f, get.FTSR(x))
Would you please suggest a better solution for the problem with empty elements in such a case?
A simpler dummy example here:
test.A <- data.frame(name <- c("Michael", "John", "Mary"),
mathematik <- c(85, 72, 90), physics <- c(67, 82, 94))
test.B <- vector(length = 0, mode = "numeric")
test.L <- list(test.A, test.B)
sum.mean.calc <- function(test){
test$total <- apply(test[,2:3], MARGIN = 1, sum)
test$mean <- apply(test[,2:3], MARGIN = 1, mean)
return(test)
}
test.L <- lapply(test.L, sum.mean.calc)
test.L <- lapply(test.L, function(x){
if(length(x) != 0){
x <- sum.mean.calc(x)
}else x <- 0
return(x)
})
To first attemp to use lapply failed, because test.B is a 1-Dim vector with 0, so it can't be processed by function sum.mean.calc, so in the second attempt I have to use the extra loop
if(length(x) != 0){
...
}else x <- 0
to process all components in the list test.L, and that can be annoying when I want to use lapply a number of times on that list.

Related

Creating a function called clean_data in R

I'm a new guy in the R world and had to create a vector.
data <- rnorm(10, 0, 1)
Next question asked for a loop so I did:
for(i in 1:length(data)){
if(data[i] > 0)
print("postive")
else
print("negative")
But now it's asking for:
"Write a function called “clean data” that takes in a vector of numbers and returns a vector called “ret” of same length such that ret[i] = 1 if the input vector ith element was positive, and ret[i] = 0 otherwise. To get started, make a separate R Block for your function and use the following shell:
clean data <- function(input){ # your code here [...]
# ...
# your code here [...] return(ret) }
Professor also recommends reusing the loop from earlier.
I believe this is what you are looking for. Please note how ifelse is being used in the function instead of if and else functions separately. As you can see, this function would work with numeric string as input, but I've added an instance of it running on the previous data you've created. But I would like to add that the purpose of these exercices is really to make we scratch our heads and try to work the problem out for ourselves, so I recommend you persist on trying next time :)
data <- rnorm(10, 0, 1)
for(i in 1:length(data)){
if(data[i] > 0) print("positive") else print("negative")
}
clean_data <- function(input){
ret <- NULL
for(i in 1:length(input)){
ifelse(input[i] > 0, ret[i] <- 1 , ret[i] <- 0) # note ifelse structure
}
return(ret)
}
clean_data(data)

Nested loop in R: number of items to replace is not a multiple of replacement length

I try to subset my dataset using a nested loop. Unfortunately, it does not seem to work out properly: I get a couple of warnings and the loop is also not working as I would wish.
Here a short code example. The presented data is just an example - the actual dataset is much bigger: Any solution that involves manually picking values is not feasible.
# #Generate example data
unique_test <- list()
unique_test[[1]] <- c(178.5, 179.5, 180.5, 181.5)
unique_test[[2]] <- c(269.5, 270.5, 271.5)
tmp_dataframe1 <- data.frame(myID = c(268, 305, 268, 305, 268, 305, 306),
myvalue = c(1.150343, 2.830392, 1.150343, 2.830392, 1.150343, 2.830392, 1.150343),
myInter = c(178.5, 178.5, 179.5, 179.5, 180.5, 180.5, 181.5))
tmp_dataframe2 <- data.frame(myID = c(144, 188, 196, 300, 301, 302, 303, 97),
myvalue = c(1.293493, 3.286649, 1.408049, 0.469219, 11.143147, 0.687355, 0.508603, 0.654335),
myInter = c(269.5, 269.5, 269.5, 270.5, 270.5, 271.5, 185.5, 186.5))
mydata <- list()
mydata[[1]] <- tmp_dataframe1
mydata[[2]] <- tmp_dataframe2
########################
# #Generate nested loop
mysubset <- list() #Define list
for(i in 1:length(unique_test)){
#Prepare list of lists
mysubset[[i]] <- NaN
for(j in 1:length(unique_test[[i]])){
#Select myvalues whose myInter data equals the one found in unique_test and assign them to a new subset
mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == unique_test[[i]][j]),][["myvalue"]]
}
}
# #There are warnings and the nested loop is not really doing, what it is supposed to do!
R gives the following warnings:
Warning messages:
1: In mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == :
number of items to replace is not a multiple of replacement length
2: In mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == :
number of items to replace is not a multiple of replacement length
3: In mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == :
number of items to replace is not a multiple of replacement length
4: In mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == :
number of items to replace is not a multiple of replacement length
5: In mysubset[[i]][j] <- mydata[[i]][which(mydata[[i]]$myInter == :
number of items to replace is not a multiple of replacement length
If I restrict myself to just the first element in my dataset, the "normal" (i.e. NOT nested) loop works out:
# #If I don't use a nested loop (by just using the first element in both "mydata" and "unique_test"), things seem to work out
# #But obviously, this is not really what I want to achieve (I can't just manually select every element in mydata and unique_test)
mysubset <- list()
for(i in 1:length(unique_test[[1]])){
#Select myvalues whose myInter data equals the one found in unique_test and assign them to a new subset
mysubset[[i]] <- mydata[[1]][which(mydata[[1]]$myInter == unique_test[[1]][i]),][["myvalue"]]
}
Could it be that I first have to initiate my list with the appropriate dimensions? But how would I do that, if the dimensions are NOT the same for all the elements in my dataset (that's why I have to use the length() function in the first place)?
As you can see mydata[[1]] has not the same dimensions as mydata[[2]].
Therefore the solutions presented in the following links do not apply to this dataset:
Error in R :Number of items to replace is not a multiple of replacement length
Error in `*tmp*`[[k]] : subscript out of bounds in R
I'm pretty sure it's something obvious I'm missing, but I just cannot find it. Any help is much appreciated!
If there are better ways of achieving the same without a loop (I'm sure there are, e.g. apply() or something along the lines of subset()), I would appreciate such comments as well. Unfortunately I'm not familiar enough with the alternatives to be able to implement them quickly.
Simply wrap your assignment in list() as you are attempting to assign a numeric vector to a nested list because of nested for loops and not a vector itself.
mysubset[[i]][j] <- list(mydata[[i]][which(mydata[[i]]$myInter == unique_test[[i]][j]),][["myvalue"]])
Or the shorter as which() is not needed nor outer square brackets:
mysubset[[i]][j] <- list(mydata[[i]][mydata[[i]]$myInter == unique_test[[i]][j], c("myvalue")])
Alternatively, consider an apply solution as you do not need to initially assign an empty list and expand it iteratively to bind values to it. Nested lapply, sapply, mapply, even rapply can create the needed lists and dimensions in one call. The mapply assumes unique_test and mydata are always equal length objects.
# NESTED LAPPLY
mysubset2 <- lapply(seq(length(unique_test)), function(i) {
lapply(seq(length(unique_test[[i]])), function(j){
mydata[[i]][mydata[[i]]$myInter == unique_test[[i]][j], c("myvalue")]
})
})
# NESTED SAPPLY
mysubset3 <- sapply(seq(length(unique_test)), function(i) {
sapply(seq(length(unique_test[[i]])), function(j){
mydata[[i]][mydata[[i]]$myInter == unique_test[[i]][j], c("myvalue")]
})
}, simplify = FALSE)
# NESTED M/LAPPLY
mysubset4 <- mapply(function(u, m){
lapply(u, function(i) m[m$myInter == i, c("myvalue")])
}, unique_test, mydata, SIMPLIFY = FALSE)
# NESTED R/LAPPLY
mysubset5 <- rapply(unique_test, function(i){
df <- do.call(rbind, mydata)
lapply(i, function(u) df[df$myInter == u, c("myvalue")])
}, how="list")
# ALL SUBSETS EQUAL EXACTLY
all.equal(mysubset, mysubset2)
# [1] TRUE
all.equal(mysubset, mysubset3)
# [1] TRUE
all.equal(mysubset, mysubset4)
# [1] TRUE
all.equal(mysubset, mysubset5)
# [1] TRUE
Can you post what you expect mysubset to look like? Based on my understanding, this should subset myvalue using values in unique_test:
mysubset <- unique(unlist(lapply(unlist(unique_test),function(x) subset(mydata,myInter==x,select="myvalue"))))

Indexing certain elements in a nested list, for all nests

I have a list which contains more lists of lists:
results <- sapply(c(paste0("cv_", seq(1:50)), "errors"), function(x) NULL)
## Locations for results to be stored
step_results <- sapply(c("myFit", "forecast", "errors"), function(x) NULL)
step_errors <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
final_error <- sapply(c("MAE", "MSE", "sign_accuracy"), function(x) NULL)
for(i in 1:50){results[[i]] <- step_results}
for(i in 1:50){results[[i]][[3]] <- step_errors}
results$errors <- final_error
Now in this whole structure, I would like to sum up all the values in sign_accuracy and save them in results$errors$sign_accuracy
I could maybe do this with a for-loop, indexing with i:
## This is just an example - it won't actually work!
sign_acc <- matrix(nrow = 50, ncol = 2)
for (i in 1:50){
sign_acc[i, ] <- `results[[i]][[3]][[3]]`
results$errors$sign_accuracy <- sign_acc
}
If I remember correctly, in Matlab there is something like list(:), which means all elements. In Python I have seen something like list(0:-1), which also means all elements.
What is the elegent R equivalent? I don't really like loops.
I have seen methods using the apply family of functions. With something like apply(data, "[[", 2), but can't get it to work for deeper lists.
Did you try with c(..., recursive)?
Here is an option with a short example at the end:
sumList <- function(l, label) {
lc <- c(l, recursive=T)
filter <- grepl(paste0("\\.",label, "$"), names(lc)) | (names(lc) == label)
nums <- lc[filter]
return(sum(as.numeric(nums)))
}
ex <- list(a=56,b=list("5",a=34,list(c="3",a="5")))
sumList(ex,"a")
In this case, you can do what you want with
results$errors$sign_accuracy <- do.call(sum, lapply(results, function(x){x[[3]][[3]]}))
lapply loops through the first layer of results, and pulls out the third element of the third element for each. do.call(sum catches all the results and sums them.
The real problems with lists arise when the nesting is more irregular, or when you need to loop through more than one index. It can always be done in the same way, but it gets extraordinarily ugly very quickly.

Trying to vectorize a for loop in R

UPDATE
Thanks to the help and suggestions of #CarlWitthoft my code was simplified to this:
model <- unlist(sapply(1:length(model.list),
function(i) ifelse(length(model.list[[i]][model.lookup[[i]]] == "") == 0,
NA, model.list[[i]][model.lookup[[i]]])))
ORIGINAL POST
Recently I read an article on how vectorizing operations in R instead of using for loops are a good practice, I have a piece of code where I used a big for loop and I'm trying to make it a vector operation but I cannot find the answer, could someone help me? Is it possible or do I need to change my approach? My code works fine with the for loop but I want to try the other way.
model <- c(0)
price <- c(0)
size <- c(0)
reviews <- c(0)
for(i in 1:length(model.list)) {
if(length(model.list[[i]][model.lookup[[i]]] == "") == 0) {
model[i] <- NA
} else {
model[i] <- model.list[[i]][model.lookup[[i]]]
}
if(length(model.list[[i]][price.lookup[[i]]] == "") == 0) {
price[i] <- NA
} else {
price[i] <- model.list[[i]][price.lookup[[i]]]
}
if(length(model.list[[i]][reviews.lookup[[i]]] == "") == 0) {
reviews[i] <- NA
} else {
reviews[i] <- model.list[[i]][reviews.lookup[[i]]]
}
size[i] <- product.link[[i]][size.lookup[[i]]]
}
Basically the model.list variable is a list from which I want to extract a particular vector, the location from that vector is given by the variables model.lookup, price.lookup and reviews.lookup which contain logical vectors with just one TRUE value which is used to return the desired vector from model.list. Then every cycle of the for loop the extracted vectors are stored on variables model, price, size and reviews.
Could this be changed to a vector operation?
In general, try to avoid if when not needed. I think your desired output can be built as follows.
model <- unlist(sapply(1:length(model.list), function(i) model.list[[i]][model.lookup[[i]]]))
model[model=='']<-NA
And the same for your other variables. This assumes that all model.lookup[[i]] are of length one. If they aren't, you won't be able to write the output to a single element of model in the first place.
I would also note that you are grossly overcoding, e.g. x<-0 is better than x<-c(0), and don't bother with length evaluation on a single item.

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

Resources