in R: Setting new Values in a data.table fast - r

I am trying to set values to a data.table in an efficient way. The following code will do what I want, but it is too slow for large datasets:
DTcars<-as.data.table(mtcars)
for(i in 1:(dim(DTcars)[1]-1)){
for(j in 1:dim(DTcars)[2]){
if(DTcars[i,j, with=F]>10){
set(DTcars,
i=as.integer(i),
j =as.integer(j) ,
value = DTcars[dim(DTcars)[1],j,with=F])
}
}
}
And I want something like this... which is totally a wrong code, but expresses my need and I think it would be faster. Meaning that I want to subset my data.table and insert the same value for a particular column and repeat for each column.
DTcars<-as.data.table(mtcars)
ns<-names(DTcars)
for(j in 1:length(ns)){
DTcars[ns[j]>10]<-DTcars[20,ns[j]]
}

I think you're looking for
for (j in names(DTcars)) set(DTcars,
i = which(DTcars[[j]]>10),
j = j,
value = tail(DTcars[[j]],1)
)
The column numbers or names can be used as the for iterator here.
The value changes between the two pieces of code in the OP, so I'm not sure about that.

IMO set should be used sparingly, and regular := is sufficient almost always:
for (col in names(DTcars))
DTcars[get(col) > 10, (col) := get(col)[.N]]

Related

How to pass a column name in a for loop concatenating i with a string?

I need to subset a data frame in several others based in the values of several columns of the original data frame.
Here's my for loop:
for (i in 1:qtde_erros_esti){
temp_esti <- erro_esti[(paste0("erro_esti$" , "erro", i) == "1"),]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
The last piece of the puzzle for me is to pass the column name which value I must check (1st line in the for loop).
I'm trying to pass it with the function paste0, but the result of the function is a string that will never be equal to "1", hence never getting any data.
How can I pass the column names (erro_esti$erro1, erro_esti$erro2, and so on...) in this case?
Observation: I'm aware that this may not be the best approach using R, but I'm a noobie, coming from SAS, so I have limited knowledge.
Secondary question: is the way that I formulated the question (topic title) good? Accepting criticism on that too, please, aiming to improve future questions.
Thanks in advance for anyone who take some time to read this.
We can use [[ instead of $ to subset the column dynamically
erro_esti[[paste0("erro", i)]]
-full code
for(i in seq_len(qtde_erros_esti)) {
temp_esti <- erro_esti[erro_esti[[paste0("erro", i)]] == 1,]
assign(paste0("erro", i,"_esti"), temp_esti)
rm(temp_esti)
}
You are probably going about things a bit too complicated most likely, considert his approach:
for (i in 1:qtde_erros_esti){
column.name <- paste0("erro", i)
column.data <- erro_esti[, column.name ]
## do things with the column.data vector here
}
Now you can do what needs to be done with the data from column i, using the column.data variable.
If you just want to work with every column of your data.frame, also consider this further simplified pattern:
for( column.data in erro_esti ) {
## work with column.data here
}
You can just iterate over the columns of erro_esti directly, no need to use a counter, unless you need that counter for something else.

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

Deleting a row from a data set

I am trying to create a function that deletes n rows from a data set in R. The rows that I want to delete are the minimum values from the column time in the data set my_data_set.
I currently have
delete_data <- function(n)
{
k=1
while(k <= n)
{
my_data_set = my_data_set[-(which.min(my_data_set$time)),]
k=k+1
}
}
When I input these lines manually (without the use of the while loop) it works perfectly but I am not able to get the loop to work.
I am calling the function by:
delete_data(n = 2)
Any help is appreciated!
Thanks
Try:
my_data_set[ ! my_data_set$time == min(my_data_set$time), ]
Or if you are using data.table and wish to use the more direct syntax that data.table provides:
library(data.table)
setDT( my_data_set )
my_data_set[ ! time == min(time) ]
Then review how R work. R is a vectorized language that pretty much does what you mean without having to resort to complicated loops.
Also try:
my_data_set <- my_data_set[which(my_data_set$time > min(my_data_set$time)),]
By the way, which.min() will only pick up the first record if there is more than one record matching the minimum value.

Modifying Data Set within a function but data set is not changed

My code is the following in R:
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
}
I am trying to iterate through each element in myData and replace all NAs of a row with 1 IF the row does not have more than the number of NAs. I have tested my code with print statements and found that the iteration works perfectly (although not the most efficient code) and if I examine the modified myData by putting in a fix(myData) before the last bracket of the function, I see that my function worked perfectly(the NAs are replaced with 1s for the rows that meet the limit condition). However, when I examine myData after the function terminates, myData does not show the changes replaceNA made.
I know there is a problem in storing the modified myData but I am not sure how to store it properly.
The condition is not clear ( English problem). In any case you don't need a for loop here.
To compute the number of missing values for each row :
rowSums(is.na(myData))
Then you just test your condition and you replace all the row:
mm <- myData[rowSums(is.na(myData)) <= limit ,]
mm[is.na(mm)] <- 1
myData[rowSums(is.na(myData)) <= limit ,] <- mm
You should make your function explicitly return the modified data,
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
return(myData)
}
then assign the modified data. You could overwrite your old data
myData <- replaceNA(myData, limit = 2)
or make a copy to compare
myData_no_na <- replaceNA(myData, limit = 2)
You can also avoid the loop entirely, which is much more R-like. #agstudy's answer seems to be covering that approach nicely.

missing value where TRUE/FALSE needed error in R

I have got a column with different numbers (from 1 to tt) and would like to use looping to perform a count on the occurrence of these numbers in R.
count = matrix(ncol=1,nrow=tt) #creating an empty matrix
for (j in 1:tt)
{count[j] = 0} #initiate count at 0
for (j in 1:tt)
{
for (i in 1:N) #for each observation (1 to N)
{
if (column[i] == j)
{count[j] = count[j] + 1 }
}
}
Unfortunately I keep getting this error.
Error in if (column[i] == j) { :
missing value where TRUE/FALSE needed
So I tried:
for (i in 1:N) #from obs 1 to obs N
if (column[i] = 1) print("Test")
I basically got the same error.
Tried to do abit research on this kind of error and alot have to said about "debugging" which I'm not familiar with.
Hopefully someone can tell me what's happening here. Thanks!
As you progress with your learning of R, one feature you should be aware of is vectorisation. Many operations that (in C say) would have to be done in a loop, can be don all at once in R. This is particularly true when you have a vector/matrix/array and a scalar, and want to perform an operation between them.
Say you want to add 2 to the vector myvector. The C/C++ way to do it in R would be to use a loop:
for ( i in 1:length(myvector) )
myvector[i] = myvector[i] + 2
Since R has vectorisation, you can do the addition without a loop at all, that is, add a scalar to a vector:
myvector = myvector + 2
Vectorisation means the loop is done internally. This is much more efficient than writing the loop within R itself! (If you've ever done any Matlab or python/numpy it's much the same in this sense).
I know you're new to R so this is a bit confusing but just keep in mind that often loops can be eliminated in R.
With that in mind, let's look at your code:
The initialisation of count to 0 can be done at creation, so the first loop is unnecessary.
count = matrix(0,ncol=1,nrow=tt)
Secondly, because of vectorisation, you can compare a vector to a scalar.
So for your inner loop in i, instead of looping through column and doing if column[i]==j, you can do idx = (column==j). This returns a vector that is TRUE where column[i]==j and FALSE otherwise.
To find how many elements of column are equal to j, we just count how many TRUEs there are in idx. That is, we do sum(idx).
So your double-loop can be rewritten like so:
for ( j in 1:tt ) {
idx = (column == j)
count[j] = sum(idx) # no need to add
}
Now it's even possible to remove the outer loop in j by using the function sapply:
sapply( 1:tt, function(j) sum(column==j) )
The above line of code means: "for each j in 1:tt, return function(j)", an returns a vector where the j'th element is the result of the function.
So in summary, you can reduce your entire code to:
count = sapply( 1:tt, function(j) sum(column==j) )
(Although this doesn't explain your error, which I suspect is to do with the construction or class of your column).
I suggest to not use for loops, but use the count function from the plyr package. This function does exactly what you want in one line of code.

Resources