Deleting a row from a data set - r

I am trying to create a function that deletes n rows from a data set in R. The rows that I want to delete are the minimum values from the column time in the data set my_data_set.
I currently have
delete_data <- function(n)
{
k=1
while(k <= n)
{
my_data_set = my_data_set[-(which.min(my_data_set$time)),]
k=k+1
}
}
When I input these lines manually (without the use of the while loop) it works perfectly but I am not able to get the loop to work.
I am calling the function by:
delete_data(n = 2)
Any help is appreciated!
Thanks

Try:
my_data_set[ ! my_data_set$time == min(my_data_set$time), ]
Or if you are using data.table and wish to use the more direct syntax that data.table provides:
library(data.table)
setDT( my_data_set )
my_data_set[ ! time == min(time) ]
Then review how R work. R is a vectorized language that pretty much does what you mean without having to resort to complicated loops.

Also try:
my_data_set <- my_data_set[which(my_data_set$time > min(my_data_set$time)),]
By the way, which.min() will only pick up the first record if there is more than one record matching the minimum value.

Related

Operate on multiple rows of dataframe simultaneously in R

I'm sure someone has asked this (very basic) question before, but I must be searching for the wrong thing because I can't find an answer:
I frequently need to perform operations that involve combining data from multiple rows of the same dataframe. I know how to do this with a looping construct, e.g.
for (i in 2:nrow(df)) { df$result[i] <- df$data[i] - df$data[i-1] }
for (i in 12:nrow(df)) { j <- i - 11; df$result[i] <- prod(df$data[j:i]) }
Is there a general solution for these types of operations that does not involve looping? Or is looping actually the best way to do it in R?
You may try subsetting your data frame, e.g. this:
for (i in 2:nrow[df]) { df$result[i] <- df$data[i] - df$data[i-1] }
becomes:
df$result[2:nrow(df)] <- df$data[2:nrow(df)] - df$data[1:nrow(df)-1]
Note: nrow() is a function AFAIK, so you should call it using parentheses, not square brackets.
In base R:
df$result[2:nrow(df)] = diff(df$data)
df$result2[13:nrow(df)] = diff(df$data,12)
Or dplyr:
df$result = dplyr::lag(df$data)
df$result2 = dplyr::lag(df$data, 12)

R: Build custom cumsum function in sapply

I'm trying to build a more custom version of cumsum to use on a data.table, but I'm failing at the first step:
numbers <- data.table(num=1:10)
sum <- 0
cumFunct <- function(n) {
sum <<- sum+n
return(sum)
}
numbers[, cum:=sapply(num, cumFunct)]
While this works, it is very unclean. It also requires sum to be set to 0 before I run the function.
Now, how do I write this in a cleaner way? Essentially, how can I pass the intermediate result to the next iteration of cumFunct without using global variables?
Thanks very much!
One way to do this would be to use the datatable "numbers" within the function:
numbers <- data.table(num=1:10)
cumFunct <- function(n) {
sum <- sum(numbers[1:n])
return(sum)
}
numbers[, cum:=sapply(num, cumFunct)]
This is not the most efficient way, but depending on what you do in your custom code, one can improve it.
An answer that is also a question: is this a pattern that will work here?
complicated.wizardry <- function(a,b){
a+b
}
cumlist <- function(sofar, remaining, myfn){
if(length(remaining)==1)return(c(sofar, myfn(sofar[length(sofar)],remaining[1])))
return ( cumlist( c(sofar, myfn(sofar[length(sofar)],remaining[1])),remaining[2:length(remaining)],myfn))
}
cumlist(0,1:10,complicated.wizardry)

in R: Setting new Values in a data.table fast

I am trying to set values to a data.table in an efficient way. The following code will do what I want, but it is too slow for large datasets:
DTcars<-as.data.table(mtcars)
for(i in 1:(dim(DTcars)[1]-1)){
for(j in 1:dim(DTcars)[2]){
if(DTcars[i,j, with=F]>10){
set(DTcars,
i=as.integer(i),
j =as.integer(j) ,
value = DTcars[dim(DTcars)[1],j,with=F])
}
}
}
And I want something like this... which is totally a wrong code, but expresses my need and I think it would be faster. Meaning that I want to subset my data.table and insert the same value for a particular column and repeat for each column.
DTcars<-as.data.table(mtcars)
ns<-names(DTcars)
for(j in 1:length(ns)){
DTcars[ns[j]>10]<-DTcars[20,ns[j]]
}
I think you're looking for
for (j in names(DTcars)) set(DTcars,
i = which(DTcars[[j]]>10),
j = j,
value = tail(DTcars[[j]],1)
)
The column numbers or names can be used as the for iterator here.
The value changes between the two pieces of code in the OP, so I'm not sure about that.
IMO set should be used sparingly, and regular := is sufficient almost always:
for (col in names(DTcars))
DTcars[get(col) > 10, (col) := get(col)[.N]]

How to efficiently chunk data.frame into smaller ones and process them

I have a bigger data.frame which i want to cut into small ones, depending on some "unique_keys" ( In reffer to MySQL ). At the moment I am doing this with this loop, but it takes awfully long ~45sec for 10k rows.
for( i in 1:nrow(identifiers_test) ) {
data_test_offer = data_test[(identifiers_test[i,"m_id"]==data_test[,"m_id"] &
identifiers_test[i,"a_id"]==data_test[,"a_id"] &
identifiers_test[i,"condition"]==data_test[,"condition"] &
identifiers_test[i,"time_of_change"]==data_test[,"time_of_change"]),]
# Sort data by highest prediction
data_test_offer = data_test_offer[order(-data_test_offer[,"prediction"]),]
if(data_test_offer[1,"is_v"]==1){
true_counter <- true_counter+1
}
}
How can i refactor this, to make it more "R" - and faster?
Before applying groups you are filtering your data.frame using another data.frame. I would use merge then by.
ID <- c("m_id","a_id","condition","time_of_change")
filter_data <- merge(data_test,identifiers_test,by=ID)
by(filter_data, do.call(paste,filter_data[,ID]),
FUN=function(x)x[order(-x[,"prediction"]),])
Of course the same thing can be written using data.table more efficiently:
library(data.table)
setkeyv(setDT(identifiers_test),ID)
setkeyv(setDT(data_test),ID)
data_test[identifiers_test][rev(order(prediction)),,ID]
NOTE: the answer below is not tested since you don't provide a data to test it.

Modifying Data Set within a function but data set is not changed

My code is the following in R:
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
}
I am trying to iterate through each element in myData and replace all NAs of a row with 1 IF the row does not have more than the number of NAs. I have tested my code with print statements and found that the iteration works perfectly (although not the most efficient code) and if I examine the modified myData by putting in a fix(myData) before the last bracket of the function, I see that my function worked perfectly(the NAs are replaced with 1s for the rows that meet the limit condition). However, when I examine myData after the function terminates, myData does not show the changes replaceNA made.
I know there is a problem in storing the modified myData but I am not sure how to store it properly.
The condition is not clear ( English problem). In any case you don't need a for loop here.
To compute the number of missing values for each row :
rowSums(is.na(myData))
Then you just test your condition and you replace all the row:
mm <- myData[rowSums(is.na(myData)) <= limit ,]
mm[is.na(mm)] <- 1
myData[rowSums(is.na(myData)) <= limit ,] <- mm
You should make your function explicitly return the modified data,
replaceNA<- function(myData,limit){
numNA<- rowsum(is.na(myData))
targetRows<- which(numNA<=limit)
targetCols<- length(names(myData))
for(row in targetRows){
for(col in 1:targetCols){
myData[row,col][is.na(myData[row,col])]<-1
}
}
return(myData)
}
then assign the modified data. You could overwrite your old data
myData <- replaceNA(myData, limit = 2)
or make a copy to compare
myData_no_na <- replaceNA(myData, limit = 2)
You can also avoid the loop entirely, which is much more R-like. #agstudy's answer seems to be covering that approach nicely.

Resources