Optimize code to filter R dataframe - r

I have some R code that takes in the args string from the command line and then filters a dataframe based on values in a column; the args string contains the column names. Right now I'm doing it by looping through the vector but something tells me that there has to be a better way. Is there a way to optimize this code?
args = c("col1","col2")
for(i in args){
df = df[df[,i]==0,]
}

If I understand correctly, you want to keep the rows where all of the args are equal to 0 (or any other given value).
First get the indices of the columns you're interested in:
idx <- match(args, colnames(df))
Then you can simply do:
df <- df[apply(df[, idx], 1, function(x) all(x == 0)), ]
Another possibility:
df <- df[rowSums(df[, idx] != 0) == 0, ]

Related

Looping through rows until criteria is met in R

I'm trying to loop over each row in my dataframe and if that row contains a 1, I'm looking "bf" to change to True so that the loop cancels and then prints out the index of row. Here's the code ive tried below.
bf <- FALSE
for(row in 1:nrow(df)){
while(bf == FALSE){
if(df[row, ] == 1){
bf==TRUE
print(row)
}
}
}
However what happens with this code is that it never seems to get if statement and execute it properly to my knowledge
You can use the apply, any, which functions to id rows with a 1. Then select the first row:
bdrows <- apply(df, 1, function(x) any(x == 1))
bd <- which(bdrows == TRUE)
firstbdrow <- bd[1]
bf==TRUE is used for comparison, you might be looking for bf = TRUE. Also this doesn't operation doesn't require for or while loop. Let's say you have a column called column_name in your data you can do :
which.max(df$column_name == 1)
Or
which(df$column_name == 1)[1]

Use if else statement for Dummy-Coding in R

I tried to create a If Else Statement to Recode my Variable in a Dummy-Variable.
I Know there is the ifelse() Function and the fastDummy-Package, but I tried this Way without succes.
Why does this not work? I want to learn and understand R in a better Way.
if(df$iscd115==1){
df$iscd1151 <- 1
} else {
df$iscd1151 <- 0
}
This should be a reasonable solution.
First we'll find out what the positions of your important columns are, and then we'll apply a function that will search the rows (margin = 1) that will check if that our important column is 1 or 0, and then modify the other column accordingly.
col1 <- which(names(df) == "iscd115")
col2 <- which(names(df) == "iscd1151")
mat <- apply(df, margin = 1, function(x) {
if (x[col1] == 1) {x[col2] <- 1
} else {
x[col2] == 0
}
x
})
Unfortunately, this transforms the original data frame into a transposed matrix. We can re-transpose the matrix back and turn it back into a data frame with the following.
new_df <- as.data.frame( t(mat))

How to substitute negative values with a calculated value in an entire dataframe

I've got a huge dataframe with many negative values in different columns that should be equal to their original value*0.5.
I've tried to apply many R functions but it seems I can't find a single function to work for the entire dataframe.
I would like something like the following (not working) piece of code:
mydf[] <- replace(mydf[], mydf[] < 0, mydf[]*0.5)
You can simply do,
mydf[mydf<0] <- mydf[mydf<0] * 0.5
If you have values that are non-numeric, then you may want to apply this to only the numeric ones,
ind <- sapply(mydf, is.numeric)
mydf1 <- mydf[ind]
mydf1[mydf1<0] <- mydf1[mydf1<0] * 0.5
mydf[ind] <- mydf1
You could try using lapply() on the entire data frame, making the replacements on each column in succession.
df <- lapply(df, function(x) {
x <- ifelse(x < 0, x*0.5, x)
})
The lapply(), or list apply, function is intended to be used on lists, but data frames are a special type of list so this works here.
Demo
In the replace the values argument should be of the same length as the number of TRUE values in the list ('index' vector)
replace(mydf, mydf <0, mydf[mydf <0]*0.5)
Or another option is set from data.table, which would be very efficient
library(data.table)
for(j in seq_along(mydf)){
i1 <- mydf[[j]] < 0
set(mydf, i = which(i1), j= j, value = mydf[[j]][i1]*0.5)
}
data
set.seed(24)
mydf <- as.data.frame(matrix(rnorm(25), 5, 5))

How to create a function which removes a certain first character of column names

R has problems when reading .csv files with column names that begin with a number; it changes these names by putting an "X" as the first character.
I am trying to write a function which simply solves this problem (although: is this the easiest way?)
As an example file, I simply created two new (non-sensical) columns in iris:
iris$X12.0 <- iris$Sepal.Length
iris$X18.0 <- iris$Petal.Length
remv.X <- function(x){
if(substr(colnames(x), 1, 1) == "X"){
colnames(x) <- substr(colnames(x), 2, 100)
}
else{
colnames(x) <- substr(colnames(x), 1, 100)
}
}
remv.X(iris)
When printing, I get a warning, and nothing changes.
What do I do wrong?
check.names=FALSE
Use the read.table/read.csv argument check.names = FALSE to turn off column name mangling.
For example,
read.csv(text = "1x,2x\n10,20", check.names = FALSE)
giving:
1x 2x
1 10 20
Removing X using sub
If for some reason you did have an unwanted X character at the beginning of some column names they could be removed like this. This only removes an X at the beginning of columns names for which the next character is a digit. If the next character is not a digit or if there is no next character then the column name is left unchanged.
names(iris) <- sub("^X(\\d.*)", "\\1", names(iris))
or as a function:
rmX <- function(data) setNames(data, sub("^X(\\d.*)", "\\1", names(data)))
# test
iris <- rmX(iris)
Problem with code in question
There are two problems with the code in the question.
in if (condition) ... the condition is a vector but must be a
scalar.
the data frame is never returned.
Here it is fixed up. We have also factored out the LHS of the two legs of the if.
remv.X2 <- function(x) {
for (i in seq_along(x)) {
colnames(x)[i] <- if (substr(colnames(x)[i], 1, 1) == "X") {
substr(colnames(x)[i], 2, 100)
} else {
substr(colnames(x)[i], 1, 100)
}
}
x
}
iris <- remv.X2(iris)
or maybe even:
remv.X3 <- function(x) {
setNames(x, substr(colnames(x), (substr(colnames(x), 1, 1) == "X") + 1, 100))
}
iris <- remv.X3(iris)

How to set a column value based on values in another column in R

I am trying to add a new column based on values in another column. (Basically if the other column is missing or 0, set the new value to 0 or to 1)
What's wrong with this code below?
times=nrow(eachfile)
for(i in 1:times)
{eachfile$SalesCycleN0[i] <- ifelse(eachfile$R[i]==NA | eachfile$R[i]==0,0,1 ) }
table(eachfile$SalesCycleN0)
As long as you have tested that the column only contains 0, 1 and NA I would do:
eachfile$SalesCycleN0 <- 1
eachfile$SalesCycleN0[is.na(eachfile$R) | eachfile$R==0] <- 0
Nothing is ever "==" to NA. Just do this (no loop):
eachfile$SalesCycleN0 <- ifelse( is.na(eachfile$R) | eachfile$R==0, 0,1 )
If you were looking for a little more economy in code this might also work:
eachfile$SalesCycleN0 <- as.numeric( !grepl("^0$", eachfile$R) )
grepl returns FALSE for NA's.
A more efficient way of doing this is using the sapply function, rather than using a for loop (handy in case of huge dataset). Here is an example:
df = data.frame(x = c(1,2,0,NA,5))
fun = function(i) {is.na(df$x[i]) || (df$x[i] == 0)}
bin <- (sapply(1:nrow(df), FUN = fun))*1 ## multiplying by 1 will convert the logical vector to a binary one.
df <- cbind(df, bin)
In your case:
fun = function(i) {is.na(eachfile$SalesCycleNO[i]) || (eachfile$SalesCycleNO[i] == 0)}
bin <- (sapply(1:times, FUN = fun))*1
eachfile <- cbind(eachfile, bin)

Resources