I would like to know if there is a simpler way of subsetting a data.frame`s integer columns.
My goal is to modify the numerical columns in my data.frame without touching the purely integer columns (in my case containing 0 or 1). The integer columns were originally factor levels turned into dummy variables and should stay as they are. So I want to temporarily remove them.
To distinguish numerical from integer columns I used the OP's version from here (Check if the number is integer).
But is.wholenumber returns a matrix of TRUE/FALSE instead of one value per column like is.numeric, therefore sapply(mtcars, is.wholenumber) does not help me. I came up with the following solution, but I thought there must be an easier way?
data(mtcars)
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
integer_column_names <- apply(is.wholenumber(mtcars), 2, mean) == 1
numeric_df <- mtcars[, !integer_column_names]
You can use dplyr to achieve that as shown here
library(dplyr)
is_whole <- function(x) all(floor(x) == x)
df = select_if(mtcars, is_whole)
or in base R
df = mtcars[ ,sapply(mtcars, is_whole)]
Related
summary(standard_airline)
#outlier treatment
outFix <- function(x){
quant <- quantile(x,probs = c(.25,.75))
h <- 1.5*IQR(x,na.rm = T)
x[x<(quant[1]-h)] <- quant[1]
x[x>(quant[2]+h)] <- quant[2]
}
v <- colnames(airline[,-1])
data2 <- lapply(v,outFix)
Error - Error in (1 - h) * qs[i] : non-numeric argument to binary operator
I couldn't find out what is the error coming here although logically seems right, Is there any way in R to pass multiple column of a dataset to a particular function. Here I want to pass every column except ID to fix the outliers.
Problem
The issue you are encountering is that v is a character vector of column names. Your function outFix expects a numeric vector. So what your lapply code is actually doing is something like this: outFix("Balance"). So it's trying to compute quantiles and IQRs on a string, which is why you're having your error.
quantile("Balance")
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
Solutions
In the following code replace df with airline for your specific data.
In base R:
df[,-1] <- lapply(df[, -1], function(x) outFix(as.numeric(x))) # exclude first column
Or using your code:
df[, v] <- lapply(df[, v], function(x) outFix(as.numeric(x)))
Using dplyr you can apply your function to every column and except ID with:
library(dplyr)
df %>%
dplyr::mutate_at(dplyr::vars(-ID), ~ outFix(as.numeric(.))) # remove ID by name
df %>%
dplyr::mutate_at(-1, ~ outFix(as.numeric(.))) # remove ID by column position
This makes sure that all your columns are numeric before being passed to your function outFix.
If you're certain that all of your columns are numeric ahead of time then you don't need to use the as.numeric function, but could be good to have in case.
I have a data frame composed of numeric values. I calculated the standard deviation and mean for each column and created Upper_Bound and Lower_Bound vectors as follows:
std_devs = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, sd)
means = apply(exp_vars[,sapply(exp_vars,is.numeric)], 2, mean)
Upper_Bound = means + 3*std_devs
Lower_Bound = means - 3*std_devs
Now i want to detect the rows that has at least one value that does not fall between the relevant upperbound and lowerbound. For example a value in column j must be equal or greater than Lower_Bound[j] and equal or smaller than Upper_Bound[j], if at least one value in a row i violates this condition I want to save the index of that row (I also have row names, saving row names would be fine too.) What I want to obtain is a vector of indices (or row names) that shows all rows which violate the rule. I tried the following:
outliers = apply(my_data ,1, between(x,Lower_Bound, Upper_Bound,incbounds = TRUE))
But i guess it was too much to expect between to automatically go over every value in a row and compare them with the relevant bounds. This was my second hopeless attempt that did not work:
outliers = apply(exp_vars_numeric,1, apply(x,2,between(x,Lower_Bound, Upper_Bound, incbounds = TRUE)))
I know that i can do it with a for loop but i am hoping for a more efficient solution. Any suggestion is highly appreciated.
Thanks in advance.
Consider keeping everything in one data frame by adding lower and upper bound columns with help of ave() for inline aggregation of sd and mean. Then run conditional ifelse() for the flagging of such rows.
num_cols <- sapply(exp_vars,is.numeric)
num_names <- colnames(exp_vars)[num_cols]
means <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=mean))
std_devs <- sapply(exp_vars[,num_cols], function(x) ave(x, FUN=sd))
exp_vars[,paste0(num_names, "_lower")] <- means - 3*std_devs
exp_vars[,paste0(num_names, "_upper")] <- means + 3*std_devs
# CONDITIONALLY ASSIGN FLAG COLS
exp_vars[,paste0(num_names, "_flag")] <- ifelse(exp_vars[,num_names] >= exp_vars[,paste0(num_names, "_lower")] &
exp_vars[,num_names] <= exp_vars[,paste0(num_names, "_upper")], 1, 0)
# ADD ALL FLAG COLS HORIZONTALLY
exp_vars$index <- ifelse(rowSums(exp_vars[,paste0(num_names, "_flag")]) > 0, row.names(exp_vars), NA)
exp_vars[is.na(exp_vars$index), ]
It is recommended to include a small example of how your data looks like so that it is easier for us to respond to your question :) I generated data.frames based on your description, and it seems that the following solves your problem:
df <- data.frame(a=c(1:10),b=c(5:14))
ncols <- ncol(df)
bounds <- data.frame(lower=seq(.5,5,.5),upper=seq(6.5,11,.5))
one_plus_fall_outside <- sapply(1:nrow(df),
function(i)
sum(between(df[i,],bounds$lower[i],bounds$upper[i]))/ncols<1
)
which(one_plus_fall_outside)
you can check if this works well by looking at all the columns together:
cbind(df,bounds,one_plus_fall_outside)
I wanted to order by some column, and subset, a multi-column dataframe but the command used did not work
print(df[order(df$x) & df$x < 5,])
This does not order the results.
To debug this I generated a test dataframe with 1 column but this 'simplification' had unexpected effects
df <- data.frame(x = sample(1:50))
print(df[order(df$x) & df$x < 5,])
This does not order the results so I felt I had reproduced the problem but with simpler data.
Breaking down the process to first ordering and then subsetting led me to discover the ordering in this case does not generate a dataframe object
df <- data.frame(x = sample(1:50))
ndf <- df[order(df$x),]
print(class(ndf))
produces
[1] "integer"
Attempting to subset the resultant "integer" ndf object using dataframe syntax e.g.
print(ndf[ndf$x < 5, ])
obviously generates an error:
Error in ndf$x : $ operator is invalid for atomic vectors.
Simplifying even further, I found subsetting alone (not applying the order function ) does not generate a dataframe object
ndf <- df[df$x < 5,]
class(ndf)
[1] "integer"
It turns out for the multicolumn dataframe that separating the ordering and the subsetting does work as expected
df <- data.frame(x = sample(1:50), y = rnorm(50))
ndf <- df[order(df$x),]
print(ndf[ndf$x < 5, ])
and this solved my original problem, but led to two further questions:
Why is the type of object returned, as described above based on the 1 column dataframe test case, not a dataframe? ( I appreciate a 1 column dataframe just contains a single vector but it's still wrapped in a dataframe ?)
Is it possible to order and subset a multicolumn dataframe in 1 step?
A data.frame in R automatically simplifies to vectors when selecting just one column. This is a common and useful simplification and is better described in this question. Of course you can prevent that with drop=FALSE.
Subsetting and ordering are two different operations. You should do them in two logical steps (but possibly one line of code). This line doesn't make a lot of sense
df[order(df$x) & df$x < 5,]
Subsetting in R can either be done with a vector of row indices (which order() returns) or boolean values (which the < comparison returns). Mixing them (with just an &) doesn't make it clear how R should perform the subset. But you can break that out into two steps with subset()
subset(df[order(df$x),], x < 5)
This does the ordering first and then the subsetting. Note that the condition no longer directory references the value of df specfically, it's will filter the data from the re-ordered data.frame.
Operations like this is one of the reasons many people perfer the dplyr library for data manipulations. For example this can be done with
library(dplyr)
dd <- data.frame(x = sample(1:50))
dd %>% filter(x<5) %>% arrange(x)
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
I have a data frame (150000 obs, 15 variables) in R and need to correct a subset of values of one variable (simply by multiplying by a constant) based on the value of another. What's an easy way to do this?
I though apply would work, but I'm not sure how to write the function (obviously can't multiply in the function) and qualifier:
df$RESULT <- df[apply(df$RESULT, 1, function(x * 18.01420678) where(SITE==1)), ]
you mean this?
dat <- data.frame(x=1:10,y=sample(20,10))
constant <- 100
dat$y <- ifelse(dat$x > dat$y, dat$y*constant, dat$y)
You could use the capacity of "[" to do subsetting but for "correction" of a subset you need to use the logical expression that defines the subset on both sides of the assignment. Since you will then be working with only the values that need correction you do not use any further conditional function.
df[ df$SITE==1, "RESULT" ] <- df[ df$SITE==1, "RESULT"] * 18.01420678
In cases where the operation is to be done on large (millions) of cases or done repeatedly in simulations, this approach may be much faster that the ifelse approach