I have a dataset with a categorical variable that may take around 6 or 7 unique variables. Depending upon which variable that is, I need to run several functions - each of which is different depending upon the value of the categorical variable.
I don't know how to go about programming this so that things are called correctly. Keep in mind this might be simple here, but in my scenario is much more complicated with lots of sub functions.
library(dplyr)
func1_value_one = function(multiplication_value){
mtcars$check="value_one"
mtcars$mpg =mtcars*multiplication_value
filter(mtcars, mpg>60)
}
func0_value_zero = function(division_value){
mtcars$check="value_zero"
mtcars$mpg =mtcars$mpg / division_value
filter(mtcars, mpg <3)
}
help_function=function(category_p,change_p){
mtcars=return(filter(mtcars, vs==category_p))
data=ifelse(category_p==0,return(func0_value_zero(change_p)),return(func1_value_ one(change_p) ))
return(data)
}
#i want to filter for the values that meet the parameter passed in and then perform the update on the values
# right now I am not able to both filter for the values and then perform the correct function call.
help_function(0,2)
so
identifying_func(0,20) would return only only the rows from mtcars that has VS==0, divide all the mpg values by 20, and create a new column called check with all values equal to 'value_zero'
In the broader context, my dataset would use a flag to determine many different table join combinations and depending upon the data perform a variety of calculations and adjustments.
Related
good day
I don´t understand a topic here, is like it works but I can´t understand why
I have this database
# planets_df is pre-loaded in your workspace
# Use order() to create positions
positions <- order(planets_df$diameter)
positions
# Use positions to sort planets_df
planets_df[positions,]
I don´t understand why if u take the column diameter, then if u want to order it why u put it in a row of the dataframe like for me it should be [ rows, colum] but u put a column in a row and it changes, I really don´t get that.Why it´s not planets_df[,positions].
The exercise is solved I just don´t get it, is a data camp exercise btw.
Sorry if my English is wrong, it is not my native language.
I believe that I have created an example that matches your description. For the mtcars data set, which is pre-loaded in any R session, we can sort based on the variable mpg.
The function order returns the row indices sorted by mpg in this case. The ordering variable indicates the order that the rows should be presented in by storing the row indices based on mpg.
ordering <- order(mtcars$mpg)
This next step indicates that we want the rows of mtcars as specified by ordering. Essentially ordering is the order of the rows we want and so we pass that object to the row portion the call to mtcars.
mtcars[ordering,]
If we instead passed ordering as the columns, we would be reordering the columns of mtcars instead of the rows.
Let's see an example.
library(sjmisc)
data(efc)
From this dataset I want to recode all variables whose name contains cop (so I could use the tidyselect contains) as follows. For males (e16sex==1) NA into 999 and else=copy (as I could do with sjmisc::rec(..., rec = "NA=999; else=copy"); for females (e16sex==2) keep them intact.
I tried through dplyr (and sjmisc) the next naive test:
mutate_at(efc, vars(contains("cop")), list(~if_else(e16sex == 1, rec(., rec="NA=999; else=copy"),.)))
but, as it is understandable, if_else does not process the second dot . as if it was the original contains("cop")-variables for the rows with e16sex != 1.
I am looking for a function (or composite) returning a data frame with the recoding specified (so, please, avoid for). I could not try with data.table because I do not know yet the language, but all effective (and efficient) solutions are welcome. Maybe could it be done with purrr?
Thank you!
UPDATE
The naive test above works. I hadn't tried it with this example but with iris dataset, and with Species variable instead of copvariables. As Species is factor, trying to change some of its levels by a new one produce NA's, thence my confusion.
I'm not sure I fully understand the question, but you could use a for loop for this:
for(x in grep( "cop",names(efc))) {
efc[!is.na(efc$e16sex) & efc$e16sex==1 & is.na(efc[,x]),x] <- 999
}
I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.
My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.
For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.
Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.
userInput_prob1 <- 0.95
userInput_prob2 <- 0.9
# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){
# Define new column name to hold the MPG percentile cutoff.
probColName <- paste0('P', pctCutoff*100)
# Compute the MPG percentile cutoff by number of gears.
MPGCutoffs <- mtcars %>%
dplyr::group_by( gear ) %>%
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
# Filter mtcars with only MPG values above cutoffs.
output <- mtcars %>%
dplyr::left_join( MPGCutoffs, by='gear' ) %>%
dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck
# Return filtered data.
return(output)
}
best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )
The dplyr::filter() statement is what I can't get to run properly. I've tried:
dplyr::filter( mpg > probColName ) - No error, but no rows returned.
dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.
I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.
Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.
If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change
dplyr::filter( mpg > !!rlang::sym(probColName) )
and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116
It's still fine to use
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.
Here's an alternate solution from Hadley's comment in the post referred to in MrFlick's answer (https://github.com/tidyverse/rlang/issues/116). Using as.name() from base R takes the place of rlang::sym(), and you still do need to unquote it. That is, the following also works:
dplyr::filter( mpg > !!as.name(probColName) )
I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.
I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric