SUM with three logical conditions R (edited) - r

First of all I have checked the existing topics. Unfortunately, they are either not exactly relevant or I am not able to understand them. As you will know from my type of question, I'm VERY new to R. I hope this is okay...
I feel I am on the right way....
here https://i.stack.imgur.com/5jv0m.jpg is an excerpt of the dataframe (df)
I want to compare whether the values of the subcategories of emissions (y) sum up the values stated in the parent categories. Part of this is summing up the values of subcategories.
In short I want to know whether sum(3.B.1+3.B.2+...+3.B.n) = 3.B. (i.e. the in the csv stated sum) for a given year and country. I want to verify the sums.
I've tried this code (with 2010 and Austria):
sum(compare_df, x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5") & x
== "2010" & x2 == "Austria")
but get this:
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
After having this, is there a way to run a code which will automate the process of running code for other conditions (i.e. list of countries and years)? You some keywords would be helpful here. I could then search for it myself.
I hope my question is clear enough and thank you for any sort of help or suggestion. Sorry for such a long post...
PS: I've updated everything know and hope my question is more clear.

If you want to verify the sums of the y variable you need to specify which variable you want to sum. Currently your sum statement is trying to sum the whole data.frame and when it encounters a categorical variable it throws the error
Error in FUN(X[[i]], ...) : only defined on a data frame with all
numeric variables
I didn't reproduce your code but this can be verified by sum(iris). If you truly want to sum all numeric variables you would have to do this sum(iris[sapply(iris,is.numeric)]).
But to get to your question about subsetting on three variables you would have to do something like this:
sum(iris$Sepal.Length[iris$Species %in% c("setosa","versicolor") &
iris$Sepal.Width >= 3 &
iris$Petal.Length >= 2])
First you have to tell sum what data.frame and variable you want to sum over e.g(the iris$Sepal.Length part of the code - this would be your df$y) then with [ you need to subset on the variables of interest. In your code when you refer variables without the df$ notation R will not find those variables because they are not objects but rather part of the data.frame. Hope this helps.
Also in your post your year variable is a numeric and not a categorical variable so you should remove the quotes around 2010.

Hard to be sure without knowing what compare_df looks like but here is a possible solution using dplyr which is great for working with data frames.
The %>% operator is the 'pipe' which takes the results of the previous function and inserts them into the first argument of the subsequent function.
All of the dplyr functions (filter, group_by, summarize, etc) take the data as the first function argument so it works nicely with %>%.
library(dplyr)
compare_df %>%
filter(x4 %in% c("1.A.1", "1.A.2", "1.A.3", "1.A.4", "1.A.5"))
group_by(x, x2) %>%
summarize(sum_emmissions = sum(y, na.rm = TRUE)) %>%
filter(x == "2010", x2 == "Austria")

Related

Counting number of rows where a value occurs at least once within many columns

I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.

R dplyr operate on a column known only by its string name

I am wrestling with programming using dplyr in R to operate on columns of a data frame that are only known by their string names. I know there was recently an update to dplyr to support quosures and the like and I've reviewed what I think are the relevant components of the new "Programming with dplyr" article here: http://dplyr.tidyverse.org/articles/programming.html. However, I'm still not able to do what I want.
My situation is that I know a column name of a data frame only by its string name. Thus, I can't use non-standard evaluation in a call to dplyr within a function or even a script where the column name may change between runs because I can't hard-code the unquoted (i.e., "bare") column name generally. I'm wondering how to get around this, and I'm guessing I'm overlooking something with the new quoting/unquoting syntax.
For example, suppose I have user inputs that define cutoff percentiles for a distribution of data. A user may run the code using any percentile he/she would like, and the percentile he/she picks will change the output. Within the analysis, a column in an intermediate data frame is created with the name of the percentile that is used; thus this column's name changes depending on the cutoff percentile input by the user.
Below is a minimal example to illustrate. I want to call the function with various values for the cutoff percentile. I want the data frame named MPGCutoffs to have a column that is named according to the chosen cutoff quantile (this currently works in the below code), and I want to later operate on this column name. Because of the generality of this column name, I can only know it in terms of the input pctCutoff at the time of writing the function, so I need a way to operate on it when only knowing the string defined by probColName, which follows a predefined pattern based on the value of pctCutoff.
userInput_prob1 <- 0.95
userInput_prob2 <- 0.9
# Function to get cars that have the "best" MPG
# fuel economy, where "best" is defined by the
# percentile cutoff passed to the function.
getBestMPG <- function( pctCutoff ){
# Define new column name to hold the MPG percentile cutoff.
probColName <- paste0('P', pctCutoff*100)
# Compute the MPG percentile cutoff by number of gears.
MPGCutoffs <- mtcars %>%
dplyr::group_by( gear ) %>%
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
# Filter mtcars with only MPG values above cutoffs.
output <- mtcars %>%
dplyr::left_join( MPGCutoffs, by='gear' ) %>%
dplyr::filter( mpg > !!probColName ) #****This doesn't run; this is where I'm stuck
# Return filtered data.
return(output)
}
best_1 <- getBestMPG( userInput_prob1 )
best_2 <- getBestMPG( userInput_prob2 )
The dplyr::filter() statement is what I can't get to run properly. I've tried:
dplyr::filter( mpg > probColName ) - No error, but no rows returned.
dplyr::filter( mpg > !!probColName ) - No error, but no rows returned.
I've also seen examples where I could pass something like quo(P95) to the function and then unquote it in the call to dplyr::filter(); I've gotten this to work, but it doesn't solve my problem since it requires hard-coding the variable name outside the function. For example, if I do this and the percentile passed by the user is 0.90, then the call to dplyr::filter() fails because the column created is named P90 and not P95.
Any help would be greatly appreciated. I'm hoping there's an easy solution that I'm just overlooking.
If you have a column name in a string (aka character vector) and you want to use it with tidyeval, then you can covert it with rlang::sym(). Just change
dplyr::filter( mpg > !!rlang::sym(probColName) )
and it should work. This is taken from the recommendation at this github issue: https://github.com/tidyverse/rlang/issues/116
It's still fine to use
dplyr::summarize( !!probColName := quantile(mpg, pctCutoff) )
because when dynamically setting a parameter name, you just need the string and not an unqouted symbol.
Here's an alternate solution from Hadley's comment in the post referred to in MrFlick's answer (https://github.com/tidyverse/rlang/issues/116). Using as.name() from base R takes the place of rlang::sym(), and you still do need to unquote it. That is, the following also works:
dplyr::filter( mpg > !!as.name(probColName) )

aggregate function is not intuitive

Like the following source code about aggregation function, I can't understand why we have to use list function() in here. Rather than I want to replace this with using one column that is needs to be grouped by. And I don't know why we use the same dataset like 'train[Sales != 0]' twice? What if I use other dataset as a second dataset param? I think it will make change to be fairly high possible mistake.
aggregate(train[Sales != 0]$Sales,
by = list(train[Sales != 0]$Store), mean)
Maybe one who can say this is wrong use case. But I also saw this source code in R Documentation
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
Thanks for reading my question.
First of all, if you don't like the syntax of the aggregate function, you could take a look at the dplyr package. Its syntax might be a bit easier for you.
To answer your questions:
The second argument is just expected to be a list, so you can add multiple variables.
You have to use train[Sales != 0] two times, because otherwise the first and the by argument look at different indices. You could also make a subset first:
Base R-code:
trainSales <- train[Sales != 0]
aggregate( trainSales$Sales, by = list(trainSales$Store), mean )
With dplyr you could do something like this:
train %>%
filter( Sales != 0) %>%
group_by( Store ) %>%
summarise_each( funs(mean) )
You see I use summarise_each because it condenses the dataset to one row, but you could off course also do something that leaves all the rows intact (in that case, use do).

R Use factor() for nominal data - how to value labels for multiple variables

I'm an R newbie so this problem is probably quite obvious but I have had a good search around and can't find anything.
I'm wanting to use R to analyse a survey rather than the usual method excel.
I have my variables labeled as Q1, Q2, Q3...
Q1 and Q2 contains nominal data (1, 2) and I'd like the values replacing with ("Yes", "No"). I can do this for Q1 using the code below but I'm not sure if I should subset or use c( to use the factor function. The survey will have about 25 questions this will need applying to so I'd rather it be done in one line of code rather than 25.
resdata$Q1 <- factor(resdata$Q1, levels = c(1,2), labels = c("Yes", "No"))
First you need to define the replacer function:
repl.f <- function(x) ifelse(x==1, "Yes","No")
Then simply use this argument in mutate_each() in dplyr:
library(dplyr)
resdata <- resdata %>% mutate_each(funs(repl.f),contains("Q"))
Its first argument is the functions you want to apply inside funs(), the second is the subset of columns. If you only give it one function, then it will by default replace these columns. To select the columns you can use another dplyr function:contains(), which much like everything else in dplyr speaks for itself.

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Resources