How to get dynamic variable in string in r? - r
How can I get the variable name of a function to act dynamically within a string?
Below is an extract of what I am trying to achieve: a function that produces a list depending on the varName. But I cannot get the varName to act dynamically within the string sqldf(...). I assume this problem is not specific to the package sqldf.
createExcelSheetData<-function(varName){
sqldf("
SELECT Name
FROM dataTable
WHERE Choice=varName
")
}
table1<-createExcelSheetData(1)
table2<-createExcelSheetData(2)
table3<-createExcelSheetData(3)
What the above gives me is the choice fixed with the text varName.
UPDATE: To have the variable within the text, not just at the end.
createExcelSheetData<-function(varName){
sqldf("
SELECT Name
FROM dataTable
WHERE Choice=varName
ORDER BY Name
")
}
table1<-createExcelSheetData(1)
table2<-createExcelSheetData(2)
table3<-createExcelSheetData(3)
fn$ is discussed in Example 6 on the sqldf home page. Here is a self contained minimial reproducible example using the iris data frame that comes with R: (In the future please ensure all code is minimal and reproducible and in particular includes all inputs).
library(sqldf)
# retrieve records for specified Species and Petal.Length above minPetalLength
f <- function(Species, minPetalLength) {
fn$sqldf("SELECT *
FROM iris
WHERE Species = '$Species' and [Petal.Length] > $minPetalLength")
}
f("virginica", 6)
giving:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.3 2.9 6.3 1.8 virginica
3 7.2 3.6 6.1 2.5 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.7 2.8 6.7 2.0 virginica
7 7.4 2.8 6.1 1.9 virginica
8 7.9 3.8 6.4 2.0 virginica
9 7.7 3.0 6.1 2.3 virginica
Related
R Best way to delete data.table rows in a function call
I am looking for the best way to subset iris dataset in a function call. Here is the code - data(iris) remove_rows <- function(x) { x = setDT(x)[Species == "virginica"] } remove_rows(iris) > iris Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1: 5.1 3.5 1.4 0.2 setosa 2: 4.9 3.0 1.4 0.2 setosa 3: 4.7 3.2 1.3 0.2 setosa 4: 4.6 3.1 1.5 0.2 setosa 5: 5.0 3.6 1.4 0.2 setosa --- 146: 6.7 3.0 5.2 2.3 virginica 147: 6.3 2.5 5.0 1.9 virginica 148: 6.5 3.0 5.2 2.0 virginica 149: 6.2 3.4 5.4 2.3 virginica 150: 5.9 3.0 5.1 1.8 virginica As you can see, none of the rows are deleted after running remove_rows function. This is understandable as library data.table does not have the functionality to remove rows by reference. The workaround I have used is to update remove_rows function and return the new object from the function - library(data.table) remove_rows <- function(x) { x= setDT(x)[Species == "virginica"] return(x) } iris = remove_rows(iris) This has solved the problem, but since this data.table is huge in my case (iris is just a toy example), it takes a lot of time to run this function and copy the subset in iris dataset. Is there a workaround to this situation?
This is not yet implemented feature. Highly requested. You can track its progress in https://github.com/Rdatatable/data.table/issues/635 Function setsubset that you are about to test is not complete. It lacks the C part to set true length of object to a shorter than the original, so without actually adding that missing piece, it won't help you much. As is now, it will return a subset at the beginning of the data.table and remaining rows will be garbage. For now you have to return new object from a function and assign it to (possibly) same variable as the one you are passing to the function. If you really don't want to do this you can always use assign to parent frame, but it is less elegant.
How can I use a parameter as a filter criteria with 'dplyr' in R?
I am using the package dplyr in R. With the filter function, I would like to use a parameter as filter criteria. How can I proceed? Instead of writing this b = dplyr::filter(a, Note == "N.6.2", Liability == R.val.1) (where Note and Liability are column names in the table "a"). I would like to have that R.cat.1 = "Liability" b = dplyr::filter(a, Note == "N.6.2", R.cat.1 == R.val.1) The second method does not produce an error, but it produces an empty table, "b".
This will get you what you want: R.cat.1 = "Liability" b = dplyr::filter(a, Note == "N.6.2", !!rlang::sym(R.cat.1) == R.val.1) You can learn more about how this works by reading Advanced R and programming in the tidyverse.
You have multiple options. The solution I use is based on the bang-bang operator (!!). I don't know if this is the most elegant/concise/efficient method, and I would suggest reading the documentation on the tidy evaluation (https://tidyeval.tidyverse.org/dplyr.html). x <- 'Sepal.Length' iris %>% dplyr::filter( !!rlang::sym(x) > 7.5) 1 7.6 3.0 6.6 2.1 virginica 2 7.7 3.8 6.7 2.2 virginica 3 7.7 2.6 6.9 2.3 virginica 4 7.7 2.8 6.7 2.0 virginica 5 7.9 3.8 6.4 2.0 virginica 6 7.7 3.0 6.1 2.3 virginica Otherwise, as suggested by #Phil you can use. dplyr::filter(iris, .data[[x]] > 7.5)
Function to filter data equal to or greater than a certain value
I have a dataframe containing thousands of rows and columns. The rows contain the names of genes and the columns the names of samples. I only want to keep the rows that contain a value equal to or greater than 5 in more than 3 samples. I tried this so far but I can't figure out how to set multiple conditions: data.frame1 %>% filter_all(all_vars(.>= 5)) I hope I have stated this question correctly.
The way I do it in my gene expression filtering pre-differential gene expression pipeline is as follows: data.frame1[rowSums(data.frame1 >= 5) > 3, ] -> filtered.counts And if your first column is your gene identifier, with all the other columns being numeric, you can have the evaluation skip the first column as follows: data.frame1[rowSums(data.frame1[-1] >= 5) > 3, ] -> filtered.counts
The way to do this in dplyr 1.0.0 is iris %>% filter(rowSums(across(where(is.numeric)) > 6) > 1) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 7.6 3.0 6.6 2.1 virginica 2 7.3 2.9 6.3 1.8 virginica 3 7.2 3.6 6.1 2.5 virginica 4 7.7 3.8 6.7 2.2 virginica 5 7.7 2.6 6.9 2.3 virginica 6 7.7 2.8 6.7 2.0 virginica 7 7.4 2.8 6.1 1.9 virginica etc For your case data.frame1 %>% filter(rowSums(across(where(is.numeric)) >= 5) > 3)
Write a tidyeval function to rename a factor level in a dplyr
I'm trying to write a tidyeval function that takes a numeric column, replaces values above a certain limit with the value for limit, turns that column into a factor and then replaces the factor level equal to limit with a level called "limit+". For example, I'm trying to replace any value above 3 in sepal.width with 3 and then rename that factor level to 3+. As an example, here's how I'm trying to make it work with the iris dataset. The fct_recode() function is not renaming the factor level properly, though. plot_hist <- function(x, col, limit) { col_enq <- enquo(col) x %>% mutate(var = factor(ifelse(!!col_enq > limit, limit,!!col_enq)), var = fct_recode(var, assign(paste(limit,"+", sep = ""), paste(limit)))) } plot_hist(iris, Sepal.Width, 3)
To fix the last line, we can use the special symbol :=, since we need to set the value at the left hand side of the expression. For the RHS we need to coerce to character, since fct_recode expects a character vector on the right. library(tidyverse) plot_hist <- function(x, col, limit) { col_enq <- enquo(col) x %>% mutate(var = factor(ifelse(!!col_enq > limit, limit, !!col_enq)), var = fct_recode(var, !!paste0(limit, "+") := as.character(limit))) } plot_hist(iris, Sepal.Width, 3) %>% sample_n(10) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species var #> 40 5.1 3.4 1.5 0.2 setosa 3+ #> 98 6.2 2.9 4.3 1.3 versicolor 2.9 #> 7 4.6 3.4 1.4 0.3 setosa 3+ #> 99 5.1 2.5 3.0 1.1 versicolor 2.5 #> 76 6.6 3.0 4.4 1.4 versicolor 3+ #> 77 6.8 2.8 4.8 1.4 versicolor 2.8 #> 85 5.4 3.0 4.5 1.5 versicolor 3+ #> 119 7.7 2.6 6.9 2.3 virginica 2.6 #> 110 7.2 3.6 6.1 2.5 virginica 3+ #> 103 7.1 3.0 5.9 2.1 virginica 3+
Smart spreadsheet parsing (managing group sub-header and sum rows, etc)
Say you have a set of spreadsheets formatted like so: Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 7.0 3.2 4.7 1.4 versicolor 5 6.4 3.2 4.5 1.5 versicolor 6 6.9 3.1 4.9 1.5 versicolor 7 5.7 2.8 4.1 1.3 versicolor 8 6.3 3.3 6.0 2.5 virginica 9 5.8 2.7 5.1 1.9 virginica 10 7.1 3.0 5.9 2.1 virginica I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic. Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx: ,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width ,,,, Setosa,,,, 1,5.1,3.5,1.4,0.2 2,4.9,3,1.4,0.2 3,4.7,3.2,1.3,0.2 Mean,4.9,3.23,1.37,0.2 ,,,, Versicolor,,,, 1,7,3.2,4.7,1.4 2,6.4,3.2,4.5,1.5 3,6.9,3.1,4.9,1.5 Mean,6.77,3.17,4.7,1.47 ,,,, Virginica,,,, 1,6.3,3.3,6,2.5 2,5.8,2.7,5.1,1.9 3,7.1,3,5.9,2.1 Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo. x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width ,,,, Setosa,,,, 1,5.1,3.5,1.4,0.2 2,4.9,3,1.4,0.2 3,4.7,3.2,1.3,0.2 Mean,4.9,3.23,1.37,0.2 ,,,, Versicolor,,,, 1,7,3.2,4.7,1.4 2,6.4,3.2,4.5,1.5 3,6.9,3.1,4.9,1.5 Mean,6.77,3.17,4.7,1.47 ,,,, Virginica,,,, 1,6.3,3.3,6,2.5 2,5.8,2.7,5.1,1.9 3,7.1,3,5.9,2.1 Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE) library(zoo) x <- x[x$X!="Mean",] #remove Mean line x$Species <- x$X #create species column x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers x$Species <- na.locf(x$Species) #carry last observation if NA x <- x[!rowSums(is.na(x))>0,] #remove lines with NA X Sepal.Length Sepal.Width Petal.Length Petal.Width Species 3 1 5.1 3.5 1.4 0.2 Setosa 4 2 4.9 3.0 1.4 0.2 Setosa 5 3 4.7 3.2 1.3 0.2 Setosa 9 1 7.0 3.2 4.7 1.4 Versicolor 10 2 6.4 3.2 4.5 1.5 Versicolor 11 3 6.9 3.1 4.9 1.5 Versicolor 15 1 6.3 3.3 6.0 2.5 Virginica 16 2 5.8 2.7 5.1 1.9 Virginica 17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution: iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width ,,,, Setosa,,,, 1,5.1,3.5,1.4,0.2 2,4.9,3,1.4,0.2 3,4.7,3.2,1.3,0.2 Mean,4.9,3.23,1.37,0.2 ,,,, Versicolor,,,, 1,7,3.2,4.7,1.4 2,6.4,3.2,4.5,1.5 3,6.9,3.1,4.9,1.5 Mean,6.77,3.17,4.7,1.47 ,,,, Virginica,,,, 1,6.3,3.3,6,2.5 2,5.8,2.7,5.1,1.9 3,7.1,3,5.9,2.1 Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE) First I used a which splits at an index. split_at <- function(x, index) { N <- NROW(x) s <- cumsum(seq_len(N) %in% index) unname(split(x, s)) } Then you define that index using: iris[,1] <- stringr::str_trim(iris[,1]) index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa")) The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed. split_at(iris, index) %>% .[2:length(.)] %>% purrr::map_df(function(x) { Species <- x[1,1] x <- x[-c(1,NROW(x) - 1, NROW(x)),] data.frame(x, Species = Species) })