how to drop columns by passing variable name with dplyr? - r

I have a df as follows:
a <- data_frame(keep=c("hello", "world"),drop = c("nice", "work"))
a
Source: local data frame [2 x 2]
keep drop
(chr) (chr)
1 hello nice
2 world work
I can use a %>% select(-drop) to drop the column without problem. however, if I want to pass a variable to present drop column, then it returns error.
name <- "drop"
a %>% select(-(name))
Error in -(name) : invalid argument to unary operator

You can use one_of to find the column positions and then use - to drop it, select(-one_of(name)), if you check ?select, the usage is documented in the Drop variable section in the Examples:
name <- "drop"
a %>% select(-one_of(name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world
Or with select_, you need to paste - with the column names to drop them and pass the pasted column names to the .dots parameter if there are more than one column to be dropped:
name <- "drop"
a %>% select_(.dots = paste("-", name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world

You can simple use
a <- data_frame(keep=c("hello", "world"),drop = c("nice", "work"))
select(a, -starts_with('drop'))
# Source: local data frame [2 x 1]
#
# keep
# (chr)
# 1 hello
# 2 world
you have to search for some previously written solutions too. Please read the document here Select/rename variables by name.DPLYR
I hope that does the job for you :)
#Psidom thanx for updating your answer.. but I will request upvoters for vote for me too as I recently became an active user and still am to get basic privileges on stackoverflow.

We can use select with setdiff
a %>%
select_(setdiff(names(.), name))
# A tibble: 2 × 1
# keep
# <chr>
#1 hello
#2 world

A few more possibilities:
name <- "drop"
a %>% `[<-`(name, value=NULL)
a %>% magrittr::inset(name,value=NULL)
a %>% purrr::modify_at(name,~NULL)

I could only get these solutions to work by first ungrouping the data using ungroup:
df <- df %>% ungroup %>% select(-hello)
Notice no quotation marks on the column name you want to drop (hello). Also, to remove multiple columns, just place a , after hello and add the second column.

From the ?select_ help: "dplyr used to offer twin versions of each verb suffixed with an underscore. ... However, dplyr now uses tidy evaluation semantics. ... Thus, the underscored versions are now superfluous."
The example given in vignette("programming"), similar to #Psidom's answer, is:
name < "drop"
a %>% select(!all_of(name))
Alternatively, this one could create a function to drop columns, so that drop does not need quoting:
drop_columns <- function(data, cols) {
data %>% select(!{{cols}})
}
drop_columns(a, drop)

Related

R/arrow summarizing on variable columns

I have a large-ish parquet file I'm referencing via arrow::open_dataset. I'd like to get the max value of one or more of the columns, where I don't know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across), but I can't get it to work.
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
summarize(across(sym(vars), ~ max(.))) %>%
collect()
# # A tibble: 1 x 1
# a
# <dbl>
# 1 9
But when vars is length 2 or more, I assume I need to be using syms or similar, but that fails with
vars <- c("a", "b")
open_dataset("quux.parquet") %>%
summarize(across(all_of(syms(vars)), ~ max(.))) %>%
collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.
How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?
While I suspect that the correct answer in dplyr will be some form of syms, and then whether or not arrow supports that is the next question. I'm not tied to the dplyr mechanisms, if there's a method using ds$NewScan() or similar, I'm amenable.
Is this the kind of thing you're after - using tidyselect's all_of function?
library(arrow)
library(dplyr)
write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a", "d")
open_dataset("quux.parquet") %>%
summarize(across(all_of(vars), ~ max(.))) %>%
collect()
#> # A tibble: 1 × 2
#> a d
#> <dbl> <chr>
#> 1 9 r
See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.

extracting string patterns that repeat to a separate column or columns

Hi I have not seen a similar solution to this problem I am having. I am trying to make a regrex pattern to extract the characters following the word major within { } and place them in a major column. However, the major repeats in row 2 and I need to extract and combine all characters within both { } following major. Ideally I would do this for minor and incidental attributes as well. Not sure what I am getting wrong here. Thanks!
test <- data.frame(lith=c("major{basalt} minor{andesite} incidental{dacite rhyolite}",
"major {andesite flows} major {dacite flows}",
"major{andesite} minor{dacite}",
"major{basaltic andesitebasalt}"))
test %>%
mutate(major = str_extract_all(test$lith, "[major].*[{](\\D[a-z]*)[}]") %>%
map_chr(toString))
What I am looking for:
major minor incidental
1 basalt andesite dacite ryolite
2 andesite flows, decite flows <NA> <NA>
3 basaltic andesitebasalt <NA> <NA>
First, (almost) never use test$ within a dplyr pipe starting with test %>%. At best it's just a little inefficient; if there are any intermediate steps that re-order, alter, or filter the data, then the results will be either (a) an error, preferred; or (b) silently just wrong. The reason: let's say you do
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(test$lith, ...))
In this case, the filter reduced the data from 4 rows to just 2 rows. However, since you're using test$lith, that's taken from the contents of test before the pipe started, so here test$lith is length-4 where we need it to be length-2.
Alternatively (and preferred),
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(lith, ...))
Here, the str_extract_all(lith, ...) sees only two values, not the original four.
On to the regularly-scheduled answer ...
I'll add a row number rn column as an original row reference (id of sources). This is both functional (for things to work internally) and in case you need to tie it back to the original data somehow. I'm inferring that you group the values together as strings instead of list-columns, though it's easy enough to switch to the latter if desired.
library(dplyr)
library(stringr) # str_extract_all
library(tidyr) # unnest, pivot_wider
test %>%
mutate(
rn = row_number(),
tmp = str_extract_all(lith, "\\b([[:alpha:]]+) ?\\{([^}]+)\\}"),
tmp = lapply(tmp, function(z) strcapture("^([^{}]*) ?\\{(.*)\\}", z, list(categ="", val="")))
) %>%
unnest(tmp) %>%
mutate(across(c(categ, val), trimws)) %>%
group_by(rn, categ) %>%
summarize(val = paste(val, collapse = ", ")) %>%
pivot_wider(rn, names_from = "categ", values_from = "val") %>%
ungroup()
# # A tibble: 4 x 4
# rn incidental major minor
# <int> <chr> <chr> <chr>
# 1 1 dacite rhyolite basalt andesite
# 2 2 NA andesite flows, dacite flows NA
# 3 3 NA andesite dacite
# 4 4 NA basaltic andesitebasalt NA

Make column of input items with purrr::map_df using .id without duplicating inputs for named vector

I often want to map over a vector of column names in a data frame, and keep track of the output using the .id argument. But to write the column names related to each map iteration into that .id column seems to require doubling up their name in the input vector - in other words, by naming each column name with its own name. If I don't name the column with its own name, then .id just stores the index of the iteration.
This is expected behavior, per the purrr::map docs:
.id
Either a string or NULL. If a string, the output will contain a variable with that name, storing either the name (if .x is named) or the index (if .x is unnamed) of the input.
But my approach feels a little clunky, so I imagine I'm missing something. Is there a better way to get a list of the columns I'm iterating over, that doesn't require writing each column name twice in the input vector? Any suggestions would be much appreciated!
Here's an example to work with:
library(rlang)
library(tidyverse)
tb <- tibble(foo = rnorm(10), bar = rnorm(10))
cols_once <- c("foo", "bar")
cols_once %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores only the iteration index
<chr> <dbl>
1 1 -0.0519
2 2 0.204
cols_twice <- c("foo" = "foo", "bar" = "bar")
cols_twice %>% map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# A tibble: 2 x 2
var avg <-- var stores the column names
<chr> <dbl>
1 foo -0.0519
2 bar 0.204
Here's an alternative solution for your specific scenario using summarize_at and gather:
tb %>% summarize_at( cols_once, mean ) %>% gather( var, avg )
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
In a more general scenario, I don't think there's a way around naming your cols_once when working with map_dfr, because of the expected behavior you pointed out in your question. However, you can use the "snake case" wrapper for setNames() to do it more elegantly:
cols_once %>% set_names %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
# # A tibble: 2 x 2
# var avg
# <chr> <dbl>
# 1 foo 0.374
# 2 bar 0.0397
You could create your input vector easily with:
setNames(names(tb), names(tb))
So your code would be:
setNames(names(tb), names(tb)) %>%
map_dfr(~ tb %>% summarise(avg = mean(!!sym(.x))), .id="var")
Edit following your comment:
Still not the solution you are hoping for, but when you don't use all the column names, you could still use setNames() and subset the ones you want (or subset out the ones you don't).
tb <- tibble(foo = rnorm(10), bar = rnorm(10), taz = rnorm(10))
setNames(names(tb), names(tb))[-3]

Problem using dplyr on tibbles with vector elements [list columns]

I am running into some problems doing text processing using dplyr and stringr functions (specifically str_split()). I think I am misunderstanding something very fundamental about how to use dplyr correctly when dealing with elements that are vectors/lists.
Here's a tibble, df...
library(tidyverse)
df <- tribble(
~item, ~phrase,
"one", "romeo and juliet",
"two", "laurel and hardy",
"three", "apples and oranges and pears and peaches"
)
Now I create a new column, splitPhrase, by doing str_split() on one of the columns using "and" as the delimiter.
df <- df %>%
mutate(splitPhrase = str_split(phrase,"and"))
That seems to work, sort-of, in RStudio I see this...
In the console I see that my new column, splitPhrase, is actually composed of list... but it looks correct in the Rstudio display, right?
df
#> # A tibble: 3 x 3
#> item phrase splitPhrase
#> <chr> <chr> <list>
#> 1 one romeo and juliet <chr [2]>
#> 2 two laurel and hardy <chr [2]>
#> 3 three apples and oranges and pears and peaches <chr [4]>
What I ultimately want to do is to extract the last item of each splitPhrase. In other words, I'd like to get to this...
The problem is I can't see how to just grab the last element in each splitPhrase. If it were just a vector, I could do something like this...
#> last( c("a","b","c") )
#[1] "c"
#>
But that doesn't work within the tibble, neither does other things that come to mind:
df <- df %>%
mutate(lastThing = last(splitPhrase))
# Error in mutate_impl(.data, dots) :
# Column `lastThing` must be length 3 (the number of rows) or one, not 4
df <- df %>% group_by(splitPhrase) %>%
mutate(lastThing = last(splitPhrase))
# Error in grouped_df_impl(data, unname(vars), drop) :
# Column `splitPhrase` can't be used as a grouping variable because it's a list
So, I think I am "not getting" how to work with vectors that are inside an element in table/tibble column. It seems to have something to do with the fact that in my example it's actually a list of vectors.
Is there a particular function that will help me out here, or a better way of getting to this?
Created on 2018-09-27 by the reprex package (v0.2.1)
The 'splitPhrase' column is a list, so we loop through the list to get the elements
library(tidyverse)
df %>%
mutate(splitPhrase = str_split(phrase,"\\s*and\\s*"),
Last = map_chr(splitPhrase, last)) %>%
select(item, Last)
But, it can be done in many ways. Using separate_rows, expand the column, then get last element grouped by 'item'
df %>%
separate_rows(phrase,sep = " and ") %>%
group_by(item) %>%
summarise(Last = last(phrase))
Haven't tested for efficiency, but we can also use regex to extract the string segment after the last "and":
With sub:
library(dplyr)
df %>%
mutate(lastThing = sub("^.*and\\s", "", phrase)) %>%
select(-phrase)
With str_extract:
library(stringr)
df %>%
mutate(lastThing = str_extract(phrase, "(?<=and\\s)\\w+$")) %>%
select(-phrase)
With extract:
library(tidyr)
df %>%
extract(phrase, "lastThing", "^.*and\\s(\\w+)")
Output:
# A tibble: 3 x 2
item lastThing
<chr> <chr>
1 one juliet
2 two hardy
3 three peaches

R: How to best extract two XML attributes from a node?

The following code extracts one attribute (or all) from an XML file:
library(xml2);library(magrittr);library(readr);library(tibble);library(knitr)
fname<-'https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml'
fname
x<-read_xml(fname)
xpath="//d1:ItemDef"
itemsNames <- x %>% xml_find_all(xpath, ns=xml_ns(x)) %>% xml_attr('Name')
items <- x %>% xml_find_all(xpath, ns=xml_ns(x))
Item looks like this:
<ItemDef OID="IT.ABNORM" Name="Normal/Abnormal/Not Done" DataType="integer" Length="1" ...
Sample file can be viewed here: https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml
Using pipes and xml_attr, what is the best way to extract both the Name and DataType attributes and have them rbinded?
Ideally it would be a single line of super efficient piped code. I can extract names and types and have 'data.frame(name=names,type=types)' but that seems not the best and most modern.
The result should be a tibble with columns name and data type.
library(purrr)
map(items, xml_attrs) %>%
map_df(as.list) %>%
select(Name, DataType)
## # A tibble: 94 × 2
## Name DataType
## <chr> <chr>
## 1 Normal/Abnormal/Not Done integer
## 2 Actions taken re study drug text
## 3 Actions taken, other text
## 4 Stop Day - Enter Two Digits 01-31 text
## 5 Derived Stop Date text
## 6 Stop Month - Enter Two Digits 01-12 text
## 7 Stop Year - Enter Four Digit Year text
## 8 Outcome text
## 9 Relationship to study drug text
## 10 Severity text
## # ... with 84 more rows
One "base" version:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
do.call(rbind, .) %>%
tbl_df()
NOTE: an issue with ^^ is that if Name or DataType is missing then you're SOL. You can mitigate that with:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
data.table::rbindlist(fill=TRUE) %>%
tbl_df()
or:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
bind_rows() %>%
tbl_df()
if you don't like purrr.

Resources