Avoiding backtick characters with dplyr - r

How can I write the argument of select without backtick characters? I would like to do this so that I can pass in this argument from a variable as a character string.
df <- dat[["__Table"]] %>% select(`__ID` ) %>% mutate(fk_table = "__Table", val = 1)
Changing the argument of select to "__ID" gives this error:
Error: All select() inputs must resolve to integer column positions.
The following do not:
* "__ID"
Unfortunately, the _ characters in column names cannot be avoided since the data is downloaded from a relational database (FileMaker) via ODBC and needs to be written back to the database while preserving the column names.
Ideally, I would like to be able to do the following:
colName <- "__ID"
df <- dat[["__Table"]] %>% select(colName) %>% mutate(fk_table = "__Table", val = 1)
I've also tried eval(parse()):
df <- dat[["__Table"]] %>% select( eval(parse(text="__ID")) ) %>% mutate(fk_table = "__Table", val = 1)
It throws this error:
Error in parse(text = "__ID") : <text>:1:1: unexpected input
1: _
^
By the way, the following does work, but then I'm back to square one (still with backtick symbol).
eval(parse(text="`__ID`")
References about backtick characters in R:
Removing backticks in R output
What do backticks do in R?
R encoding ASCII backtick

You can use as.name() with select_():
colName <- "__ID"
df <- data.frame(`__ID` = c(1,2,3), `123` = c(4,5,6), check.names = FALSE)
select_(df, as.name(colName))

Related

Multiline text in R dataframe

I'm trying to include a multiline text in a dataframe cell, however R keeps reading the \n as a next row, resulting in row mismatches. If i change the 'code' input to a simple string, the code works fine.
Defined dataframe:
df <- data.frame(
"Id" = character(),
"Name" = character(),
"Code" = character()
)
Adding new row:
NewRow <- data.frame(
"Id" = Id, # Simple string
"Name" = Name, # Simple string
"Code" = Code # Complex multiline string containing '#' and '\n' (10+ lines)
)
df <- rbind(df, NewRow)
Received error: Error in data.frame: arguments imply differing number of rows: 1, 0
Does anyone know how to get around this problem?
Many thanks in advance!
Maybe what you can try is to clean up the Code variable a bit, before adding it to the dataframe. In this sense, you can remove \n and # from the Code variable, and then add it inside the dataframe. For this you can use stringr and dplyr, to update the Code variable:
### Using the replace option:
Code <- Code %>%
str_replace_all("\\\n", "") %>%
str_replace_all("#", "")
### Using the remove option:
Code <- Code %>%
str_remove_all("\\\n") %>%
str_remove_all("#")

What is causing 'object not found' error in filter() with the across() function?

This function filters/selects one or more variables from my dataset and writes it to a new CSV file. I'm getting an 'object not found' error when I call the function. Here is the function:
extract_ids <- function(filename, opp, ...) {
#Read in data
df <- read_csv(filename)
#Remove rows 2,3
df <- df[-c(1,2),]
#Filter and select
df_id <- filter(df, across(..., ~ !is.na(.x)) & gc == 1) %>%
select(...) #not sure if my use of ... here is correct
#String together variables for export file path
path <- c("/Users/stephenpoole/Downloads/",opp,"_",...,".csv") #not sure if ... here is correct
#Export the file
write_csv(df_id, paste(path,collapse=''))
}
And here is the function call. I'm trying to get columns "rid" and "cintid."
extract_ids(filename = "farmers.csv",
opp = "farmers",
rid, cintid)
When I run this, I get the below error:
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `across(..., ~!is.na(.x)) & gc == 1`.
x object 'cintid' not found
The column cintid is correct and appears in the data. I've also tried running it with just one column, rid, and get the same 'object not found' error.
If you are passing multiple values to across(), you need to collect them in the first parameter, otherwise they will spread into the other parameters of across(). Try
filter(df, across(c(...), ~ !is.na(.x))
Otherwise every value other than the first one will be passed along as a parameter to function you've specified in across()
Sorry for omitting this in my previous suggestion to you. Unfortunately, your original question was closed before I could post it as an answer:
If you want your function to resemble dplyr, here's a few
modifications you can make. Write your function header as
function(filename, opp, ...) verbatim. Then, replace !is.na(ID)
with across(..., ~ !is.na(.x)) verbatim. Now, you can call
extract_ids() and, just as you would with any dplyr verb, you can
specify any selection of columns you want to filter out NAs:
extract_ids(filename = "farmers.csv", opp = "farmers", rid, another_column_you_want_without_NAs).
Object Not Found
As MrFlick rightly suggests in their comment, you should wrap ... with c(), so everything you pass in ... is interpreted as the first argument to across(): a single tidy-selection of columns from df:
extract_ids <- function(filename, opp, ...) {
# ...
# Filter and select
df_id <- df %>%
# This format is preferred for dplyr workflows with pipes (%>%).
filter(across(c(...), ~ !is.na(.x)) & gc == 1) %>%
select(...)
# ...
}
Without this precaution, R interprets rid and cintid as multiple arguments to across(), rather than as simply columns named by the first argument (the tidy-selection).
Variable Names in the Filepath
To get those variable names within your filepath, use
extract_ids <- function(filename, opp, ...) {
# ...
# Expand the '...' into a list of given variable names, which will get pasted.
path <- c("/Users/stephenpoole/Downloads/", opp, "_", match.call(expand.dots = FALSE)$`...`, ".csv")
# ...
}
though you might want to consider replacing match.call(expand.dots = FALSE)$`...`, which currently mushes together the variable names:
"/Users/stephenpoole/Downloads/farmers_ridcintid.csv"
In exactly the same place, you might use the expression paste(match.call(expand.dots = FALSE)$`...`, collapse = "-"), which will separate those variable names using -
"/Users/stephenpoole/Downloads/farmers_rid-cintid.csv"
or any other separator of your choice that gives a valid filename.

using select and stringr together

I'm trying
qual %>% select(reasons_code) %>% str_replace('\\+.*',replacement = '')
but I get the Warning message: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : argument is not an atomic vector; coercing.
However, when I do the following, the replacement works fine.
str_replace(qual$reasons_code,'\\+.*',replacement = '')
Does anyone know why this is happening?
For ?str_replace, the input string is
string - Input vector. Either a character vector, or something coercible to one.
while, the output from select is a data.frame with a single column selected. It is not converted to vector. Instead of select, we can pull the column as vector and it should work
library(dplyr)
qual %>%
pull(reasons_code) %>%
str_replace('\\+.*',replacement = '')
Or if we prefer to use the OP's code with select, there are several ways to convert to vector - unlist is one of them
qual %>%
select(reasons_code) %>%
unlist %>%
str_replace('\\+.*',replacement = '')

Error with Chinese characters in creating notebook: non-numeric argument to binary operator

I am using the r notebook. The problem is when add Chinese character levels or column in the data table, the application always says,
error creating notebook: non-numeric argument to binary operator.
I know there are something incompatible between r notebook and Chinese characters.
Does anyone know how to fix it?
My code is here.
(
y <- data_frame(
x = c("cmstr", "cmbool", "cmnum")
) %>%
mutate(x = as.factor(x)) %>%
mutate(x = fct_recode(x,
"一" = "cmstr",
"二" = "cmbool",
"三" = "cmnum"
))
)

Generalizing a function to return a list of data.frame columns with invalid UTF-8 bytes/code points

I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
library(dplyr)
library(magrittr)
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
c('a','b','c','d','e','f'))),
hex2 = as.character(c(seq(0,9),
c('a','b','c','d','e','f')))
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
}
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
return(matches)
}
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
library(dplyr)
library(magrittr)
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
c(as.character(as.hexmode(32:126)),
sapply(rASCII, charToRaw))) %>%
extract(not(duplicated(.)))
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
return(matches)
}
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: https://github.com/jkroes/FixEncoding. It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
?tools::showNonASCII
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")
tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.

Resources