How select columns with UTF-8 symbols in name in R?

How select columns with UTF-8 symbols in name in R? - r

I have db as a database and "Coś ktoś_był" as a column name in db.
I tried this:
temp_df <- db %>%
select('Coś ktoś_był')
but output:
Error: Can't subset columns that don't exist.
x Column `Cos ktos_byl` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
I don't know how make it correct without change column name.
I can't change it!

Try this:
library(dplyr)
temp_df <- db %>%
select(matches("[^ -~]"))
Alternatively, in base R:
db[ , grepl("[^ -~]", names(db))]
Both methods will select any column with non-ASCII characters in the name. If you need to be more specific, you can use something along these lines:
temp_df <- df %>%
select(matches("^Co[^ -~]"))

Related

In R list ,how to set sub list names

How to set list names ,here is the code as below.
Currently,split_data include two sub list [[1]] and [[2]], how set names separately for them?
I want set name 'A' for [[1]],'B' for [[2]], so can retrieve data use split_data['A']...
Anyone can help on this, thanks ?
for instance ma <- list(a=c('a1','a2'),b=c('b1','b2')) can use ma["a"] for sub list
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)

Others have shown you in the comments how to get what you want using split() instead of group_split(). That seems like the easiest solution.
However, if you're stuck with the existing code, here's an alternative that keeps your current code, and adds the names.
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
names(split_data) <- test_data %>% group_by(category) %>% group_keys() %>% apply(1, paste, collapse = ".")
The idea is to use group_by to split in the same way group_split does, then extract the keys as a tibble. This will have one row per group, but will have the different variables in separate columns, so I put them together by pasting the columns with a dot as separator. The last expression in the pipe is equivalent to apply(keys, 1, f)
where f is function(row) paste(row, collapse = "."). It applies f to each row of the tibble, producing a single name.
This should work even if the split happens on multiple variables, and produces names similar to those produced by split().

Batch Convert Columns from chr to num with either read_excel or dplyr

I have a database saved in excel, and when I bring it into R there are many columns that should be numeric, but they get listed as characters. I know that in read_excel I can specify each column format using the col_types = "numeric", but I have > 500 columns, so this gets a bit tedious.
Any suggestions on how to do this either when importing with read_excel, or after with dplyr or something similar?
I can do this 1 by 1 using a function that I wrote but it still requires writing out each column name
convert_column <- function(data, col_name) {
new_col_name <- paste0(col_name)
data %>% mutate(!!new_col_name := as.numeric(!!sym(col_name)))
}
convert_column("gFat_OVX") %>%
convert_column("gLean_OVX")%>%
convert_column("pFat_OVX") %>%
convert_column("pLean_OVX")
I would ideally like to say "if a column contains the text "Fat" or "Lean" in the header, then convert to numeric", but I'm open to suggestions.
select(df, contains("Fat" | "Lean"))
I'm not sure how to make an example that allows people to test this out, given that we're starting with an excel sheet here.

dplyr::mutate and across may be a solution after reading in the data.
Something like this, where df1 is your data frame from read_excel:
library(dplyr)
df1 <- df1 %>%
mutate(across(contains(c("Fat", "Lean")), ~as.numeric(.x)))

Selecting columns from dataframe programmatically when column names have spaces

I have a dataframe which I would like to query. Note that the columns of that dataframe could change and the column names have spcaes. I have a function that I want to apply on the dataframe columns. I figured I could programmatically find out what columns exists and then use that list of columns to apply function to the columns that exist.
I was able to figure out how to do that when the column names don't have spaces: See the code below
library(tidyverse)
library(rlang)
col_names <- c("cyl","mpg","New_Var")
cc <- rlang::quos(col_names)
mtcars%>%mutate(New_Var=1)%>%select(!!!cc)
But when the column names have spaces, this method does not works, below is the code I used:
col_names <- c("cyl","mpg","`New Var`")
cc <- rlang::quos(col_names)
mtcars%>%mutate(`New Var`=1)%>%select(!!!cc)
Is there a way to select columns that have spaces in their name without changing their names ?

You have to do nothing differently for values with spaces. For example,
library(dplyr)
library(rlang)
col_names <- c("cyl","mpg","New Var")
cc <- quos(col_names)
mtcars %>% mutate(`New Var`=1) %>% select(!!!cc)
Also note, that select also accepts string names so this works too :
mtcars%>% mutate(`New Var`=1) %>% select(col_names)

The columns option in spark_read_parquet

I tried to read a subset of columns from a 'table' using spark_read_parquet,
temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
path="/my/path/to/the/parquet/folder")
But I got the error:
Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....
Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.
(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)

TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.
The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):
spark_read_parquet(
sc, name = "mytable", path = "/tmp/foo",
memory = FALSE
) %>% select(Col1, Col2)
optionally followed by
... %>%
sdf_persist()
If you have a character vector, you can use rlang:
library(rlang)
cols <- c("Col1", "Col2")
spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>%
select(!!! lapply(cols, parse_quosure))

Passing column names as both variables and columns in a single dplyr function in R

I am writing a code in which a column name (e.g. "Category") is supplied by the user and assigned to a variable biz.area. For example...
biz.area <- "Category"
The original data frame is saved as risk.data. User also supplies the range of columns to analyze by providing column names for variables first.column and last.column.
Text in these columns will be broken up into bigrams for further text analysis including tf_idf.
My code for this analysis is given below.
x.bigrams <- risk.data %>%
gather(fields, alldata, first.column:last.column) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, biz.area, sort=TRUE) %>%
bind_tf_idf(bigrams, biz.area, n) %>%
arrange(desc(tf_idf))
However, I get the following error.
Error in grouped_df_impl(data, unname(vars), drop) : Column
x.biz.area is unknown
This is because count() expects a column name text string instead of variable biz.area. If I use count_() instead, I get the following error.
Error in compat_lazy_dots(vars, caller_env()) : object 'bigrams'
not found
This is because count_() expects to find only variables and bigrams is not a variable.
How can I pass both a constant and a variable to count() or count_()?
Thanks for your suggestion!

It looks to me like you need to enclosures, so that you can pass column names as variables, rather than as strings or values. Since you're already using dplyr, you can use dplyr's non-standard evaluation techniques.
Try something along these lines:
library(tidyverse)
analyze_risk <- function(area, firstcol, lastcol) {
# turn your arguments into enclosures
areaq <- enquo(area)
firstcolq <- enquo(firstcol)
lastcolq <- enquo(lastcol)
# run your analysis on the risk data
risk.data %>%
gather(fields, alldata, !!firstcolq:!!lastcolq) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, !!areaq, sort=TRUE) %>%
bind_tf_idf(bigrams, !!areaq, n) %>%
arrange(desc(tf_idf))
}
In this case, your users would pass bare column names into the function like this:
myresults <- analyze_risk(Category, Name_of_Firstcol, Name_of_Lastcol)
If you want users to pass in strings, you'll need to use rlang::expr() instead of enquo().

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How select columns with UTF-8 symbols in name in R? - r

Related

In R list ,how to set sub list names

Batch Convert Columns from chr to num with either read_excel or dplyr

Selecting columns from dataframe programmatically when column names have spaces

The columns option in spark_read_parquet

Passing column names as both variables and columns in a single dplyr function in R

Categories

Resources