The columns option in spark_read_parquet - r

I tried to read a subset of columns from a 'table' using spark_read_parquet,
temp <- spark_read_parquet(sc, name='mytable',columns=c("Col1","Col2"),
path="/my/path/to/the/parquet/folder")
But I got the error:
Error: java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (54): .....
Is my syntax right? I tried googling for a (real) code example using the columns argument but couldn't find any.
(And my apologies in advance... I don't really know how to give you a reproducible example involving a spark and cloud.)

TL;DR This is not how columns work. When applied like this there are used to rename the columns, hence its lengths, should be equal to the length of the input.
The way to use it is (please note memory = FALSE, it is crucial for this to work correctly):
spark_read_parquet(
sc, name = "mytable", path = "/tmp/foo",
memory = FALSE
) %>% select(Col1, Col2)
optionally followed by
... %>%
sdf_persist()
If you have a character vector, you can use rlang:
library(rlang)
cols <- c("Col1", "Col2")
spark_read_parquet(sc, name="mytable", path="/tmp/foo", memory=FALSE) %>%
select(!!! lapply(cols, parse_quosure))

Related

`by` can't contain join column/inner_join

im trying to run this code but i keep getting this error, given the code below, Im applying inner-join for 2 lists which are student_db and grade_db by student_id and then couse_db by course_id.
could anyone help with this issue?
q2 <- inner_join(student_db, grade_db, by = "student_id") %>%
inner_join(course_db, by = "course_id", suffix = c(".student", ".course")) %>%
filter(name.student == "Ava Smith" | name.student == "Freddie Haris")
Error in common_by.list():
! by can't contain join column
course_id which is missing from LHS.
Run rlang::last_error() to see where the error occurred.
Does this work?
library(dplyr)
q2 <- student_db %>%
inner_join(grade_db, by = c("student_id"="student_id")) %>%
inner_join(course_db, by = c("course_id"="course_id")) %>%
filter(name.student %in% c("Ava Smith","Freddie Haris"))
If the names of your id variables are different in the two data frames, the by argument allows you to tell R which variables should matched, e.g. by=("var_from_df1"="var_from_df2"). (My guess is your dataframes have different column names, so this might be what you need to fix....
I'm not sure why you've included the suffix argument. That's there for if you have two variables with same name in both data sets with data that doesn't match. If you need it you can add it back. It's hard to tell exactly what the problem is without seeing your dataframes or similar example data....

How select columns with UTF-8 symbols in name in R?

I have db as a database and "Coś ktoś_był" as a column name in db.
I tried this:
temp_df <- db %>%
select('Coś ktoś_był')
but output:
Error: Can't subset columns that don't exist.
x Column `Cos ktos_byl` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
I don't know how make it correct without change column name.
I can't change it!
Try this:
library(dplyr)
temp_df <- db %>%
select(matches("[^ -~]"))
Alternatively, in base R:
db[ , grepl("[^ -~]", names(db))]
Both methods will select any column with non-ASCII characters in the name. If you need to be more specific, you can use something along these lines:
temp_df <- df %>%
select(matches("^Co[^ -~]"))

Only strings can be converted to symbols within a function in R

I have a function that is intended to operate on data obtained from a variety of sources with many manual entry fields. Since I don't know what to expect for the layout or naming convention used in these files, I want it to 'scan' a data frame for columns with the character string 'fix', 'name', or 'agent', and mutate the column to a new column with name 'Firm', then proceed to do string cleaning on the entries of that column, then finally, remove the original column. I have gotten it to work with SOME of the CSVs that I have already, but now have run into this error: ONLY STRINGS CAN BE CONVERTED TO SYMBOLS. I have checked into this thread ERROR: Only strings can be converted to symbols but to no avail.
Here is the function at the moment:
clean_firm_names2 <- function(df){
df <- df %>%
mutate(Firm := !!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T)) %>%
str_replace_all(pattern = "(\\W)+"," ") %>%
...str manipulations...
str_squish()) %>%
dplyr::select(-(!!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T))))
return(df)
}
I have tried using as.character() around the grep() function but that did not solve the problem. I have looked at the CSV that the function is meant to operate on and all of the column names are character strings. I read in the CSV using vroom(), as with my other CSVs, and that works fine, all of the column names appear. I can perform other dplyr functions on the df, suggesting to me that the df is behaving normally otherwise. I have run out of ideas as to why the function is choking up only on SOME of my CSVs but works as intended on others. Has anyone run into similar issues or got any clues as to what might be causing this error? This is the first time I've used SO-- I'm sorry if this question isn't very clear. I'll try and edit as needed.
Thanks!
Note that grep() returns indices of the matches (integers), not the matches themselves (strings). Integer indices can be passed directly to dplyr::rename, so perhaps the following may work better?
i <- grep(pattern = '(AGENT)|(NAME)|(FIX)', x = colnames(df), ignore.case = T, value = T)
df <- df %>%
rename(Firm = i) %>%
mutate(Firm = ...str manipulations... )
(There is an implicit assumption here that your grep() returns a single index. Additional code may be required to handle multiple matches.)

Suppose the name of a column in a dataframe is unknown to me, how can I sort the df according to the values in that column?

I'm trying to sort a dataframe in descending order according to the values in a specific column whose name is supposed to be unknown to me (i.e. I know it but I am not allowed to use it). The only clue is that it is the last column of this dataframe.
I've tried arange() and order() but they doesn't work. I also noticed that if I try to use names(df)[ncol(df)], I will get the name of that column as a character. However, the correct argument formating in arrange() seems to be columnName in two grave accents rather than "columnName". So I don't know how to correctly passs the name I got to the functions I want to use.
Base R
mtcars[order(mtcars[tail(names(mtcars), 1)]), ] #ascending
mtcars[order(mtcars[tail(names(mtcars), 1)], decreasing = TRUE), ] #descending
tidyverse
library(dplyr)
mtcars %>% arrange_at(vars(last(names(.)))) #ascending
mtcars %>% arrange_at(vars(last(names(.))), desc) #descending

how to use gather_ in tidyr with variables

I'm using tidyr together with shiny and hence needs to utilize dynamic values in tidyr operations.
However I do have trouble using the gather_(), which I think was designed for such case.
Minimal example below:
library(tidyr)
df <- data.frame(name=letters[1:5],v1=1:5,v2=10:14,v3=7:11,stringsAsFactors=FALSE)
#works fine
df %>% gather(Measure,Qty,v1:v3)
dyn_1 <- 'Measure'
dyn_2 <- 'Qty'
dyn_err <- 'v1:v3'
dyn_err_1 <- 'v1'
dyn_err_2 <- 'v2'
#error
df %>% gather_(dyn_1,dyn_2,dyn_err)
#error
df %>% gather_(dyn_1,dyn_2,dyn_err_1:dyn_err_2)
after some debug I realized the error happened at melt measure.vars part, but I don't know how to get it work with the ':' there...
Please help with a solution and explain a little bit so I could learn more.
You are telling gather_ to look for the colume 'v1:v3' not on the separate column ids. Simply change dyn_err <- "v1:v3" to dyn_err <- paste("v", seq(3), sep="").
If you df has different column names (e.g. var_a, qtr_b, stg_c), you can either extract those column names or use the paste function for whichever variables are of interest.
dyn_err <- colnames(df)[2:4]
or
dyn_err <- paste(c("var", "qtr", "stg"), letters[1:3], sep="_")
You need to look at what column names you want and make the corresponding vector.

Resources