Issue with SparkR regexp_extract function - r

Data
I work with a large dataset (280 million rows) for which Spark and R seems to work nicely.
Problem
I'd had problems with SparkR's regexp_extract function. I thought it to work analogically to Stringr's str_detect but I haven't managed to get it to work. The documentation for regexp_extract is limited. Could you please give me a hand?
Reprex
Here is a reprex where I try to identify strings that do not have a space and paste " 00:01" as a suffix.
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken = ifelse(regexp_extract(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Error
error: org.apache.spark.sql.AnalysisException: cannot resolve '(NOT regexp_extract(df.sampletaken, ' ', 1))' due to data type mismatch: argument 1 requires boolean type, however, 'regexp_extract(df.sampletaken, ' ', 1)' is of string type.; line 1 pos 80;
Solution
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken1 = ifelse(rlike(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)

Probably rlike is what you're after if you're looking for the analog to str_detect, see the SQL API docs:
str rlike regexp - Returns true if str matches regexp, or false otherwise.
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
on a Column (i.e., in R, rather than in SparkQL through sql()), it would be like:
rlike(Column, 'regex.*pattern')
# i.e., in magrittr form
Column %>% rlike('regex.*pattern')
Note that like is usually more efficient if you can use it since the set of valid like patterns is much smaller.

I'm not familiar with SparkR, but it seems that the function regex_extract returns a string (presumably the matched pattern in the string) instead of a boolean, as required by the function ifelse.
You may try to match the returned value against the empty string.

Related

unexpected = error when using mutate function

i want to replace existing variable in one column from my dataset with new variable, T with 0.0 , and the column type is char , so i put the values between "" , the error was the = , pls i want your help how i can resolve this problem ?? (i requiered to do replacing in R with mutate function )
# Install tidymodels if you haven't done so
install.packages("rlang")
install.packages("tidymodels")
install.packages("dplyr")
# Library for modeling
library(tidymodels)
# Load tidyverse
library(tidyverse)
library(dplyr)
URL <- 'https://dax-cdn.cdn.appdomain.cloud/dax-noaa-weather-data-jfk-airport/1.1.4/noaa-weather-sample-data.tar.gz'
download.file (URL, destfile='noaa-weather-sample-data.tar.gz')
untar('noaa-weather-sample-data.tar.gz',tar = 'internal')
dataset<- read.csv ('noaa-weather-sample-data/jfk_weather_sample.csv')
head(dataset)
glimpse(dataset)
subset_data <- data.frame(dataset$HOURLYRelativeHumidity,dataset$HOURLYDRYBULBTEMPF,dataset$HOURLYStationPressure,dataset$HOURLYWindSpeed,dataset$HOURLYPrecip)
subset_data<-setNames(subset_data,c('HOURLYRelativeHumidity','HOURLYDRYBULBTEMPF','HOURLYStationPressure','HOURLYWindSpeed', 'HOURLYPrecip'))
head(subset_data,10)
unique(subset_data$HOURLYPrecip)
data_new <- subset_data %>% # Replacing values
mutate(subset_data$HOURLYPrecip = replace(subset_data$HOURLYPrecip, subset_data$HOURLYPrecip == 'T', '99'))
View(data_new)
ant i got this error?
Error: unexpected '=' in:
"data_new <- subset_data %>% # Replacing values
mutate(subset_data$HOURLYPrecip ="

How to pipe in dplyr

I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)
Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.
I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')
Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)

Error in stream-delim when exporting to CSV

I'm trying to write this StatsBomb Data into a CSV but I keep on getting the following error message:
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) :
Don't know how to handle vector of type list.
I'm lost (tried multiple things) and not sure what I did wrong here. Is there anyone out here who knows how to solve this? I've included my code below.
library(StatsBombR)
library(tidyverse)
### Read in all free events and matches from the FAWSL
data <- StatsBombFreeEvents()
matches <- FreeMatches(Competitions = 72)
### Clean and separate all data loaded above
dataclean <- allclean(data)
### Filter event data to include only FAWSL data.
data1 <- dataclean %>%
filter(dataclean$competition_id == 72)
### Join event and match data by "match_id"
data1 <- left_join(data1, matches, by = "match_id")
FullData <- data1 %>%
select(-c(related_events, tactics.lineup, shot.freeze_frame, location, pass.end_location, shot.end_location, goalkeeper.end_location))
setwd()
write_csv(FullData, "StatsBomb_FullData.csv")
I had the same problem. Unlisting the column fixed mine.
df$listcolumn <- sapply(df$listcolumn, function(x) paste0(unlist(x), collapse = "\n"))

Concat_ws() function in Sparklyr is missing

I am following a tutorial on web (Adobe) analytics, where I want to build a Markov Chain Model. (http://datafeedtoolbox.com/attribution-theory-the-two-best-models-for-algorithmic-marketing-attribution-implemented-in-apache-spark-and-r/).
In the example they are using the function:
concat_ws (from library(sparklyr)). But it looks like the function does not exists (after installing the package, and calling the library, I receive an error that the function does not exists...).
Comment author of the blog: concat_ws is a Spark SQL function:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/functions.html
So, you’ll have to rely on sparklyr to have that function work.
My question: are there workarounds to get access to the concat_ws() function? I tried:
Searched on Github (https://github.com/rstudio/sparklyr) if I could find the function (or the source code).. unfortunately no result..
What is the goal of the function?
Concatenates multiple input string columns together into a single string column, using the given separator.
You can simply use paste from base R.
library(sparklyr)
library(dplyr)
config <- spark_config()
sc <- spark_connect(master = "local", config = config)
df <- as.data.frame(cbind(c("1", "2", "3"), c("a", "b", "c")))
sdf <- sdf_copy_to(sc, df, overwrite = T)
sdf %>%
mutate(concat = paste(V1, V2, sep = "-"))
You cannot find the function because it doesn't exist in sparklyr package. concat_ws is a Spark SQL function (org.apache.spark.sql.functions.concat_ws).
sparklyr depends on a SQL translation layer - function calls are translated into SQL expressions with dbplyr:
> dbplyr::translate_sql(concat_ws("-", foo, bar))
<SQL> CONCAT_WS('-', "foo", "bar")
This means that the function can be applied only in the sparklyr context:
sc <- spark_connect(master = "local[*]")
df <- copy_to(sc, tibble(x="foo", y="bar"))
df %>% mutate(xy = concat_ws("-", x, y))
# # Source: spark<?> [?? x 3]
# x y xy
# * <chr> <chr> <chr>
# 1 foo bar foo-bar
I had a similar problem with dbplyr (BigQuery database).
Problem
I kept getting the error:
my_dbplyr_object %>%
mutate(datetime_char = paste(date_char, time_char))
# failed x Function not found: CONCAT_WS at [1:147] [invalidQuery]
Solution
I wrote custom SQL and placed it inside sql().
Example
Once you know the SQL that will generate what you're after (in my case it was CONCAT(date_char, ' ', time_char)), then simply place it inside the sql() function, like so:
my_dbplyr_object %>%
mutate(datetime_char = sql("CONCAT(date_char, ' ', time_char)"))

Can dplyr modify multiple columns of spark DF using a vector?

I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:
library(dplyr)
library(rlang)
library(sparklyr)
sc1 <- spark_connect(master = "local")
mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)
mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10
# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
# set equation and print - use sym() convert a string
mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
# mutate formula - LHS resolves to a string, RHS a quosure
mtcars_sp = mtcars_sp %>%
mutate(!!mtcars_cols[i] := !!mtcars_eq )
}
dbplyr::sql_render(mtcars_sp)
mtcars_sp
This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?
BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!
In general you can use great answer by Artem Sokolov
library(glue)
mtcars_sp %>%
mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>%
lapply(parse_quosure))
However if this is input for MLlib algorithms then ft_vector_assembler combined with ft_elementwise_product might be a better fit:
scaled <- mtcars_sp %>%
ft_vector_assembler(mtcars_cols, "features") %>%
ft_elementwise_product("features", "features_scaled", mtc_factors)
The result can be separated (I wouldn't recommend that if you're going with MLlib) into individual columns with sdf_separate_column:
scaled %>%
select(features_scaled) %>%
sdf_separate_column("features_scaled", mtcars_cols)

Resources