sparklyr change all column names spark dataframe - r

I intended to change all column names. The current rename or select operation is too labouring. I dont know if anybody has a better solution. Examples as belwo:
df <- data.frame(oldname1 = LETTERS, oldname2 = 1,...oldname200 = "APPLE")
df_tbl <- copy_to(sc,df,"df")
newnamelist <- paste("Name", 1:200, sep ="_")
How do I assign newnamelist as the new colnames? I probably cant do this:
df_new <- df_tbl %>% dplyr::select(Name_1 = oldname1, Name_2 = oldname2,....)

You can use select_ with .dots:
df <- copy_to(sc, iris)
newnames <- paste("Name", 1:5, sep="_")
df %>% select_(.dots=setNames(colnames(df), newnames))
# Source: lazy query [?? x 5]
# Database: spark_connection
Name_1 Name_2 Name_3 Name_4 Name_5
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
You can also select with !!!:
library(rlang)
library(purrr)
df %>% select(!!! setNames(map(colnames(df), parse_quosure), newnames))
# Source: lazy query [?? x 5]
# Database: spark_connection
Name_1 Name_2 Name_3 Name_4 Name_5
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with more rows

The solutions listed above did not work for me. I did find a straight forward solution documented in github which works with sparklyr.
rename() doesn't support unquoting of character vectors #3030
Below is an excerpt of my script expanding on the method described in the link above.
library(dplyr)
library(stringr)
# Generate list of column names without special characters (replace spaces and dashes with underscores)
list_new_names = colnames(spark_df) %>% str_remove_all('LAST ') %>% str_replace_all(' - ', '_') %>% str_replace_all(' ', '_')
# Generate list used to rename columns
list_new_names = colnames(spark_df) %>% setNames(list_new_names)
# Rename columns
spark_df = spark_df %>% rename(!!! list_new_names)

You can do this too, This worked fine for me.
df <- copy_to(sc, iris)
newnames <- paste("Name", 1:5, sep="_")
colnames(df) <- newnames

Related

select first occurrence of variable with prefix in dataframe

What is the best way to dplyr::select the first occurrence of a variable with a certain prefix (and all other variables without that prefix). Or put another way, drop all variables with that prefix except the first occurrence.
library(tidyverse)
hiris <- head(iris)
#given this data.frame:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
# rowname Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa
Now lets say I want to drop all variables with prefix Sepal.Length except the first one (Sepal.Length.x) I could do:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
dplyr::select(-Sepal.Length.y, -Sepal.Length)
which works fine but I want something flexible so it will work with an arbitrary number of variables with prefix Sepal.Length e.g.:
lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
I could do something like this:
df <- lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
name_drop <- (df %>% select(matches("Sepal.Length")) %>% names())[-1]
df %>%
select(-name_drop)
but im looking to do it in a pipe and more efficiently. any suggestions?
thanks
I like this explanation of the problem:
drop all variables with that prefix except the first occurrence.
select(iris, !starts_with("Sepal")[-1])
# Sepal.Length Petal.Length Petal.Width Species
# 1 5.1 1.4 0.2 setosa
# 2 4.9 1.4 0.2 setosa
# ...
starts_with("Sepal") of course returns all columns that start with "Sepal", we can use [-1] to remove the first match, and ! to drop any remaining matches.
It does seem a little like black magic - if we were doing this in base R, the [-1] would be appropriate if we used which() to get column indices, and the ! would be appropriate if we didn't use which() and had a logical vector, but somehow the tidyselect functionality makes it work!

Succinct subsetting across multiple columns in R

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?
A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().
You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....
Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)
One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1
library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor

How to change the column names and make a data frame of columns in the dataset

I am building a function to change the column names of 3 columns and make a new data frame with 3 column. The file name is noaaFilename, and Date, HrMn, and Slp were earlier column names and new names I want as Date, Time, AtmosPressure.
names(noaaFilename)[names(noaaFilename) == "Date"] <- "Date"
names(noaaFilename)[names(noaaFilename) == "HrMn"] <- "Time"
names(noaaFilename)[names(noaaFilename) == "Slp"] <- "AtmosPressure"
noaaData <- subset(noaaFilename, select = c(Date, Time, AtmosPressure))
mysubset <- function(df, oldnames, newnames){
if(length(oldnames)!=length(newnames)){
stop("oldnames and newnames are not the same length")
}
if(!all(oldnames%in%colnames(df)){
stop("Not all of oldnames match column names of df")
}
df <- df[,oldnames, drop = F]
colnames(df) <- newnames
return(df)
}
An example with the iris data set.
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
tmp <- mysubset(iris,
oldnames = c("Sepal.Length","Sepal.Width","Species"),
newnames = c("Date","Time","AtmosPressure"))
head(tmp)
# Date Time AtmosPressure
#1 5.1 3.5 setosa
#2 4.9 3.0 setosa
#3 4.7 3.2 setosa
#4 4.6 3.1 setosa
#5 5.0 3.6 setosa
#6 5.4 3.9 setosa
Writing the function like this makes it so you don't only need to specify 3 columns.
noaaData <- subset(noaaFilename, select = c(Date, HrMn, Slp))
names(noaaData) <- c("Date", "Time", "AtmosPressure")

Using a custom function with tidyverse

I created a dummy function to get the lag of one variable and I want to use it with other tidyverse functions. It works after I call mutate but not after calling group_by. It throws the following error:
Error in mutate_impl(.data, dots) :
Not compatible with STRSXP: [type=NULL].
Here is a repex:
#create a function to lag a selected variable
lag_func <- function(df, x) {
mutate(df, lag = lag(df[,x]))
}
#works
iris %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
#doesn't work
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
Any idea what the error means and/or how to fix it?
The best way to pass a column name as an argument to a tidyverse function is convert it to quosure using enquo(). See this code:
lag_func <- function(df, x) {
x <- enquo(x)
mutate(df, lag = lag(!!x)) # !! is to evaluate rather than quoting (x)
}
Now let's try our function:
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func(Petal.Length)
# A tibble: 150 x 7
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lead lag
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.4 NA
2 4.9 3 1.4 0.2 setosa 1.3 1.4
3 4.7 3.2 1.3 0.2 setosa 1.5 1.4
4 4.6 3.1 1.5 0.2 setosa 1.4 1.3
5 5 3.6 1.4 0.2 setosa 1.7 1.5
6 5.4 3.9 1.7 0.4 setosa 1.4 1.4
7 4.6 3.4 1.4 0.3 setosa 1.5 1.7
8 5 3.4 1.5 0.2 setosa 1.4 1.4
9 4.4 2.9 1.4 0.2 setosa 1.5 1.5
10 4.9 3.1 1.5 0.1 setosa 1.5 1.4
# ... with 140 more rows
For more info on how to use tidyverse functions within your custom functions see here

Creating a data frame based on a simple VLOOKUP in R?

df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'))
The new data frame (dfnew) needs all 5 columns from "iris" extracted, for all the rows with the petal lengths specified in "x".
I've tried it this way, but it doesn't seem to work:
dfnew <- df$Petal.Length[x]
Using dplyr:
> library(dplyr)
> data(iris)
> (dfnew <- iris %>% filter(Petal.Length %in% c('1.7', '1.9', '3.5')) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.4 3.4 1.7 0.2 setosa
4 5.1 3.3 1.7 0.5 setosa
5 4.8 3.4 1.9 0.2 setosa
6 5.1 3.8 1.9 0.4 setosa
7 5.0 2.0 3.5 1.0 versicolor
8 5.7 2.6 3.5 1.0 versicolor
It's worth noting that this is what you are technically asking for with "VLOOKUP", but the comment from phiver might actually be what you want.
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'), X = c('X','Y','Z'))
df.new <- merge(df, x, by = 'Petal.Length')

Resources