Using a custom function with tidyverse - r

I created a dummy function to get the lag of one variable and I want to use it with other tidyverse functions. It works after I call mutate but not after calling group_by. It throws the following error:
Error in mutate_impl(.data, dots) :
Not compatible with STRSXP: [type=NULL].
Here is a repex:
#create a function to lag a selected variable
lag_func <- function(df, x) {
mutate(df, lag = lag(df[,x]))
}
#works
iris %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
#doesn't work
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func('Petal.Length')
Any idea what the error means and/or how to fix it?

The best way to pass a column name as an argument to a tidyverse function is convert it to quosure using enquo(). See this code:
lag_func <- function(df, x) {
x <- enquo(x)
mutate(df, lag = lag(!!x)) # !! is to evaluate rather than quoting (x)
}
Now let's try our function:
iris %>%
group_by(Species) %>%
mutate(lead = lead(Petal.Length)) %>%
lag_func(Petal.Length)
# A tibble: 150 x 7
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species lead lag
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 1.4 NA
2 4.9 3 1.4 0.2 setosa 1.3 1.4
3 4.7 3.2 1.3 0.2 setosa 1.5 1.4
4 4.6 3.1 1.5 0.2 setosa 1.4 1.3
5 5 3.6 1.4 0.2 setosa 1.7 1.5
6 5.4 3.9 1.7 0.4 setosa 1.4 1.4
7 4.6 3.4 1.4 0.3 setosa 1.5 1.7
8 5 3.4 1.5 0.2 setosa 1.4 1.4
9 4.4 2.9 1.4 0.2 setosa 1.5 1.5
10 4.9 3.1 1.5 0.1 setosa 1.5 1.4
# ... with 140 more rows
For more info on how to use tidyverse functions within your custom functions see here

Related

R dataframe - Top n values in row with column names

I want to sort rowwise values in specific columns, get top 'n' values, and get corresponding column names in new columns.
The output would look something like this:
SL SW PL PW Species high1 high2 high3 col1 col2 col3
dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 3.5 1.4 0.2 SW PL PW
2 4.9 3 1.4 0.2 setosa 3 1.4 0.2 SW PL PW
3 4.7 3.2 1.3 0.2 setosa 3.2 1.3 0.2 SW PL PW
Tried something like code below, but unable to get column names.
What I'm hoping to achieve is to compare the highest 'n' values (rows[n]) with values in dataframe for each row, and then extract corresponding column name of matching value. For eg. rows[1] == 3.5 (from column 'SW'). Is this feasible?
Help appreciated.
iris %>%
rowwise() %>%
mutate(rows = list(sort(c( Sepal.Width, Petal.Length, Petal.Width), decreasing = TRUE))) %>%
mutate(high1 = rows[1], col1 = names(~.)[which(~.[] ==rows[1]),
high2 = rows[2], col2 = names(~.)[which(~.[] ==rows[2]),
high3 = rows[3], col3 = names(~.)[which(~.[] ==rows[3])
) %>%
select(-rows)
You could pivot to long, group by the corresponding original row, use slice_max to get the top values, then pivot back to wide and bind that output to the original table.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
iris %>%
group_by(rn = row_number()) %>%
pivot_longer(-c(Species, rn), 'col', values_to = 'high') %>%
slice_max(col, n = 2) %>%
mutate(nm = row_number()) %>%
pivot_wider(values_from = c(high, col),
names_from = nm) %>%
ungroup() %>%
select(-c(Species, rn)) %>%
bind_cols(iris)
#> # A tibble: 150 × 9
#> high_1 high_2 col_1 col_2 Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 5.1 3.5 Sepal.… Sepa… 5.1 3.5 1.4 0.2
#> 2 4.9 3 Sepal.… Sepa… 4.9 3 1.4 0.2
#> 3 4.7 3.2 Sepal.… Sepa… 4.7 3.2 1.3 0.2
#> 4 4.6 3.1 Sepal.… Sepa… 4.6 3.1 1.5 0.2
#> 5 5 3.6 Sepal.… Sepa… 5 3.6 1.4 0.2
#> 6 5.4 3.9 Sepal.… Sepa… 5.4 3.9 1.7 0.4
#> 7 4.6 3.4 Sepal.… Sepa… 4.6 3.4 1.4 0.3
#> 8 5 3.4 Sepal.… Sepa… 5 3.4 1.5 0.2
#> 9 4.4 2.9 Sepal.… Sepa… 4.4 2.9 1.4 0.2
#> 10 4.9 3.1 Sepal.… Sepa… 4.9 3.1 1.5 0.1
#> # … with 140 more rows, and 1 more variable: Species <fct>
Created on 2022-02-16 by the reprex package (v2.0.1)
Edited to remove the unnecessary rename and mutate, thanks to tip from #Onyambu!
My approach is to make a function that takes any dataframe (df), any set of columns that you want to focus on (cols), and any value for top n (n)
# load data.table and magrittr (I only use %>% for illustration here)
library(data.table)
library(magrittr)
# define function
get_high_vals_cols <- function(df, cols, n=3) {
setDT(df)[, `_rowid`:=.I]
df_l <- melt(df,id = "_rowid",measure.vars = cols, variable.name = "col",value.name = "high") %>%
.[order(-high), .SD[1:n], by="_rowid"] %>%
.[,id:=1:.N, by="_rowid"]
dcast(df_l, `_rowid`~id, value.var = list("col", "high"))[,`_rowid`:=NULL]
}
Then, you can feed any dataframe to this function, along with any columns of interest
cols= c("Sepal.Width", "Petal.Length", "Petal.Width")
get_high_vals_cols(iris,cols,3)
Output
col_1 col_2 col_3 high_1 high_2 high_3
1: Sepal.Width Petal.Length Petal.Width 3.5 1.4 0.2
2: Sepal.Width Petal.Length Petal.Width 3.0 1.4 0.2
3: Sepal.Width Petal.Length Petal.Width 3.2 1.3 0.2
4: Sepal.Width Petal.Length Petal.Width 3.1 1.5 0.2
5: Sepal.Width Petal.Length Petal.Width 3.6 1.4 0.2
---
146: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.3
147: Petal.Length Sepal.Width Petal.Width 5.0 2.5 1.9
148: Petal.Length Sepal.Width Petal.Width 5.2 3.0 2.0
149: Petal.Length Sepal.Width Petal.Width 5.4 3.4 2.3
150: Petal.Length Sepal.Width Petal.Width 5.1 3.0 1.8

select first occurrence of variable with prefix in dataframe

What is the best way to dplyr::select the first occurrence of a variable with a certain prefix (and all other variables without that prefix). Or put another way, drop all variables with that prefix except the first occurrence.
library(tidyverse)
hiris <- head(iris)
#given this data.frame:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
# rowname Sepal.Length.x Sepal.Width.x Petal.Length.x Petal.Width.x Species.x Sepal.Length.y Sepal.Width.y Petal.Length.y Petal.Width.y Species.y Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa 5.4 3.9 1.7 0.4 setosa
Now lets say I want to drop all variables with prefix Sepal.Length except the first one (Sepal.Length.x) I could do:
lst(hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname") %>%
dplyr::select(-Sepal.Length.y, -Sepal.Length)
which works fine but I want something flexible so it will work with an arbitrary number of variables with prefix Sepal.Length e.g.:
lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
I could do something like this:
df <- lst(hiris, hiris, hiris, hiris, hiris, hiris, hiris) %>%
map(rownames_to_column) %>%
reduce(full_join, by = "rowname")
name_drop <- (df %>% select(matches("Sepal.Length")) %>% names())[-1]
df %>%
select(-name_drop)
but im looking to do it in a pipe and more efficiently. any suggestions?
thanks
I like this explanation of the problem:
drop all variables with that prefix except the first occurrence.
select(iris, !starts_with("Sepal")[-1])
# Sepal.Length Petal.Length Petal.Width Species
# 1 5.1 1.4 0.2 setosa
# 2 4.9 1.4 0.2 setosa
# ...
starts_with("Sepal") of course returns all columns that start with "Sepal", we can use [-1] to remove the first match, and ! to drop any remaining matches.
It does seem a little like black magic - if we were doing this in base R, the [-1] would be appropriate if we used which() to get column indices, and the ! would be appropriate if we didn't use which() and had a logical vector, but somehow the tidyselect functionality makes it work!

Creating a data frame based on a simple VLOOKUP in R?

df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'))
The new data frame (dfnew) needs all 5 columns from "iris" extracted, for all the rows with the petal lengths specified in "x".
I've tried it this way, but it doesn't seem to work:
dfnew <- df$Petal.Length[x]
Using dplyr:
> library(dplyr)
> data(iris)
> (dfnew <- iris %>% filter(Petal.Length %in% c('1.7', '1.9', '3.5')) )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.4 3.4 1.7 0.2 setosa
4 5.1 3.3 1.7 0.5 setosa
5 4.8 3.4 1.9 0.2 setosa
6 5.1 3.8 1.9 0.4 setosa
7 5.0 2.0 3.5 1.0 versicolor
8 5.7 2.6 3.5 1.0 versicolor
It's worth noting that this is what you are technically asking for with "VLOOKUP", but the comment from phiver might actually be what you want.
df <- iris
x <- data.frame(Petal.Length=c('1.7', '1.9', '3.5'), X = c('X','Y','Z'))
df.new <- merge(df, x, by = 'Petal.Length')

sparklyr change all column names spark dataframe

I intended to change all column names. The current rename or select operation is too labouring. I dont know if anybody has a better solution. Examples as belwo:
df <- data.frame(oldname1 = LETTERS, oldname2 = 1,...oldname200 = "APPLE")
df_tbl <- copy_to(sc,df,"df")
newnamelist <- paste("Name", 1:200, sep ="_")
How do I assign newnamelist as the new colnames? I probably cant do this:
df_new <- df_tbl %>% dplyr::select(Name_1 = oldname1, Name_2 = oldname2,....)
You can use select_ with .dots:
df <- copy_to(sc, iris)
newnames <- paste("Name", 1:5, sep="_")
df %>% select_(.dots=setNames(colnames(df), newnames))
# Source: lazy query [?? x 5]
# Database: spark_connection
Name_1 Name_2 Name_3 Name_4 Name_5
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
You can also select with !!!:
library(rlang)
library(purrr)
df %>% select(!!! setNames(map(colnames(df), parse_quosure), newnames))
# Source: lazy query [?? x 5]
# Database: spark_connection
Name_1 Name_2 Name_3 Name_4 Name_5
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with more rows
The solutions listed above did not work for me. I did find a straight forward solution documented in github which works with sparklyr.
rename() doesn't support unquoting of character vectors #3030
Below is an excerpt of my script expanding on the method described in the link above.
library(dplyr)
library(stringr)
# Generate list of column names without special characters (replace spaces and dashes with underscores)
list_new_names = colnames(spark_df) %>% str_remove_all('LAST ') %>% str_replace_all(' - ', '_') %>% str_replace_all(' ', '_')
# Generate list used to rename columns
list_new_names = colnames(spark_df) %>% setNames(list_new_names)
# Rename columns
spark_df = spark_df %>% rename(!!! list_new_names)
You can do this too, This worked fine for me.
df <- copy_to(sc, iris)
newnames <- paste("Name", 1:5, sep="_")
colnames(df) <- newnames

tidyr::spread() function throws an error

I try to use gather and spread functions in tidyverse package, but it throws an error in spread function
library(caret)
dataset<-iris
# gather function is to convert wide data to long data
dataset_gather<-dataset %>% tidyr::gather(key=Type,value = Values,1:4)
head(dataset_gather)
# spead is the opposite of gather
This code below throws an error like this Error: Duplicate identifiers for rows
dataset_spead<- dataset_gather%>%tidyr::spread(key = Type,value = Values)
Added later: Sorry #alistaire, only saw your comment on the original post after posting this response.
As far as I understand Error: Duplicate identifiers for rows..., it occurs when you have values with the same identifier. For example in the original 'iris' dataset, the first five rows of Species = setosa all have a Petal.Width of 0.2, and three rows of Petal.Length have values of 1.4. Gathering those data isn't an issue, but when you try spread them, the function doesn't know what belongs to what. That is, which 0.2 Petal.Width and 1.4 Petal.Length belongs to which row of setosa.
The (tidyverse) solution I use in those circumstances is to create a unique marker for each row of data at the gather stage so that the function can keep track which duplicate data belong to which rows when you want to spread again. See example below:
# Load packages
library(dplyr)
library(tidyr)
# Get data
dataset <- iris
# View dataset
head(dataset)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Gather data
dataset_gathered <- dataset %>%
# Create a unique identifier for each row
mutate(marker = row_number(Species)) %>%
# Gather the data
gather(key = Type, value = Values, 1:4)
# View gathered data
head(dataset_gathered)
#> Species marker Type Values
#> 1 setosa 1 Sepal.Length 5.1
#> 2 setosa 2 Sepal.Length 4.9
#> 3 setosa 3 Sepal.Length 4.7
#> 4 setosa 4 Sepal.Length 4.6
#> 5 setosa 5 Sepal.Length 5.0
#> 6 setosa 6 Sepal.Length 5.4
# Spread it out again
dataset_spread <- dataset_gathered %>%
# Group the data by the marker
group_by(marker) %>%
# Spread it out again
spread(key = Type, value = Values) %>%
# Not essential, but remove marker
ungroup() %>%
select(-marker)
# View spread data
head(dataset_spread)
#> # A tibble: 6 x 5
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> <fctr> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1.4 0.2 5.1 3.5
#> 2 setosa 1.4 0.2 4.9 3.0
#> 3 setosa 1.3 0.2 4.7 3.2
#> 4 setosa 1.5 0.2 4.6 3.1
#> 5 setosa 1.4 0.2 5.0 3.6
#> 6 setosa 1.7 0.4 5.4 3.9
(and as ever, thanks to Jenny Bryan for the reprex package)
We can do this with data.table
library(data.table)
dcast(melt(setDT(dataset, keep.rownames = TRUE), id.var = c("rn", "Species")), rn + Species ~ variable)

Resources