How to use something similar to set (data.table) in sparklyr? - r

I have the following spark data frame called "clients":
ID VAR1 VAR2 VAR3
1 A F A1
2 C M B1
3 E F C1
4 C 0 B1
And I also have a dictionary of variables called "ref", which shows the unique values a variable can take, like this one:
Variables Values
VAR1 A
VAR1 B
VAR1 C
VAR1 D
VAR1 E
VAR2 F
VAR2 M
VAR3 A1
VAR3 B1
I want to replace those values that are not allowed for variables, with a default value like "-1", so I can have something like this:
ID VAR1 VAR2 VAR3
1 A F A1
2 C M B1
3 E F -1
4 C -1 B1
With data table library I would use something like this:
names <- colnames(clients)
for(var in names){
set(x = clients,
i = which(!clients[[var]] %in% ref[Variables == var]$Values),
j = var,
value = "-1")
I would like to know if there exist some similar solution to use with sparklyr.

I am not sure which functions are available for spark hence here is a base R option.
clients[-1] <- lapply(names(clients)[-1], function(x) {
replace(clients[[x]],
!clients[[x]] %in% ref$Values[ref$Variables == x], NA)
})
clients
# ID VAR1 VAR2 VAR3
#1 1 A F A1
#2 2 C M B1
#3 3 E F <NA>
#4 4 C <NA> B1
I have replaced the values which are not allowed with NA instead of -1, you can use whichever value you want.

One approach, similar to your current code, is to collect the list of ref values for each VAR and apply case_when. If the value exist in the collected list, return the value of VAR, otherwise return -1.
Let's see an example. I suggest to always provide code that generates the test dataset:
clients <- data.frame(
ID=c(1, 2, 3, 4),
VAR1=c("A", "C", "E", "C"),
VAR2=c("F", "M", "F", "0"),
VAR3=c("A1", "B1", "C1", "B1"),
stringsAsFactors = FALSE
)
ref <- data.frame(
Variables = c("VAR1", "VAR1", "VAR1", "VAR1", "VAR1", "VAR2", "VAR2", "VAR3", "VAR3"),
Values=c("A", "B", "C", "D", "E", "F", "M", "A1", "B1"),
stringsAsFactors = FALSE
)
Start the session
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master="local[4]", version = "3.0.1")
sref <- copy_to(sc, ref)
sclients <- copy_to(sc, clients, overwrite = T)
for(var in c("VAR1", "VAR2", "VAR3")) {
var_values <- (sref %>%
filter(Variables == var) %>%
select(Values) %>%
collect())$Values
sclients <- sclients %>%
mutate(!!as.name(var) := case_when(!!as.name(var) %in% var_values ~ !!as.name(var),
TRUE ~ "-1"))
}
sclients
Result
# Source: spark<?> [?? x 4]
ID VAR1 VAR2 VAR3
<dbl> <chr> <chr> <chr>
1 1 A F A1
2 2 C M B1
3 3 E F -1
4 4 C -1 B1
Although this solution works, it may not be the best depending on how large your dataset is. For instance, if the values of ref for one VAR don't fit in memory, the collect will through an out of memory error. An alternative could be to generate an array column with the ref values and then check if the VAR value exist inside the array using a function like hof_exists()

Related

Call a function with some parameters in a vector variable

I have a dataframe, and a function I use to mark specific cases in it (The following is an example to make things more concrete. The actual dataframe and function are much more complicated).
df = data.frame(number = c(1,2,3,4),
type1 = c("A","A","B","B"),
type2 = c("A","B","A","B"),
var1 = c(1,2,3,4),
var2 = c(1,2,3,4) )
FilterMark <- function (fun_data, cond_type1, cond_type2) {
fun_data$filter <- FALSE
fun_data$filter [which(fun_data$type1 == cond_type1 &
fun_data$type2 == cond_type2 )] <- TRUE
return(fun_data$filter)
}
I need to call this function multiple times, with various conditions. I want use a vector to define the conditions for each such call. for example:
conds = c("A","A")
df$case1[FilterMark(df, conds)] <- TRUE
But this doesn't work. The function interprets conds as a one parameter, instead of breaking the vector into two parameters.
Is there a way to call the function in such a way?
(I already tried do.call, but couldn't get it to work...)
Do you really need an specific function for that?
library(tidyverse)
df = data.frame(
number = c(1, 2, 3, 4),
type1 = c("A", "A", "B", "B"),
type2 = c("A", "B", "A", "B"),
var1 = c(1, 2, 3, 4),
var2 = c(1, 2, 3, 4)
)
df |>
mutate(case1 = type1 == "A" & type2 == "A")
#> number type1 type2 var1 var2 case1
#> 1 1 A A 1 1 TRUE
#> 2 2 A B 2 2 FALSE
#> 3 3 B A 3 3 FALSE
#> 4 4 B B 4 4 FALSE

Adding new information to a table upon matching rows

I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))

R get column names for changed rows

I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2
A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)
Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2

Automating the process of recoding numeric variables to meaningful factor variables

I have a large data frame (hundreds of variables wide) in which all values of categorical variables are saved as numerics, for example, 1, 2, 8, representing no, yes, and unknown.
However, this is not always consistent. There are variables that have ten or more categories with 88 representing unknown etc.
data <- data.frame("ID" = c(1:5),
"Var1" = c(2,2,8,1,8),
"Var2" = c(5,8,4,88,10))
For each variable, I do have all information on which value represents which category. Currently, I have this information stored in vectors that are each correctly ordered, like
> Var1_values
[1] 8 2 1
with a corresponding vector containing the categories:
> Var1_categories
[1] "unknown" "yes" "no"
But I cannot figure out a process for how to bring this information together in order to automate the recoding process towards an expected result like
| ID | Var1 | Var2 |
|----|---------|-------------------|
| 1 | yes | condition E |
| 2 | yes | condition H |
| 3 | unknown | condition D |
| 4 | no | unknown condition |
| 5 | unknown | condition H |
where each column is a meaningful factor variable.
As I said, the data frame is very wide and things might change internally, so doing this manually is not an option. I feel like I'm being stupid as I have all the necessary information readily available, so any insight would be greatly appreciated, and a cup of coffee is the least I can do for helpful advice.
// edit:
I forgot to mention that I have already made some kind of a mapping-dataframe but I couldn't really put it to use, yet. It looks like this:
mapping <- data.frame("Variable" = c("Var1", "Var2", "Var3", "Var4"),
"Value1" = c(2,2,2,7),
"Word1" = c("yes","yes","yes","condition A"),
"Value2" = c(1,1,1,6),
"Word2" = c("no","no","no","Condition B"),
"Value3" = c(8,8,8,5),
"Word3" = c("unk","unk","unk", "Condition C"),
"Value4" = c(NA,NA,NA,4),
"Word4" = c(NA,NA,NA,"Condition B")
)
I would like to "long"-transform it so I can use it with #r2evan 's solution.
Here's one thought, though it requires reshaping (twice) the data.
mapping <- data.frame(
Var = c(rep("Var1", 3), rep("Var2", 5)),
Val = c(1, 2, 8, 4, 5, 8, 10, 88),
Words = c("no", "yes", "unk", "D", "E", "H", "H", "unk")
)
mapping
# Var Val Words
# 1 Var1 1 no
# 2 Var1 2 yes
# 3 Var1 8 unk
# 4 Var2 4 D
# 5 Var2 5 E
# 6 Var2 8 H
# 7 Var2 10 H
# 8 Var2 88 unk
library(dplyr)
library(tidyr) # pivot_*
data %>%
pivot_longer(-ID, names_to = "Var", values_to = "Val") %>%
left_join(mapping, by = c("Var", "Val")) %>%
pivot_wider(ID, names_from = "Var", values_from = "Words")
# # A tibble: 5 x 3
# ID Var1 Var2
# <int> <chr> <chr>
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
With this method, you control the number-to-words mapping for each variable.
Another option is to use a map list, similar to above but it does not require double-reshaping.
maplist <- list(
Var1 = c("1" = "no", "2" = "yes", "8" = "unk"),
Var2 = c("4" = "D", "5" = "E", "8" = "H", "10" = "H", "88" = "unk")
)
maplist
# $Var1
# 1 2 8
# "no" "yes" "unk"
# $Var2
# 4 5 8 10 88
# "D" "E" "H" "H" "unk"
nms <- c("Var1", "Var2")
data[,nms] <- Map(function(val, lookup) lookup[as.character(val)],
data[nms], maplist[nms])
data
# ID Var1 Var2
# 1 1 yes E
# 2 2 yes H
# 3 3 unk D
# 4 4 no unk
# 5 5 unk H
Between the two, I think I prefer the first if your data doesn't punish you for reshaping it (many things could make this less appealing). One reason it's good is that maintaining the mapping can be as easy as maintaining a CSV (which might be done in your favorite spreadsheet tool, e.g., Excel or Calc).
Here's a way of doing it that requires no reshaping of your original data, and can be conceivably applied to any number of columns. First, put all your existing "values" and "categories" vectors into lists, formatting all as characters:
library(tidyverse)
# Recreating your existing vectors
Var1_values <- c(8, 2, 1)
Var2_values <- c(88, 10, 8, 5, 4)
Var1_categories <- c("unknown", "yes", "no")
Var2_categories <- c("unknown condition", "condition H", "condition H", "condition E", "condition D")
Var_values <- list(Var1_values, Var2_values) %>%
map(as.character)
Var_categories <- list(Var1_categories, Var2_categories)
Add names to each element of each vector in Var_categories using Var_values, and get a list of variable names to recode from your dataset:
for (i in 1:length(Var_categories)) {
names(Var_categories[[i]]) <- Var_values[[i]]
}
vars_names <- str_subset(colnames(data), "Var")
Then, use map2 to recode all of your target variables, before transforming into a tibble with ID column.
data_recoded <- map2(vars_names, Var_categories, ~ dplyr::recode(unlist(data[.x], use.names = F), !!!.y)) %>%
as_tibble(.name_repair = ~ vars_names) %>%
add_column(ID = 1:5, .before = "Var1")
Output (data_recoded):
ID Var1 Var2
<int> <chr> <chr>
1 1 yes condition E
2 2 yes condition H
3 3 unknown condition D
4 4 no unknown condition
5 5 unknown condition H

Data manipulation using r purrr

I have two datasets dat1 and dat2. I would like to pull out rows from dat1 which match the pairs of variables from dat2. var6 can be matched in any of var1, var2, var3, and var4. var7 must be matched with var5.
I would like to come up with a solution using the map functions from the purr package in tidyverse but I'm not sure where to start. Thank you for any help!
dat1 <- data.frame(id = c(1:9),
var1 = c("x","x","x","y","y","y","z","z","z"),
var2 = c("c","c","c","d","d","d","e","e","e"),
var3 = c("f","f","f","g","g","g","h","h","h"),
var4 = c("i","i","i","j","j","j","k","k","k"),
var5 = c("aa","aa","aa","aa","aa","aa","bb","bb","bb"), stringsAsFactors = FALSE)
dat2 <- data.frame(var6 = c("c", "d", "l", "m", "n"),
var7 = c("aa", "bb", "aa", "aa","aa"), stringsAsFactors = FALSE)
In this example the result would pull out rows 1, 2, and 3 from dat1 as "c" is matched in var2 and "aa" is matched in var5.
If we need an elementwise comparison, loop through the column 2 to 5 in 'dat1' with lapply, then do an elementwisse comparison with 'var6' of 'dat2' using outer while doing the same comparison with 'var5', 'var7' columns from 'dat1', 'dat2' respectively, check whether we get both as TRUE (&), then take the row wise sum (rowSums) to collapse the matrix into a single logical vector and Reduce the list. of vectors into. a single vector with | i.e. checking whether any of the row elements are TRUE in each of the vectors. It is used for subsetting the rows ('i1')
i1 <- Reduce(`|`, lapply(dat1[2:5], function(x)
rowSums(outer(x, dat2$var6, `==`) & outer(dat1$var5, dat2$var7, `==`)) > 0 ))
dat1[i1,]
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa
Or using map
library(purrr)
library(dplyr)
map(dat1[2:5], ~ outer(.x, dat2$var6, `==`) &
outer(dat1$var5, dat2$var7, `==`)) %>%
reduce(`+`) %>%
rowSums %>%
as.logical %>%
magrittr::extract(dat1, ., )
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa

Resources