I have two datasets dat1 and dat2. I would like to pull out rows from dat1 which match the pairs of variables from dat2. var6 can be matched in any of var1, var2, var3, and var4. var7 must be matched with var5.
I would like to come up with a solution using the map functions from the purr package in tidyverse but I'm not sure where to start. Thank you for any help!
dat1 <- data.frame(id = c(1:9),
var1 = c("x","x","x","y","y","y","z","z","z"),
var2 = c("c","c","c","d","d","d","e","e","e"),
var3 = c("f","f","f","g","g","g","h","h","h"),
var4 = c("i","i","i","j","j","j","k","k","k"),
var5 = c("aa","aa","aa","aa","aa","aa","bb","bb","bb"), stringsAsFactors = FALSE)
dat2 <- data.frame(var6 = c("c", "d", "l", "m", "n"),
var7 = c("aa", "bb", "aa", "aa","aa"), stringsAsFactors = FALSE)
In this example the result would pull out rows 1, 2, and 3 from dat1 as "c" is matched in var2 and "aa" is matched in var5.
If we need an elementwise comparison, loop through the column 2 to 5 in 'dat1' with lapply, then do an elementwisse comparison with 'var6' of 'dat2' using outer while doing the same comparison with 'var5', 'var7' columns from 'dat1', 'dat2' respectively, check whether we get both as TRUE (&), then take the row wise sum (rowSums) to collapse the matrix into a single logical vector and Reduce the list. of vectors into. a single vector with | i.e. checking whether any of the row elements are TRUE in each of the vectors. It is used for subsetting the rows ('i1')
i1 <- Reduce(`|`, lapply(dat1[2:5], function(x)
rowSums(outer(x, dat2$var6, `==`) & outer(dat1$var5, dat2$var7, `==`)) > 0 ))
dat1[i1,]
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa
Or using map
library(purrr)
library(dplyr)
map(dat1[2:5], ~ outer(.x, dat2$var6, `==`) &
outer(dat1$var5, dat2$var7, `==`)) %>%
reduce(`+`) %>%
rowSums %>%
as.logical %>%
magrittr::extract(dat1, ., )
# id var1 var2 var3 var4 var5
#1 1 x c f i aa
#2 2 x c f i aa
#3 3 x c f i aa
Related
I am trying to reshape a dataset by switching some cells information. Here is how my sample dataset looks like.
data <- data.frame(var1 = c("Text","A","B","C","D"),
var2 = c("Text",NA, 1,0,1),
var3 = c("112-1",NA,NA,"text",NA),
var4 = c("Text",1,0,NA, NA),
var5 = c("113-1",NA,"text",NA,NA))
> data
var1 var2 var3 var4 var5
1 Text Text 112-1 Text 113-1
2 A <NA> <NA> 1 <NA>
3 B 1 <NA> 0 text
4 C 0 text <NA> <NA>
5 D 1 <NA> <NA> <NA>
It needs some cleaning first.var1 has the item information. var2 and var4 have score information. var3 and var5 have id information at the first row.
I will need to reshape this dataset as below.
> data.1
id A B C D
1 112 NA 1 0 1
2 113 1 0 NA NA
Considering this datafile in multiple columns (e.g. having more columns var6,var7,var8,var9,.etc) with the same pattern, How can I reshape to this desired dataset?
This isn't much different from my answer yesterday, but this will give you the result you asked for. Shift that first row over one column so that the id is on the same column with the needed values, remove the unnecessary columns, then make row one the column names. Add some pivots and then it should be roughly what you need:
data <- data.frame(var1 = c("Text","A","B","C","D"), var2 = c("Text",NA, 1,0,1), var3 = c("112",NA,NA,NA,NA), var4 = c("Text",1,0,NA, NA), var5 = c(113,NA,NA,NA,NA))
library(dplyr)
library(tidyr)
data2<-data%>%
mutate_all(as.character) #Making character to avoid factor issues
data2[1, 2:(ncol(data2) - 1)] <- data2[1, 3:ncol(data2)] #Shifting first row over one column
data3<-data2%>%
select(-var3,-var5) #Removing the uneeded columns
colnames(data3) <- data3[1,] #Taking the first row and making it the column names
data3 <- data3[-1, ] #removing row 1, since it was made into column names
data3%>%
tidyr::pivot_longer(-Text, names_to = "id", values_to = "time")%>% #Making the data into longer format
tidyr::pivot_wider(names_from = Text, values_from = time) #Then back into wide
You could shift the first row, delete, columns %% 2 and transpose.
data[1, ] <- data[1, -1]
data <- data[c(TRUE, seq_len(ncol(data))[-1] %% 2 == 0)]
setNames(as.data.frame(t(data[, -1]), row.names=FALSE), c('id', data[[1]][-1])) |>
type.convert(as.is=TRUE)
# id A B C D
# 1 112-1 NA 1 0 1
# 2 113-1 1 0 NA NA
BTW, how do you get such data? Maybe you have an x-y-problem.
library(dplyr)
library(tidyr)
library(stringr)
#First rename the columns to more appropriate
n = 2 #Number of pairs of columns you have (here 2)
nam <- do.call(paste0, (expand.grid(c("n_", "id_"), seq(n))))
colnames(data) <- c("col", nam)
#Then, the data manipulation
data %>%
mutate(across(starts_with("id"), ~ first(str_remove(.x, "-")))) %>%
fill(starts_with("id")) %>%
slice(-1) %>%
pivot_longer(-col, names_to = c(".value", "rn"), names_sep = "_") %>%
pivot_wider(names_from = "col", values_from = 'n') %>%
select(-rn)
id A B C D
1 1121 NA 1 0 1
2 1131 1 0 NA NA
I have two dataframes, old and new, in R. Is there a way to add a column (called changed) to the new dataframe that lists the column names (in this case, separated with a ";") where the values are different between the two dataframes? I am also trying to use this is a function where the column names that I am comparing are contained in other variables (x1, x2, x3). Ideally, I would only refer to x1, x2, x3 instead of the actual column names, but I can make due if this isn't possible. A tidy solution is preferable.
old <- data.frame(var1 = c(1, 2, 3, 5), var2 = c("A", "B", "C", "D"))
new <- data.frame(var1 = c(1, 4, 3, 6), var2 = c("A", "B", "D", "Z"))
x1 <- "var1"
x2 <- "var2"
x3 <- "changed"
#Output, adding a new column changed to new dataframe
var1 var2 changed
1 1 A NA
2 4 B var1
3 3 D var2
4 6 Z var1; var2
A tidyverse way -
library(dplyr)
library(tidyr)
cols <- names(new)
bind_cols(new, map2_df(old, new, `!=`) %>%
rowwise() %>%
transmute(changed = {
x <- c_across()
if(any(x)) paste0(cols[x], collapse = ';') else NA
}))
# var1 var2 changed
#1 1 A <NA>
#2 4 B var1
#3 3 D var2
#4 6 Z var1;var2
The same logic can be implemented in base R as well -
new$changed <- apply(mapply(`!=`, old, new), 1, function(x)
if(any(x)) paste0(cols[x], collapse = ';') else NA)
Here is a base R approach.
new$changed <- apply(old != new, 1L, \(r, nms) toString(nms[which(r)]), colnames(old))
Output
var1 var2 changed
1 1 A
2 4 B var1
3 3 D var2
4 6 Z var1, var2
I'm trying to iterate over throws of a data frame and get access to values in the columns of each row. Perhaps, I need a paradigm shift. I've attempted a vectorization approach. My ultimate objective is to use specific column values in each row to filter another data frame.
Any help would be appreciated.
df <- data.frame(a = 1:3, b = letters[24:26], c = 7:9)
f <- function(row) {
var1 <- row$a
var2 <- row$b
var3 <- row$c
}
pmap(df, f)
Is there a way to do this in purrr?
Using pmap, we can do
library(purrr)
pmap(df, ~ f(list(...)))
#[[1]]
#[1] 7
#[[2]]
#[1] 8
#[[3]]
#[1] 9
Or use rowwise with cur_data
library(dplyr)
df %>%
rowwise %>%
transmute(new = f(cur_data()))
-output
# A tibble: 3 x 1
# Rowwise:
# new
# <int>
#1 7
#2 8
#3 9
library(tidyverse)
df <- data.frame(a = 1:3, b = letters[24:26], c = 7:9)
f <- function(row) {
var1 <- row$a
var2 <- row$b
var3 <- row$c
}
df %>%
split(rownames(.)) %>%
map(
~f(.x)
)
I have the following spark data frame called "clients":
ID VAR1 VAR2 VAR3
1 A F A1
2 C M B1
3 E F C1
4 C 0 B1
And I also have a dictionary of variables called "ref", which shows the unique values a variable can take, like this one:
Variables Values
VAR1 A
VAR1 B
VAR1 C
VAR1 D
VAR1 E
VAR2 F
VAR2 M
VAR3 A1
VAR3 B1
I want to replace those values that are not allowed for variables, with a default value like "-1", so I can have something like this:
ID VAR1 VAR2 VAR3
1 A F A1
2 C M B1
3 E F -1
4 C -1 B1
With data table library I would use something like this:
names <- colnames(clients)
for(var in names){
set(x = clients,
i = which(!clients[[var]] %in% ref[Variables == var]$Values),
j = var,
value = "-1")
I would like to know if there exist some similar solution to use with sparklyr.
I am not sure which functions are available for spark hence here is a base R option.
clients[-1] <- lapply(names(clients)[-1], function(x) {
replace(clients[[x]],
!clients[[x]] %in% ref$Values[ref$Variables == x], NA)
})
clients
# ID VAR1 VAR2 VAR3
#1 1 A F A1
#2 2 C M B1
#3 3 E F <NA>
#4 4 C <NA> B1
I have replaced the values which are not allowed with NA instead of -1, you can use whichever value you want.
One approach, similar to your current code, is to collect the list of ref values for each VAR and apply case_when. If the value exist in the collected list, return the value of VAR, otherwise return -1.
Let's see an example. I suggest to always provide code that generates the test dataset:
clients <- data.frame(
ID=c(1, 2, 3, 4),
VAR1=c("A", "C", "E", "C"),
VAR2=c("F", "M", "F", "0"),
VAR3=c("A1", "B1", "C1", "B1"),
stringsAsFactors = FALSE
)
ref <- data.frame(
Variables = c("VAR1", "VAR1", "VAR1", "VAR1", "VAR1", "VAR2", "VAR2", "VAR3", "VAR3"),
Values=c("A", "B", "C", "D", "E", "F", "M", "A1", "B1"),
stringsAsFactors = FALSE
)
Start the session
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master="local[4]", version = "3.0.1")
sref <- copy_to(sc, ref)
sclients <- copy_to(sc, clients, overwrite = T)
for(var in c("VAR1", "VAR2", "VAR3")) {
var_values <- (sref %>%
filter(Variables == var) %>%
select(Values) %>%
collect())$Values
sclients <- sclients %>%
mutate(!!as.name(var) := case_when(!!as.name(var) %in% var_values ~ !!as.name(var),
TRUE ~ "-1"))
}
sclients
Result
# Source: spark<?> [?? x 4]
ID VAR1 VAR2 VAR3
<dbl> <chr> <chr> <chr>
1 1 A F A1
2 2 C M B1
3 3 E F -1
4 4 C -1 B1
Although this solution works, it may not be the best depending on how large your dataset is. For instance, if the values of ref for one VAR don't fit in memory, the collect will through an out of memory error. An alternative could be to generate an array column with the ref values and then check if the VAR value exist inside the array using a function like hof_exists()
Is there a function in dplyr that allows you to test the same condition against a selection of columns?
Take the following dataframe:
Demo1 <- c(8,9,10,11)
Demo2 <- c(13,14,15,16)
Condition <- c('A', 'A', 'B', 'B')
Var1 <- c(13,76,105,64)
Var2 <- c(12,101,23,23)
Var3 <- c(5,5,5,5)
df <- as.data.frame(cbind(Demo1, Demo2, Condition, Var1, Var2, Var3), stringsAsFactors = F)
df[4:6] <- lapply(df[4:6], as.numeric)
I want to take all the rows in which there is at least one value greater than 100 in any of Var1, Var2, or Var3. I realise that I could do this with a series of or statements, like so:
df <- df %>%
filter(Var1 > 100 | Var2 > 100 | Var3 > 100)
However, since I have quite a few columns in my actual dataset this would be time-consuming. I am assuming that there is some reasonably straightforward way to do this but haven't been able to find a solution on SO.
We can do this with filter_at and any_vars
df %>%
filter_at(vars(matches("^Var")), any_vars(.> 100))
# Demo1 Demo2 Condition Var1 Var2 Var3
#1 9 14 A 76 101 5
#2 10 15 B 105 23 5
Or using base R, create a logical expression with lapply and Reduce and subset the rows
df[Reduce(`|`, lapply(df[grepl("^Var", names(df))], `>`, 100)),]
In base-R one can write the same filter using rowSums as:
df[rowSums((df[,grepl("^Var",names(df))] > 100)) >= 1, ]
# Demo1 Demo2 Condition Var1 Var2 Var3
# 2 9 14 A 76 101 5
# 3 10 15 B 105 23 5