Related
I would like to combine two dataframes using crossing, but some have the same columnnames. For that, I would like to add "_nameofdataframe" to these columns. Here are some reproducible dataframes (dput below):
> df1
person V1 V2 V3
1 A 1 3 3
2 B 4 4 5
3 C 2 1 1
> df2
V2 V3
1 2 5
2 1 6
3 1 2
When I run the following code it will return duplicated column names:
library(tidyr)
crossing(df1, df2, .name_repair = "minimal")
#> # A tibble: 9 × 6
#> person V1 V2 V3 V2 V3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 3 3 1 2
#> 2 A 1 3 3 1 6
#> 3 A 1 3 3 2 5
#> 4 B 4 4 5 1 2
#> 5 B 4 4 5 1 6
#> 6 B 4 4 5 2 5
#> 7 C 2 1 1 1 2
#> 8 C 2 1 1 1 6
#> 9 C 2 1 1 2 5
As you can see it returns the column names while being duplicated. My desired output should look like this:
person V1 V2_df1 V3_df1 V2_df2 V3_df2
1 A 1 3 3 1 2
2 A 1 3 3 1 6
3 A 1 3 3 2 5
4 B 4 4 5 1 2
5 B 4 4 5 1 6
6 B 4 4 5 2 5
7 C 2 1 1 1 2
8 C 2 1 1 1 6
9 C 2 1 1 2 5
So I was wondering if anyone knows a more automatic way to give the duplicated column names a name like in the desired output above with crossing?
dput of df1 and df2:
df1 <- structure(list(person = c("A", "B", "C"), V1 = c(1, 4, 2), V2 = c(3,
4, 1), V3 = c(3, 5, 1)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(V2 = c(2, 1, 1), V3 = c(5, 6, 2)), class = "data.frame", row.names = c(NA,
-3L))
As you probably know, the .name_repair parameter can take a function. The problem is crossing() only passes that function one argument, a vector of the concatenated column names() of both data frames. So we can't easily pass the names of the data frame objects to it. It seems to me that there are two solutions:
Manually add the desired suffix to an anonymous function.
Create a wrapper function around crossing().
1. Manually add the desired suffix to an anonymous function
We can simply supply the suffix as a character vector to the anonymous .name_repair parameter, e.g. suffix = c("_df1", "_df2").
crossing(
df1,
df2,
.name_repair = \(x, suffix = c("_df1", "_df2")) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
# person V1 V2_df1 V3_df1 V2_df2 V3_df2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 3 3 1 2
# 2 A 1 3 3 1 6
# 3 A 1 3 3 2 5
# 4 B 4 4 5 1 2
# 5 B 4 4 5 1 6
# 6 B 4 4 5 2 5
# 7 C 2 1 1 1 2
# 8 C 2 1 1 1 6
# 9 C 2 1 1 2 5
The disadvantage of this is that there is a room for error when typing the suffix, or that we might forget to change it if we change the names of the data frames.
Also note that we are checking for names which appear twice. If one of your original data frames already has broken (duplicated) names then this function will also rename those columns. But I think it would be unwise to try to do any type of join if either data frame did not have unique column names.
2. Create a wrapper function around crossing()
This might be more in the spirit of the tidyverse. Thecrossing() docs to which you linked state crossing() is a wrapper around expand_grid(). The source for expand_grid() show that it is basically a wrapper which uses map() to apply vctrs::vec_rep() to some inputs. So if we want to add another function to the call stack, there are two ways I can think of:
Using deparse(substitute())
crossing_fix_names <- function(df_1, df_2) {
suffixes <- paste0(
"_",
c(deparse(substitute(df_1)), deparse(substitute(df_2)))
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Output the same as above
crossing_fix_names(df1, df2)
The disadvantage of this is that deparse(substitute()) is ugly and can occasionally have surprising behaviour. The advantage is we do not need to remember to manually add the suffixes.
Using match.call()
crossing_fix_names2 <- function(df_1, df_2) {
args <- as.list(match.call())
suffixes <- paste0(
"_",
c(
args$df_1,
args$df_2
)
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Also the same output
crossing_fix_names2(df1, df2)
As we don't have the drawbacks of deparse(substitute()) and we don't have to manually specify the suffix, I think this is the probably the best approach.
test for the condition using dputs :
colnames(df1) %in% colnames(df2)
[1] FALSE FALSE TRUE TRUE
rename
colnames(df2) <- paste0(colnames(df2), '_df2')
then cbind
cbind(df1,df2)
person V1 V2 V3 V2_df2 V3_df2
1 A 1 3 3 2 5
2 B 4 4 5 1 6
3 C 2 1 1 1 2
not so elegant, but usefully discernible later.
I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)
I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))
First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6
Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4
It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4
I have a big df with the following structure
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7), name = c("aa", "ab", "ac", "aa", "aab", "aac", "aabc")), .Names = c("id", "name"), row.names = c(NA, -7L), class = "data.frame")
df
id name
1 1 aa
2 2 ab
3 3 ac
4 4 aa
5 5 aab
6 6 aac
7 7 aabc
I would like to create a new column group depending two character strings in column name (here aa, ab, ac) to achieve something like
df
id name group
1 1 aa 1
2 2 ab 2
3 3 ac 3
4 4 aa 1
5 5 aab 1
5 5 aab 2
6 6 aac 1
6 6 aac 3
7 7 aabc 1
7 7 aabc 2
7 7 aabc 3
While assigning groups for the two character strings is straight forward I struggle to find an efficient way to include the pairwise combinations of longer strings. I thought about splitting each string with nchar>2 into all the possible pairwise combinations and assign them to the respective groups but wonder if there is a better way.
Further notes
only pairwise combinations found in df (not all possible combinations)
order of the two character string does not matter (e.g. ab=ba)
only unique recombinations of longer strings (e.g. aaab is just aa and ab) d
Similar question without the recombination problem Assigning groups using grepl with multiple inputs
How about the following
# Your data
df <- structure(
list(
id = c(1, 2, 3, 4, 5, 6, 7),
name = c("aa", "ab", "ac", "aa", "aab", "aac", "aabc")),
.Names = c("id", "name"), row.names = c(NA, -7L), class = "data.frame")
# Create all possible 2char combinations from unique chars in string
group <- lapply(strsplit(df$name, ""), function(x)
unique(apply(combn(x, 2), 2, function(y) paste0(y, collapse = ""))));
# Melt and add original data
require(reshape2);
df2 <- melt(group);
df2 <- cbind.data.frame(
df2,
df[match(df2$L1, df$id), ]);
df2$group <- as.numeric(as.factor(df2$value));
df2;
# value L1 id name group
#1 aa 1 1 aa 1
#2 ab 2 2 ab 2
#3 ac 3 3 ac 3
#4 aa 4 4 aa 1
#5 aa 5 5 aab 1
#5.1 ab 5 5 aab 2
#6 aa 6 6 aac 1
#6.1 ac 6 6 aac 3
#7 aa 7 7 aabc 1
#7.1 ab 7 7 aabc 2
#7.2 ac 7 7 aabc 3
#7.3 bc 7 7 aabc 4
Explanation: strsplit splits the strings from df$name into char vectors. combn creates all 2-char combinations based on those char vectors. paste0 and unique keeps the concatenated unique 2-char combinations.
Note that this almost reproduces your example. That's because in my case, aabc also gives rise to group 4 = bc.
Update 1
You can filter entries based on a list of 2-char comparisons
# Filter entries
filter <- c("aa", "ab", "ac");
df2 <- df2[df2$value %in% filter, ]
# Clean up df2 to be consistent with OPs request
df2 <- df2[, -(1:2)];
df2;
# id name group
#1 1 aa 1
#2 2 ab 2
#3 3 ac 3
#4 4 aa 1
#5 5 aab 1
#5.1 5 aab 2
#6 6 aac 1
#6.1 6 aac 3
#7 7 aabc 1
#7.1 7 aabc 2
#7.2 7 aabc 3
Update 2
You can also create a filter dynamically, by selecting those value entries that are represented as 2-char strings in the original dataframe (in this case aa, ab and ac).
filter <- unique(unlist(group[sapply(group, function(x) length(x) == 1)]));
I have a dataset arranged such that the data is stored as a list of multiple observations within each 'cell'. See below:
partID | Var 1 | Var 2
1 | 1,2,3 | 4,5,6
2 | 7,8,9 | 1,2,3
I would like to get the data in a format more like this:
partID | Var 1 | Var 2
1 | 1 | 4
1 | 2 | 5
1 | 3 | 6
I've been trying various combinations of melt, unlist, and data.table but I haven't had much luck applying the various ways to expand the lists while simultaneously preserving multiple columns and their names. Am I reduced to looping through the dataset and binding the columns together?
If for each row, the cells have the same number of entries and they are strings, then this is what you can do, using data.table.
require(data.table)
DT<-data.table(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))
DT2<-DT[,list(Var1=unlist(strsplit(Var1,",")),Var2=unlist(strsplit(Var2,","))),by=partID]
You use strsplit() to split the strings by the commas. You use unlist() to make the entries into a vector, not a list.
If, on the other hand, each cell is already a list, then all you need to do is unlist().
require(data.table)
DT3<-data.table(partID=c(1,2),Var1=list(c(1,2,3),c(7,8,9)),Var2=list(c(4,5,6),c(1,2,3)))
DT4<-DT3[,list(Var1=unlist(Var1),Var2=unlist(Var2)),by=partID]
Either way, you get this:
partID Var1 Var2
1 1 4
1 2 5
1 3 6
2 7 1
2 8 2
2 9 3
We can do this easily with cSplit
library(splitstackshape)
cSplit(DT, c("Var1", "Var2"), ",", "long")
# partID Var1 Var2
#1: 1 1 4
#2: 1 2 5
#3: 1 3 6
#4: 2 7 1
#5: 2 8 2
#6: 2 9 3
data
DT<-data.frame(partID=c(1,2),Var1=c("1,2,3","7,8,9"),Var2=c("4,5,6","1,2,3"))
The separate_rows() function in tidyr is the boss for observations with multiple delimited values...
# create data
library(tidyverse)
d <- data_frame(
partID = c(1, 2),
Var1 = c("1,2,3", "7,8,9"),
Var2 = c("4,5,6","1,2,3")
)
d
# # A tibble: 2 x 3
# partID Var1 Var2
# <dbl> <chr> <chr>
# 1 1 1,2,3 4,5,6
# 2 2 7,8,9 1,2,3
# tidy data
separate_rows(d, Var1, Var2, convert = TRUE)
# # A tibble: 6 x 3
# partID Var1 Var2
# <dbl> <int> <int>
# 1 1 1 4
# 2 1 2 5
# 3 1 3 6
# 4 2 7 1
# 5 2 8 2
# 6 2 9 3
You can also use dplyr and tidyr which provides the unnest function to expand the columns:
library(dplyr); library(tidyr);
df %>% mutate(Var.1 = strsplit(Var.1, ","), Var.2 = strsplit(Var.2, ",")) %>% unnest()
Source: local data frame [6 x 3]
partID Var.1 Var.2
(dbl) (chr) (chr)
1 1 1 4
2 1 2 5
3 1 3 6
4 2 7 1
5 2 8 2
6 2 9 3