group 2 variables and then delimit the strings - r

I am trying to group two variables and remove the comma seperated without increasing the number of row
eg:
#my dataframe
> df
g1 g2 g3
1 a1 a2 77.7,81.7
2 a1 a2 77.7,81.7
3 b2 b3 3,1,5
4 b2 b3 3,1,5
5 b2 b3 3,1,5
Expected Output:
g1 g2 g3
1 a1 a2 77.7
2 a1 a2 81.7
3 b2 b3 3
4 b2 b3 1
5 b2 b3 5
I tried some codes below but its unable to group and not comes in expected format. Please help!
Codes:
df <- data.frame(g1 = c("a1","a1","b2","b2","b2"), g2 = c("a2","a2","b3","b3","b3"), g3 = c("77.7,81.7","77.7,81.7","3,1,5","3,1,5","3,1,5"))
library(stringr)
s <- strsplit(df$g3, split = ",")
data.frame(V1 = rep(df$g1, sapply(s, length)), V2 = unlist(s))

Building on Chris Ruehlemann's answer: you can use the following and it will still work if values reappear.
df$g3_split <- unlist(lapply(split(df,df$g1), function(x) unique(unlist(strsplit(x$g3, ","))) ))
df
g1 g2 g3 g3_split
1 a1 a2 77.7,81.7 77.7
2 a1 a2 77.7,81.7 81.7
3 b2 b3 3,77.7,5 3
4 b2 b3 3,77.7,5 77.7
5 b2 b3 3,77.7,5 5

DATA:
df <- data.frame(g1 = c("a1","a1","b2","b2","b2"),
g2 = c("a2","a2","b3","b3","b3"),
g3 = c("77.7,81.7","77.7,81.7","3,1,5","3,1,5","3,1,5"), stringsAsFactors = F)
SOLUTION:
df$g3_split <- unique(unlist(strsplit(df$g3, ",")))
RESULT:
df
g1 g2 g3 g3_split
1 a1 a2 77.7,81.7 77.7
2 a1 a2 77.7,81.7 81.7
3 b2 b3 3,1,5 3
4 b2 b3 3,1,5 1
5 b2 b3 3,1,5 5
If you want to replace g3with the new values, just assign unique(unlist(strsplit(df$g3, ","))) to df$g3 instead of df$g3_split.

An option with separate_rows
library(dplyr)
library(tidyr)
df %>%
mutate( g3_split = g3) %>%
separate_rows(g3_split) %>%
distinct(g3_split, .keep_all = TRUE)

Related

Is there a way to change data frame entries in R from numeric to a specific character?

If I have a data frame like so:
df <- data.frame(
a = c(1,1,1,2,2,2,3,3,3),
b = c(1,2,3,1,2,3,1,2,3)
)
which looks like this:
> df
a b
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Is there a quick way to change the columns a and b to match the example below, without explicitly having to type it all out?
> df
a b
a1 b1
a1 b2
a1 b3
a2 b1
a2 b2
a2 b3
a3 b1
a3 b2
a3 b3
In other words, Im trying to take the name of the column and just place it in front of the value that was in that row originally.
We can use cur_column to return the corresponding column name within across and paste (str_c) the column value with the corresponding column name
library(dplyr)
library(stringr)
df1 <- df %>%
mutate(across(everything(), ~ str_c(cur_column(), .)))
-output
df1
# a b
#1 a1 b1
#2 a1 b2
#3 a1 b3
#4 a2 b1
#5 a2 b2
#6 a2 b3
#7 a3 b1
#8 a3 b2
#9 a3 b3
Or using base R
df[] <- Map(paste0, names(df), df)
Or another option is
df[] <- paste0(names(df)[col(df)], unlist(df))

How to write two vectors of different length into one data frame by writing same values into same row?

I want to write two vectors of different length with partly equal values into one data frame. The same values should be written in the same row.
ef1 <- c('A1', 'A2', 'B0', 'B1', 'C1', 'C2')
ef2 <- c('A1', 'A2', 'C1', 'C2', 'D1', 'D2')
If I write them in one data frame, it looks like this:
df <- data.frame (ef1, ef2)
> df
ef1 ef2
1 A1 A1
2 A2 A2
3 B0 C1
4 B1 C2
5 C1 D1
6 C2 D2
But what I want is this:
> df
ef1 ef2
1 A1 A1
2 A2 A2
3 B0 NA
4 B1 NA
5 C1 C1
6 C2 C2
7 NA D1
8 NA D2
I'm grateful for any help.
One option is match
(tmp <- unique(c(ef1, ef2)))
# [1] "A1" "A2" "B0" "B1" "C1" "C2" "D1" "D2"
out <- data.frame(ef1 = ef1[match(tmp, ef1)],
ef2 = ef2[match(tmp, ef2)])
Result
out
# ef1 ef2
#1 A1 A1
#2 A2 A2
#3 B0 <NA>
#4 B1 <NA>
#5 C1 C1
#6 C2 C2
#7 <NA> D1
#8 <NA> D2
Another solution, using dplyr's full_join. The idea is to artificially create a merging column and then make a full join.
ef1<-tibble(a=ef1,ef1=ef1)
ef2<-tibble(a=ef2,ef2=ef2)
ef1 %>%
full_join(ef2,by="a") %>%
select(ef1,ef2)
# A tibble: 8 x 2
ef1 ef2
<chr> <chr>
1 A1 A1
2 A2 A2
3 B0 NA
4 B1 NA
5 C1 C1
6 C2 C2
7 NA D1
8 NA D2

Split and combination of dataframe columns in R

I have a very large dataset (around 500k rows and 15 columns). one of the columns has more than one character divided by a semicolon as follows:
Date a b c d
01-01-2020 A1 B1 C1a;C1b D1
30-12-2019 A2 B2 C2a;C2b;C2c D2
33-5-2018 A3 B3 C3a;C3b;C3c;C3d D3
20-11-2019 A4 B4 C4a;C4b D4
I would like to split column c in order to have only to columns (cA and cB). When there are more than two factors in c, such as in columns 2 and 3, I want to create as many rows as per each possible unique combination of the Cs all else equal. The result would then be like:
Date a b c_01 c_02 d
01-01-2020 A1 B1 C1a C1b D1
30-12-2019 A2 B2 C2a C2b D2
30-12-2019 A2 B2 C2a C2c D2
30-12-2019 A2 B2 C2b C2c D2
33-5-2018 A3 B3 C3a C3b D3
33-5-2018 A3 B3 C3a C3c D3
33-5-2018 A3 B3 C3a C3d D3
33-5-2018 A3 B3 C3b C3c D3
33-5-2018 A3 B3 C3b C3d D3
33-5-2018 A3 B3 C3c C3d D3
20-11-2019 A4 B4 C4a C4b D4
I have tried to use csplit to create a single column per each factor and then to create a for loop per each row but it does not really work. I have also tried with apply function to create something similar to a loop but the dataset is too large and I keep receiving errors. Can someone help? Thank you very much!
We could use strsplit to split the 'c' column by ';', then loop over the list with map, get the pair of combnations, convert to data.frame, and unnest the list of 'data.frame' column
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate(c = map(strsplit(c, ";"), ~ combn(.x, 2) %>%
t %>%
as.data.frame %>%
set_names(c('c_01', 'c_02')))) %>%
unnest(c(c))
# A tibble: 11 x 6
# Date a b c_01 c_02 d
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 01-01-2020 A1 B1 C1a C1b D1
# 2 30-12-2019 A2 B2 C2a C2b D2
# 3 30-12-2019 A2 B2 C2a C2c D2
# 4 30-12-2019 A2 B2 C2b C2c D2
# 5 33-5-2018 A3 B3 C3a C3b D3
# 6 33-5-2018 A3 B3 C3a C3c D3
# 7 33-5-2018 A3 B3 C3a C3d D3
# 8 33-5-2018 A3 B3 C3b C3c D3
# 9 33-5-2018 A3 B3 C3b C3d D3
#10 33-5-2018 A3 B3 C3c C3d D3
#11 20-11-2019 A4 B4 C4a C4b D4
Or using base R
lst1 <- lapply(strsplit(df1$c, ";"),
function(x) as.data.frame(t(combn(x, 2))))
l1 <- sapply(lst1, nrow)
out <- cbind(df1[rep(seq_len(nrow(df1)), l1),c('Date', 'a', 'b', 'd')],
do.call(rbind, lst1))
row.names(out) <- NULL
names(out)[5:6] <- c("c_01", "c_02")
out
# Date a b d c_01 c_02
#1 01-01-2020 A1 B1 D1 C1a C1b
#2 30-12-2019 A2 B2 D2 C2a C2b
#3 30-12-2019 A2 B2 D2 C2a C2c
#4 30-12-2019 A2 B2 D2 C2b C2c
#5 33-5-2018 A3 B3 D3 C3a C3b
#6 33-5-2018 A3 B3 D3 C3a C3c
#7 33-5-2018 A3 B3 D3 C3a C3d
#8 33-5-2018 A3 B3 D3 C3b C3c
#9 33-5-2018 A3 B3 D3 C3b C3d
#10 33-5-2018 A3 B3 D3 C3c C3d
#11 20-11-2019 A4 B4 D4 C4a C4b
data
df1 <- structure(list(Date = c("01-01-2020", "30-12-2019", "33-5-2018",
"20-11-2019"), a = c("A1", "A2", "A3", "A4"), b = c("B1", "B2",
"B3", "B4"), c = c("C1a;C1b", "C2a;C2b;C2c", "C3a;C3b;C3c;C3d",
"C4a;C4b"), d = c("D1", "D2", "D3", "D4")), class = "data.frame",
row.names = c(NA,
-4L))

Create new columns lopping an array inside mutate (dplyr)

I have the following dummy data frame called df:
A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 1 2 2 2 3 3 3
and I would like to sum columns that contain the same letter into a new column (naming it using the corresponding letter).
I would expect this result:
A1 A2 A3 B1 B2 B3 C1 C2 C3 A B C
1 1 1 2 2 2 3 3 3 3 6 9
I know I can achieve this result using mutatefrom dyplr:
mutate(df,
A = A1 + A2 + A3,
B = B1 + B2 + B3,
C = C1 + C2 + C3)
Is there any way to do it using a vector like letters <- c("A", "B", "C") and looping over that vector inside the mutate function? Something like:
mutate(df,
letters = paste0(letters,"1") + paste0(letters,"2") + paste0(letters,"3") )
One dplyr and purrr solution could be:
bind_cols(df, map_dfc(.x = LETTERS[1:3],
~ df %>%
transmute(!!.x := rowSums(select(., starts_with(.x))))))
A1 A2 A3 B1 B2 B3 C1 C2 C3 A B C
1 1 1 1 2 2 2 3 3 3 3 6 9

how to sort data frame by column names in R?

How can I sort the below data frame df to df1?
df
a1 a4 a3 a5 a2
sorted data frame
df1
a1 a2 a3 a4 a5
We can use mixedorder from library(gtools)
library(gtools)
df1 <- df[mixedorder(colnames(df))]
df1
# a1 a3 a9 a10
#1 1 3 1 2
#2 2 4 2 3
#3 3 5 3 4
#4 4 6 4 5
#5 5 7 5 6
data
df <- data.frame(a1 = 1:5, a10=2:6, a3 = 3:7, a9= 1:5)
In base R, assuming the numbers in the colnames don't go into double digits.
df
# a1 a4 a3 a5 a2
#1 1 4 3 5 2
df[, order(names(df))]
# a1 a2 a3 a4 a5
#1 1 2 3 4 5
Assuming there is no "hole" in the numbers suffixing the columns names, you can also use dplyr:
library(dplyr)
df1 <- select(df, num_range("a", 1:4))

Resources