I am searching for a simple dplyr or data.table solution. I need to sort rows of a large data frame, but only have a solution with for loops.
Here is a minimum example:
A = c('A1', 'A2', 'A3', 'A4', 'A5')
B = c('B1', 'B2', 'B3')
set.seed(20)
df = data.frame(col1 = sample(c(A,B),8,1), col2 = sample(c(A,B),8,1), col3 = sample(c(A,B),8,1))
col1 col2 col3
1 B1 B1 A1
2 B2 B1 A5
3 A3 A5 B1
4 B3 B2 B3
5 A2 B2 A2
6 A1 A1 B2
7 A2 A3 A4
8 A5 A5 A1
The expected output should be:
col1 col2 col3
1 B1 A1 B1
2 B1 A5 B2
3 B1 A3 A5
4 B2 B3 B3
5 B2 A2 A2
6 B2 A1 A1
7 A2 A3 A4
8 A1 A5 A5
So, the order of the rows for the sort algorithm is c('B1', 'B2', 'B3', 'A1', 'A2', 'A3', 'A4', 'A5') with one exception. If there is already one of the B's in the first column we continue with the A's.
The next problem is, that I have three more columns in the data frame with different numbers which should be rearranged in the same order as these three columns.
You can use apply, factor and sort twice with different orders.
order1 = c('B1', 'B2', 'B3', 'A1', 'A2', 'A3', 'A4', 'A5') #Main order
order2 = c('A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3') #Secondary order for rows with 1st column as "B"
startB <- grepl("B", df[, 1]) #Rows with 1st column being "B"
df <- data.frame(t(apply(df, 1, \(x) sort(factor(x, levels = order1)))))
df[startB, -1] <- t(apply(df[startB, ], 1, \(x) sort(factor(x[-1], levels = order2))))
output
X1 X2 X3
1 B1 A1 B1
2 B1 A5 B2
3 B1 A3 A5
4 B2 B3 B3
5 B2 A2 A2
6 B2 A1 A1
7 A2 A3 A4
8 A1 A5 A5
Might be more than a little bit too convoluted, but a dplyr and purrr option might be:
map2_dfr(.x = df %>%
group_split(cond = as.numeric(grepl("^B", col1))),
.y = list(vec1, vec2),
~ .x %>%
mutate(pmap_dfr(across(c(starts_with("col"), - pluck(select(.x, "cond"), 1))),
function(...) set_names(c(...)[order(match(c(...), .y))], names(c(...))))))
col1 col2 col3 cond
<chr> <chr> <chr> <dbl>
1 B1 A3 A5 0
2 B2 A2 A2 0
3 B2 A1 A1 0
4 A2 A3 A4 0
5 A1 A5 A5 0
6 B1 A1 B1 1
7 B2 A5 B1 1
8 B3 B2 B3 1
My solution so far:
A = c('A1', 'A2', 'A3', 'A4', 'A5')
B = c('B1', 'B2', 'B3')
set.seed(100)
N = 20
df_1 = data.frame(col1 = sample(c(A,B),N,1), col2 = sample(c(A,B),N,1), col3 = sample(c(A,B),N,1))
vec = c('B1', 'B2', 'B3', 'A1', 'A2', 'A3', 'A4', 'A5')
df_2 = t(apply(df_1,1,function(x)match(x,vec)))
df_3 = t(apply(df_2,1,sort))
tr = rowSums(matrix(df_3 %in% c(1,2,3),nrow(df_3), ncol(df_3))) == 2
change = which((df_3[,2]*tr)!=0)
save = df_3[change,2]
df_3[change,2] = df_3[change,3]
df_3[change,3] = save
df_4 = matrix(vec[df_3],nrow(df_3),ncol(df_3))
from df_2 to df_3 the place of the number is changing and I can rearrange the other columns by that.
Looks a little bit complicated
Related
If I have a data frame like so:
df <- data.frame(
a = c(1,1,1,2,2,2,3,3,3),
b = c(1,2,3,1,2,3,1,2,3)
)
which looks like this:
> df
a b
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Is there a quick way to change the columns a and b to match the example below, without explicitly having to type it all out?
> df
a b
a1 b1
a1 b2
a1 b3
a2 b1
a2 b2
a2 b3
a3 b1
a3 b2
a3 b3
In other words, Im trying to take the name of the column and just place it in front of the value that was in that row originally.
We can use cur_column to return the corresponding column name within across and paste (str_c) the column value with the corresponding column name
library(dplyr)
library(stringr)
df1 <- df %>%
mutate(across(everything(), ~ str_c(cur_column(), .)))
-output
df1
# a b
#1 a1 b1
#2 a1 b2
#3 a1 b3
#4 a2 b1
#5 a2 b2
#6 a2 b3
#7 a3 b1
#8 a3 b2
#9 a3 b3
Or using base R
df[] <- Map(paste0, names(df), df)
Or another option is
df[] <- paste0(names(df)[col(df)], unlist(df))
I want to write two vectors of different length with partly equal values into one data frame. The same values should be written in the same row.
ef1 <- c('A1', 'A2', 'B0', 'B1', 'C1', 'C2')
ef2 <- c('A1', 'A2', 'C1', 'C2', 'D1', 'D2')
If I write them in one data frame, it looks like this:
df <- data.frame (ef1, ef2)
> df
ef1 ef2
1 A1 A1
2 A2 A2
3 B0 C1
4 B1 C2
5 C1 D1
6 C2 D2
But what I want is this:
> df
ef1 ef2
1 A1 A1
2 A2 A2
3 B0 NA
4 B1 NA
5 C1 C1
6 C2 C2
7 NA D1
8 NA D2
I'm grateful for any help.
One option is match
(tmp <- unique(c(ef1, ef2)))
# [1] "A1" "A2" "B0" "B1" "C1" "C2" "D1" "D2"
out <- data.frame(ef1 = ef1[match(tmp, ef1)],
ef2 = ef2[match(tmp, ef2)])
Result
out
# ef1 ef2
#1 A1 A1
#2 A2 A2
#3 B0 <NA>
#4 B1 <NA>
#5 C1 C1
#6 C2 C2
#7 <NA> D1
#8 <NA> D2
Another solution, using dplyr's full_join. The idea is to artificially create a merging column and then make a full join.
ef1<-tibble(a=ef1,ef1=ef1)
ef2<-tibble(a=ef2,ef2=ef2)
ef1 %>%
full_join(ef2,by="a") %>%
select(ef1,ef2)
# A tibble: 8 x 2
ef1 ef2
<chr> <chr>
1 A1 A1
2 A2 A2
3 B0 NA
4 B1 NA
5 C1 C1
6 C2 C2
7 NA D1
8 NA D2
I have a very large dataset (around 500k rows and 15 columns). one of the columns has more than one character divided by a semicolon as follows:
Date a b c d
01-01-2020 A1 B1 C1a;C1b D1
30-12-2019 A2 B2 C2a;C2b;C2c D2
33-5-2018 A3 B3 C3a;C3b;C3c;C3d D3
20-11-2019 A4 B4 C4a;C4b D4
I would like to split column c in order to have only to columns (cA and cB). When there are more than two factors in c, such as in columns 2 and 3, I want to create as many rows as per each possible unique combination of the Cs all else equal. The result would then be like:
Date a b c_01 c_02 d
01-01-2020 A1 B1 C1a C1b D1
30-12-2019 A2 B2 C2a C2b D2
30-12-2019 A2 B2 C2a C2c D2
30-12-2019 A2 B2 C2b C2c D2
33-5-2018 A3 B3 C3a C3b D3
33-5-2018 A3 B3 C3a C3c D3
33-5-2018 A3 B3 C3a C3d D3
33-5-2018 A3 B3 C3b C3c D3
33-5-2018 A3 B3 C3b C3d D3
33-5-2018 A3 B3 C3c C3d D3
20-11-2019 A4 B4 C4a C4b D4
I have tried to use csplit to create a single column per each factor and then to create a for loop per each row but it does not really work. I have also tried with apply function to create something similar to a loop but the dataset is too large and I keep receiving errors. Can someone help? Thank you very much!
We could use strsplit to split the 'c' column by ';', then loop over the list with map, get the pair of combnations, convert to data.frame, and unnest the list of 'data.frame' column
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate(c = map(strsplit(c, ";"), ~ combn(.x, 2) %>%
t %>%
as.data.frame %>%
set_names(c('c_01', 'c_02')))) %>%
unnest(c(c))
# A tibble: 11 x 6
# Date a b c_01 c_02 d
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 01-01-2020 A1 B1 C1a C1b D1
# 2 30-12-2019 A2 B2 C2a C2b D2
# 3 30-12-2019 A2 B2 C2a C2c D2
# 4 30-12-2019 A2 B2 C2b C2c D2
# 5 33-5-2018 A3 B3 C3a C3b D3
# 6 33-5-2018 A3 B3 C3a C3c D3
# 7 33-5-2018 A3 B3 C3a C3d D3
# 8 33-5-2018 A3 B3 C3b C3c D3
# 9 33-5-2018 A3 B3 C3b C3d D3
#10 33-5-2018 A3 B3 C3c C3d D3
#11 20-11-2019 A4 B4 C4a C4b D4
Or using base R
lst1 <- lapply(strsplit(df1$c, ";"),
function(x) as.data.frame(t(combn(x, 2))))
l1 <- sapply(lst1, nrow)
out <- cbind(df1[rep(seq_len(nrow(df1)), l1),c('Date', 'a', 'b', 'd')],
do.call(rbind, lst1))
row.names(out) <- NULL
names(out)[5:6] <- c("c_01", "c_02")
out
# Date a b d c_01 c_02
#1 01-01-2020 A1 B1 D1 C1a C1b
#2 30-12-2019 A2 B2 D2 C2a C2b
#3 30-12-2019 A2 B2 D2 C2a C2c
#4 30-12-2019 A2 B2 D2 C2b C2c
#5 33-5-2018 A3 B3 D3 C3a C3b
#6 33-5-2018 A3 B3 D3 C3a C3c
#7 33-5-2018 A3 B3 D3 C3a C3d
#8 33-5-2018 A3 B3 D3 C3b C3c
#9 33-5-2018 A3 B3 D3 C3b C3d
#10 33-5-2018 A3 B3 D3 C3c C3d
#11 20-11-2019 A4 B4 D4 C4a C4b
data
df1 <- structure(list(Date = c("01-01-2020", "30-12-2019", "33-5-2018",
"20-11-2019"), a = c("A1", "A2", "A3", "A4"), b = c("B1", "B2",
"B3", "B4"), c = c("C1a;C1b", "C2a;C2b;C2c", "C3a;C3b;C3c;C3d",
"C4a;C4b"), d = c("D1", "D2", "D3", "D4")), class = "data.frame",
row.names = c(NA,
-4L))
I am trying to group two variables and remove the comma seperated without increasing the number of row
eg:
#my dataframe
> df
g1 g2 g3
1 a1 a2 77.7,81.7
2 a1 a2 77.7,81.7
3 b2 b3 3,1,5
4 b2 b3 3,1,5
5 b2 b3 3,1,5
Expected Output:
g1 g2 g3
1 a1 a2 77.7
2 a1 a2 81.7
3 b2 b3 3
4 b2 b3 1
5 b2 b3 5
I tried some codes below but its unable to group and not comes in expected format. Please help!
Codes:
df <- data.frame(g1 = c("a1","a1","b2","b2","b2"), g2 = c("a2","a2","b3","b3","b3"), g3 = c("77.7,81.7","77.7,81.7","3,1,5","3,1,5","3,1,5"))
library(stringr)
s <- strsplit(df$g3, split = ",")
data.frame(V1 = rep(df$g1, sapply(s, length)), V2 = unlist(s))
Building on Chris Ruehlemann's answer: you can use the following and it will still work if values reappear.
df$g3_split <- unlist(lapply(split(df,df$g1), function(x) unique(unlist(strsplit(x$g3, ","))) ))
df
g1 g2 g3 g3_split
1 a1 a2 77.7,81.7 77.7
2 a1 a2 77.7,81.7 81.7
3 b2 b3 3,77.7,5 3
4 b2 b3 3,77.7,5 77.7
5 b2 b3 3,77.7,5 5
DATA:
df <- data.frame(g1 = c("a1","a1","b2","b2","b2"),
g2 = c("a2","a2","b3","b3","b3"),
g3 = c("77.7,81.7","77.7,81.7","3,1,5","3,1,5","3,1,5"), stringsAsFactors = F)
SOLUTION:
df$g3_split <- unique(unlist(strsplit(df$g3, ",")))
RESULT:
df
g1 g2 g3 g3_split
1 a1 a2 77.7,81.7 77.7
2 a1 a2 77.7,81.7 81.7
3 b2 b3 3,1,5 3
4 b2 b3 3,1,5 1
5 b2 b3 3,1,5 5
If you want to replace g3with the new values, just assign unique(unlist(strsplit(df$g3, ","))) to df$g3 instead of df$g3_split.
An option with separate_rows
library(dplyr)
library(tidyr)
df %>%
mutate( g3_split = g3) %>%
separate_rows(g3_split) %>%
distinct(g3_split, .keep_all = TRUE)
I have a dataframe:
df <- data.frame(id = c('1','2','3'), b = c('b1', '', 'b3'), c = c('c1', 'c2', ''), d = c('d1', '', ''))
id b c d
1 b1 c1 d1
2 c2
3 b3
where the row with id-1 is filled with all data with no empty column values. I want to copy all cell values from id-1 into id 2 and 3 if there are missing values in those cells from rows 2 & 3. Final output something like:
df2 <- data.frame(id = c('1','2','3'), b = c('b1', 'b1', 'b3'), c = c('c1', 'c2', 'c1'), d = c('d1', 'd1', 'd1'))
id b c d
1 b1 c1 d1
2 b1 c2 d1
3 b3 c1 d1
Thank you for your help in advance
Use some matrix indexing to get the "" cases and then overwrite selecting the appropriate column from the first row of df:
idx <- which(df[-1]=="", arr.ind=TRUE)
df[-1][idx] <- unlist(df[1,-1][idx[,"col"]])
# id b c d
#1 1 b1 c1 d1
#2 2 b1 c2 d1
#3 3 b3 c1 d1