Group values in rows according into similar columns - r

I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....

If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)

Related

R Replace column values in dataframe base on matching indexing column in separate dataframe

I have a dataframe 'df1' that looks like:
Number
Variable1
Variable
Variable3
1
A
B
C
2
A
B
C
3
A
B
C
4
A
B
C
5
A
B
C
And I have a second dataframe 'df2' that looks like:
Number
Variable1
Variable
Variable3
1
D
E
F
2
G
H
I
3
J
K
L
4
M
N
O
15
P
Q
R
I want to update the three Variable columns in df1 with the data in the Variable columns in df2 based on matching values in Number so that df1 ends up looking like:
Number
Variable1
Variable
Variable3
1
D
E
F
2
G
H
I
3
J
K
L
4
M
N
O
5
A
B
C
You could use a power_left_join from powerjoin package with conflict = coalesce_yx like this:
library(powerjoin)
power_left_join(df1, df2, by = "Number", conflict = coalesce_yx)
#> Number Variable1 Variable Variable3
#> 1 1 D E F
#> 2 2 G H I
#> 3 3 J K L
#> 4 4 M N O
#> 5 5 A B C
Created on 2022-12-13 with reprex v2.0.2
Data:
df1 <- read.table(text = 'Number Variable1 Variable Variable3
1 A B C
2 A B C
3 A B C
4 A B C
5 A B C
', header = TRUE)
df2 <- read.table(text = 'Number Variable1 Variable Variable3
1 D E F
2 G H I
3 J K L
4 M N O
15 P Q R
', header = TRUE)
Would be helpful if dput(df) done. Have created another dataset for replication
df1<-cbind.data.frame(id=c(1:5),var1=rep("A",5),var2=rep("B",5),var3=rep("C",5))
df2<-cbind.data.frame(id=c(1:4,15),var1=LETTERS[7:11],var2=LETTERS[12:16],var3=LETTERS[16:20])
df1 %>%
left_join(df2, by = "id") %>%
mutate(var1 = coalesce(var1.y, var1.x),
var2 = coalesce(var2.y, var2.x),
var3 = coalesce(var3.y, var3.x)) %>%
select(-var1.y, -var1.x,
-var2.y, -var2.x,
-var3.y, -var3.x)
With dplyr, we can use rows_update
library(dplyr)
rows_update(df1, df2, by = 'Number', unmatched = "ignore")
-output
Number Variable1 Variable Variable3
1 1 D E F
2 2 G H I
3 3 J K L
4 4 M N O
5 5 A B C
You could update df1 while joining using data.table package and fcoalesce function:
library(data.table)
cols = c("Variable1", "Variable", "Variable3")
setDT(df1)[df2, (cols) := Map(fcoalesce, mget(paste0("i.", cols)), mget(cols)), on="Number"]
Number Variable1 Variable Variable3
<int> <char> <char> <char>
1: 1 D E F
2: 2 G H I
3: 3 J K L
4: 4 M N O
5: 5 A B C

Merge data frames based on custom condition - string comparison

I'd like to merge rows of two data frames - df1 and df2 using column A:
#df1
A <- c('ab','ab','bc','bc','bc','cd')
B <- floor(runif(6, min=0, max=10))
C <- floor(runif(6, min=0, max=10))
D <- floor(runif(6, min=0, max=10))
E <- c('a, b, c','a, d, e','a, g, h','d, e, f','a, d, f','f, j')
df1 <- data.frame(A,B,C,D,E)
df1
A B C D E
1 ab 5 4 3 a, b, c
2 ab 9 4 0 a, d, e
3 bc 4 4 9 a, g, h
4 bc 5 5 6 d, e, f
5 bc 1 6 6 a, d, f
6 cd 1 2 0 f, j
#df2
A <- c('ab','bc','cd')
B <- floor(runif(3, min=0, max=10))
E <- c('a, d','d, f','n, m')
df2 <- data.frame(A,B,E)
df2
A B E
1 ab 4 a, d
2 bc 7 d, f
3 cd 1 n, m
I can do simply:
df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)
However there's condition of merging. Namely, I'd like to merge only rows from df2 to df1 when all substrings (column E) from df2 are present in df1, so the output should look like this:
df3
A B C D E A.y B.y E.y
1 ab 5 4 3 a, b, c NA NA NA
2 ab 9 4 0 a, d, e, ab 6 a, d
3 bc 4 4 9 a, g, h NA NA NA
4 bc 5 5 6 d, e, f bc 7 d, f
5 bc 1 6 6 a, d, f bc 7 d, f
6 cd 1 2 0 f, j NA NA NA
I know there's an option using %in% regarding vector comparison. However I have strings, should I first do some strsplit and unlist and then perform the comparison?
This is pretty messy but should do what you're looking for:
First, expand rows for both E values, then group by the key column to check if any values from RHS E are in LHS E. Then filter based on the lookup table.
library(tidyverse)
df3 <- merge(x=df1, y=df2, by='A', all.x = TRUE)
check_rows <- df3 %>%
separate_rows(E.y, sep = ',') %>%
separate_rows(E.x, sep = ',') %>%
mutate(E.x = trimws(E.x),
E.y = trimws(E.y)) %>%
group_by(A) %>%
mutate(check = E.y %in% E.x,
check = ifelse(any(check == TRUE), TRUE, FALSE)) %>%
select(A, check) %>%
unique() %>%
filter(check == TRUE)
df3 <- df3 %>%
filter(A %in% check_rows$A)

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me
You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g
You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

r create new data frame that matches in rows elements grouped by another column

I want to create a new data frame from the df one below. In the new data frame (df2), each element in df$name is placed in the first column and matched in its row with other element of df$name grouped by df$group.
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
> df
group name
1 a A
2 a B
3 a C
4 b D
5 b E
6 b F
In this example, "A", "B", and "C" in df$name belong to "a" in df$group, and I want to put them in the same row in a new data frame. The desired output looks like this:
> df2
V1 V2
1 A B
2 A C
3 B A
4 B C
5 C A
6 C B
7 D E
8 D F
9 E D
10 E F
11 F D
12 F E
We could do this in base R with merge
out <- setNames(subset(merge(df, df, by.x = 'group', by.y = 'group'),
name.x != name.y, select = -group), c("V1", "V2"))
row.names(out) <- NULL
out
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
#7 D E
#8 D F
#9 E D
#10 E F
#11 F D
#12 F E
In my opinion its case of self-join. Using dplyr a solution can be as:
library(dplyr)
inner_join(df, df, by="group") %>%
filter(name.x != name.y) %>%
select(V1 = name.x, V2 = name.y)
# V1 V2
# 1 A B
# 2 A C
# 3 B A
# 4 B C
# 5 C A
# 6 C B
# 7 D E
# 8 D F
# 9 E D
# 10 E F
# 11 F D
# 12 F E
df <- data.frame(group = rep(letters[1:2], each=3),
name = LETTERS[1:6])
library(tidyverse)
df %>%
group_by(group) %>% # for every group
summarise(v = list(expand.grid(V1=name, V2=name))) %>% # create all combinations of names
select(v) %>% # keep only the combinations
unnest(v) %>% # unnest combinations
filter(V1 != V2) # exclude rows with same names
# # A tibble: 12 x 2
# V1 V2
# <fct> <fct>
# 1 B A
# 2 C A
# 3 A B
# 4 C B
# 5 A C
# 6 B C
# 7 E D
# 8 F D
# 9 D E
# 10 F E
# 11 D F
# 12 E F

R: Binding columns by key variable

I want to combine two dataframes, df1 and df2, by different groups of a key variable in x1. It is basically some join operation, however, I do not want the rows to duplicate and do not care about the relationship among the added columns.
Assume:
df1:
x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7
df2:
x1 x3
A a
A b
A c
A d
A e
A f
B g
C h
The result should look like this.
df1 + df2:
x1 x2 x3
A 1 a
A 2 b
A 3 c
A NA d
A NA f
B 4 g
B 5 NA
C 6 h
C 7 NA
Does anyone have an idea? I would most appreciate your help!
The full_join in dplyr works well for this too. See below:
#recreate your data
library (data.table)
library (dplyr)
df1 <- data.table (x1 = c("A","A","A","B","B","C","C"), x2 = seq (from = 1, to = 7))
df2 <- data.table (x1 = c("A","A","A","A","A","A","B","C"), x3 = c("a","b","c","d","e","f","g","h" ))
df1[, rowid := rowid(x1)]
df2[, rowid := rowid(x1)]
df3 <- full_join (df1, df2, by = c ("x1","rowid"))
df3$rowid <- NULL
setorder (df3, x1)
To replicate your resulting data.frame you can create row ids by x1 and then merge on those row ids and x1 (but I don't really know if that is what you are trying to accomplish)
library(data.table)
df1 = read.table(text = "x1 x2
A 1
A 2
A 3
B 4
B 5
C 6
C 7", header = T)
df2 = read.table(text = "x1 x3
A a
A b
A c
A d
A e
A f
B g
C h", header = T)
setDT(df1)
setDT(df2)
df1[, rowid := seq(.N), by = x1] # create rowid
df2[, rowid := seq(.N), by = x1] # create rowid
merge(df1, df2, by = c("x1", "rowid"), all = T)[, rowid := NULL][]
x1 x2 x3
1: A 1 a
2: A 2 b
3: A 3 c
4: A NA d
5: A NA e
6: A NA f
7: B 4 g
8: B 5 NA
9: C 6 h
10: C 7 NA

Resources