a,Remove the duplicate based on elements in character vector - r

I have data frame like this, it contain 3 or more characters separated by comma (,) I want to remove the row if it contain same characters.
x <-c(1,2,3,4,5)
y <-c("a,a,a","a,a,b,c","b,c,a","b,b,b,b","a,b,b,c")
df<-data.frame(x,y)
desired output is
x <-c(2,3,5)
y <-c("a,a,b,c","b,c,a","a,b,b,c")
df<-data.frame(x,y)

You can use separate_rows to split the comma-separated values into different rows, remove those groups where there are only 1 distinct values and summarise the data again.
library(dplyr)
df %>%
tidyr::separate_rows(y) %>%
group_by(x) %>%
filter(n_distinct(y) > 1) %>%
summarise(y = toString(y))
# x y
# <dbl> <chr>
#1 2 a, b, c
#2 3 b, c, a
#3 5 a, b, c
In base R :
df[sapply(strsplit(df$y, ','), function(x) length(unique(x))) > 1, ]

Related

r, dplyr: how to transform values in one column based on value in another column using gsub

I have a dataframe with two (relevant) factors, and I'd like to remove a substring equal to one factor from the value of the other factor, or leave it alone if there is no such substring. Can I do this using dplyr?
To make a MWE, suppose these factors are x and y.
library(dplyr)
df <- data.frame(x = c(rep('abc', 3)), y = c('a', 'b', 'd'))
df:
x y
1 abc a
2 abc b
3 abc d
What I want:
x y
1 bc a
2 ac b
3 abc d
My attempt was:
df |> transform(x = gsub(y, '', x))
However, this produces the following, incorrect result, plus a warning message:
x y
1 bc a
2 bc b
3 bc d
Warning message:
In gsub(y, "", x) :
argument 'pattern' has length > 1 and only the first element will be used
How can I do this?
str_remove is vectorized for the pattern instead of gsub
library(stringr)
library(dplyr)
df <- df %>%
mutate(x = str_remove(x, y))
-output
df
x y
1 bc a
2 ac b
3 abc d
If we want to use sub/gsub, then may need rowwise
df %>%
rowwise %>%
mutate(x = sub(y, "", x)) %>%
ungroup

Paste column content by group into a new group

Here is my data frame:
a <- data.frame(x=c(rep("A",2),rep("B",4)),
y=c("AA","BB","CC","AA","DD","AA"))
What I want is group the data frame by x and for each member of the group (here A or B), I would like to paste the content of column y into a single element, separated by _. I would like to sort it by alphabetical order and remove identical characters. Here is the desired result:
out <- data.frame(x=c(rep("A",1),rep("B",1)),
y=c("AA_BB","AA_CC_DD"))
I tried the following code, which produces an error message:
library(dplyr)
a %>% group_by(x) %>% mutate(y_comb=paste(as.character(sort(unique(y))))) %>%
slice(1) %>% ungroup()
We get the distinct element of 'x', 'y' column (as there is only two columns, simply use distinct on the entire data), then arrange the rows by 'x', 'y' column, grouped by 'x', paste (str_c) the 'y' elements into a single string by collapseing with _
library(dplyr)
library(stringr)
a %>%
distinct %>%
arrange(x, y) %>%
group_by(x) %>%
summarise(y = str_c(y, collapse="_"))
-output
# A tibble: 2 x 2
# x y
#* <chr> <chr>
#1 A AA_BB
#2 B AA_CC_DD
The error in OP's code is because of the difference in length after doing the unique and paste by itself doesn't do anything. We need to either collapse (or sep - in this case it is collapse). mutate is particular about returning the same length as the number of rows of original data while summarise is not
Perhaps we can do like this
a %>%
group_by(x) %>%
summarise(y = paste0(sort(unique(y)), collapse = "_"))
which gives
# A tibble: 2 x 2
x y
<chr> <chr>
1 A AA_BB
2 B AA_CC_DD
Base R option with aggregate :
aggregate(y~x, unique(a), function(x) paste0(sort(x), collapse = '_'))
# x y
#1 A AA_BB
#2 B AA_CC_DD

Data cleaning in R: grouping by number and then by name

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.
Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result
library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)
According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

Dictionary-like matching on string in R

I have a dataframe in which a string variable is an informal list of elements, that can be split on a symbol. I would like to make operaion on these elements on the basis of another dataset.
e.g. task: Calculate the sum of the elements
df_1 <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"))
df_2 <- data.frame(groups=c("A","B","C","D"), values=c(1:4))
desired <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"),sum=c(6,5))
An option would be to split the 'groups' by the delimiter , to expand the rows with separate_rows, do a join with the key/val dataset ('df_2'), groued by 'element', get the sum of 'values'
library(tidyverse)
df_1 %>%
separate_rows(groups) %>%
left_join(df_2) %>%
group_by(element) %>%
summarise(groups = toString(groups), sum = sum(values))
# A tibble: 2 x 3
# element groups sum
# <int> <chr> <int>
#1 1 A, B, C 6
#2 2 A, D 5
Or another option with base R would be to use a named key/value vector 'nm1') to change the values in the splitted list elements, sum and assign it to new column in 'df_1'
nm1 <- setNames(df_2$values, df_2$groups)
df_1$sum <- sapply(strsplit(as.character(df_1$groups), ","), function(x) sum(nm1[x]))

Rearrange specific cell according to the tag in front

I work in R and my data looks like the following, Right now they are 3 columns of data in df.
Is there any way to rearrange the columns by the tag in front, namely H, D & A?
So far I can't see any patterns of the wrong arrangement...but I would like to rearrange it for over 1000 rows.
H A D
H":"100#1.68" A:"100#4.35" D:"100#3.35"
H":"100#2.33" D:"100#3.20" A:"100#2.62"
A":"100#2.25" D:"100#3.15" H:"100#2.78"
H":"100#2.80" D:"100#3.25" A:"100#2.18"
D":"100#3.05" A:"100#3.40" H:"100#2.00"
H":"100#2.30" A:"100#2.90" D:"100#2.92"
D":"100#3.05" H:"100#2.25" A:"100#2.85"
# example data
df = read.table(text = "
H A D
H:100#1.68 A:100#4.35 D:100#3.35
H:100#2.33 D:100#3.20 A:100#2.62
A:100#2.25 D:100#3.15 H:100#2.78
", header=T, stringsAsFactors=F)
library(tidyverse)
df %>%
gather() %>% # reshape data
group_by(key = substr(value,1,1)) %>% # update column names based on first character
mutate(row_id = row_number()) %>% # add row id (useful for reshaping again)
spread(key, value) %>% # reshape data
select(-row_id) # remove column
# # A tibble: 3 x 3
# A D H
# <chr> <chr> <chr>
# 1 A:100#2.25 D:100#3.20 H:100#1.68
# 2 A:100#4.35 D:100#3.15 H:100#2.33
# 3 A:100#2.62 D:100#3.35 H:100#2.78

Resources