Is there a way to to eliminate duplicate strings inside a column value? (Please check the details below, I really don't know how to put it concisely) [duplicate] - r

This question already has answers here:
Remove duplicates in string
(3 answers)
Remove duplicate values in string for all values in a column of a data frame [duplicate]
(1 answer)
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 3 years ago.
i have a dataframe with two columns:
VAR1. VAR2.
A. 102 million; 102 million
B. 0.1 million; 2 million; 0.1 million; 2 million
I want to remove duplicate values of VAR2. for each row, obtaining
VAR1. VAR2.
A. 102 million
B. 0.1 million; 2 million
How can I do?
thank you for your suggestions.

Using base R, we can split the string on ";" and paste unique entries for VAR2
sapply(strsplit(df$VAR2, ";"), function(x) paste(unique(x), collapse = ";"))
#[1] "102 million" "0.1 million;2 million"
Using dplyr and tidyr we can use separate_rows to bring VAR2 into different rows and then paste only unique entries per group.
library(dplyr)
library(tidyr)
df %>%
separate_rows(VAR2, sep = ";") %>%
group_by(VAR1) %>%
summarise(VAR2 = paste(unique(VAR2), collapse = ";"))
# VAR1 VAR2
# <fct> <chr>
#1 A 102 million
#2 B 0.1 million;2 million

Here is a solution using sub which seems to work:
x <- "0.1 million; 2 million; 0.1 million; 2 million"
gsub("\\b(\\d+(?:\\.\\d+)?) ([^;]+); (?=.*\\b\\1 \\2\\b)", "", x, perl=TRUE)
[1] "0.1 million; 2 million"
The general strategy used here is to match a number, with an optional decimal component, followed by another word, provided that this number-word term appears at least once downstream in the input string. If it does appear again, then we remove the first terms by replacing with empty string. Note that the last occurrence of a pair of terms would not be deleted, owing to that the positive lookahead would fail.

Related

How to count specific occurrences of strings within chr data into a new column

I have a dt similar to below, with chr data held in the Description column like below. I need to count the number of times certain strings of characters occur in that column, and sum them in the Occurrences column.
In the table below, it would be counting the number of times "A18" or "A19" appears.
ID
Date
Description
Occurrences
1
2020-01-01
A1901,A1804,A2008,AB06
2
2
2020-01-14
A1402,A1805,A1902
2
3
2018-02-25
A1702
0
I'm very new to R and datatables, so haven't tried much. I've searched, but only found how to count occurrences of whole strings, not within them.
Use str_count:
library(stringr)
library(dplyr)
df %>%
mutate(Occurences2 = str_count(Description, "A18|A19"))

Replace strings of numbers separated by commas with the median in R [duplicate]

This question already has answers here:
R: split string into numeric and return the mean as a new column in a data frame
(3 answers)
Closed 2 years ago.
I need help with replacing or extracting string of numbers, separated by comma in each element of my df, and replacing it with the median. For example,
a <- c("3, 3, 5, 5", "7, 7, 5, 5", "3, 4, 4, 5", "5, 7")
b <- c("Karina", "Eva", "Jake", "Ana")
df <- data.frame(b,a)
Now i need to replace variable a with the median of those numbers contained in each elements so it looks like below:
b a
1 Karina 4
2 Eva 6
3 Jake 4
4 Ana 6
Little bit background. Each number is actually a length of a word that belongs to the corresponding name. I need to find median length for each name and figure out whether names that start with a vowel have longer median length or not. So for example, from the above i will conclude that names that start with vowel have shorted length. And to use a test to show that it is statistically significant. If someone can guide me in any way, i really appreciate it!
We can split the 'a' column with strsplit on , followed by zero or more spaces (\\s*), loop over the list, convert to numeric and get the median, assign it to same column
df$a <- sapply(strsplit(df$a, ",\\s*"), function(x) median(as.numeric(x)))
df$a
#[1] 4 6 4 6
Or using tidyverse, we can use separate_rows to split the 'a' column and expand the rows while converting the type', then do a group by median
library(dplyr)
library(tidyr)
df %>%
separate_rows(a, convert = TRUE) %>%
group_by(b) %>%
summarise(a = median(a))

How to split one column to multiple? [duplicate]

This question already has answers here:
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 3 years ago.
May I ask how to split the column of a data frame into multiple columns, for example:
ID value
10.A.S 1
11.A.S 2
12.A.S 3
10.A 4
11.A 5
12.A 6
I want to split the ID column based on the ".", and the expected result should be like:
ID NO. type treatment value
10.A.S 10 A S 1
11.A.S 11 A S 2
12.A.S 12 A S 3
10.A 10 A 4
11.A 11 A 5
12.A 12 A 6
Thank you very much.
An option is separate. The sep in separate takes by default regex. According to ?separate
sep - If character, is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
The . is a metacharacter to match any character. So, we either escape (\\.) or place it in square brackets ([.])
library(dplyr)
library(tidyr)
df1 %>%
separate(ID, into = c("NO.", "type", "treatment"),
sep="\\.", remove = FALSE, convert = TRUE)

How can I cross tabulate multiple select and single select questions in R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
Problem description
I've run a survey with a multiple select question, where the output is separated by commas in one column, and a grouping question (e.g. sex). Now I want to cross tabulate those 2 variables.
Sample data
My data comprises of 2 columns:
A multiple select question, which the survey software outputs as one column with commas separating the selection
A grouping variable, in this case male or female
dat <- data.frame(Multiple = c("A,B,C","B","A,C"), Sex = c("M","F","F"))
Desired output
I want to cross tabulate the multiple select options (without commas) with sex:
Multiple Sex Count
A M 1
B M 1
C M 1
A F 1
B F 1
C F 1
Attempted solution
This is a partial solution where I count the elements in the multiple select question only. My problem is that I don't know how to include the grouping variable sex into this function because I am using a regular expression to count the elements in the comma separated vector:
MSCount <- function(X){
# Function to count values in a comma separated vector
Answers <- sort(
unique(
unlist(
strsplit(
as.character(X), ",")))) # Find the possible options from the data alone, e.g. "A", "B" etc.
Answers <- Answers[-which(Answers == "")] # Drop blank answers
CountAnswers <- numeric(0) # Initialise the count as an empty numeric list
for(i in 1:length(Answers)){
CountAnswers[i] <- sum(grepl(Answers[i],X))
} # Loop round and count the rows with a match for the answer text
SummaryAnswers <- data.frame(Answers,CountAnswers,PropAnswers = 100*CountAnswers/length(X[!is.na(X)]))
return(SummaryAnswers)
}
We can use separate_rows
library(tidyverse)
separate_rows(dat, Multiple) %>%
mutate(Count = 1) %>%
arrange(Sex, Multiple) %>%
select(Multiple, Sex, Count)

Creating new rows for listed substrings [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
My goal is to create a wordcloud in R, but I'm working with nested JSON data (which also happens to be incredibly messy).
There's a nice explanation here for how to create a wordcloud of phrases rather than singular words. I also know melt() from reshape2 can create new rows out of entire columns. Is there a way in R to perform a melt-like function over nested substrings?
Example:
N Group String
1 A c("a", "b", c")
2 A character(0)
3 B a
4 B c("b", d")
5 B d
...should become:
N Group String
1 A a
2 A b
3 A c
4 A character(0)
5 B a
6 B b
7 B d
8 B d
...where each subsequent substring is returned to the next row. In my actual data, the pattern c("x, y") is consistent but the substrings are too varied to know a priori.
If there's no great way to do this, too bad... just thought I'd ask the experts!
You can use separate_rows from the tidyr package:
library(tidyverse)
data %>%
separate_rows(listcites, sep = ",") %>% # split on commas
dmap_at("listcites", ~ gsub("^c\\(\"|\")$|\"", "", .x)) # clean up the quotations and parens

Resources