R - Splitting a dataframe by using strsplit, but keep delimiter [duplicate] - r

This question already has an answer here:
R split on delimiter (split) keep the delimiter (split)
(1 answer)
Closed 2 months ago.
I have a dataframe like the following:
ref = c("ab/1bc/1", "dd/1", "cc/1", "2323")
text = c("car", "train", "mouse", "house")
data = data.frame(ref, text)
Which produces this:
IF the cell within the ref column has /1 in it, I want to split it and duplicate the row.
I.e. the table above should look like this:
I have the following code, which splits the cell by the /1, but it also removes it. I thought about adding /1 back onto every ref, but not all refs have it.
data1 = data %>%
mutate(ref = strsplit(as.character(ref), "/1")) %>%
unnest(ref)
Some of the other answers use regex for when people split by things like &/,. etc, but not /1. Any ideas?

With separate_rows and look-behind:
library(tidyr)
library(dplyr)
data %>%
separate_rows(ref, sep = "(?<=/1)") %>%
filter(ref != "")
output
# A tibble: 5 × 2
ref text
<chr> <chr>
1 ab/1 car
2 bc/1 car
3 dd/1 train
4 cc/1 mouse
5 2323 house
Or with strsplit:
data %>%
mutate(ref = strsplit(ref, "(?<=/1)", perl = TRUE)) %>%
unnest(ref)

Related

Using R Regex to identify two characters followed by a dash and two numbers

Very obnoxious regex question incoming! I have a column that I am trying to split into two based off a condition. I'd like a new column to be created when there are two characters, followed by a dash and two numbers (e.g., CA-01).
My code is:
mydf %>% extract(col = pilot_id, regex = "[a-z]{2}.d{2}", into = 'facility_test')
Where the column I'd like to identify the pattern in is pilot_id, and the new column I'd like to make is facility_test.
We need to capture in extract
library(dplyr)
library(tidyr)
mydf %>%
extract(col = pilot_id, regex = ".*-([A-Z]{2}-\\d{2})\\s.*",
into = 'facility_test')
# A tibble: 1 x 1
# facility_test
# <chr>
#1 FL-03
data
mydf <- tibble(pilot_id = "TGT Track -FL-03 (Hilsborough County) 3/3/2021")

How to optimize For Loops in R? I am aware of the apply function but currently facing problem in applying it

So basically I have a vector of tags that I want to find in my Transcript column (row by row) and if I find any word from the tags in my Transcript string, I want to create a separate column concatenating all the tags as shown in the example below (see image):
tags=c("loan","deposit","quarter","morning")
So, the output should look like this:
Output Result
Currently, I am able to tag this by using two for loops i.e. one to go over Tags vector and the other to go over my data frame's Transcript column one-by-one. But, I have a tag list of around 500 words and data frame has more than 100,000 rows. So, I am concerned about the run time. Is there any better way to optimize my R code using apply function or any other method?
Using, the following code to tag all the rows of Transcript column one-by-one
for (i in 1:length(tags)) {
for (j in 1:nrow(FinalData)){
check_tag <- str_extract(string = FinalData$Cleaned_Transcript[j], pattern = tags[i])
if (is.na(check_tag)==FALSE) {
FinalData$Tags[j] <- stri_remove_empty(paste(FinalData$Tags[j],check_tag,sep = ","))
}
}
}
Not sure if you are open to not using a for loop, but if so, here's a tidyverse approach.
library(tidyverse)
dat <- data.frame(Transcript = c("This is example text a", "this is loan", "deposit is not quarter"))
# as per comment from TO, we want to provide an input vector of tags
my_tags <- c("loan", "deposit", "quarter", "morning")
my_tags_collapsed <- str_c(my_tags, collapse = "|")
# We can now use the collapsed tags in the str_extract_all function
dat %>%
mutate(test = str_extract_all(Transcript, my_patterns_collapsed)) %>%
unnest_wider(test) %>%
mutate(across(-Transcript, replace_na, "")) %>%
mutate(Tags_Marked = apply(across(-Transcript), 1, str_c, collapse = ",")) %>%
select(Transcript, Tags_Marked)
Which gives:
# A tibble: 3 x 2
Transcript Tags_Marked
<chr> <chr>
1 This is example text a ,
2 this is loan loan,
3 deposit is not quarter deposit,quarter
Admittedly, this is not 100% ok, since you still get the comma separator for 0-length characters.
Alternative could be to not concatenate the strings into one column, but keep them as separate columns which would mean that you could stop much earlier:
dat %>%
mutate(test = str_extract_all(Transcript, my_tags_collapsed)) %>%
unnest_wider(test)
which would give you:
# A tibble: 3 x 3
Transcript ...1 ...2
<chr> <chr> <chr>
1 This is example text a NA NA
2 this is loan loan NA
3 deposit is not quarter deposit quarter

List of unique characters in a column [duplicate]

This question already has answers here:
keep only unique elements in string in r
(2 answers)
Closed 2 years ago.
I am trying to figure out how to extract all the unique characters from a certain column. For example, if one of my column has the following rows,
june
july&
august%
then I would like r to give me the list of all the unique characters, i.e,
junely&agst%
How can this be done in R?
Split the column values at each character and paste only unique characters.
x <- c('june', 'july&', 'august%')
paste0(unique(unlist(strsplit(x, ''))), collapse = "")
#[1] "junely&agst%"
May be a Tidy approach will be useful:
library(dplyr)
library(purrr)
library(stringr)
# input
x <- c("june", "july&", "august%")
expected <- "junely&agst%"
# modify
actual <- x %>% str_split(pattern = "") %>% flatten_chr %>% unique %>% paste0(collapse = "")
# validate
stopifnot(actual == expected)

group by a id and concatenate where matches into a new features [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 4 years ago.
sample_data <- data.frame(id = c("123abc", "def456", "789ghi", "123abc"),
some_str = c("carrots", "bananas", "apples", "cabbage"))
I would like to know how to wrangle sample df to be like this:
desired_df <- data.frame(id = c("123abc", "def456", "789ghi"),
some_str_concat = c("carrots, cabbage", "bananas", "apples"))
Each id may appear multiple times. In that case I would like to get the corresponding value from some_str and concatenate into a new feature, where the new df is grouped on id.
In the example above, id 123abc appears twice. First with a value of "carrots" and then again with a value of "apples". Thus, the desired data frame has a single row for abc123 with the value "carrots, cabbage".
How can I do this? Ideally within either base r or dplyr.
sample_data %>%
+ group_by(id) %>%
+ mutate(some_str = paste(some_str, collapse = ", ")) %>%
+ distinct()

Not able to convert a list into a data frame

I'm new in programming in R, and I've been having this problem for several days now. I started with a list, I created from splitting a file. This list contains a number of items in a single row.
head(sales2)
> $`7143443`
>>[1] "SSS-U-CCXVCSS1" "L-CCX-8GETTS-LIC"
>$`7208993`
>>[1] "NFFGSR4=" "1MV-FT-1=" "VI-NT/TE="
>$`7241758`
>>[1] "PW_SQSGG="
>$`9273628`
>>[1] "O1941-SE9" "CCO887VA-K9" "2901-SEC/K9" "CO1941-C/K9"
>$`9371709`
>>[1] "HGR__SASS=" "WWQTTB0S-L" "WS-RRRT48FP" "WTTTF24PS-L"
[5] "GEDQTT8TS-L" "WD-SRNS-2S-L"
>$`9830473`
>>[1] "SPA$FFSB0S"
I wanted it to convert into a data frame , I used
x<-do.call(rbind, lapply(sales2,data.frame))
It gets converted in the data frame ,but it converts like this
> head(x,6)
id
> 7143443.1 "SSS-U-CCXVCSS1"
> 7143443.2 "L-CCX-8GETTS-LIC"
> 7208993.1 "NFFGSR4="
> 7208993.2 "1MV-FT-1="
> 7208993.3 "VI-NT/TE="
> 7241758 "PW_SQSGG="
I want 7143443's all item in a single row not in multiple row
Through this I want to calculate how many rows contain 2 items together
for example "WS-C2960S-48TS-L" , "WS-C2960S-24TS-L",
these 2 elements are there in how many rows?
You can also say probability of these over all other elements.
I am not sure what is your final desired output. But the following script can convert your list to a data frame. Perhaps you can begin your analysis from this data frame.
# Create example list
sales2 <- list(`7143443` = c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC"),
`7208993` = c("NFFGSR4=", "1MV-FT-1=", "VI-NT/TE="),
`7241758` = "PW_SQSGG=",
`9273628` = c("O1941-SE9", "CCO887VA-K9", "2901-SEC/K9", "CO1941-C/K9"),
`9371709` = c("HGR__SASS=", "WWQTTB0S-L", "WS-RRRT48FP", "WTTTF24PS-L",
"GEDQTT8TS-L", "WD-SRNS-2S-L"),
`9830473` = "SPA$FFSB0S")
# Load packages
library(dplyr)
library(purrr)
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) %>% # Change the name of the second column
group_by(ID) %>% # Group by the first column
summarise(Value = paste0(Value, collapse = " ")) # Collapse the second column
dat
# A tibble: 6 × 2
ID Value
<chr> <chr>
1 7143443 SSS-U-CCXVCSS1 L-CCX-8GETTS-LIC
2 7208993 NFFGSR4= 1MV-FT-1= VI-NT/TE=
3 7241758 PW_SQSGG=
4 9273628 O1941-SE9 CCO887VA-K9 2901-SEC/K9 CO1941-C/K9
5 9371709 HGR__SASS= WWQTTB0S-L WS-RRRT48FP WTTTF24PS-L GEDQTT8TS-L WD-SRNS-2S-L
6 9830473 SPA$FFSB0S
Update
After reading original poster's comment, I decided to update my solution, to count how many rows contain two specified string patterns.
Here one row is a unique ID. So I assume that the request can be rephrased to "How many IDs contain two specified string patterns?" If this is the case, I would prefer not to collapse all the observations. Because after collapsing all observations to from one ID per row, we need to develop a strategy to match the string, such as using the regular expression. I am not familiar with regular string, so I will leave this for others to provide solutions.
In addition, the original poster did not specify which two strings are the targeted, so I would develop a strategy that the users can replace the targeted string case by case.
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) # Change the name of the second column
# After this, there is no need to collapse the rows
# Set the target string, User can change the strings here
target_string1 <- c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC")
dat2 <- dat %>%
filter(Value %in% target_string1) %>% # Filter rows matching the targeted string
distinct(ID, Value, .keep_all = TRUE) %>% # Only keep one row if ID and Value have exact duplicated
count(ID) %>% # Count how many rows per ID
filter(n > 1) %>% # Keep only ID that the Count number is larger than 1
select(ID)
dat2
# A tibble: 1 × 1
ID
<chr>
1 7143443

Resources