I want to remove duplicate text within certain column values of the data frame.
like this..
what should i do?
In base R, we can split the 'originaltext' column by , followed by zero or more spaces (\\s*), then loop over the list with sapply, get the unique values and paste them by collapseing without space
df1$result <- sapply(strsplit(df1$originaltext, ",\\s*"),
function(x) paste(unique(x), collapse=""))
Here's a way with dplyr :
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(original_text, sep = ',\\s*') %>%
group_by(row) %>%
summarise(result = paste0(unique(original_text), collapse = ''),
original_text = toString(original_text)) %>%
select(-row)
Related
I have an R dataframe with column names as following,
MMR42_L_2_S52_L001_R1_001
MMR42_LN_2_S51_L001_R1_001
MMR43_N_1_S53_L001_R1_001
MMR48_N_1_S54_L001_R1_001
MMR612_S55_L001_R1_001
MMR658_S56_L001_R1_001
I have to remove the _S* from each column name
Desired Column names:
MMR42_L_2
MMR42_LN_2
MMR43_N_1
MMR48_N_1
MMR612
MMR658
My Idea
library(dplyr)
df1 %>%
rename_all(.funs = funs(sub("\\_S*", "", names(df1)))) %>%
I could not get the desired result with the above
Within the rename_at/_all, the . is the column name. We don't need names(.)
library(dplyr)
library(stringr)
df1 %>%
rename_all(~ str_remove(., "\\_S.*"))
Or using the OP's code
df1 %>%
rename_all(.funs = funs(sub("\\_S.*", "", .)))
The dataset below has columns with very similar names and some values which are NA.
library(tidyverse)
dat <- data.frame(
v1_min = c(1,2,4,1,NA,4,2,2),
v1_max = c(1,NA,5,4,5,4,6,NA),
other_v1_min = c(1,1,NA,3,4,4,3,2),
other_v1_max = c(1,5,5,6,6,4,3,NA),
y1_min = c(3,NA,2,1,2,NA,1,2),
y1_max = c(6,2,5,6,2,5,3,3),
other_y1_min = c(2,3,NA,1,1,1,NA,2),
other_y1_max = c(5,6,4,2,NA,2,NA,NA)
)
head(dat)
In this example, x1 and y1 would be what I would consider the common "categories" among the columns. In order to get something similar with my current dataset, I had to use grepl to tease these out
cats<-dat %>%
names() %>%
gsub("^(.*)_(min|max)", "\\1",.) %>%
gsub("^(.*)_(.*)", "\\2",.) %>%
unique()
Now, my goal is to mutate a new min and a new max column for each of those categories. So far the code below works just fine.
dat %>%
rowwise() %>%
mutate(min_v1 = min(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(max_v1 = max(c_across(contains(cats[1])), na.rm=T)) %>%
mutate(min_y1 = min(c_across(contains(cats[2])), na.rm=T)) %>%
mutate(max_y1 = max(c_across(contains(cats[2])), na.rm=T))
However, the number of categories in my current dataset is quite a bit bigger than 2.. Is there a way to implement this but quicker?
I've tried a few of the suggestions on this post but haven't quite been able to extend them to this problem.
You can use one of the map function here for each common categories.
library(dplyr)
library(purrr)
result <- bind_cols(dat, map_dfc(cats,
~dat %>%
rowwise() %>%
transmute(!!paste('min', .x, sep = '_') := min(c_across(matches(.x)), na.rm = TRUE),
!!paste('max', .x, sep = '_') := max(c_across(matches(.x)), na.rm = TRUE))))
result
In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.
I've this string and I need to split it into different columns
legend = "Frequency..Derivatives.measure...Derivatives.instrument...Derivatives.risk.category...Derivatives.reporting.country...Derivatives.counterparty.sector...Derivatives.counterparty.country...Derivatives.underlying.risk.sector...Derivatives.currency.leg.1...Derivatives.currency.leg.2...Derivatives.maturity...Derivatives.rating...Derivatives.execution.method...Derivatives.basis...Period..30.06.1998.31.12.1998.30.06.1999.31.12.1999.30.06.2000.31.12.2000.30.06.2001.31.12.2001.30.06.2002.31.12.2002.30.06.2003.31.12.2003.30.06.2004.31.12.2004.30.06.2005.31.12.2005.30.06.2006.31.12.2006.30.06.2007.31.12.2007.30.06.2008.31.12.2008.30.06.2009.31.12.2009.30.06.2010.31.12.2010.30.06.2011.31.12.2011.30.06.2012.31.12.2012.30.06.2013.31.12.2013.30.06.2014.31.12.2014.30.06.2015.31.12.2015.30.06.2016.31.12.2016.30.06.2017.31.12.2017.30.06.2018.31.12.2018.30.06.2019"
Every three points there should be a new column, until the word perdiod. Note that the first word Frequency is divided from the second word Derivatives.measure by only two points not three.
After that, there are a series of Date (6 months interval) and they should be divided in this way: "everytime there's a 4 digit number perform a split".
How can I do this? Thank You
We can use strsplit to split at the ... with fixed = TRUE into a list of vectors and then rbind the vectors to create a data.frame
df1 <- do.call(rbind.data.frame, strsplit(legend, "...", fixed = TRUE))
names(df1) <- paste0("V", seq_along(df1))
If we also need to include the last condition to split the "Period"
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
tibble(col = legend) %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep= "[.]{3}") %>%
mutate(rn2 = str_c("V", rowid(rn))) %>%
pivot_wider(names_from = rn2, values_from = col) %>%
rename_at(ncol(.), ~ "Period") %>%
mutate(Period = str_remove(Period, "Period\\.+")) %>%
separate_rows(Period, sep="(?<=\\.[0-9]{4})\\.")
I have two data frames that look like this:
Table1:
Gender<-c("M","F","M","M","F")
CPTCodes<-c("15777, 19328, 19342, 19366, 19370, 19371, 19380","15777, 19357","19367, 49568","15777, 19357","15777, 19357")
Df<-tibble(Gender,CPTCodes)
Table2:
Code<-c(19328,19342,15777,49568,12345)
Value<-c(0.5,7,9,35,2)
Df2<-tibble(Code,Value)
And had previously asked this question about how to summarize the "values" from table 2 into a column in table 1, depending on how many codes were in the "Code" column of table 1. Turns out it was a duplicate of another question, but either way, the solutions there worked great! It did exactly what I asked.
Problem was that I didn't realize, buried deep down in the thousands of rows of Table 2, were some duplicate codes. I.e. table 2 really looked like this:
Code<-c(19357,19342,15777,49568,12345,15777,19357)
Modifier<-c("","","","","","a","a")
Value<-c(0.5,7,9,35,2,3,45)
Df2<-tibble(Code,Modifier,Value)
So when I use the suggested code:
Df %>% mutate(id = row_number()) %>% separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>% left_join(Df2, by = c("CPTCodes" = "Code")) %>% group_by(id, Gender) %>% summarize(total = sum(Value, na.rm = TRUE))
It summarizes ALL of the codes in finds that match in Table2, and I really just want rows that dont have anything in the "modifier" column. Any ideas?
Lastly, the current code returns the summarized total in its own data frame, but it'd be cool if everything was still there from the original Table 1, and it just had an extra column with the new sum.
I'm not entirely sure of your expected output. But you should be able to filter and then join the new column to the original df.
Df <- Df %>% mutate(id = row_number()) %>%
separate_rows(CPTCodes, sep = ", ", convert = TRUE) %>%
left_join(Df2, by = c("CPTCodes" = "Code")) %>%
group_by(id, Gender) %>%
filter(Modifier == "") %>%
summarize(total = sum(Value, na.rm = TRUE)) %>%
right_join(Df, by = "Gender")