remove suffix with pattern in R dataframe - r

I have an R dataframe with column names as following,
MMR42_L_2_S52_L001_R1_001
MMR42_LN_2_S51_L001_R1_001
MMR43_N_1_S53_L001_R1_001
MMR48_N_1_S54_L001_R1_001
MMR612_S55_L001_R1_001
MMR658_S56_L001_R1_001
I have to remove the _S* from each column name
Desired Column names:
MMR42_L_2
MMR42_LN_2
MMR43_N_1
MMR48_N_1
MMR612
MMR658
My Idea
library(dplyr)
df1 %>%
rename_all(.funs = funs(sub("\\_S*", "", names(df1)))) %>%
I could not get the desired result with the above

Within the rename_at/_all, the . is the column name. We don't need names(.)
library(dplyr)
library(stringr)
df1 %>%
rename_all(~ str_remove(., "\\_S.*"))
Or using the OP's code
df1 %>%
rename_all(.funs = funs(sub("\\_S.*", "", .)))

Related

how to remove duplicate values from specific columns in a data frame?

I want to remove duplicate text within certain column values of the data frame.
like this..
what should i do?
In base R, we can split the 'originaltext' column by , followed by zero or more spaces (\\s*), then loop over the list with sapply, get the unique values and paste them by collapseing without space
df1$result <- sapply(strsplit(df1$originaltext, ",\\s*"),
function(x) paste(unique(x), collapse=""))
Here's a way with dplyr :
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(original_text, sep = ',\\s*') %>%
group_by(row) %>%
summarise(result = paste0(unique(original_text), collapse = ''),
original_text = toString(original_text)) %>%
select(-row)

Mutate new column by comparing multiple column values in different data frame in r

I have a data frame DF in which I want to insert new column called Stage by comparing with the data frame DF1 columns Col1,Col2,Col3,Col4,Col5,Col6. Below is my sample data format
Col1=c("ABCD","","","","wxyz","")
Col2=c("","","MTNL","","","")
Col3=c("","PQRS","","","","")
Col4=c("","","","","","")
Col5=c("","","","","","")
Col6=c("","","","","","EFGH")
DF=data.frame(Col1,Col2,Col3,Col4,Col5,Col6)
Style=c("ABCD","WXYZ","PQRS","EFGH")
DF1=data.frame(Style)
Stage=c(1,1,3,6)
DFR=data.frame(Style,Stage)
DFR would be my resulting data frame.
Can Some one help me to solve this.
A tidyverse method:
library(tidyverse)
DFR <- DF %>%
mutate(across(everything(), ~na_if(., ""))) %>%
pivot_longer(cols = everything(),
names_to = "Stage",
values_to = "Style",
values_drop_na = T) %>%
filter(Style %in% c("ABCD","WXYZ","PQRS","EFGH"))%>%
mutate(Stage = as.integer(gsub("Col", "", Stage)))
The first mutate call replaces your blank values with NA. Then I pivot your table to long format and drop NA values, before filtering for only the Style values you're interested in (these can be saved in a vector instead to make the code cleaner, but here the column and your vector are named the same so I didn't want to make it confusing). The second mutate call is optional, it removes "Col" from each of your Stage values and converts the column to the type integer.
You can join the data after getting it into long format.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(cols = everything()) %>%
right_join(DF1, by = c('value' = 'Style'))
# name value
# <chr> <chr>
#1 Col1 ABCD
#2 Col3 PQRS
#3 Col6 EFGH
#4 NA WXYZ
I tried to solve this by below way and it is working
DF <- DF %>%
mutate(across(everything(), ~na_if(., "")))
DFR=DF1
DFR$Stage=ifelse(is.na(DF1$Style),NA,ifelse(DF1$Style %in% DF$Col1,1,
ifelse(DF1$Style %in% DF$Col2,2,
ifelse(DF1$Style %in% DF$Col3,3,
ifelse(DF1$Style %in% DF$Col4,4,
ifelse(DF1$Style %in% DF$Col5,5,
ifelse(DF1$Style %in% DF$Col6,6,NA)))))))

Simplify a list to a data frame & create new columns from numeric vectors in the list

I have a fairly simple list:
ls <- list(560L, 4163L, 3761L, 287:290, 4467L, 3564L, 200:202)
where each row corresponds to a row in a data frame:
df <- enframe(c("tom", "dick", "harry", "sally", "sarah", "petra", "helen"), value = "name", name = NULL)
Because some row elements of the list contain a numeric vector it's not as easy as converting the list to a data frame and using bind_cols to combine the data.
So, I'd like to be able to simplify the list into a data frame and put each vector element into a column so I can combine with the df. The simplified list from this sample would be a data frame 7 rows by 4 columns. The non-reprex data will change and so the number of columns would represent the number of elements in the longest numeric vector and not just this sample.
Thanks.
We can use unnest_wider
library(tidyr)
library(dplyr)
set_names(ls, df$name) %>%
tibble(col = .) %>%
unnest_wider(c(col))
Or after stacking into a 2 column data.frame, use pivot_wider
set_names(ls, df$name) %>%
stack %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = ind, values_from = values)
If we needs the opposite
df %>%
mutate(val = ls) %>%
unnest(val) %>%
group_by(name) %>%
mutate(rn = str_c('col', row_number())) %>%
ungroup %>%
pivot_wider(names_from = rn, values_from = val)
Or with unnest_wider
library(stringr)
df %>%
mutate(val = ls) %>%
unnest_wider(c(val), names_repair = ~ c('name', str_c('col', 1:4)))

write from nested dataframe with on-the-fly filename using purrr::walk

I'm applying a function to a nested dataframe using purrr::map to get a new dataframe list column.
Now I want to write each of these new dataframes to file using column values from the same row as part of the filename.
I'm stuck on how to pull the other column values out in order to pass to the filename for writing to file. I'm confident purrr::walk should be involved but the manner of how to access column variables and the list dataframe contents is the problem.
Reprex below:
library(tibble)
library(dplyr)
library(tidyr)
library(purrr)
# Data
data("mtcars")
mtcars_nest <- mtcars %>% rownames_to_column() %>% rename(rowname_1 = rowname) %>% select(-mpg) %>% group_by(cyl) %>% nest()
mtcars_mpg <- mtcars %>% rownames_to_column() %>% rename(rowname_2 = rowname) %>% select(rowname_2, mpg)
# Function to apply to nested dataframe
join_df <- function(df_nest, df_other) {
df_all <- inner_join(df_nest, df_other, by = c("rowname_1" = "rowname_2"))
return(df_all)
}
# 1. Apply function to `$data` to get new dataframe list column and add an extra 'case' column for filename
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg)))
# 2. Now write `$new_mpg` to file with filename sources from $cyl and $case
# I think `walk` is the correct to use but how to pass the two row values into filename?
## Not real code##
# mtcars_nest %>%
# walk(., function(x) {write.csv(., file = paste0(cyl, "_", case, ".csv")})
Use pwalk:
... %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))
Chain after your code:
mtcars_nest %>%
mutate(case = c("first", "second", "third")) %>%
mutate(new_mpg = map(data, ~ join_df(., mtcars_mpg))) %>%
select(cyl, case, new_mpg) %>%
pwalk(~ write.csv(..3, file = paste0(..1, '_', ..2, '.csv')))

Finding elements from multiple columns of one dataframe that are not in multiple columns of another

library(tidyverse)
I have two dataframes (see sample code at bottom) called Df1 and Df2. I want to find phone numbers in Df1 (from all the columns) that are not in any of the phone number columns in Df2.
First, I restructure Df1 so that there is only one Id per row.
Df1<-Df1 %>%
gather(key, value, -Id) %>%
filter(!is.na(value)) %>%
select(-key) %>%
group_by(Id) %>%
filter(!duplicated(value)) %>%
mutate(Phone=paste0("Phone_",1:n())) %>%
spread(Phone, value)
Next, I rename Df2 and then use a join to find only Ids in Df1 that are in Df2.
Df2<-Df2%>%set_names(c("Id","Ph1","Ph2"))
DfJoin<-left_join(Df2,Df1,by="Id")
This is where I'm stuck. I want to find all the numbers in Df1 (Phone1 Phone2, and Phone 3) that are not in Df2 (Ph1 and Ph2). Below are some ideas for code. I tried many variations of this idea but could not find a way to achieve what I want. The final product should just be a table with the phone numbers(s) in any Df1 column that are not in any Df2 column together with the associated Id. I'm also wondering if there is another join or set operation that would achieve this in a more efficient way?
DfJoin<-DfJoin%>%mutate(New=if_else(! DfJoin[2:3] %in% DfJoin[4:6]),1,0)
DfJoin<-DfJoin%>%filter(! DfJoin[2:3] %in% DfJoin[2:4])
Sample Data:
Dataframe 1:
Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df1<-data.frame(Id,Ph1,Ph2)
Dataframe 2:
Id2<-c(199,148,142,145,177,165,144,121,182,109,188,111)
Phone1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
Phone2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
Df2<-data.frame(Id2,Phone1,Phone2)
One way to think about this problem:
You have a set of phone numbers in df1 for each ID number.
You have a set of phone numbers in df2 for each ID number.
You want to find, within each ID, the set difference between df1 and df2.
You can do this by mapping the base R function setdiff() onto your joined dataframe. To do this, you need to convert your data frames into list-column format, where all the phone numbers for each ID are present as a list in a "cell" of the dataframe. This is easily done by combining group_by(), summarize() and list().
# create example data
Id <- c(199,148,148,145,177,165,144,121,188,188,188,111)
ph1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
ph2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df1 <- data.frame(Id, ph1, ph2)
Id2 <- c(199,148,142,145,177,165,144,121,182,109,188,111)
phone1 <- c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554400,8764443344,6453348736)
phone2 <- c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)
df2 <- data.frame(Id=Id2, phone1, phone2)
# convert the data to list-column format
df1.listcol <- df1 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list1 = list(phone))
df2.listcol <- df2 %>%
gather(col, phone, -Id) %>%
na.omit() %>%
group_by(Id) %>%
summarize(phone_list2 = list(phone))
Take a look at these dataframes to make sure you understand how we've reformatted them. Obviously, we could save a few lines of code by making this conversion process into a function, and then calling the function on each of df1 and df2, but I didn't do that here.
# join the two listcol dfs by Id, then map setdiff on the two columns
result <-
df1.listcol %>%
left_join(df2.listcol, by='Id') %>%
mutate(only_list_1 = map2(phone_list1, phone_list2, ~setdiff(.x, .y))) %>%
select(Id, only_list_1) %>%
unnest()
result
The result is
Id only_list_1
148 6541132112
188 7890986543
188 6785554444
Have you tried anti_join(a, b, by = "x1")
This basically gives you all rows in a which are not in b
DfJoin <- anti_join(Df1, Df2, by = "Id")
tidyr_dplyr cheatsheet
Use the above cheatsheet for data manipulation in tidyverse

Resources