Replace Column Names With String Right of "_" - r

I have a dataframe (d3) which has some column names with "Date_Month.Year", I want to replace those column names with just "Month.Year" so if there are multiple columns with the same "Month.Year" they will just be a summed column.
Below is the code I tried and the output
library(stringr)
print(colnames(d3))
#below is output of the print statement
#[1] "ProductCategoryDesc" "RegionDesc" "SourceDesc" "variable"
#[5] "2019-02-28_Feb.2019" "2019-03-01_Mar.2019" "2019-03-04_Mar.2019" "2019-03-05_Mar.2019"
#[9] "2019-03-06_Mar.2019" "2019-03-07_Mar.2019" "2019-03-08_Mar.2019"
d3 <- d3 %>% mutate(col = str_remove(col, '*._'))
Here is the error I get:
Evaluation error: argument str should be a character vector (or an object coercible to).
So I got the first part of my problem answered I used to get all column names in Month.Year format but now I am having issues with summing the columns that have the same name, for that I looked at Sum and replace columns with same name R for a data frame containing different classes
colnames(d3) <- gsub('.*_', '', colnames(d3))
Below is the code I used to get the columns summed that have a duplicate name, however with this code it is not necessarily putting the summed values in the correct columns.
indx <- sapply(d3, is.numeric)#check which columns are numeric
nm1 <- which(indx)#get the numeric index of the column
indx2 <- duplicated(names(nm1))|duplicated(names(nm1),fromLast=TRUE)
nm2 <- nm1[indx2]
indx3 <- duplicated(names(nm2))
d3[nm2[!indx3]] <- Map(function(x,y) rowSums(x[y],na.rm = FALSE),
list(d3),split(nm2, names(nm2)))
d3 <- d3[ -nm2[indx3]]

If you want to change the column names, you should be changing colnames:
colnames(d3) <- gsub('.*_', '', colnames(d3))
Note, in your regex, quantifiers (ie *) go after the thing they quantify. So it should be .*_ rather than *._
An example where we remove text before a . in iris:
colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
# In regex, . means any character, so to match an actual '.',
# we need to 'escape' it with \\.
colnames(iris) <- gsub('.*\\.', '', colnames(iris))
colnames(iris)
[1] "Length" "Width" "Length" "Width" "Species"

colnames(d3) <- sapply(colnames(d3), function(colname){
return( str_remove(colname, '.*_') )
})
The regex should be ".*_" to match the case you need

Related

Rename strings with identical names

Context
When importing columns with identical names from a spreadsheet software, readxl transform doublons with the following syntax : "Col1","Col1" becomes : "Col1","Col1...2". I would like instead to transform it into "Col1","Col1A".
Here is a reproducible example :
Example
# Original string :
library(stringr)
string <- c("G01","G01...2","G02","G03","G04","G04...6","G05","G05...8")
# Desired result
result <- c("G01","G01A","G02","G03","G04","G04A","G05","G05A")
# this line successfully detects the wrongful entries :
str_detect(string,pattern = "[:alpha:][:digit:][:digit:]...[:digit:]")
# this line fails to address the issue correctly :
str_replace(string,"[:alpha:][:digit:][:digit:]...[:digit:]", "[:alpha:][:digit:][:digit:]A")
#output :
[1] "G01" "[:alpha:][:digit:][:digit:]A" "G02"
[4] "G03" "G04" "[:alpha:][:digit:][:digit:]A"
[7] "G05" "[:alpha:][:digit:][:digit:]A"
We could use str_remove to remove the substring that start with one or more . followed by any other characters and then use make.unique to change the duplicates by appending .1, .2 etc
library(stringr)
make.unique(str_remove(string, "\\.+.*"))
If we need to add LETTERS, the issue would be that there will be only 26 duplicates that can be filled
Assuming there will not be more than 26 duplicates, you could do
nm = sapply(strsplit(string, "\\.{3}"), function(x) x[1])
paste0(nm, ave(nm, nm, FUN = function(x) c("", LETTERS)[seq_along(x)]))
# [1] "G01" "G01A" "G02" "G03" "G04" "G04A" "G05" "G05A"

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

Match unlist output against set of column names

Here is sample data:
main.data <- c("id","num","open","close","char","gene","valid")
data.step.1 <- list(id="12",num="00",open="01-01-2015",char="yes",gene="1234",valid="NA")
match.step.1 <- unlist(data.step.1)
The main.data are the column names of all possible column data.
I have a loop that streams data step-by-step, which could have missing column (list name).
I would like to match the each step (data.step.n) against the master column names (main.data).
Desired output:
id num open close char gene valid
"12" "00" "01-01-2015" "" "yes" "1234" "NA"
How can I unlist the data and match it against the names so that if the entry is missing like in this case close that would be filled with empty string.
Try
v1 <- setNames(rep('', length(main.data)), main.data)
v1[main.data %in% names(match.step.1)] <- match.step.1
Or use match
v1[match(names(match.step.1), main.data)] <- match.step.1
Or just use [
v2 <- setNames(match.step.1[main.data], main.data)
v2[is.na(v2)] <- ''

how to eliminate specific columns by column name

I have a data set df and I have 300 columns. I also have a vector names which is a vector of characters. I'm trying to eliminate the columns that match the characters in names. I tried
> head(names)
[1] "X749.-4" "X339" "X449" "X486" "X300" "X301"
real.final<-df[-names]
Error in -names : invalid argument to unary operator
Would there be a way to remove the columns mentioned in the names?
I would use setdiff instead. Here's an example:
## This is head(names)
x <- c("X749.-4", "X339", "X449", "X486", "X300", "X301")
## Imagine this is names(df)
y <- c(letters[1:2], x, LETTERS[1:2])
setdiff(y, x)
# [1] "a" "b" "A" "B"
## So, you could try:
df[, setdiff(y, x)]
The negation operator "-" will not work with character arguments passed as arguments to "[". You need to either use a lgocal vector with "!" as illustrated by user2568648, or you need to convert the character vector into numeric vector with grep:
#Failed attemtpt : real.final <- df[-grep(names, names(df) )]
Perhaps:
real.final <- df[ -as.vector(sapply(names[1], grep, x=c(names,names)))]
Another error:
real.final <- subset( df, select=-names)
Error in -"Result" : invalid argument to unary operator
Success with:
subset(df, select=-which(names(df) %in% names))
I don't like to use -which() because it will bite you if there are no "hits", but it's probably safe as an argument to subset.
You can use the which function. For example to drop the columns named "X749.-4" and "X486":
df <- df[ , -which(names(df) %in% c("X749.-4", "X486"))]
Would this work? [NO - see comment from Dwin below for correction]
subset.df<-subset(df, !(colnames(df) %in% names))

Resources