I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"
Related
Let's say my data is df <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3","Author4","Reference4","Abstract4").
This is a series in which the order is Author, Reference and Abstract. But in some cases, the Abstract data is missing. (In this example, the third Abstract is missing.) So, how can I add NA values in place of Abstract, when Abstract is missing?
In other words, If an element in the vector starts with the word "Reference", but its next element doesn't start with the word "Abstract", I want to add an NA value just after the element starting with "Reference". The result vector should be
result <- c("Author1","Reference1","Abstract1","Author2","Reference2","Abstract2","Author3","Reference3",NA,"Author4","Reference4","Abstract4")
How can I do it?
I have tried the append function in R, but for using it, I need to have the index number of the element where I want to add NA. So, it takes a manual entry for each NA element.
Here's an approach.
Bascially you get two vectors:
which tests whether that element containts Reference, the other that checks that the element does not contain Abstract
You offset one vector by 1, because you want to test whether abstract follows reference.
you take the logical and
then you insert NAs into the positions where abstract should be but isn't with append()
ab_missing <- grepl("Reference", df) & c(!grepl("Abstract", df)[-1], FALSE)
df <- append(df, NA, which(ab_missing))
df
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3" "Reference3" NA "Author4"
[11] "Reference4" "Abstract4"
One way (and the only way I get these things done) is to think in tibbles or data frames: (So this is not the best approach)!
We create a tibble of one column calling x,
then we group by the numbers e.g. 1,1,1 with parse_number() function from readr (I love parse_number()),
With summarise(cur_data()[seq(3),]) see expand each group to the max rows, see here Expand each group to the max n of rows
3a stop here and pull if NA is desired otherwise continue
finally we use paste with r's recycling ability and pull the vector:
1. In case NA is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
pull(x)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
2. In case the lacking word is desired:
library(dplyr)
library(readr)
my_vector <- tibble(x = c("Author1","Reference1","Abstract1","Author2","Reference2",
"Abstract2","Author3","Reference3","Author4","Reference4","Abstract4")) %>%
group_by(group= parse_number(x)) %>%
summarise(cur_data()[seq(3),]) %>%
mutate(group = paste0(c("Author", "Reference", "Abstract"), group)) %>%
pull(group)
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" "Abstract3" "Author4" "Reference4" "Abstract4"
A slightly different approach might be:
c(sapply(split(x, cumsum(grepl("Author", x))), function(x) head(c(x, NA_character_), 3)))
[1] "Author1" "Reference1" "Abstract1" "Author2" "Reference2" "Abstract2" "Author3"
[8] "Reference3" NA "Author4" "Reference4" "Abstract4"
I have a data set with a good hundred thousand lines in it.
somehow.. the data provider sent it to me with all the dates formatted like 1/1/20202021 08:07:43 AM (mdy_hms). The correct year should be the last four in year for every row.
lubridate::mdy_hms() obviously cant recognize this. So I am trying to figure out how I could use grep or similar to pull out the correct date time. Any ideas?
Thanks everyone (:
You can handle this with functions in the stringr package. First, get the correct year by extracting it from the date variable. For example,
library(stringr)
date_value <- "1/1/20202021 08:07:43 AM"
correct_year <- str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
This returns "2021 ". You can now use str_replace() to replace the 8-digit bad year with correct_year:
str_replace(date_value, pattern = "\\d{8}\\s", replacement = correct_year)
[1] "1/1/2021 08:07:43 AM"
To perform this operation across the whole data frame you can do something like this:
library(tidyverse)
df %>%
mutate(
date_value = str_replace(
date_value,
pattern = "\\d{8}\\s",
replacement = str_sub(
str_extract(date_value, pattern = "\\d{8}\\s"), 5, 10
)
)
)
You can extract only the 2nd 4-digit year with sub.
x <- "1/1/20202021 08:07:43 AM"
lubridate::mdy_hms(sub('(\\d{4})(\\d{4})', '\\2', x))
#[1] "2021-01-01 08:07:43 UTC"
To apply this to entire column you replace x with df$column_name.
I am working with the following dataset called results and am trying to add in a column that only contains the date (ideally just the year) of the row.
I am trying to extract just the date (for example: 2012-02-10) from the column_label column.
This is the code that I use:
pattern <- "- (.*?) .RData"
subsetpk <- results %>%
filter(team=="Pakistan") %>%
mutate(year = str_extract(column_label, pattern))
This, however, only gives me NA values.
You can use a regular expression. Here '\\d{4}' just matches the first 4 consecutive digits that are found in the string. This works if your data always looks the same as your example. If not, you may need something more sophisticated. If this doesn't work, post some more example data.
library(tidyverse)
library(stringr)
df <- data.frame(column_label = c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs"))
df %>%
mutate(my_year = str_extract(column_label, '\\d{4}'))
column_label my_year
#1 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2012
#2 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2019
The ymd() function from the lubridate package
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, we can pass the complete string conveniently without having to deal with regular expressions:
x <- c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs")
lubridate::ymd(x)
[1] "2012-02-10" "2019-02-10"
The year can be derived from the extracted dates by
library(lubridate)
year(ymd(x))
[1] 2012 2019
Use str_extract from the package stringr:
DATA:
results <- data.frame(
column_label = "Afghanistan-Pakistan-2012-02-10.RData.overs")
SOLUTION:
results$date <- str_extract(results$column_label, "\\d{4}-\\d{2}-\\d{2}")
RESULT:
results
column_label date
1 Afghanistan-Pakistan-2012-02-10.RData.overs 2012-02-10
I have a Name column and names are like this:
Preety ..
Sudalai Rajkumar S.
Parvathy M. S.
Navaraj Ranjan Arthur
I want to get which of these are single-word names, like in this case Preety.
I have tried eliminating the "." and " " and counting the length and using the difference of this length and the original string length.
But it's not giving me the desired output. Please help.
NBData3$namewodot <- gsub(" .","",NBData3$Client.Name)
NBData3$namewoblank <- gsub(" ","",NBData3$namewodot)
wordlength <- NBData3$namelengthchar-nchar(as.character(NBData3$namewoblank))
You could use str_count from stringr inside an ifelse() statement to check one worded names; first removing dots from names with gsub.
library(stringr)
NBData3$namewodot <- gsub("\\.", "", NBData3$Client.Name)
NBData3$oneword <- ifelse(str_count(NBData3$namewodot , '\\w+') == 1, TRUE, FALSE)
# Client.Name namewodot oneword
# 1 Preety .. Preety TRUE
# 2 Sudalai Rajkumar S. Sudalai Rajkumar S FALSE
# 3 Parvathy M. S. Parvathy M S FALSE
# 4 Navaraj Ranjan Arthur Navaraj Ranjan Arthur FALSE
This seems to work for your example
names = c("Preety ..",
"Sudalai Rajkumar S." ,
"Parvathy M. S.",
"Navaraj Ranjan Arthur")
names[sapply(strsplit(gsub(".","",names,fixed=T)," ",fixed=T),function(x) length(x) == 1)]
[1] "Preety .."
This may be a bit round about, but here would be a text mining approach. There are definitely more streamlined ways, but I thought there might be concepts in here that are also useful.
# define the data frame
df <- data.frame(Name = c("Preety ..",
"Sudalai Rajkumar S.",
"Parvathy M. S.",
"Navaraj Ranjan Arthur"),
stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
# break each name out by words. remove all the periods
df_token <- df %>%
rowid_to_column(var = "name_id") %>%
mutate(Name = str_remove_all(Name, pattern = "\\.")) %>%
unnest_tokens(name_split, Name, to_lower = FALSE)
# find the lines with only one word
df_token %>%
group_by(name_id) %>%
summarize(count = n()) %>%
filter(count == 1) %>%
left_join(df_token) %>%
pull(name_split)
[1] "Preety"
in base R you could use grep:
grep("^\\S+$", gsub("\\W+$", "", names), value=T)
[1] "Preety"
If you need the names as originally given, then you will just use [:
names[grep("^\\S+$", gsub("\\W+$", "", names))]
[1] "Preety .."
I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.
It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)
Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.
You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"