Substring a statement after character matching and year - r

I am trying to extract certain rows based on year from my dataset, furthermore I want to substring those rows matching the following conditions, for year 2017 I want to substring the the portion before the second '-' in the statment for eg: "17Q4-EMEA-All-SOV-OutR-Sov_Score-18Dec.Email" I would want only "All-SOV-OutR-Sov_Score-18Dec.Email" and for 2018 I want to remove the portion after the '.' for eg: "IVP Program Template.IVP Email Template" I want "IVP Program Template"
I have tried using
data$col <- sub(".*:", "", data$`Email Name`)
data$col2 <- substring(data$`Email Name`, regexpr(".", data$`Email Name`) + 1)
but none of it is working and returns the statements as is, also for filtering based on year I tried using the filter function
filter(data, as.Date(data$First Activity (EDT)) = "2017") but it gives me syntax error
My dataset is like this:

Here is the regex that should give you the desired result for 2017 values:
sub(".*?-.*?-", "", "17Q4-EMEA-All-SOV-OutR-Sov_Score-18Dec.Email")
# "All-SOV-OutR-Sov_Score-18Dec.Email"
The one for 2018 values:
sub("\\..*", "", "IVP Program Template.IVP Email Template")
# IVP Program Template
You can then apply the regex functions with ifelse:
library(lubridate)
data$email_adj <- NA
data$email_adj <- ifelse(year(mdy(data$`First Activity (EDT)`)) %in% "2017", sub(".*?-.*?-", "", data$`Email Name`), data$email_adj)
data$email_adj <- ifelse(year(mdy(data$`First Activity (EDT)`)) %in% "2018", sub("\\..*", "", data$`Email Name`), data$email_adj)
If you want to filter by month instead of year use the month instaed of the year function (in the example I only selected months from April until July):
library(lubridate)
data$email_adj <- NA
data$email_adj <- ifelse(month(mdy(data$`First Activity (EDT)`)) %in% 4:7, sub(".*?-.*?-", "", data$`Email Name`), data$email_adj)
data$email_adj <- ifelse(month(mdy(data$`First Activity (EDT)`)) %in% 4:7, sub("\\..*", "", data$`Email Name`), data$email_adj)

Related

Have a dataframe of tweets, wanting to filter out tweets that contain one of a number of keywords in the text with str_detect() in R

I have a dataframe of tweets. I want to identify all the tweets that contain at least one reference to a set of countries.
These references can appear in various forms. For instance, a reference to the US might be written as "America", "Washington", "Biden", or a number of other things. I figure the best way to do this is to create a vector for each country containing each value I'm searching for:
usid <- c("America", "Washington", "Biden")
rusid <- c("Russia", "Moscow", "Putin")
chnid <- c("China", "Beijing", "Xi jingping")
ids <- c(usid, rusid, chnid)
And so on. Please note that this is just a sample. I have 18 countries that will each have a vector of terms.
I have been using stringr because I thought the str_detect() function would be the best way to do this.
I've tried:
newdf <- filter(df, str_detect(text, usid))
This will return ONLY tweets that contain "America" but no other values in the vector and this error message: "Warning message: In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): longer object length is not a multiple of shorter object length"
When I use:
newdf <- filter(df, str_detect(text, ids))
I get seemingly random results and the same error message.
After resolving the above, I'd like to be able to negate certain country vectors within the ids vector. For instance, I'd like to search the dataframe for all country vectors except the US vector:
newdf <- filter(df, str_detect(text, ids![usid]))
But I am unsure of the syntax for this.
You need to include 'or'.
Instead of
newdf <- filter(df, str_detect(text, usid))
You can
newdf <- filter(df, str_detect(text, paste0(usid, collapse = "|")))
where usid becomes "America|Washington|Biden"
If you wanted everything except usid these two calls do the exact same thing. One uses the argument negate in str_detect, the other uses the ! as a negation operator.
filter(df, str_detect(text, paste0(usid, collapse = "|"), negate = T))
filter(df, !str_detect(text, paste0(usid, collapse = "|")))
If you wanted to use usid as a 'not' filter in ids you can:
notusid <- ids[which(!ids %in% usid)]
# [1] "Russia" "Moscow" "Putin" "China" "Beijing"
# [6] "Xi jingping"
filter(df, str_detect(text, paste0(notusid, collapse = "|")))
or (this does the exact same thing)
filter(df, str_detect(text, paste0(ids[which(!ids %in% usid)], collapse = "|")))

String conversion to array: Opening hours (over a week)

I've done an OSM-extraction and here you can see the column "osm_openin" for the opening hours for each object in R.
It has the following structure:
I would love to have new columns for each day of the week, with a symbol "X" - if it is not open all day - or the according opening hours for the day "07:00 - 21:00".
My solution:
Firstly, I am thinking of using representative values for the week days "Mo = 1", "Tu = 2"..."Su = 7". It is important, if the day/value itself is not explicitly mentioned, but is exisiting in an intervall.
For each value, I am searching its existence in the column.
If it finds the value, I'll take the opening hours following directly after (don't know which R command to use for that)
If not, then the value has to be in an intervall. For example "2" (Tuesday) is not existing, then the script needs to realize Tuesday is between Mo-Sa. (don't know which method to use for that).
Public Holiday is not important.
Any suggestion for a solution?
Thanks.
I don't know the best way, but may be I can help you.
Firstly we need to create array of weekdays:
wdays <- c("Mo", "Tu", "We", "Th", "Fr", "Sa", "Su")
Now let's write code for converting text from "Mo,We-Fr" to vector c(1, 3, 4, 5). Algorithm:
Delete information about holidays ("PH", "SH");
Replace name of weekday with number ("Mo" --> 1, "Tu" --> 2, etc.);
Replace - with :. For example, 3-5 will be 3:5 and it is R-style code;
Add c( to the beginning and ) to the end. For example, 1,3:5 will be c(1, 3:5);
c(1, 3:5) is R-style vector and we can create vector by text (eval(parse(text = "c(1, 3:5)"))).
Full code:
GetWDays <- function(x, wdays) {
holi <- c("PH", "SH")
x <- gsub(paste0("(,|^)", holi, collapse = "|"), "", x) #delete holidays
for (i in 1:7) {
x <- gsub(wdays[i], i, x)
}
x <- gsub("-", ":", x)
x <- paste0("c(", x, ")")
wday_idx <- eval(parse(text = x))
return(wday_idx)
}
Let's create function that has opening hours (like "Mo-Fr 6:30-19:00;Sa 09:00-17:00;Su,PH 09:00-15:00") as input and returns data.frame with 7 columns (for each weekday). Algorithm:
Split text by ;; Now we will work with one part of text (for example, "Mo-Fr 6:30-19:00");
Split text by (space); "Mo-Fr 6:30-19:00" --> "Mo-Fr" and "6:30-19:00"
First part ("Mo-Fr") we put into GetWDays and we make vector from second part (it's size will be like as first part size). Example: "Mo-Fr" --> c(1,2,3,4,5), "6:30-19:00" --> rep("6:30-19:00", 5);
Make data.frame from 2 vectors (Day and Time);
Use bind_rows for each part from first step. Now we have big data.frame, but some weekdays may be missing, and some weekdays may have "Off" in column Time;
So add rows for missing weekdays (by merge) and replace "Off" and NA with "X" (as you want);
Transpose data.frame and return
Full code:
GetTimetable <- function(x) {
wdays <- c("Mo", "Tu", "We", "Th", "Fr", "Sa", "Su")
tmp <- strsplit(strsplit(x, ";")[[1]], " ")
tmp <- lapply(tmp, function(x) {Day <- GetWDays(x[1], wdays); data.frame(Day, Time = rep(x[2], length(Day)))})
tmp <- bind_rows(tmp) %>% arrange(Day) %>% as.data.frame()
tmp <- merge(data.frame(Day = 1:7), tmp, all.x = T, by = "Day")
tmp$Time[is.na(tmp$Time) | tmp$Time == "Off"] = "X"
tmp <- tmp %>% t() %>% "["(2, ) %>% as.list() %>% setNames(wdays) %>% bind_cols()
return(tmp)
}
If you want to apply GetTimetable for each row you can use this code:
df_time <- df$osm_openning %>% lapply(GetTimetable) %>% bind_rows()
And if you want to add this data.frame to your data you can do something like this:
df <- bind_cols(df, df_time)

Can you use the output of a function with user input to "call" a dataframe?

I have some code in R that invites a user to put in a year between 2010 and 2021.
chosen.year <- readline(promt = "choose year between 2010 and 2021:")
y.chosen.year <- paste("y", chosen.year)
year.input <- gsub(" ", "", y.chosen.year, fixed = TRUE)
The output that is stored in year.input is for e.g. 2015: y2015.
I have a dataframe for each year between 2010 and 2021 that is called y2010, y2011 etc.
Is it possible to later use year.input in another function that would otherwhise require me to write y2015 (so that the user can choose a year that will be used later on)?
Example:
myspdf2 <- merge(myspdf1, year.input, by.x "abc", by .y "def")
Instead of:
myspdf2 <- merge(myspdf1, y2015, by.x "abc", by .y "def")
I tried the method above but it did not work.
Assuming promt= is not in your real code, two options:
Combine all years into one frame, including the year in the data (if not there already).
years <- ls(pattern = "^y\\d{4}$")
allyears <- Map(
function(x, yr) transform(x, year = yr),
mget(years), years)
subset(allyears, year == chosen.year)
Combine all years into a list of frames, and subset from there:
allyears <- mget(ls(pattern = "^y\\d{4}$"))
allyears[[ chosen.year ]]
(This assumes that a chosen.year will only reference one of the multiple frames.)
Ultimately I suspect that this is not about merge so much about subset (one-frame) or [[-extraction (list of frames).
A third option that I'm not fond of, but offered to round out the answer:
Just get the data. BTW, you should use either paste0(.) or paste(., sep=""), otherwise you'll get y 2015 instead of y2015. This is much more direct than paste(.) and gsub(" ", "", .).
year.input <- paste0("y", chosen.year)
get(year.input)

Split characters of a column in a dataframe to insert an underscore between values in R

I have a dataframe with a column of dates in YYYYMMDD format. I would like to convert this column of values to YYYY_MM_DD to match the format in another database.
Solutions I have found primarily focus on splitting a strings such as comma-delimited.
I essentially want a solution that indexes YYYYMMDD and inserts an '_' after the 4th character and the 6th character.
Thanks
UPDATE:
After posting I almost immediately found a solution that worked for me:
library(tidyr)
# create names for new columns
newCol <- c("YEAR", "MONTH", "DAY")
# separate existing column at 4 and 6 character
newData <- separate(table, old_column, newCol, sep = c(4,6))
# combining the three columns to one delimited by '_'
table$newColumn <- paste(table$YEAR, table$MONTH, table$DAY, sep = "_")
You can try gsub + as.Date, e.g.,
> gsub("-","_",as.Date(s,"%Y%m%d"))
[1] "2020_10_30" "2020_09_22"
Data
s <- c("20201030", "20200922")
You can use the format function in R to achive the required, like this:
year = format(as.Date("20200913", "%Y%m%d"), "%Y")
month = format(as.Date("20200913", "%Y%m%d"), "%m")
day = format(as.Date("20200913", "%Y%m%d"), "%d")
dt = paste0(year, "_", month, "_", day)
print(dt)
This prints the following result:
[1] "2020_09_13"
Another way you can try
library(stringr)
date1 <- c("20201030", "20200922")
date2 <- paste0(str_sub(date1, 1,4), "_", str_sub(date1, 5,6), "_", str_sub(date1, 7,8))
#[1] "2020_10_30" "2020_09_22"

Split values in a column and reassign it to a new column [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 3 years ago.
In my DataFrame, one the columns has a value that is a combination of [state,country]
I tried this code:
voivodeshipdf <- voivodeshipdf %>% mutate(state = as.character(unlist(str_split(voivodeship, ','))[1]))
but it only reassigns the value of the first row.
Please how do I update my code to split the right values for each row?
An option would be separate
library(tidyverse)
voivodeshipdf %>%
separate(voivodeship, into = c('state', 'newcol'), sep=",", remove = FALSE) %>%
select(-newcol)
Or extract
voivodeshipdf %>%
extract(volvodeship, into = 'state', '^([^,]+),.*', remove = FALSE)
or with word
voivodeshipdf %>%
mutate(state = word(volvodeship, 1, sep=","))
The issue in the OP's code is that is subsetting the list with [1], which would select the first list element as a list with one vector and it is getting assigned to the column due to recycling
Instead, what we need is to extract the first element from the list output of str_split with map or lapply (map would be more appropriate in tidyverse context)
voivodeshipdf %>%
mutate(state = map_chr(str_split(voivodeship, ','), first))
We can try using sub here for a base R option:
voivodeshipdf$state <- sub("^.*, ", "", voivodeshipdf$voivodeship)
voivodeshipdf$voivodeship <- sub(",.*$", "", voivodeshipdf$voivodeship)
Sample script:
voivodeship <- "Greater Poland voivodeship, poland"
sub("^.*, ", "", voivodeship)
sub(",.*$", "", voivodeship)
[1] "poland"
[1] "Greater Poland voivodeship"

Resources