gsub to create new variables [duplicate] - r

This question already has answers here:
How to split a string on first number only
(4 answers)
Closed 4 years ago.
I want to create 2 variables from 1 variable in R.
I have following character variable for gas station:
station
Valero 1810 N Foster Rd & IH-10 E
from this variable I want to create 2: station_id and address
station_id
Valero
address
1810 N Foster Rd & IH-10 E
In my data set all strings in station variable begin with words (up to 3 words, eg: EZ Mart) and all addresses begin with numeric value.
I was trying to achieve this goal using gsub for last couple hours but I couldn't do it.
Thank you

Base R solution: This works for the sample string you give. You need to test if this works for your other cases. It would've been good to include more than one sample string.
ss <- "Valero 1810 N Foster Rd & IH-10 E";
station_id <- trimws(gsub("(\\w+\\s+){1,3}(\\d+.+)$", "\\1", ss));
address <- gsub("(\\w+\\s+){1,3}(\\d+.+)$", "\\2", ss);
station_id;
#[1] "Valero"
address;
#[1] "1810 N Foster Rd & IH-10 E"

Related

Regex match for singular version BUT NOT plural in R [duplicate]

This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!
Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"
You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

removing variables containing certain string in r [duplicate]

This question already has answers here:
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Delete rows containing specific strings in R
(7 answers)
Closed 4 years ago.
I'd have hundreds of observations and I'd like to remove the ones that contain the string "english basement". I can't seem to find the right syntax to do so. I can only figure out how to keep observations with the that string. For instance, I used the code below to get only observations containing the string, and it worked perfectly:
eng_base <- zdata %>%
filter(str_detect(zdata$ListingDescription, “english basement”))
Now I want a data set,top_10mpEB, that excludes observations containing "english basement". Your help is greatly appreciated.
I do not know how your data looks like, but maybe this example helps you - I think you just need to negate the logical vector returned by str_detect:
library(dplyr)
library(stringr)
zdata <- data.frame(ListingDescription = c(rep("english basement, etc",3), letters[1:2] ))
zdata
# ListingDescription
#1 english basement, etc
#2 english basement, etc
#3 english basement, etc
#4 a
#5 b
zdata %>%
filter(!str_detect(ListingDescription, "english basement"))
# ListingDescription
#1: a
#2: b
Or using data.table package (no need of stringr::str_detect):
library(data.table)
setDT(zdata)
zdata[! ListingDescription %like% "english basement"]
# ListingDescription
#1: a
#2: b
You can do this using grepl():
x <- data.frame(ListingDescription = c('english basement other words description continued',
'great fireplace and an english basement',
'no basement',
'a house with a sauna!',
'the pool is great... and wait till you see the english basement!',
'new listing...will go fast'),
rent = c(3444, 23444, 346, 9000, 1250, 599))
x_english_basement <- x[grepl('english basement',
x$ListingDescription)==FALSE, ]
You can use dplyr to easily filter your dataframe.
library(dplyr)
new_data <- data %>%
filter(!ListingDescription=="english basement")
The ! became my best friend once I realized it meant "doesnt equal"

How do I parse the contents of hundreds of csvs that are in a list of dataframes and split on ";" and ";" in loops?

I'm working with a large number (1,983) of CSV files. Posts on stackoverflow have said that lists are easier to work with so I've approached my task that way. I have read the CSVs in and gotten the first part of my task accomplished: what is the maximum number of concurrent users of the application? (A:203) Here's that code:
# get a list of the files
files <- list.files("my_path_here",pattern="*.CSV$", recursive = TRUE, full.names=TRUE)
#read in the csv's and store them as a list of dataframes
tables <- lapply(files, read.csv)
#store the counts of the number of users here
counts<-rep(NA,length(tables))
#loop thru the files to find the count and store that value
for (i in 1:length(files)) {
counts[i] <- length(tables[[i]][[2]])
}
#what's the largest number?
max(counts)
#203
The 2nd part of the task is to show the count of each title for each file. The contents of each file will be something like this:
compute_0001 compute_0002
[1] 3/26/2015 6:00:00 Business System Manager;Lead CoPath Analyst
[2] Regional Histotechnologist;Hist Tech - Ht
[3] Regional Histotechnologist;Tissue Tech
[4] SDX Histotechnologist;Histology Tech
[5] SDX Histotechnologist;Histology Tech
[6] Regional Histotechnologist;Lab Asst II Histology
[7] CytoPrep Tech;Histo Tech - Ht
[8] Regional Histotechnologist;Tissue Tech
[9] Histology Supervisor;Supv Reg Lab Unit
[10] Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT
What will differ from file to file is the time stamp in compute_0001, name of the file and the number of users (ie length of the file).
My approach was to try this:
>col2 <- sapply(tables,summary, maxsum=300) # gives me a list of 1983 elements that is 23.6Mb
(I noticed that when doing a summary() on the files I would get something like this - which is why I was trying it)
>col2[[1]]
compute_0001 compute_0002
1] Business System Manager;Lead CoPath Analyst :1
[2] Regional Histotechnologist;Hist Tech - Ht :1
[3] Regional Histotechnologist;Tissue Tech :1
[4] SDX Histotechnologist;Histology Tech :1
[5] SDX Histotechnologist;Histology Tech :1
[6] Regional Histotechnologist;Lab Asst II Histology :2
[7] CytoPrep Tech;Histo Tech - Ht :4
[8] Regional Histotechnologist;Tissue Tech :1
[9 Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT :1
The above is actually many different people. For my purposes, [2],[3], [6] and [8] are the same title (even though the stuff after the ";" is different. The truth is that even [4] and [5] could also be considered the same as [2,3,6,8]).
That ":1" (or generally ":#") is the number of users with that title at that particular time. I was hoping to grab that character, make it numeric and add them up to get a count of the users with each title for each file. Each file is an observation at a particular datetime.
I tried something like this:
>for (s in 1:length(col2)) {
>split <- strsplit(col2[[s]][,2], ":")
>#... make it numeric so I can do addition with it
>num <- as.numeric(split[[s]][2])
>#... and put it in the correct df
>tables[[s]]$count <- num
# After dealing with the ":" I was going to handle splitting on the first ";"
>}
But I couldn't get the loop to iterate more than a single time or past the first element of col2.
A more experienced useR suggested something like this:
>strsplit(x = as.character(compute2[[s]]),split=";",fixed=TRUE)
He said "However this results in a messy list also, since there are multiple ";" in some lines. What I would #suggest is to use grep() with a regex that returns the text before the first ";"- use that with sapply(compute2,grep()) and then you can run sapply(??,table) on the list that is returned to tally the job titles."
I'd prefer not to get into regex but, following his advice, I tried:
>for (s in 1:length(tables)){
>+ split <- strsplit(x = >as.character(compute2[[s]]),split=";",fixed=TRUE)
>+ }
split is a list of only 122 , not nearly long enough so it's not iterating thru the loop either. So, I figured I'd skip the loop and try:
>title_split<- sapply(compute2, strsplit, x = as.character(compute2[[1]]),split=";",fixed=TRUE)
But that gave me more than 50 warnings and a matrix that had 105,000+ elements that was 20.2Mb in size.
Like I said, I'd prefer to not venture into the world of regex, since I think I should be able to split on the ":" first and then the first of the ";" and return the string that precedes the ";". I'm just not sure why the loop is failing.
What I eventually want is a table that shows the count of each title (collapsed for duplicates like [2],[3], [6] and [8] above) for each file (which represents an observation at a particular datetime). I'm pretty agnostic as to approach, so if I have to do it via regex, then so be it.
Sorry for the lengthy post but I suspect that part of my problem (besides being brand new to stackoverflow, R and not understanding regex well) is that I'm not well versed in list manipulation and I wanted you to have the context.
Many thanks for reading.
You data isn't easily reproducible, so I've created a simple list of fake data that I hope captures the essence of your data.
Make a list of fake data frames:
string1 = "53 Regional histotechnologist;text2 - more text"
string2 = "54 Regional histotechnologist;text2 - more text"
string3 = "CytoPrep Tech;text2 - more text"
tables = list(df1=data.frame(compute=c(string1, string2, string3)),
df2=data.frame(compute=c(string1, string2, string3)))
Count the number of rows in each data frame:
counts = sapply(tables, nrow)
Add a column that extracts job titles from the compute column. The regex pattern skips zero or more digit characters ([0-9]*) followed by zero or one space character (?) and then captures everything up to, but not including, the first semi-colon(([^;]*);) and then skips every character after the semi-colon (.*).
tables = sapply(names(tables), function(df) {
cbind(tables[[df]], title=gsub("[0-9]* ?([^;]*);.*", "\\1", tables[[df]][,"compute"]))
}, simplify=FALSE)
tables
$df1
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
$df2
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
Make a table of counts of each title for each data frame in tables:
title.table.list = lapply(tables, function(df) table(df$title))
title.table.list
$df1
CytoPrep Tech Regional histotechnologist
1 2
$df2
CytoPrep Tech Regional histotechnologist
1 2

Selecting strings and using in logical expressions to create new variable - R

I have a categorical variable indicating location of flu clinics as well as an "other" category. Participants who select the "other" category give open-ended responses for their location. In most cases, these open-ended responses fit with one of the existing categories (for example, one category is "public health clinic", but some respondents picked "other" and cited "mall" which was a public health clinic). I could easily do this by hand but want to learn the code to select "mall" strings then use logical expressions to assign these people to "public health clinic" (e.g. create a new variable for location of flu clinics).
My categorical variable is "lrecflu2" and my character string variable is "lfother"
So far I have:
mall <- grep("MALL", Motiv82012$lfother, value = TRUE)
This gives me a vector with all the string responses containing "MALL" (all strings are in caps in the dataframe)
How do I use this vector in a logical expression to create a new variable that assigns these people to the "public health clinic" category and assigns the original value of flu clinic location variable for people that did not select "other" (and do not have values in the character string variable) to the new flu clinic location variable?
Perhaps, grep is not even the right function to be using.
As I understand it, you have a column in a data frame, where you want to reassign one character value to another. If so, you were almost there...
set.seed(1) # for generating an example
df1 <- data.frame(flu2=sample(c("MALL","other","PHC"),size=10,replace=TRUE))
df1$flu2[grep("MALL",df1$flu2)] <- "PHC"
Here grep() is giving you the required vector index; you then subset the vector based on this and change those elements.
Update 2
This should produce a data.frame similar to the one you are using:
set.seed(1)
lreflu2 <- sample(c("PHC","Med","Work","other"),size=10,replace=TRUE)
Ifother <- rep("",10) # blank character vector
s1 <- c("Frontenac Mall","Kingston Mall","notMALL")
Ifother[lreflu2=="other"] <- s1
df1 <- data.frame(lreflu2,Ifother)
### alternative:
### df1 <- data.frame(lreflu2,Ifother, stringsAsFactors = FALSE)
df1
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 other Frontenac Mall
5 PHC
6 other Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
If you're looking for an exact string match you don't need grep at all:
df1$lreflu2[df1$Ifother=="MALL"] <- "PHC"
Using a regex:
df1$lreflu2[grep("Mall",df1$Ifother)] <- "PHC"
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 PHC Frontenac Mall
5 PHC
6 PHC Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
Whether Ifother is a factor or vector with mode character doesn't affect things. data.frame will coerce string vectors to factors by default.

Resources