Remove part of an element in a dataframe in R - r

I have a data frame (DF) like this:
word
1 vet clinic New York
2 super haircut Alabama
3 best deal on dog drugs
4 doggy medicine Texas
5 cat healthcare
6 lizards that don't lie
I am trying to get the resulting data frame (only remove the geo names)
word
1 vet clinic
2 super haircut
3 best deal on dog drugs
4 doggy medicine
5 cat healthcare
6 lizards that don't lie
The following does not keep the remaining words after the geo name has been removed.
vec <- # vector of geo names
DF <-DF[!grepl(vec,DF$word),]

Using #Ari's variables and data frame, a vectorized method could use Reduce:
vec = c("New York", "Texas", "Alabama")
word = c("vet clinic New York", "super haircut Alabama", "best deal on dog drugs", "doggy medicine Texas", "cat healthcare", "lizards that don't lie")
df = data.frame(word=word)
df$word = as.character(df$word)
Reduce(function(a, b) gsub(b,"", a, fixed=T), vec, df$word)
[1] "vet clinic " "super haircut " "best deal on dog drugs" "doggy medicine "
[5] "cat healthcare" "lizards that don't lie"

Using #Ari's example,
library(stringr)
df$word <- str_trim(gsub(paste(vec,collapse="|"),"", df$word))
df$word
#[1] "vet clinic" "super haircut" "best deal on dog drugs"
#[4] "doggy medicine" "cat healthcare" "lizards that don't lie"

As Henrik mentioned, it would have been helpful if you submitted a reproducible example along with your post. I will do so here:
vec = c("New York", "Texas", "Alabama")
word = c("vet clinic New York", "super haircut Alabama", "best deal on dog drugs", "doggy medicine Texas", "cat healthcare", "lizards that don't lie")
df = data.frame(word=word)
df$word = as.character(df$word)
df
word
1 vet clinic New York
2 super haircut Alabama
3 best deal on dog drugs
4 doggy medicine Texas
5 cat healthcare
6 lizards that don't lie
Generally speaking R gurus prefer vectorization over for loops. But in this case I found a nested for loop and the stringr package to be the easiest way to solve this problem.
library(stringr)
for(i in 1:nrow(df))
{
for (j in 1:length(vec))
{
df[i, "word"] = str_replace_all(df[i, "word"], vec[j], "")
}
}
df
word
1 vet clinic
2 super haircut
3 best deal on dog drugs
4 doggy medicine
5 cat healthcare
6 lizards that don't lie
I believe that this code gives you the result that you were looking for.

Related

R Regex for Postive Look-Around to Match Following

I have a dataframe in R. I want to match with and keep the row if
"woman" is the first or
the second word in a sentence, or
if it is the third word in a sentence and preceded by the words "no," "not," or "never."
phrases_with_woman <- structure(list(phrase = c("woman get degree", "woman obtain justice",
"session woman vote for member", "woman have to end", "woman have no existence",
"woman lose right", "woman be much", "woman mix at dance", "woman vote as member",
"woman have power", "woman act only", "she be woman", "no committee woman passed vote")), row.names = c(NA,
-13L), class = "data.frame")
In the above example, I want to be able to match with all rows except for "she be woman."
This is my code so far. I have a positive look-around ((?<=woman\\s)\\w+") that seems to be on the right track, but it matches with too many preceding words. I tried using {1} to match with just one preceding word, but this syntax didn't work.
matches <- phrases_with_woman %>%
filter(str_detect(phrase, "^woman|(?<=woman\\s)\\w+"))
Help is appreciated.
Each of the conditions can be an alternative although the last one requires two alternatives assuming that no/not/never can be either the first or second word.
library(dplyr)
pat <- "^(woman|\\w+ woman|\\w+ (no|not|never) woman|(no|not|never) \\w+ woman)\\b"
phrases_with_woman %>%
filter(grepl(pat, phrase))
I haven't come up with a regex solution but here is a workaround.
library(dplyr)
library(stringr)
phrases_with_woman %>%
filter(str_detect(word(phrase, 1, 2), "\\bwoman\\b") |
(word(phrase, 3) == "woman" & str_detect(word(phrase, 1, 2), "\\b(no|not|never)\\b")))
# phrase
# 1 woman get degree
# 2 woman obtain justice
# 3 session woman vote for member
# 4 woman have to end
# 5 woman have no existence
# 6 woman lose right
# 7 woman be much
# 8 woman mix at dance
# 9 woman vote as member
# 10 woman have power
# 11 woman act only
# 12 no committee woman passed vote

How to clean up data in R using strings?

I need to clean up gender and dates columns of the dataset found here.
They apparently contain some misspellings and ambiguities. I am new to R and data cleaning so I am not sure how to go about doing this. For starters, I have tried to correct the misspellings using
factor(data$artist_data$gender)
str_replace_all(data$artist_data$gender, pattern = "femle", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "f.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "F.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "female", replacement = "Female")
But it doesn't seem to work as I still have f., F. and femle in my output. Secondly, there seem to be empty cells inside. Do I need to remove them or is it alright to leave them there. If I need to remove them, how?
Thirdly, for the dates column, how do I make it clearer? i.e. change the format of born in xxxx to maybe xxxx-yyyy if died or xxxx-present if still alive. e.g. born in 1940 - is it safe to assume that they are still alive? Also one of the data has the word active in it. Would like to make this data more straight-forward.
Please help,
Thank you.
We have to escape the dot in f. and F.
library(dplyr)
library(stringr)
library(tibble)
pattern <- paste("f\\.|F\\.|female|femle", collapse="|")
df[[2]] %>%
mutate(gender = str_replace(string=gender,
pattern = pattern,
replacement="Female")) %>%
as_tibble()
name gender dates placeOfBirth placeOfDeath
<chr> <chr> <chr> <chr> <chr>
1 Abakanowicz, Magdalena Female born 1930 Polska ""
2 Abbey, Edwin Austin Male 1852–1911 Philadelphia, United States "London, United Kingdom"
3 Abbott, Berenice Female 1898–1991 Springfield, United States "Monson, United States"
4 Abbott, Lemuel Francis Male 1760–1803 Leicestershire, United Kingdom "London, United Kingdom"
5 Abrahams, Ivor Male born 1935 Wigan, United Kingdom ""
6 Absalon Male 1964–1993 Tel Aviv-Yafo, Yisra'el "Paris, France"
7 Abts, Tomma Female born 1967 Kiel, Deutschland ""
8 Acconci, Vito Male born 1940 New York, United States ""
9 Ackling, Roger Male 1947–2014 Isleworth, United Kingdom ""
10 Ackroyd, Norman Male born 1938 Leeds, United Kingdom ""
# ... with 3,522 more rows

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

R extract house / street numers from adress string

say I have the following data with addresses, i.e. street names. My goal is to separate street names from house numbers.
mydf <- tribble(
~street,
"Some Way 10",
"Shiny Street 12b",
"Dark Street from Netflix Movie 17c - 17d",
"Seasame Street",
"Dark Alley 15c",
)
mydf <- mydf %>% mutate(street= str_squish(street)) # get rid of whitespace
I tried the following
sub <- tidyr::extract(mydf, "street", c("street_name_only", "house_number"), "(\\D+)(\\d.*)") %>%
print(n=5)
which works fine, as long as there is a street or house number present. If the string "street" is without a street number, then NAs will show up in the new variable "street_name_only" as well as "house_number", as is the case with "Sesame Street". ( I would like to have "Sesame Street" in the "new_street_column" and ideally "" (empty) in the house_number column, though I could mange the NA in the house_number column afterwards).
Could anybody tell me where I went wrong and how to solve this issue?
Thank you very much in advance.
Will this work:
mydf %>%
transmute(street_name_only = str_remove(street, '\\d.*'),
house_number = str_extract(street, '\\d.*'))
# A tibble: 5 x 2
street_name_only house_number
<chr> <chr>
1 "Some Way " 10
2 "Shiny Street " 12b
3 "Dark Street from Netflix Movie " 17c - 17d
4 "Seasame Street" NA
5 "Dark Alley " 15c
Using tidyr::separate :
tidyr::separate(mydf, street, c("street_name_only", "house_number"),
'(?=\\d)', extra = 'merge', fill = 'right')
# street_name_only house_number
# <chr> <chr>
#1 "Some Way " 10
#2 "Shiny Street " 12b
#3 "Dark Street from Netflix Movie " 17c - 17d
#4 "Seasame Street" NA
#5 "Dark Alley " 15c

Checking to see if strings in one column matches the abbreviated form of the strings in another column

I have a large data frame "df" with 2 columns:
**column1** **column2**
The City of New York TCNY
The Land of the Free TLF
Stellar Stars Basketball Program SSBP
Center for Life Sciences CLS
Children's Hospital of Los Angeles CHLA
New York Yankees NY
etc etc
I've done some research and saw that you could use mapply to do a function on two columns at the same time but I'm uncertain what function I would do. I was thinking doing something where a function checks all the capital letters in the strings of column1 and checks if those capital letters exist in column2 but really unsure how.. Any help would be great! Thank you so much!
Here's an example of what I think you might be trying to achieve (on a subset of the rows you've shown in your question):
df <- data.frame(
col_1 = c("The City of New York", "The Land of the Free", "New York Yankees"),
col_2 = c("TCNY", "TLF", "NY")
)
> df
col_1 col_2
1 The City of New York TCNY
2 The Land of the Free TLF
3 New York Yankees NY
# Add a third column indicating whether the capitalised letters of the first
# column are equal to the strings in the second
df$col_3 <- unlist(apply(df, 1, function(x) gsub("[^A-Z]", "", x[1]) == x[2]))
> df
col_1 col_2 col_3
1 The City of New York TCNY TRUE
2 The Land of the Free TLF TRUE
3 New York Yankees NY FALSE
Above I'm using gsub to remove any characters that aren't upper case from the first column values, then comparing them to the second column in an apply statement, which is operating on each row of the dataframe. Then I'm using unlist to convert the result from a list to a vector, which can be stored in the third column of the dataframe df.
Using base r
transform(dat,correctABBV=x<-gsub("[^A-Z]","",column1),check=x==column2)
column1 column2 correctABBV check
1 The City of New York TCNY TCNY TRUE
2 The Land of the Free TLF TLF TRUE
3 Stellar Stars Basketball Program SSBP SSBP TRUE
4 Center for Life Sciences CLS CLS TRUE
5 Children's Hospital of Los Angeles CHLA CHLA TRUE
6 New York Yankees NY NYY FALSE
Here is one approach for you. I was not sure if you wanted to keep etc as an abbreviation or not. At the moment, I treat it as an abbreviation. First, I wanted to create abbreviations based on the first column. I checked how many words exist in each string using stri_count(). When the answer is TRUE to the logical condition, I used gsub() to extract capital letters. When the answer is FALSE to the logical condition, I added elements in mycol1 to abb. Finally, I checked if elements in abb and mycol2 are the same or not and created check.
mydf <- data.frame(mycol1 = c("The City of New York", "The Land of the Free", "Stellar Stars Basketball Program",
"Center for Life Sciences", "Children's Hospital of Los Angeles", "New York Yankees", "etc"),
mycol2 = c("TCNY", "TLF", "SSBP", "CLS", "CHLA", "NY", "etc"),
stringsAsFactors = FALSE)
library(dplyr)
library(stringi)
mutate(mydf,
abb = if_else(stri_count(mycol1, regex = "\\w+") > 1,
gsub(x = mycol1, pattern = "[^A-Z]",replacement = ""),
mycol1),
check = abb == mycol2)
mycol1 mycol2 abb check
1 The City of New York TCNY TCNY TRUE
2 The Land of the Free TLF TLF TRUE
3 Stellar Stars Basketball Program SSBP SSBP TRUE
4 Center for Life Sciences CLS CLS TRUE
5 Children's Hospital of Los Angeles CHLA CHLA TRUE
6 New York Yankees NY NYY FALSE
7 etc etc etc TRUE

Resources