I'm working on a dataset where one column (Place) consists of a location sentence.
librabry(tidyverse)
example <- tibble(Datum = c("October 1st 2017",
"October 2st 2017",
"October 3rd 2017"),
Place = c("Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria",
"Abu Kamal, Deir Ezzor Governorate, Syria",
"شارع القطار al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"))
I would like to split the Place column by the comma separator so I prefer a solution with the tidyverse package. Because the values of Place have different lengths I would like to start from right to left. So that the country Syria is the value in the last column of this dataframe.
Oh, and for a bonus with which RegEx code do I delete the Arabic characters?
Thanks in advance.
Edit: Found my answer:
For removing Arabic characters (thanks to #g5w):
gsub("[\u0600-\u06FF]", "", airstrikes_okt_clean$Plek)
And splitting the column in a tidyr way:
airstrikes_okt_clean <- separate(example,
Place,
into = c("detail",
"detail2",
"City_or_village",
"District",
"Country"),
sep = ",",
fill = "left")
Just split the string on comma and the reverse it.
lapply(strsplit(Place, ","), rev)
[[1]]
[1] " Syria" " Deir Ezzor Governorate"
[3] " 20km south east of Deir Ezzor" "Tabiyyah Jazeera village"
[[2]]
[1] " Syria" " Deir Ezzor Governorate"
[3] "Abu Kamal"
[[3]]
[1] " Syria" " Raqqah governorate"
[3] " north of Raqqah city centre" " al-Tawassiya area"
[5] "شارع القطار al Qitar [train] street"
To get rid of the Arabic characters before splitting, try
gsub("[\u0600-\u06FF]", "", Place)
[1] "Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria"
[2] "Abu Kamal, Deir Ezzor Governorate, Syria"
[3] " al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"
Here's a one-liner.
sapply(strsplit(example$Place, ","), function(x) trimws(x[length(x)]))
It will return the string after the last comma, be it Syria or any other.
Related
I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by #jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"
I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham
I have a variable below that I believe is separated by space.
[95] "Florida Volusia Deltona-Daytona Beach-Ormond Beach"
[96] "Iowa Polk Des Moines"
[97] "Mississippi Hinds Jackson"
[98] "Georgia Richmond Augusta-Richmond"
[99] "Ohio Mahoning Youngstown-Warren-Boardman"
[100] " Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[101] " Pennsylvania Dauphin Harrisburg-Carlisle"
[102] " Florida Brevard Palm Bay-Melbourne-Titusville"
[103] " Utah Utah Provo-Orem"
[104] " Tennessee Hamilton Chattanooga"
[105] " North Carolina Durham Durham"
I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.
> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df, :
non-character argument
> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
list()
> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()
> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()
EDIT - The original data has 104 observations, but I am posting only 8 observations with dput command...
> dput(BK_state_county_MSA)
structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Here's an option using stri_trim_left() from stringi and separate() from tidyr:
stri_trim_left() removes strings with a leading whitespace, which occurs in your data starting at [100]. You can then separate() the strings into the three specified columns state, country and MSA, separated by at least 2 whitespaces (sep = " {2,}").
Data
BK_state_county_MSA<- structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Code
library(tidyr)
library(stringi)
BK_state_county_MSA %>% mutate(non_squished_state_county_MSA = stri_trim_left(non_squished_state_county_MSA)) %>%
separate(non_squished_state_county_MSA, into = c("state", "country", "MSA"), sep = " {2,}")
Output
state country MSA
1 New York Bronx New York-Wayne-White Plains
2 New York Kings New York-Wayne-White Plains
3 Pennsylvania Lackawanna Scranton--Wilkes-Barre
4 Pennsylvania Dauphin Harrisburg-Carlisle
5 Florida Brevard Palm Bay-Melbourne-Titusville
6 Utah Utah Provo-Orem
7 Tennessee Hamilton Chattanooga
8 North Carolina Durham Durham
There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:
maddening_txt <- " North Carolina Durham Durham"
strsplit(maddening_txt, split = ' ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham" "" ""
[5] "" " Durham"
nchar(strsplit(maddening_txt, split = ' ')[[1]])
[1] 15 7 0 0 0 7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham" " Durham"
string_vec <- strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"
Left as it is, when returned to in the future, something like split=' ' begs 'How many spaces was that?'
So, more explicitly, and using some of the above notation:
> strsplit(sc_msa$entries[8], split='\\s{5,}')
[[1]]
[1] " North Carolina" "Durham" "Durham"
This appears to also remove leading left white space without recourse to trim. And then the above mentioned
for
sc_msa_lst <- list()
> for(i in 1:length(sc_msa$entries)) {
+ sc_msa_lst[[i]] <- strsplit(sc_msa$entries[i], split='\\s{5,}')
+ }
> sc_msa_lst
[[1]]
[[1]][[1]]
[1] "New York" "Bronx"
[3] "New York-Wayne-White Plains"
[[2]]
[[2]][[1]]
[1] "New York" "Kings"
[3] "New York-Wayne-White Plains"
[[3]]
[[3]][[1]]
[1] " Pennsylvania" "Lackawanna" "Scranton--Wilkes-Barre"
[[4]]
[[4]][[1]]
[1] " Pennsylvania" "Dauphin" "Harrisburg-Carlisle"
[[5]]
[[5]][[1]]
[1] " Florida" "Brevard"
[3] "Palm Bay-Melbourne-Titusville"
[[6]]
[[6]][[1]]
[1] " Utah" "Utah" "Provo-Orem"
[[7]]
[[7]][[1]]
[1] " Tennessee" "Hamilton" "Chattanooga"
[[8]]
[[8]][[1]]
[1] " North Carolina" "Durham" "Durham"
So, our regex pattern, '\s{5,}' works for cases seen so far. Next, I have to remember how to make a data.frame...there's got to be a list of lists to data.frame question on SO somewhere that'll help me.
A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.
Data, with renaming:
sc_msa <- structure(list(entries = c("New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem", " Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?
You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc
Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]
This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)
I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.
I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).
You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.
I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.