Using if_else for string search [closed] - r

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am trying to replace the name of the US states with the string " United States".
Country<-data.frame(Name=c(" China"," Japan"," Florida"," Canada"," Texas"))
Country$Name<-as.character(Country$Name)
Country
Name
1 China
2 Japan
3 Florida
4 Canada
5 Texas
str(Country)
'data.frame': 5 obs. of 1 variable:
$ Name: chr " China" " Japan" " Florida" " Canada" ...
Below is the DPLYR command I used. It doesn't work. I use state.name for this purpose.
Country%>%mutate(Name=if_else(Name %in% state.name, " United States",Name))
Name
1 China
2 Japan
3 Florida
4 Canada
5 Texas
I tried to use str_detect, but it gives multiple output for the search on state.name. ( FALSE FALSE TRUE...… ). Hence unable to succeed with the condition check.

You can use ifelse from base R to do it
Country <- within(Country, Name <- ifelse(trimws(Name) %in% state.name, "United States",trimws(Name)))

Your problem is that %in% only checks for exact matches. The names in your data.frame have whitespace at the beginning while state.name hasn't. So you need to remove this whitespace before comparing the two.
You can either remove the whitespace (with trimws) from the Name column before comparison:
library(dplyr)
Country %>%
mutate(Name = trimws(Name)) %>%
mutate(Name = if_else(Name %in% state.name, "United States", Name))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
Or just within the comparison, which will preserve the whitespace (I don't see a reason why you would want that but just in case):
Country %>%
mutate(Name = if_else(trimws(Name) %in% state.name, "United States", Name))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
A third possibility would be to use string replacement, for example, with the stringi package:
library(stringi)
Country %>%
mutate(Name = stri_replace_all_fixed(Country$Name, state.name, "United States",
vectorize_all = FALSE, stri_opts_regex(case_insensitive = TRUE)))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
I wouldn't recommend this either but included it since you have a few more options (e.g., case_insensitive) if your strings are more complicated than what's in your sample data.

Related

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

Regular Expressions to Unmerge row entries

I have an example data set given by
df <- data.frame(
country = c("GermanyBerlin", "England (UK)London", "SpainMadrid", "United States of AmericaWashington DC", "HaitiPort-au-Prince", "country66city"),
capital = c("#Berlin", "NA", "#Madrid", "NA", "NA", "NA"),
url = c("/country/germany/01", "/country/england-uk/02", "/country/spain/03", "country/united-states-of-america/04", "country/haiti/05", "country/country6/06"),
stringsAsFactors = FALSE
)
country capital url
1 GermanyBerlin #Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 SpainMadrid #Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
5 HaitiPort-au-Prince NA country/haiti/05
6 country66city NA country/country6/06
The aim is to tidy this so that the columns are as one would expect from their names:
the first should contain only the country name.
the second should contain the capital (without a # sign).
the third should remain unchanged.
So my desired output is:
country capital url
1 Germany Berlin /country/germany/01
2 England (UK) London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 country6 6city country/country6/06
In the cases where there are non-NA entries in the capital column, I have a piece of code that achieves this (see bottom of post).
Therefore I am looking for a solution that recognises that the pattern of the url column can be used to split the capital out of the country column.
This needs to account for the fact that
the URL text is all lower case, whilst the country name as it appears in the country column has mixed cases.
the text in the URL replaces spaces with hyphens.
the url removes special characters (such as the brackets around UK).
I would be interested to see how this aim can be achieved, presumably using regular expressions (though open to any options).
Partial solution when capital column is non-NA
Where there are non-NA entries in the capital column the following code achieves my aim:
df %>% mutate( capital = str_replace(capital, "#", ""),
country = str_replace(country, capital,"")
)
country capital url
1 Germany Berlin /country/germany/01
2 England (UK)London NA /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States of AmericaWashington DC NA country/united-states-of-america/04
you can do
transform(df,capital=sub(".*[A-Z]\\S+([A-Z])","\\1",country))
country capital url
1 GermanyBerlin Berlin /country/germany/01
2 England (UK)London London /country/england-uk/02
3 SpainMadrid Madrid /country/spain/03
4 United States of AmericaWashington DC Washington DC country/united-states-of-america/04
You could start with something like this and keep on refining until you get the (100%) correct results and then see if you can skip/merge any steps.
library(magrittr)
df$country2 <- df$url %>%
gsub("-", " ", .) %>%
gsub(".+try/(.+)/.+", "\\1", .) %>%
gsub("(\\b[a-z])", "\\U\\1", ., perl = TRUE)
df$capital <- df$country %>%
gsub("[()]", " ", .) %>%
gsub(" +", " ", .) %>%
gsub(paste(df$country2, collapse = "|"), "", ., ignore.case = TRUE)
df$country <- df$country2
df$country2 <- NULL
df
country capital url
1 Germany Berlin /country/germany/01
2 England Uk London /country/england-uk/02
3 Spain Madrid /country/spain/03
4 United States Of America Washington DC country/united-states-of-america/04
5 Haiti Port-au-Prince country/haiti/05
6 Country6 6city country/country6/0

How to remove specific words in a column

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and "Country" from the column called Imp_Office.
I tried using the tm-package, with reference to an earlier post, but nothing happened.
what I wrote:
library(tm)
stopwords = c("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
Where I got the following error message:
Error in gsub(paste0(stopwords, collapse = "|", "", MY_df$Imp_Office))
:
argument "x" is missing, with no default
I also tried using the function readLines:
stopwords = readLines("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
But this didn't help either
I have considered the possibility of using some other string manipulation method, but I don't need to detect, replace or remove whitespace - so I am kind of lost here.
Thank you.
First, let's set up a dataframe with a column like what you describe:
library(tidyverse)
df <- data_frame(Imp_Office = c("China Country Office",
"Bangladesh Country Office",
"China",
"Bangladesh"))
df
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China Country Office
#> 2 Bangladesh Country Office
#> 3 China
#> 4 Bangladesh
Then we can use str_remove_all() from the stringr package to remove any bits of text that you don't want from them.
df %>%
mutate(Imp_Office = str_remove_all(Imp_Office, " Country| Office"))
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China
#> 2 Bangladesh
#> 3 China
#> 4 Bangladesh
Created on 2018-04-24 by the reprex package (v0.2.0).

How to parse non xml as xml? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V
How do I port this as an XML document? I'm trying to parse this in R.
You can use xml2 to read and parse:
library(xml2)
library(tidyverse)
xml <- read_xml('https://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V')
bart <- xml %>% xml_find_all('//station') %>% # select all station nodes
map_df(as_list) %>% # coerce each node to list, collect to data.frame
unnest() # unnest list columns of data.frame
bart
#> # A tibble: 46 × 9
#> name abbr gtfs_latitude gtfs_longitude
#> <chr> <chr> <chr> <chr>
#> 1 12th St. Oakland City Center 12TH 37.803768 -122.271450
#> 2 16th St. Mission 16TH 37.765062 -122.419694
#> 3 19th St. Oakland 19TH 37.808350 -122.268602
#> 4 24th St. Mission 24TH 37.752470 -122.418143
#> 5 Ashby ASHB 37.852803 -122.270062
#> 6 Balboa Park BALB 37.721585 -122.447506
#> 7 Bay Fair BAYF 37.696924 -122.126514
#> 8 Castro Valley CAST 37.690746 -122.075602
#> 9 Civic Center/UN Plaza CIVC 37.779732 -122.414123
#> 10 Coliseum COLS 37.753661 -122.196869
#> # ... with 36 more rows, and 5 more variables: address <chr>, city <chr>,
#> # county <chr>, state <chr>, zipcode <chr>
Using library rvest. The base idea is to find nodes (xml_nodes) of interest with XPath selectors, then grab the values with xml_text
library(rvest)
doc <- read_xml("http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V")
names <- doc %>%
xml_nodes(xpath = "/root/stations/station/name") %>%
xml_text()
names[1:5]
# [1] "12th St. Oakland City Center" "16th St. Mission" "19th St. Oakland" "24th St. Mission"
# [5] "Ashby"
I had some problems using the URL within read_html directly. So I used readLines first. After that, its finding all the nodesets with <station>. Transform it into a list and feed it into data.table::rbindlist. Idea of using rbindlist came from here
library(xml2)
library(data.table)
nodesets <- read_html(readLines("http://api.bart.gov/api/stn.aspx?cmd=stns&key=MW9S-E7SL-26DU-VV8V")) %>%
xml_find_all(".//station")
data.table::rbindlist(as_list(nodesets))

Resources