How to remove specific words in a column - r

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and "Country" from the column called Imp_Office.
I tried using the tm-package, with reference to an earlier post, but nothing happened.
what I wrote:
library(tm)
stopwords = c("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
Where I got the following error message:
Error in gsub(paste0(stopwords, collapse = "|", "", MY_df$Imp_Office))
:
argument "x" is missing, with no default
I also tried using the function readLines:
stopwords = readLines("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
But this didn't help either
I have considered the possibility of using some other string manipulation method, but I don't need to detect, replace or remove whitespace - so I am kind of lost here.
Thank you.

First, let's set up a dataframe with a column like what you describe:
library(tidyverse)
df <- data_frame(Imp_Office = c("China Country Office",
"Bangladesh Country Office",
"China",
"Bangladesh"))
df
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China Country Office
#> 2 Bangladesh Country Office
#> 3 China
#> 4 Bangladesh
Then we can use str_remove_all() from the stringr package to remove any bits of text that you don't want from them.
df %>%
mutate(Imp_Office = str_remove_all(Imp_Office, " Country| Office"))
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China
#> 2 Bangladesh
#> 3 China
#> 4 Bangladesh
Created on 2018-04-24 by the reprex package (v0.2.0).

Related

Convert list from API do dataframe

I have a data as a list which looks like as following.
name type value
Api_collect list [5] List of length 5
country character [1] US
state character [1] Texas
computer character [1] Mac
house character [1] Mansion
president character [1] Trump
The following code have I runned in R.
api_col <- base::rawToChar((response$country))
as.data.frame(api_json$country)
And results in this df:
country
US
How to transfer this list to a dataframe with every column of Api_collect except for house?
Here's an option using purrr::map_df() and dplyr::select():
# name type value
# Api_collect list [5] List of length 5
# country character [1] US
# state character [1] Texas
# computer character [1] Mac
# house character [1] Mansion
# president character [1] Trump
library(dplyr)
library(purrr)
your_list <- list(
country = "US",
state = "Texas",
computer = "Mac",
house = "Mansion",
president = "Biden"
)
purrr::map_df(your_list, ~.x) %>% select(-country)
Which gives:
# A tibble: 1 × 4
state computer house president
<chr> <chr> <chr> <chr>
1 Texas Mac Mansion Biden

Add zero padding to numbers in a column by using str_pad in string package

I want to use the string str_pad function to make a column in my desired format, which includes zero padding the numbers in the "Code" column to 3 digits.
I've run this code:
Animals %>%
gather(most_common, cnt, M:OG) %>%
group_by(name) %>%
slice(which.max(cnt)) %>%
arrange(code)
Which resulted in the following tibble:
Code Name most_common
32 Monkey Africa
33 Wolf Europe
34 Tiger Asia
35 Godzilla Asia
#With 1 234 more rows
I'm happy with my code above. However, because I'm going to merge this df later on, I need the "Code" column to be three digits with zero padding (i.e. in the format "nnn" / "032"), as this:
Code Name most_common
032 Monkey Africa
033 Wolf Europe
034 Tiger Asia
035 Godzilla Asia
#With 1 234 more rows
I've run string str_pad($code, $3, $0), but it doesn't work. I guess there's something wrong there. Should I run this code wherever I want in my chunk or by using %>%?
A possible solution:
library(tidyverse)
df <- read.table(text = "Code Name most_common
32 Monkey Africa
33 Wolf Europe
34 Tiger Asia
35 Godzilla Asia", header = T)
df %>%
mutate(Code = str_pad(Code, width = 3, pad = "0"))
#> Code Name most_common
#> 1 032 Monkey Africa
#> 2 033 Wolf Europe
#> 3 034 Tiger Asia
#> 4 035 Godzilla Asia
In base R, we can use sprintf
df1$Code <- sprintf("%03d", df1$Code)
Another option could be using formatC with "d" for integer and a flag "0" the prepending zero like this:
df$Code <- formatC(df$Code, width = 3, format = "d", flag = "0")
df
#> Code Name most_common
#> 1 032 Monkey Africa
#> 2 033 Wolf Europe
#> 3 034 Tiger Asia
#> 4 035 Godzilla Asia
Created on 2022-07-23 by the reprex package (v2.0.1)

Beginner Question : How do you remove a date from a column?

I want to remove the date part from the first column but can't do it for all the dataset?
can someone advise please?
Example of dataset:
You can use sub() function with replacing ^[^[:alpha:]]+ (regular expression, i.e. all non-alphabetic characters at the beginning of the string), with "", i.e. empty string.
sub("^[^[:alpha:]]+", "", data)
Example
data <- data.frame(
good_column = 1:4,
bad_column = c("13/1/2000pasta", "14/01/2000flour", "15/1/2000aluminium foil", "15/1/2000soap"))
data
#> good_column bad_column
#> 1 1 13/1/2000pasta
#> 2 2 14/01/2000flour
#> 3 3 15/1/2000aluminium foil
#> 4 4 15/1/2000soap
data$bad_column <- sub("^[^[:alpha:]]+", "", data$bad_column)
data
#> good_column bad_column
#> 1 1 pasta
#> 2 2 flour
#> 3 3 aluminium foil
#> 4 4 soap
Created on 2020-07-29 by the reprex package (v0.3.0)

World Bank API query

I want to get data using World Bank's API. For this purpose I use follow query.
wb_data <- httr::GET("http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO?format=json") %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
data.frame()
It works pretty good. However, when I try to specify more than two variables it doesn't work.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?format=json
Note, if i change format to xml and also add source=2 because data become from same database (World Development Indicator) query works.
http://api.worldbank.org/v2/country/all/indicator/AG.AGR.TRAC.NO;NE.CON.PRVT.ZS?source=2&formal=xml
However, if i want to get data from different databases (e.g. WDI and Doing Business) it doesn't work again.
So, my first question is how can I get multiple data from different databases using one query. According to the World Bank API tutorial I can include about 60 indicators.
My second question is how can I specify number of rows per page. As I might know I can add something like &per_page=100 to get 100 rows as an output. Should i calculate number of rows by myself or I can use something lika that &per_page=9999999 to get all data upon request.
P.S. I don't want to use any libraries (such as: wb or wbstats). I want to do it by myself and also to learn something new.
Here's an answer to your question. To use multiple indicators and return JSON, you need to provide both the source ID and the format type, as mentioned in the World Bank API tutorial. You can get the total number of pages from one of the returned JSON parameters, called "total". You can then use this value in a second GET request to return the full number of pages using the per_page parameter.
library(magrittr)
library(httr)
library(jsonlite)
# set up the target url - you need BOTH the source ID and the format parameters
target_url <- "http://api.worldbank.org/v2/country/chn;ago/indicator/AG.AGR.TRAC.NO;SP.POP.TOTL?source=2&format=json"
# look at the metadata returned for the target url
httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# the metadata is in the first item in the returned list of JSON
extract2(1)
#> $page
#> [1] 1
#>
#> $pages
#> [1] 5
#>
#> $per_page
#> [1] 50
#>
#> $total
#> [1] 240
#>
#> $sourceid
#> NULL
#>
#> $lastupdated
#> [1] "2019-12-20"
# get the total number of pages for the target url query
wb_data_totalpagenumber <- httr::GET(target_url) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the first item in the returned list of JSON
extract2(1) %>%
# get the total number of pages, which is a named element called "total"
extract2("total")
# get the data
wb_data <- httr::GET(paste0(target_url, "&per_page=", wb_data_totalpagenumber)) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = T) %>%
# get the data, which is the second item in the returned list of JSON
extract2(2) %>%
data.frame()
# look at the data
dim(wb_data)
#> [1] 240 11
head(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 1 AGO 2019 NA 0 AG.AGR.TRAC.NO
#> 2 AGO 2018 NA 0 AG.AGR.TRAC.NO
#> 3 AGO 2017 NA 0 AG.AGR.TRAC.NO
#> 4 AGO 2016 NA 0 AG.AGR.TRAC.NO
#> 5 AGO 2015 NA 0 AG.AGR.TRAC.NO
#> 6 AGO 2014 NA 0 AG.AGR.TRAC.NO
#> indicator.value country.id country.value
#> 1 Agricultural machinery, tractors AO Angola
#> 2 Agricultural machinery, tractors AO Angola
#> 3 Agricultural machinery, tractors AO Angola
#> 4 Agricultural machinery, tractors AO Angola
#> 5 Agricultural machinery, tractors AO Angola
#> 6 Agricultural machinery, tractors AO Angola
tail(wb_data)
#> countryiso3code date value scale unit obs_status decimal indicator.id
#> 235 CHN 1965 715185000 <NA> 0 SP.POP.TOTL
#> 236 CHN 1964 698355000 <NA> 0 SP.POP.TOTL
#> 237 CHN 1963 682335000 <NA> 0 SP.POP.TOTL
#> 238 CHN 1962 665770000 <NA> 0 SP.POP.TOTL
#> 239 CHN 1961 660330000 <NA> 0 SP.POP.TOTL
#> 240 CHN 1960 667070000 <NA> 0 SP.POP.TOTL
#> indicator.value country.id country.value
#> 235 Population, total CN China
#> 236 Population, total CN China
#> 237 Population, total CN China
#> 238 Population, total CN China
#> 239 Population, total CN China
#> 240 Population, total CN China
Created on 2020-01-30 by the reprex package (v0.3.0)

Using if_else for string search [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am trying to replace the name of the US states with the string " United States".
Country<-data.frame(Name=c(" China"," Japan"," Florida"," Canada"," Texas"))
Country$Name<-as.character(Country$Name)
Country
Name
1 China
2 Japan
3 Florida
4 Canada
5 Texas
str(Country)
'data.frame': 5 obs. of 1 variable:
$ Name: chr " China" " Japan" " Florida" " Canada" ...
Below is the DPLYR command I used. It doesn't work. I use state.name for this purpose.
Country%>%mutate(Name=if_else(Name %in% state.name, " United States",Name))
Name
1 China
2 Japan
3 Florida
4 Canada
5 Texas
I tried to use str_detect, but it gives multiple output for the search on state.name. ( FALSE FALSE TRUE...… ). Hence unable to succeed with the condition check.
You can use ifelse from base R to do it
Country <- within(Country, Name <- ifelse(trimws(Name) %in% state.name, "United States",trimws(Name)))
Your problem is that %in% only checks for exact matches. The names in your data.frame have whitespace at the beginning while state.name hasn't. So you need to remove this whitespace before comparing the two.
You can either remove the whitespace (with trimws) from the Name column before comparison:
library(dplyr)
Country %>%
mutate(Name = trimws(Name)) %>%
mutate(Name = if_else(Name %in% state.name, "United States", Name))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
Or just within the comparison, which will preserve the whitespace (I don't see a reason why you would want that but just in case):
Country %>%
mutate(Name = if_else(trimws(Name) %in% state.name, "United States", Name))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
A third possibility would be to use string replacement, for example, with the stringi package:
library(stringi)
Country %>%
mutate(Name = stri_replace_all_fixed(Country$Name, state.name, "United States",
vectorize_all = FALSE, stri_opts_regex(case_insensitive = TRUE)))
#> Name
#> 1 China
#> 2 Japan
#> 3 United States
#> 4 Canada
#> 5 United States
I wouldn't recommend this either but included it since you have a few more options (e.g., case_insensitive) if your strings are more complicated than what's in your sample data.

Resources