I have an r data frame that contains U.S. state and county names in one column. The data is in the format:
United States - State name - County name
where each cell is a unique county. For example:
United States - North Carolina - Wake County
United States - North Carolina - Warren County
etc.
I need to break the column into 2 columns, one containing just the state name and the other containing just the county name. I've experimented with sub and gsub but am getting no results. I understand this is probably a simple matter for r experts but I'm a newbie. I would be most grateful if anyone can point me in the right direction.
You can use tidyr's separate function:
library(tidyr)
df <- separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
If the data is as you show in your question (including United States as country) and if your data frame is called df and the current column with the data is called currentColumn.
Example:
df <- data.frame(currentColumn = c("United States - North Carolina - Wake County",
"United States - North Carolina - Warren County"), val = rnorm(2))
df
# currentColumn val
#1 United States - North Carolina - Wake County 0.8173619
#2 United States - North Carolina - Warren County 0.4941976
separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
# Country State County val
#1 United States North Carolina Wake County 0.8173619
#2 United States North Carolina Warren County 0.4941976
Using read.table, and assuming your data is in df$var
read.table(text=df$var,sep="-",strip.white=TRUE,
col.names=c("Country","State","County"))
If speed is an issue, then strsplit will be a lot quicker:
setNames(data.frame(do.call(rbind,strsplit(df$var,split=" - "))),
c("Country","State","County"))
Both give:
# Country State County
#1 United States North Carolina Wake County
#2 United States North Carolina Warren County
Related
I have two datasets from two different agencies that report census tracts in two different ways, namely FIPSCode and GISJOIN.
I have to frequently interchange these two and hence looking for a way to see if anyone knows how to effectively do it as my dataset includes some 70,000 census tracts and doing it manually is out of question.
For example,
Fipscode 1073011803 = GISJOIN G0100730011803.
The logic is simple where
1 = 01 (state code)
073 = 0073 (county code)
011803 = 0011803 (census tract number)
It seems that padding with 0 for each of the three elements in a fipscode gives the GISJOIN, however, I am unsure how to convert it.
I am using excel but can work with R if there is a way.
Thank you for your time!
After giving it a few tries, I have found a solution to this.
We need to know about the GISJOIN a bit more! It is a geoidentifier unique to US census geographies and NHGIS provided a standard structure using a combination of 13 digit alphanumeric IDs.
For demonstration purposes, I selected five random census tracts from the HUD data with the fips2010 and converted them into the prescribed GISJOIN style.
data <- data.frame(State = c("Alabama", "Alabama", "Delaware",
"Texas", "Wisconsin"),
County = c("Jefferson County", "Montgomary County", "Kent County",
"Travis County", "Milwaukee County"),
Tract = c("118.03", "1.00", "433.00", "13.07", "86.00"),
fips2010 = c("1073011803", "1101000100", "10001043300",
"48453001307", "55079008600"))
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 1073011803
2 Alabama Montgomary County 1.00 1101000100
3 Delaware Kent County 433.00 10001043300
4 Texas Travis County 13.07 48453001307
5 Wisconsin Milwaukee County 86.00 55079008600
Following the logic established in the NHGIS documentation, the code below converts the fips2010 column to appropriate GISJOIN standard.
for (i in 1:nrow(data)) {
fips2010 <- data$fips2010[i]
if (nchar(fips2010) == 10) {
data$fips2010[i] <- paste0("G0", substr(fips2010, 1, 1), "0", substr(fips2010, 2, 4), "0", substr(fips2010, 5, 10))
} else if (nchar(fips2010) == 11) {
data$fips2010[i] <- paste0("G", substr(fips2010, 1, 2), "0", substr(fips2010, 3, 5), "0", substr(fips2010, 6, 11))
}
}
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 G0100730011803
2 Alabama Montgomary County 1.00 G0101010000100
3 Delaware Kent County 433.00 G1000010043300
4 Texas Travis County 13.07 G4804530001307
5 Wisconsin Milwaukee County 86.00 G5500790008600
A side by side comparison: fips2010 and GISJOIN
State County Tract fips2010 GISJOIN
1 Alabama Jefferson County 118.03 1073011803 G0100730011803
2 Alabama Montgomary County 1.00 1101000100 G0101010000100
3 Delaware Kent County 433.00 10001043300 G1000010043300
4 Texas Travis County 13.07 48453001307 G4804530001307
5 Wisconsin Milwaukee County 86.00 55079008600 G5500790008600
I hope this helps anyone dealing with a similar issue.
I need to clean up gender and dates columns of the dataset found here.
They apparently contain some misspellings and ambiguities. I am new to R and data cleaning so I am not sure how to go about doing this. For starters, I have tried to correct the misspellings using
factor(data$artist_data$gender)
str_replace_all(data$artist_data$gender, pattern = "femle", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "f.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "F.", replacement = "Female")
str_replace_all(data$artist_data$gender, pattern = "female", replacement = "Female")
But it doesn't seem to work as I still have f., F. and femle in my output. Secondly, there seem to be empty cells inside. Do I need to remove them or is it alright to leave them there. If I need to remove them, how?
Thirdly, for the dates column, how do I make it clearer? i.e. change the format of born in xxxx to maybe xxxx-yyyy if died or xxxx-present if still alive. e.g. born in 1940 - is it safe to assume that they are still alive? Also one of the data has the word active in it. Would like to make this data more straight-forward.
Please help,
Thank you.
We have to escape the dot in f. and F.
library(dplyr)
library(stringr)
library(tibble)
pattern <- paste("f\\.|F\\.|female|femle", collapse="|")
df[[2]] %>%
mutate(gender = str_replace(string=gender,
pattern = pattern,
replacement="Female")) %>%
as_tibble()
name gender dates placeOfBirth placeOfDeath
<chr> <chr> <chr> <chr> <chr>
1 Abakanowicz, Magdalena Female born 1930 Polska ""
2 Abbey, Edwin Austin Male 1852–1911 Philadelphia, United States "London, United Kingdom"
3 Abbott, Berenice Female 1898–1991 Springfield, United States "Monson, United States"
4 Abbott, Lemuel Francis Male 1760–1803 Leicestershire, United Kingdom "London, United Kingdom"
5 Abrahams, Ivor Male born 1935 Wigan, United Kingdom ""
6 Absalon Male 1964–1993 Tel Aviv-Yafo, Yisra'el "Paris, France"
7 Abts, Tomma Female born 1967 Kiel, Deutschland ""
8 Acconci, Vito Male born 1940 New York, United States ""
9 Ackling, Roger Male 1947–2014 Isleworth, United Kingdom ""
10 Ackroyd, Norman Male born 1938 Leeds, United Kingdom ""
# ... with 3,522 more rows
In my data I have one column with country names. I want to make a new variable that lists which region each country is in based on an excel sheet I have where I have labelled each country by region.
I don't want to use the package countrycode because it doesn't have specific enough regions (i.e. it labels the Netherlands as Europe, and not Northern Europe). Is there a way to get R to inspect a cell and match the contents of that cell to another dataset?
Import your spreadsheet into R. (Use RExcel, or export as CSV and import that using base functions.) Suppose your spreadsheet has two columns, named Country and Region, something like this:
regions <- data.frame(Country = c("Greece", "Netherlands"),
Region = c("Southern Europe", "Northern Europe"),
stringsAsFactors = FALSE)
regions
#> Country Region
#> 1 Greece Southern Europe
#> 2 Netherlands Northern Europe
Now create a named vector from the dataframe:
named <- regions$Region
names(named) <- regions$Country
named
#> Greece Netherlands
#> "Southern Europe" "Northern Europe"
Now you can index the named vector to convert country names to regions in any other vector.
other <- c("Netherlands", "Greece", "Greece")
named[other]
#> Netherlands Greece Greece
#> "Northern Europe" "Southern Europe" "Southern Europe"
If you have any missing countries (or variant spellings), you'll get NA for the region, e.g.
other2 <- c("Greece", "France")
named[other2]
#> Greece <NA>
#> "Southern Europe" NA
The rnaturalearth library has country shapefiles with region and subregion.
library(rnaturalearth)
world <- rnaturalearth::ne_countries(returnclass = "sf")
world$region
world$subregion
I am relatively new to R. I have written the following code. However, because it uses a for-loop, it is slow. I am not too familiar with packages that will convert this for-loop into a more efficient solution (apply functions?).
What my code does is this: it is trying to extract country names from a variable based on another dataframe that has all countries.
For instance, this is what data looks like:
country Institution
edmonton general hospital
ontario, canada
miyazaki, japan
department of head
this is what countries looks like
Name Code
algeria dz
canada ca
japan jp
kenya ke
# string match the countries
for(i in 1:nrow(data))
{
for (j in 1:nrow(countries))
{
data$country[i] <- ifelse(str_detect(string = data$Institution[i], pattern = paste0("\\b", countries$Name[j], "\\b")), countries$Name[j], data$country[i])
}
}
The above code runs so that it changes data so it looks like this:
country Institution
edmonton general hospital
canada ontario, canada
japan miyazaki, japan
department of head
How can I convert my for-loop to preserve the same function?
Thanks.
You can do a one-liner with str_extract. We'll paste the country names together with word boundaries and concatenate them with a regex | or operator.
library(stringr)
data$country = str_extract(data$Institution, paste0(
"\\b", country$Name, "\\b", collapse = "|"
))
data
# Institution country
# 1 edmonton general hospital <NA>
# 2 ontario, canada canada
# 3 miyazaki, japan japan
# 4 department of head <NA>
Using this data:
country <- read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE)
data <- data.frame(Institution = c("edmonton general hospital",
"ontario, canada",
"miyazaki, japan",
"department of head"))
The data:
countries <- setDT(read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE))
data <- setDT(list(country = array(dim = 2), Institution =
c("edmonton general hospital ontario, canada",
"miyazaki, japan department of head")))
I use data.table for syntax convenience, but you can surely do otherwise, the main idea is to use just one loop and grepl
data[,country := as.character(country)]
for( x in unique(countries$Name)){data[grepl(x,data$Institution),country := x]}
> data
country Institution
1: canada edmonton general hospital ontario, canada
2: japan miyazaki, japan department of head
You could add the tolower function to avoid cases problems grepl(tolower(x),tolower(data$Institution))
I have a data frame with one column representing country names. My goal is to add one more column which gives the continent information. Please check the following use case:
my.df <- data.frame(country = c("Afghanistan","Algeria"))
Is there a package that I can use to append a column of data containing the continent names without having the original data?
You can use the countrycode package for this task.
library(countrycode)
df <- data.frame(country = c("Afghanistan",
"Algeria",
"USA",
"France",
"New Zealand",
"Fantasyland"))
df$continent <- countrycode(sourcevar = df[, "country"],
origin = "country.name",
destination = "continent")
#warning
#In countrycode(sourcevar = df[, "country"], origin = "country.name", :
# Some values were not matched unambiguously: Fantasyland
Result
df
# country continent
#1 Afghanistan Asia
#2 Algeria Africa
#3 USA Americas
#4 France Europe
#5 New Zealand Oceania
#6 Fantasyland <NA>
Expanding on Markus' answer, countrycode draws on codelists 'continent' declaration.
?codelist
Definition of continent:
continent: Continent as defined in the World Bank Development Indicators
The question asked for continents but sometimes continents don't provide enough groups for you to delineate the data. For example, continents groups North and South America into Americas.
What you might want is region:
region: Regions as defined in the World Bank Development Indicators
It is unclear how the World Bank groups regions but the below code shows how this destination is more granular.
library(countrycode)
egnations <- c("Afghanistan","Algeria","USA","France","New Zealand","Fantasyland")
countrycode(sourcevar = egnations, origin = "country.name",destination = "region")
Output:
[1] "Southern Asia"
[2] "Northern Africa"
[3] "Northern America"
[4] "Western Europe"
[5] "Australia and New Zealand"
[6] NA
You can try
my.df <- data.frame(country = c("Afghanistan","Algeria"),
continent= as.factor(c("Asia","Africa")))
merge(my.df, raster::ccodes()[,c("NAME", "CONTINENT")], by.x="country", by.y="NAME", all.x=T)
# country continent CONTINENT
# 1 Afghanistan Asia Asia
# 2 Algeria Africa Africa
Some country values might need an adjustment; I dunno since you did not provide all values.