I'm trying to get rid of some specific words in a data frame column. So the data set looks like somewhat like this with 3235 rows:
V1 V2
AUTAUGA COUNTY 1
BALDWIN COUNTY 3
VALDEZ-CORDOVA CENSUS AREA 261
what I'm trying to do is:
data$V1 <- replace(data$V1, " COUNTY", "")
But I get an error that looks like this:
Error in `$<-.data.frame`(`*tmp*`, "V1", value = c("AUTAUGA COUNTY", :
replacement has 3236 rows, data has 3235
Am I using the function the wrong way? Or is there any other way to do this?
Thanks!
Hugo,
For the example you've provided, this code works well:
eg <- data.frame(V1 = c("AUTUAGA COUNTY", "BALDWIN COUNTY",
"VALDEZ-CORDOVA CENSUS AREA"),
V2 = c(1, 3, 261))
eg$gsub <- gsub(" COUNTY", "", eg$V1)
eg
- V1 V2 gsub
- 1 AUTUAGA COUNTY 1 AUTUAGA
- 2 BALDWIN COUNTY 3 BALDWIN
- 3 VALDEZ-CORDOVA CENSUS AREA 261 VALDEZ-CORDOVA CENSUS AREA
Does this resolve the error?
(Edited to fix the output column names.)
Related
I have two datasets from two different agencies that report census tracts in two different ways, namely FIPSCode and GISJOIN.
I have to frequently interchange these two and hence looking for a way to see if anyone knows how to effectively do it as my dataset includes some 70,000 census tracts and doing it manually is out of question.
For example,
Fipscode 1073011803 = GISJOIN G0100730011803.
The logic is simple where
1 = 01 (state code)
073 = 0073 (county code)
011803 = 0011803 (census tract number)
It seems that padding with 0 for each of the three elements in a fipscode gives the GISJOIN, however, I am unsure how to convert it.
I am using excel but can work with R if there is a way.
Thank you for your time!
After giving it a few tries, I have found a solution to this.
We need to know about the GISJOIN a bit more! It is a geoidentifier unique to US census geographies and NHGIS provided a standard structure using a combination of 13 digit alphanumeric IDs.
For demonstration purposes, I selected five random census tracts from the HUD data with the fips2010 and converted them into the prescribed GISJOIN style.
data <- data.frame(State = c("Alabama", "Alabama", "Delaware",
"Texas", "Wisconsin"),
County = c("Jefferson County", "Montgomary County", "Kent County",
"Travis County", "Milwaukee County"),
Tract = c("118.03", "1.00", "433.00", "13.07", "86.00"),
fips2010 = c("1073011803", "1101000100", "10001043300",
"48453001307", "55079008600"))
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 1073011803
2 Alabama Montgomary County 1.00 1101000100
3 Delaware Kent County 433.00 10001043300
4 Texas Travis County 13.07 48453001307
5 Wisconsin Milwaukee County 86.00 55079008600
Following the logic established in the NHGIS documentation, the code below converts the fips2010 column to appropriate GISJOIN standard.
for (i in 1:nrow(data)) {
fips2010 <- data$fips2010[i]
if (nchar(fips2010) == 10) {
data$fips2010[i] <- paste0("G0", substr(fips2010, 1, 1), "0", substr(fips2010, 2, 4), "0", substr(fips2010, 5, 10))
} else if (nchar(fips2010) == 11) {
data$fips2010[i] <- paste0("G", substr(fips2010, 1, 2), "0", substr(fips2010, 3, 5), "0", substr(fips2010, 6, 11))
}
}
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 G0100730011803
2 Alabama Montgomary County 1.00 G0101010000100
3 Delaware Kent County 433.00 G1000010043300
4 Texas Travis County 13.07 G4804530001307
5 Wisconsin Milwaukee County 86.00 G5500790008600
A side by side comparison: fips2010 and GISJOIN
State County Tract fips2010 GISJOIN
1 Alabama Jefferson County 118.03 1073011803 G0100730011803
2 Alabama Montgomary County 1.00 1101000100 G0101010000100
3 Delaware Kent County 433.00 10001043300 G1000010043300
4 Texas Travis County 13.07 48453001307 G4804530001307
5 Wisconsin Milwaukee County 86.00 55079008600 G5500790008600
I hope this helps anyone dealing with a similar issue.
Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))
I am relatively new to R. I have written the following code. However, because it uses a for-loop, it is slow. I am not too familiar with packages that will convert this for-loop into a more efficient solution (apply functions?).
What my code does is this: it is trying to extract country names from a variable based on another dataframe that has all countries.
For instance, this is what data looks like:
country Institution
edmonton general hospital
ontario, canada
miyazaki, japan
department of head
this is what countries looks like
Name Code
algeria dz
canada ca
japan jp
kenya ke
# string match the countries
for(i in 1:nrow(data))
{
for (j in 1:nrow(countries))
{
data$country[i] <- ifelse(str_detect(string = data$Institution[i], pattern = paste0("\\b", countries$Name[j], "\\b")), countries$Name[j], data$country[i])
}
}
The above code runs so that it changes data so it looks like this:
country Institution
edmonton general hospital
canada ontario, canada
japan miyazaki, japan
department of head
How can I convert my for-loop to preserve the same function?
Thanks.
You can do a one-liner with str_extract. We'll paste the country names together with word boundaries and concatenate them with a regex | or operator.
library(stringr)
data$country = str_extract(data$Institution, paste0(
"\\b", country$Name, "\\b", collapse = "|"
))
data
# Institution country
# 1 edmonton general hospital <NA>
# 2 ontario, canada canada
# 3 miyazaki, japan japan
# 4 department of head <NA>
Using this data:
country <- read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE)
data <- data.frame(Institution = c("edmonton general hospital",
"ontario, canada",
"miyazaki, japan",
"department of head"))
The data:
countries <- setDT(read.table(text = " Name Code
algeria dz
canada ca
japan jp
kenya ke",
stringsAsFactors = FALSE, header = TRUE))
data <- setDT(list(country = array(dim = 2), Institution =
c("edmonton general hospital ontario, canada",
"miyazaki, japan department of head")))
I use data.table for syntax convenience, but you can surely do otherwise, the main idea is to use just one loop and grepl
data[,country := as.character(country)]
for( x in unique(countries$Name)){data[grepl(x,data$Institution),country := x]}
> data
country Institution
1: canada edmonton general hospital ontario, canada
2: japan miyazaki, japan department of head
You could add the tolower function to avoid cases problems grepl(tolower(x),tolower(data$Institution))
I have a dataset of the following:
> head(data,3)
city state zip_code overall_spend
1 MIDDLESBORO KY 40965 $252,168.12
2 PALM BEACH FL 33411-3518 $369,240.74
3 CORBIN KY 40701 $292,496.03
Now, I want to format the zip_code which has extra parts after -. For example, in the second row, I have 33411-3518. After formatting I want to have only 33411. How can I do this to the whole zip_code column? Also, zip_code is a factor now
Try
data$zip_code <- sub('-.*', '', data$zip_code)
data$zip_code
#[1] "40965" "33411" "40701"
I have df1:
City Freq
Seattle 20
San Jose 10
SEATTLE 5
SAN JOSE 15
Miami 12
I created this dataframe using table(df)
I have another df2:
City
San Jose
Miami
I want to subset df1 if the city values in df1 equal to those in df2. This df2 is only a sample so I can't use an OR condition ( " | " ) because I have many different criteria. Perhaps I could convert this df2 into a vector.. but I'm not sure how to do this. as.vector() doesn't seem to work.
I thought about using
subset(df1, City == df2)
but this gives me errors.
Also, if you guys could get me a way to make this case insensitive such that "San Jose" and "SAN JOSE" are added together, that would be even better!
If I use "toupper / tolower", I get the error: invalid multibyte
Thanks in advance!!
Here are few more methods
R Code:
# Method 1: using dplyr package
library(dplyr)
filter(df1, tolower(df1$City) %in% tolower(df2$City))
df1 %>% filter(tolower(df1$City) %in% tolower(df2$City))
# Method 2: using which function
df1[ which( tolower(df1$City) %in% tolower(df2$City)) , ]
# Method 3:
df1[(tolower(df1$City) %in% tolower(df2$City)), ]
Output:
City Freq
2 San Jose 10
4 SAN JOSE 15
5 Miami 12
Hope this helps.