Creating a table with scraped CSV data in R - r

I have the following name_total = matrix(nrow = 51, ncol=3, NA), where each row corresponds to a state (51 being District of Columbia). The first column is a string giving the name of the state (for example: name_total[1,1]= "Alabama").
The second and third are urls of CSV files from the Census, respectively linking counties with the state senate districts, and counties with state house districts.
For Alabama:
name_total[1,2] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_lu_delim_01.txt"
name_total[1,3] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_ll_delim_01.txt"
I wish to get as a final output a table which would basically be all 50 states + DC with their respective counties and linked Senate and House districts. I don't know if that's very clear so here is an example:
[,1] [,2] [,3] [,4]
[1,] "Alabama" "countyX1" "Senate District Y1" "House District Z1"
[2,] "Alabama" "countyX2" "Senate District Y2" "House District Z2"
[3,] "Alabama" "countyX3" "Senate District Y3" "House District Z3"
[4,] "Alaska" "countyX4" "Senate District Y4" "House District Z4"
[5,] "Alaska" "countyX5" "Senate District Y4" "House District Z5"
I use a forloop:
for (i in 1:51){
senate= name_total[i,2]
link_senate = url(senate)
house= name_total[i,3]
link_house = url(house)
state=name_total[i,1]
data_senate= read.csv2(link_senate,sep=",",header=TRUE, skip=1)
data_house= read.csv2(link_house,sep=",",header=TRUE, skip=1)
final=cbind(state, data_senate, data_house)
}
Of course each element has a different number of rows, for Alabama (i=1) State returns "Alabama" once, the others returning respectively 3 by 122 and 3 by 207 matrices. I get an error message about these variations in the number of rows.
I'm pretty sure one of the issues is the use of the cbind function, but I do not know what to use to get a better result.

In case others have similar issues, I found a way to get what I wanted separately for State Senates and State Houses. First of all some of the States only have of the two, and the link for Oregon was down. Personally I took them out of my original data.
Then I initialized for the first state outside of the loop:
senate = url(name_total[1,2])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
A = assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
house= url(name_total[1,3])
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
B = assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
and then I used for loop:
for (i in 2:48){
senate = url(name_total[i,2])
house= url(name_total[i,3])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate)
names(data_senate)[2] = "County"
A = rbind(A,assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate))
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[i,1],sep=""),data_house)
names(data_house)[2] = "County"
B = rbind(B,assign(paste("Base_house_",name_total[i,1],sep=""),data_house))
}
A and B give you the expected tables (without the string name of the State, but the first variable identifies the state).
I had to use the names(data_senate)[2] = "County" because the second column had a different name for some states.
Hope it helps!

Related

Arguments should have same length error in R

I'm trying to create a key-value store with the key being entities and the value being the average sentiment score of the entity in news articles.
I have a dataframe containing news articles and a list of entities called organizations1 identified in those news articles by a classifier. The first row of the organization1 list contains the entities identified in the article on the first row of the news_us dataframe. I'm trying to iterate through the organization list and creating a key-value store with the key being the entity name in the organization1 list and the value being the sentiment score of the news description in which the entity was mentioned.
I can get the sentiment scores for the entity from an article but I wanted to add them together and average the sentiment score.
library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization",
NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC",
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN",
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC",
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat",
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR",
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.",
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.",
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.",
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.",
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?",
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org",
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L,
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L,
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
setNames(lapply(news_us$description, get_sentiment), unlist(organization1))
#$US
#[1] 0
#$Bath
#[1] -0.4
#$`Animal Crossing`
#[1] -0.1
#$`World Health Organization`
#[1] 1.1
#$Microsoft
#[1] -0.6
#$Facebook
#[1] -1.9
tapply(sapply(news_us$description, get_sentiment), unlist(organization1), mean) #this line throws the error
Your problem seems to arise from the use of 'unlist'. Avoid this, as it drops the NULL values and concatenates list entries with multiple values.
Your organization1 list has 7 entries (two of which are NULL and one is length = 2). You should have 6 entries if this is to match the news_us data.frame - so something is out of sync there.
Let's assume the first 6 entries in organization1 are correct; I would bind them to your data.frame to avoid further 'sync errors':
news_us$organization1 = organization1[1:6]
Then you need to do the sentiment analysis on each row of the data.frame and bind the results to the organization1 value/s. The code below might not be the most elegant way to achieve this, but I think it does what you are looking for:
results = do.call("rbind", apply(news_us, 1, function(item){
if(!is.null(item$organization1[[1]])) cbind(item$organization1, get_sentiment(item$description))
}))
This code drops any rows where there were no detected organization1 values. It should also duplicate sentiment scores in the case of more than one organization1 being detected. The results will look like this (which I believe is your goal):
[,1] [,2]
[1,] "US" "-0.4"
[2,] "Bath" "-0.1"
[3,] "Animal Crossing" "1.1"
[4,] "World Health Organization" "-0.6"
The mean scores for each organization can then be collapsed using by, aggregate or similar.
[Edit: Examples of by and aggregate]
by(as.numeric(results[, 2]), results$V1, mean)
aggregate(as.numeric(results[, 2]), list(results$V1), mean)

Retrieving latitude/longitude coordinates for cities/countries that have since changed names?

Say I have a vector of cities and countries, which may or may not include names of places that have since changed names:
locations <- c("Paris, France", "Sarajevo, Yugoslavia", "Rome, Italy", "Leningrad, Soviet Union", "St Petersburg, Russia")
The problem is that I can't use something like ggmap::geocode since it doesn't appear to work well for locations whose names have changed:
ggmap::geocode(locations, source = "dsk")
lon lat
1 2.34880 48.85341 #Works for Paris
2 NA NA #Didn't work for Sarajevo
3 12.48390 41.89474 #Works for Rome
4 98.00000 60.00000 #Didn't work for the old name of St Petersburg seems to just get the center of Russia
5 30.26417 59.89444 #Worked for St Petersburg
Is there an alternative functions I could use? If I have to "update" the names of the cities & countries, is there an easy method of going through this? I have hundreds of locations that I was looking to collect the longitude and latitude coordinates.
This might not be what you had in mind, but if you use the exact same code with only the city names (and not the countries), at least the two cases that you mentioned (Sarajevo and Leningrad) seem to work fine. You could try to run the function with a modified locations vector including just the city names, and see if you still get errors. Something like this:
(cities <- gsub(',.*', '', locations))
## [1] "Paris" "Sarajevo" "Rome" "Leningrad" "St Petersburg"
cbind(ggmap::geocode(cities, source = 'dsk'), cities)
## lon lat cities
## 1 2.34880 48.85341 Paris
## 2 18.35644 43.84864 Sarajevo
## 3 12.48390 41.89474 Rome
## 4 30.26417 59.89444 Leningrad
## 5 30.26417 59.89444 St Petersburg

Update a field if the value of a pattern is true

This is my first question so please excuse the mistakes.
I have a dataframe where the address is in one line and has many missing values and several errors.
Address
Braemor Drive, Clontarf, Co.Dublin
Meadow Avenue, Dundrum
Philipsburgh Avenue, Marino
Myrtle Square, The Coast
I would like to add a new field "District", if the value of the address contains certain values for example if it contains Marino, Fairview or Clontarf the District should be Dublin 3.
Dublin3 <- c("Marino", "Fairview", "Clontarf")
matches <- unique (grep(paste(Dublin3,collapse="|"),
DubPPReg$Address, value=TRUE))
Using R, how can I update the value of District where the match is true?
# I've created example data frame with column Adress
df <- data.frame(Adress = c("Braemor Drive",
"Clontarf",
"Co.Dublin",
"Meadow Avenue",
"Dundrum",
"Philipsburgh Avenue",
"Marino",
"Myrtle Square", "The Coast"))
# And vector Dublin
Dublin3 <- c("Marino", "Fairview", "Clontarf")
# Match names in column Adress and vector Dublin 3
df$District <- ifelse(df$Adress %in% Dublin3, "Dublin 3",FALSE)
df
Adress District
1 Braemor Drive FALSE
2 Clontarf Dublin 3
3 Co.Dublin FALSE
4 Meadow Avenue FALSE
5 Dundrum FALSE
6 Philipsburgh Avenue FALSE
7 Marino Dublin 3
8 Myrtle Square FALSE
9 The Coast FALSE
Instead of FALSE you can choose something else (e.g. NA).
Edited: If your data are in vector
df <- c("Braemor Drive, Churchtown, Co.Dublin",
"Meadow Avenue, Clontarf, Dublin 14",
"Sallymount Avenue, Ranelagh", "Philipsburgh Avenue, Marino")
Which looks like this
df
[1] "Braemor Drive, Churchtown, Co.Dublin"
[2] "Meadow Avenue, Clontarf, Dublin 14"
[3] "Sallymount Avenue, Ranelagh"
[4] "Philipsburgh Avenue, Marino"
You can find your maches using grepl like this
match <- ifelse(grepl("Marino|Fairview|Clontarf", df, ignore.case = T), "Dublin 3",FALSE)
and output is
[1] "FALSE" "Dublin 3" "FALSE" "Dublin 3"
Which means that one or all of the matching names that you are looking for (i.e. Marino, Fairview or Clontarf) are in second and fourth row in df.

Splitting a string by more than one space

I am trying to load some data into R that is in the following format (as a text file)
Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43
The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim I get a one column data.frame with
"John,Smith United Kingdom 20"
as the first observation and so on.
Is there any way I can either:
Load the data into R into a usable format? or
Split the character strings up into separate columns once I load it in in the one column format?
My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom" stays together and doesn't become "United" "" "Kingdom"). But I don't know if that is possible.
I tried strsplit(data.frame[,1], sep="\\s") but it returns a list of character strings like:
"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...
which I don't know what to do with.
Having columns that all "line up nicely" is a typical characteristic of fixed-width data.
For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim.
Here's the sample data:
x <- tempfile()
cat("Name Country Age\nJohn,Smith United Kingdom 20\nWashington,George USA 50\nMartin,Joseph Argentina 43\n", file = x)
R has it's own function for reading fixed width data (read.fwf) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:
read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1,
col.names = c("Name", "Country", "Age"))
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
Alternatively, you can let fwf_widths from the "readr" package do the guessing of widths for you, and then use read_fwf:
library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
You can do base R, supposing your columns do not contain words with more than 1 space:
txt = "Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43"
conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
# [,1] [,2] [,3]
#[1,] "Name" "Country" "Age"
#[2,] "John,Smith" "United Kingdom" "20"
#[3,] "Washington,George" "USA" "50"
#[4,] "Martin,Joseph" "Argentina" "43"

State name to abbreviation

I have a large file with a variable state that has full state names. I would like to replace it with the state abbreviations (that is "NY" for "New York"). Is there an easy way to do this (apart from using several if-else commands)? May be using replace() statement?
R has two built-in constants that might help: state.abb with the abbreviations, and state.name with the full names. Here is a simple usage example:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
1) grep the full name from state.name and use that to index into state.abb:
state.abb[grep("New York", state.name)]
## [1] "NY"
1a) or using which:
state.abb[which(state.name == "New York")]
## [1] "NY"
2) or create a vector of state abbreviations whose names are the full names and index into it using the full name:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Unlike (1), this one works even if "New York" is replaced by a vector of full state names, e.g. setNames(state.abb, state.name)[c("New York", "Idaho")]
Old post I know, but wanted to throw mine in there. I learned on tidyverse, so for better or worse I avoid base R when possible. I wanted one with DC too, so first I built the crosswalk:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Then I joined it to my data:
left_join(data, st_crosswalk, by = "state")
I found the built-in state.name and state.abb have only 50 states. I got a bigger table (including DC and so on) from online (e.g., this link: http://www.infoplease.com/ipa/A0110468.html) and pasted it to a .csv file named States.csv. I then load states and abbr. from this file instead of using the built-in. The rest is quite similar to #Aniko 's
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
There's a small difference between match and amatch in how they calculate the distance from one word to another. See P25-26 here http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
You can also use base::abbreviate if you don't have US state names. This won't give you equally sized abbreviations unless you increase minlength.
state.name %>% base::abbreviate(minlength = 1)
Here is another way of doing it in case you have more than one state in your data and you want to replace the names with the corresponding abbreviations.
#creating a list of names
states_df <- c("Alabama","California","Nevada","New York",
"Oregon","Texas", "Utah","Washington")
states_df <- as.data.frame(states_df)
The output is
> print(states_df)
states_df
1 Alabama
2 California
3 Nevada
4 New York
5 Oregon
6 Texas
7 Utah
8 Washington
Now using the state.abb function you can easily convert the names into abbreviations, and vice-versa.
states_df$state_code <- state.abb[match(states_df$states_df, state.name)]
> print(states_df)
states_df state_code
1 Alabama AL
2 California CA
3 Nevada NV
4 New York NY
5 Oregon OR
6 Texas TX
7 Utah UT
8 Washington WA
If matching state names to abbreviations or the other way around is something you have to frequently, you could put Aniko's solution in a function in a .Rprofile or a package:
state_to_st <- function(x){
c(state.abb, 'DC')[match(x, c(state.name, 'District of Columbia'))]
}
st_to_state <- function(x){
c(state.name, 'District of Columbia')[match(x, c(state.abb, 'DC'))]
}
Using that function as a part of a dplyr chain:
enframe(state.name, value = 'state_name') %>%
mutate(state_abbr = state_to_st(state_name))

Resources