Split string of a data frame in new columns - r

I need your help, because I got a data frame with a very difficult format. My data frame
data <- data.frame(information = c("{u'info1': u'mnfd', u'text': u'exampletext'}","{u'info2': u'332', u'text': u'lalala'}","{u'info1': u'', u'text': u'blub'}"))
has the column information (and a few other columns in the real data frame) and looks for example like that
## information
## 1 {u'info1': u'mnfd', u'text': u'exampletext'}
## 2 {u'info2': u'332', u'text': u'lalala'}
## 3 {u'info1': u'', u'text': u'blub'}
The real data frame has a few thousand rows and the strings are much longer. I would like to add columns which should display the information from the strings. So at the end I would like to have a dataframe looking like that (the string "of_" is added before every columnn name)
information of_info1 of_text of_info2
1 {u'info1': u'mnfd', u'text': u'exampletext'} mnfd exampletext <NA>
2 {u'info2': u'332', u'text': u'lalala'} <NA> lalala 322
3 {u'info1': u'', u'text': u'blub'} blub <NA>
Thanks for your help

This is close to a JSON file, so do a bit of formatting to get it right, and then import via the awesome jsonlite package:
library(jsonlite)
fromJSON(paste0("[", paste(gsub("(u|)'",'"',data$information), collapse=",\n"), "]"))
# info1 text info2
#1 mnfd exampletext <NA>
#2 <NA> lalala 332
#3 blub <NA>

Here's a version with dplyr and stringr. It should not be too difficult to translate it into base R if you prefer that.
This will break, however, if there are escaped single quotation marks in the fields.
library(stringr)
library(dplyr)
data <- data$information %>% str_match_all("u'([^']+)': u'([^']*)'") %>%
lapply(function(matches) {
result <- data.frame(as.list(matches[,3]), stringsAsFactors = FALSE)
colnames(result) <- paste0("of_", matches[,2])
result
}) %>% bind_rows() %>% bind_cols(data, .)

Convert it to DCF format and then read it in using read.dcf. No packages are used.
First we remove the junk giving s0 and then split it on comma-space giving s1. Then add an empty terminating line between records giving s2. Finally use read.dcf to read that in and append it to data.
s0 <- gsub("[{}]", "", gsub("u'(.*?)'", "\\1", data$information))
s1 <- strsplit(s0, ", ")
s2 <- unlist(lapply(s1, c, ""))
cbind(data, read.dcf(textConnection(s2)))
giving:
information info1 text info2
1 {u'info1': u'mnfd', u'text': u'exampletext'} mnfd exampletext <NA>
2 {u'info2': u'332', u'text': u'lalala'} <NA> lalala 332
3 {u'info1': u'', u'text': u'blub'} blub <NA>
magrittr
This could also be expressed as a nested magrittr pipeline like this:
library(magrittr)
data %>%
cbind({.$information %>%
gsub("u'(.*?)'", "\\1", .) %>%
gsub("[{}]", "", .) %>%
strsplit(", ") %>%
lapply(c, "") %>%
unlist %>%
textConnection %>%
read.dcf
})

Related

Formating a column in csvs while I import them

I'm importing several .csvs that are all two columns wide (they're output from a program) - the first column is wavelength and the second is absorbance, but I'm naming it by the file name to be combined later like from this old stack overflow answer (Combining csv files in R to different columns). The incoming .csvs don't have headers, and I'm aware that the way I'm naming them crops the first data points. I would like for the first column to not have any decimals and standardize all of the numbers to four digits - the code I've added works on its own but not in this block - and I would prefer to do this formatting all in one go. I run into errors with $ not being the right operator, but when I use [] I get errors about that too. The column I need to do this to is the first and it's named 'Wavelength' - which also gives me errors either because wavelength doesn't exist or it's nonnumeric. Any ideas?
This is what my script currently looks like:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file))
assign(f, setNames(get(f), c(names(get(f))[0:0], "Wavelength")))
assign(f, setNames(get(f), c(names(get(f))[1:1], file)))
floor(f[Wavelength]) #the issues are here
sprintf("%04d", f$Wavelength) #and here
}
The data looks like this in the csv before it gets processed:
1 401.7664 0.1379457
2 403.8058 0.1390427
3 405.8452 0.1421666
4 407.8847 0.1463629
5 409.9241 0.1477264
I would like the output to be:
Wavelength (file name)
1 0401 0.1379457
2 0403 0.1390427
3 0405 0.1421666
4 0407 0.1463629
5 0409 0.1477264
And here's the dput that r2evans asked for:
structure(list(X3.997270e.002 = c(401.7664, 403.8058, 405.8452,
407.8847, 409.9241, 411.9635), X1.393858e.001 = c(0.1379457,
0.1390427, 0.1421666, 0.1463629, 0.1477264, 0.1476971)), row.names =
c(NA,
6L), class = "data.frame")
Thanks in advance!
6/24 Update:
When I assign the column name "Wavelength" it only gets added as a character, not as a real column name? When I dput/head the files once they go through (omitting the sprintf/floor functions) it only lists the file name (the second column). When I open the csvs in R studio the first column is properly labeled - and even further I'm able to combine all the csvs sorted by "Wavelength":
list_csvs <- mget(sub("(.*)\\.CSV", "\\1", file_list))
all_csvs <- Reduce(function(x, y) merge(x, y, all=T,
by=c("Wavelength")), list_csvs, accumulate=F)
Naturally I've thought about just formatting the column after this, but some of the decimals are off in the thousands place so I do need to format before I merge the csvs.
I've updated the code to use colnames outside of the read.csv:
for (file in file_list) {
f <- sub("(.*)\\.CSV", "\\1", file)
assign(f, read.csv(file = file,
header = FALSE,
row.names = NULL))
colnames(f) <- c("Wavelength", file)
print(summary(f))
print(names(f))
#floor("Wavelength") #I'm omitting this to see the console errors
#sprintf("%04.0f", f["Wavelength"]) #omitting this too
}
but I get the following error:
attempt to set 'colnames' on an object with less than two dimensions
Without the naming bit and without the sprintf/floor I get this back from the summary and names prompt for each file:
Length Class Mode
1 character character
NULL
When I try to call out the first column by f[1], f[[1]], f[,1], or f[[,1]] I get error messages about 'incorrect number of dimensions'. I can clearly see in the R environment that each data frame has a length of 2. I also double checked with .row_names_info(f) that the first column isn't being read as row names. What am I doing wrong?
I'm going to suggest a dplyr/tidyr pipe for this.
First, data-setup:
writeLines(
"401.7664,0.1379457
403.8058,0.1390427
405.8452,0.1421666
407.8847,0.1463629
409.9241,0.1477264", "sample1.csv")
file.copy("sample1.csv", "sample2.csv")
file_list <- normalizePath(list.files(pattern = ".*\\.csv$", full.names = TRUE), winslash = "/")
file_list
# [1] "C:/Users/r2/StackOverflow/13765634/sample1.csv"
# [2] "C:/Users/r2/StackOverflow/13765634/sample2.csv"
First, I'm going to suggest a slightly different format: not naming the column for the filename. I like this because I'm still going to preserve the filename with the data (as a category, so to speak), but it allows you to combine all of your data into one frame for more efficient processing:
library(dplyr)
library(purrr) # map*
library(tidyr) # pivot_wider
file_list %>%
set_names(.) %>%
# set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 C:/Users/r2/StackOverflow/13765634/sample1.csv 0402 0.1379457
# 2 C:/Users/r2/StackOverflow/13765634/sample1.csv 0404 0.1390427
# 3 C:/Users/r2/StackOverflow/13765634/sample1.csv 0406 0.1421666
# 4 C:/Users/r2/StackOverflow/13765634/sample1.csv 0408 0.1463629
# 5 C:/Users/r2/StackOverflow/13765634/sample1.csv 0410 0.1477264
# 6 C:/Users/r2/StackOverflow/13765634/sample2.csv 0402 0.1379457
# 7 C:/Users/r2/StackOverflow/13765634/sample2.csv 0404 0.1390427
# 8 C:/Users/r2/StackOverflow/13765634/sample2.csv 0406 0.1421666
# 9 C:/Users/r2/StackOverflow/13765634/sample2.csv 0408 0.1463629
# 10 C:/Users/r2/StackOverflow/13765634/sample2.csv 0410 0.1477264
Options: if you prefer just the filename (no path) and are certain that there is no filename collision, then use set_names(basename(.)) instead. (This step is really necessary when using the filename as a column name anyway.) I'll also remove the file extension, since they're likely all .csv or similar.
file_list %>%
# set_names(.) %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq))
# filename freq val
# 1 sample1 0402 0.1379457
# 2 sample1 0404 0.1390427
# 3 sample1 0406 0.1421666
# 4 sample1 0408 0.1463629
# 5 sample1 0410 0.1477264
# 6 sample2 0402 0.1379457
# 7 sample2 0404 0.1390427
# 8 sample2 0406 0.1421666
# 9 sample2 0408 0.1463629
# 10 sample2 0410 0.1477264
(If you need to do something to each dataset at a time, then you should use %>% group_by(filename), not sure if that's relevant.)
If you really need the filename to be the column name of the value, then modify this slightly so that it preserves it as a list:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map(~ read.csv(.x, header = FALSE, col.names = c("freq", "val"))) %>%
map2(., names(.), ~ transmute(.x, freq = sprintf("%04.0f", freq), !!.y := val))
# $sample1
# freq sample1
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
# $sample2
# freq sample2
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
But I'm going to infer that ultimately you want to combine these column-wise, assuming there will be alignment in the freq column. (I can't think of another reason why you'd want the column name to be the filename.)
For that, try this, reverting to the first use of map_dfr, introducing pivot_wider:
file_list %>%
set_names(tools::file_path_sans_ext(basename(.))) %>%
map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
.id = "filename") %>%
mutate(freq = sprintf("%04.0f", freq)) %>%
pivot_wider(freq, names_from = filename, values_from = val)
# # A tibble: 5 x 3
# freq sample1 sample2
# <chr> <dbl> <dbl>
# 1 0402 0.138 0.138
# 2 0404 0.139 0.139
# 3 0406 0.142 0.142
# 4 0408 0.146 0.146
# 5 0410 0.148 0.148
Notes (perhaps more of a soap-box):
Regarding your use of assign, I strongly discourage this behavior. Since the data is effectively all structured the same, I infer that you'll be doing the same thing to each of these files. In that case, it is much better to use one of the *apply functions on a list of data.frames. That is, instead of having to iterate over a list of variable names, get it, do something, then reassign it ... it is often much easier (to program, to read, to maintain) dats <- lapply(dats, some_function) or dats2 <- lapply(dats, function(x) { ...; x; }).
Regarding the use of filename-as-column-name. Some tools (e.g., ggplot2) really benefit from having "long" data (i.e., one or more category columns such as filename, and one column for each type of data ... type is relative to your understanding of the data). You might benefit from reframing your thinking on working with this data.

Convert string data into data frame

I am new to R, any suggestions would be appreciated.
This is the data:
coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"
I would like this to become:
Latitude Longitude
-79.43591570873059 43.68015339477487
-79.43491506339724 43.68036886994886
-79.43394727223847 43.680578504490335
-79.43388162422195 43.68058996121469
-79.43281544978878 43.680808044458765
-79.4326971769691 43.68079658822322
You can use scan with a little gsub:
matrix(scan(text = gsub("[()]", "", coordinates), sep = ","),
ncol = 2, byrow = TRUE, dimnames = list(NULL, c("Lat", "Long")))
# Read 12 items
# Lat Long
# [1,] -79.43592 43.68015
# [2,] -79.43492 43.68037
# [3,] -79.43395 43.68058
# [4,] -79.43388 43.68059
# [5,] -79.43282 43.68081
# [6,] -79.43270 43.68080
The precision is still there--just truncated in the matrix display.
Two clear advantages:
Fast.
Handles multi-element "coordinates" vector (eg: coordinates <- rep(coordinates, 10) as an input).
Here's another option:
library(data.table)
fread(gsub("[()]", "", gsub("), (", "\n", toString(coordinates), fixed = TRUE)), header = FALSE)
The toString(coordinates) is for cases when length(coordinates) > 1. You could also use fread(text = gsub(...), ...) and skip using toString. I'm not sure of the advantages or limitations of either approach.
We can use str_extract_all from stringr
library(stringr)
df <- data.frame(Latitude = str_extract_all(coordinates, "(?<=\\()-\\d+\\.\\d+")[[1]],
Longitude = str_extract_all(coordinates, "(?<=,\\s)\\d+\\.\\d+(?=\\))")[[1]])
df
# Latitude Longitude
#1 -79.43591570873059 43.68015339477487
#2 -79.43491506339724 43.68036886994886
#3 -79.43394727223847 43.680578504490335
#4 -79.43388162422195 43.68058996121469
#5 -79.43281544978878 43.680808044458765
#6 -79.4326971769691 43.68079658822322
Latitude captures the negative decimal number from opening round brackets (() whereas Longitude captures it from comma (,) to closing round brackets ()).
Or without regex lookahead and behind and capturing it together using str_match_all
df <- data.frame(str_match_all(coordinates,
"\\((-\\d+\\.\\d+),\\s(\\d+\\.\\d+)\\)")[[1]][, c(2, 3)])
To convert data into their respective types, you could use type.convert
df <- type.convert(df)
Here is a base R option:
coordinates <- "(-79.43591570873059, 43.68015339477487), (-79.43491506339724, 43.68036886994886), (-79.43394727223847, 43.680578504490335), (-79.43388162422195, 43.68058996121469), (-79.43281544978878, 43.680808044458765), (-79.4326971769691, 43.68079658822322)"
coordinates <- gsub("^\\(|\\)$", "", coordinates)
x <- strsplit(coordinates, "\\), \\(")[[1]]
df <- data.frame(lat=sub(",.*$", "", x), lng=sub("^.*, ", "", x), stringsAsFactors=FALSE)
df
The strategy here is to first strip the leading trailing parentheses, then string split on \), \( to generate a single character vector with each latitude/longitude pair. Finally, we generate a data frame output.
lat lng
1 -79.43591570873059 43.68015339477487
2 -79.43491506339724 43.68036886994886
3 -79.43394727223847 43.680578504490335
4 -79.43388162422195 43.68058996121469
5 -79.43281544978878 43.680808044458765
6 -79.4326971769691 43.68079658822322
Yet another base R version with a bit of regex, relying on the fact that replacing the punctuation with blank lines will mean they get skipped on import.
read.csv(text=gsub(")|(, |^)\\(", "\n", coordinates), col.names=c("lat","long"), header=FALSE)
# lat long
#1 -79.43592 43.68015
#2 -79.43492 43.68037
#3 -79.43395 43.68058
#4 -79.43388 43.68059
#5 -79.43282 43.68081
#6 -79.43270 43.68080
Advantages:
Deals with vector input as well like the other scan answer.
Converts to correct numeric types in output
Disadvantages:
Not super fast
We can use rm_round from qdapRegex
library(qdapRegex)
read.csv(text = rm_round(coordinates, extract = TRUE)[[1]], header = FALSE,
col.names = c('lat', 'lng'))
# lat lng
#1 -79.43592 43.68015
#2 -79.43492 43.68037
#3 -79.43395 43.68058
#4 -79.43388 43.68059
#5 -79.43282 43.68081
#6 -79.43270 43.68080
Or in combination with tidyverse
library(tidyr)
library(dplyr)
rm_round(coordinates, extract = TRUE)[[1]] %>%
tibble(col1 = .) %>%
separate(col1, into = c('lat', 'lng'), sep= ",\\s*", convert = TRUE)
# A tibble: 6 x 2
# lat lng
# <dbl> <dbl>
#1 -79.4 43.7
#2 -79.4 43.7
#3 -79.4 43.7
#4 -79.4 43.7
#5 -79.4 43.7
#6 -79.4 43.7

Conversion of data (one column) with a stringr

I have a table called DATA_SET.This table contains one column with six different
cases of data.
#DATA_SET
DATA_SET<-data.frame(
CUSTOMS_RATE=c("20","15+0,41 eur/kg","10+0,1 eur/kg max.17","0,1
eur/l max.17","0,04 eur/kg max.10","NA")
)
View(DATA_SET)
#DATA_SET1
DATA_SET1<-data.frame(
RATE="",
SPECIFIC_RATE="",
MAXIMUM_RATE=""
)
So my intention is to divide this column into three different columns in order to continue with other statistical operations (calculation of averages, etc.) like table (DATA_SET 1) below.
So can anybody help me how to transform this table ?
Usually, separate would be a better option, but in this case, the positions of the numbers are not the same in each row, (sometimes missing too). So, we are using str_extract to individually extract the values
library(tidyverse)
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", "."),
RATE = str_extract(CUSTOMS_RATE, "^[0-9]+(?=\\+|$)"),
SPECIFIC_RATE = str_extract(CUSTOMS_RATE, "\\d+\\.\\d+"),
MAXIMUM_RATE = str_extract(CUSTOMS_RATE, "(?<=max\\.)\\d+")) %>%
select(2:4) %>%
mutate_all(as.numeric)
# RATE SPECIFIC_RATE MAXIMUM_RATE
#1 20 <NA> <NA>
#2 15 0.41 <NA>
#3 10 0.1 17
#4 <NA> 0.1 17
#5 <NA> 0.04 10
#6 <NA> <NA> <NA>
Or use str_replace to create a single delimiter and then use separate
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", ".") %>%
str_replace("\\+?([0-9]+\\.[0-9]+)", "+\\1") %>%
str_replace_all("[A-Za-z/ ]+\\.?", "+")) %>%
separate(CUSTOMS_RATE, into = c("RATE", "SPECIFIC_RATE", "MAXIMUM_RATE"),
sep="\\+", convert = TRUE)

Add synonyms from qdap to a preexisting dataframe in R

I have created the following dataframe df in R
Sl NO Word
1 get
2 Free
3 Joshi
4 Hello
5 New
I have used this code to get a list of synonyms but the same are in the form of a list
library(qdap)
synonyms(DF$Word)
I am getting a list of synonymous words for this. I Want to get the synonymous words for each word in the dataframe appended rowwise to the dataframe as separate columns.
DF<-
Sl NO Word Syn1 Syn2
1 get obtain receive
2 Free independent NA
3 Joshi NA NA
4 Hello Greeting NA
5 New Unused Fresh
Is there an elegant way to obtain this.Are there other dictionaries that can be used for this.
One approach could be to use mapply and pass each word at a time to qdap::synonyms. The result from 'synonyms' can be collapsed in a column using paste0 function with collapse = "|". Now data is ready.
Use tidyr::separate to separate columns into Syn1, Syn2 etc.
Note: synonyms is called with two arguments as return.list = FALSE, multiwords = FALSE
The below code has limit on maximum 10 synonyms but solution can be evolved to handle number dynamically.
library(tidyverse)
library(qdap)
df %>%
mutate(Synonyms =
mapply(function(x)paste0(
head(synonyms(x, return.list = FALSE, multiwords = FALSE),10), collapse = "|"),
tolower(.$Word))) %>%
separate(Synonyms, paste("Syn",1:10), sep = "\\|", extra = "drop" )
Result:
# SlNO Word Syn 1 Syn 2 Syn 3 Syn 4 Syn 5 Syn 6 Syn 7 Syn 8 Syn 9 Syn 10
# 1 1 get achieve acquire attain bag bring earn fetch gain glean inherit
# 2 2 Free buckshee complimentary gratis gratuitous unpaid footloose independent liberated loose uncommitted
# 3 3 Joshi <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 4 4 Hello <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 5 New advanced all-singing all-dancing contemporary current different fresh ground-breaking happening latest
Data
df <- read.table(text =
"SlNO Word
1 get
2 Free
3 Joshi
4 Hello
5 New",
header = TRUE, stringsAsFactors = FALSE)
Here is another approach with splitstackshape::cSplit.
library(tidyverse)
library(qdap)
library(splitstackshape)
DF <- read.table(text = tt, header = T)
DF <- DF %>% mutate_at(vars(Word), tolower)
syns <- synonyms_frame(synonyms(tolower(DF$Word))) %>%
mutate_at(vars(x), funs(str_remove(x, "\\..*"))) %>%
mutate_at(vars(y), funs(str_extract(y, '[:alpha:]+'))) %>%
group_by(x) %>%
summarise(Syn = toString(y)) %>%
rename(Word = x) %>% cSplit('Syn')
left_join(DF, syns)
I am not sure how exactly would you like to add all synonyms of a word because when you run synonyms("get") it gives 75 definitions of get and I feel that the desired layout will not be of much help if you add all values of 75 definitions in a single row.
So in below solution I have selected the very first definition only.
library(qdap)
library(dplyr)
library(splitstackshape)
df %>%
rowwise() %>%
mutate(synonym_of_word = paste(synonyms(tolower(word))[[1]], collapse=",")) %>%
cSplit("synonym_of_word", ",")
Sample data:
df <- structure(list(sl_no = 1:5, word = c("get", "Free", "Joshi",
"Hello", "New")), .Names = c("sl_no", "word"), class = "data.frame", row.names = c(NA,
-5L))

Creating a dataframe from a scraped character vector

I am trying to create a dataframe that has the columns: First name, Last name, Party, State, Member ID. Here is my code
library('rvest')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
candidate_list <- html_text(candidate_nodes)
My main issue is getting the member IDs. An example ID is A000009. When I use the gsub function I lose the leading A in this example. The A is from this candidate's last name (Abercrombie), but I do not know how to add the A back into the member ID. Of course if there's a better way I am open to any suggestions.
Since you've got an HTML table, use html_table to extract it to a data.frame. You'll need fill = TRUE, because the table has extra empty rows inserted between each entry, which you can easily drop afterwards with tidyr::drop_na.
library(tidyverse)
library(rvest)
page <- 'https://www.congress.gov/help/field-values/member-bioguide-ids' %>%
read_html()
members <- page %>%
html_node('table') %>%
html_table(fill = TRUE) %>%
set_names('member', 'bioguide') %>%
drop_na(member) %>% # remove empty rows inserted in the table
tbl_df() # for printing
members
#> # A tibble: 2,243 x 2
#> member bioguide
#> * <chr> <chr>
#> 1 Abdnor, James (Republican - South Dakota) A000009
#> 2 Abercrombie, Neil (Democratic - Hawaii) A000014
#> 3 Abourezk, James (Democratic - South Dakota) A000017
#> 4 Abraham, Ralph Lee (Republican - Louisiana) A000374
#> 5 Abraham, Spencer (Republican - Michigan) A000355
#> 6 Abzug, Bella S. (Democratic - New York) A000018
#> 7 Acevedo-Vila, Anibal (Democratic - Puerto Rico) A000359
#> 8 Ackerman, Gary L. (Democratic - New York) A000022
#> 9 Adams, Alma S. (Democratic - North Carolina) A000370
#> 10 Adams, Brock (Democratic - Washington) A000031
#> # ... with 2,233 more rows
The member column could be further extracted, if you like.
There are also many other useful sources for this data, some of which correlate it with other useful variables. This one is well-structured and updated regularly.
Give this a try. I have updated this to include separating out the different fields.
library('rvest')
library('dplyr')
library('tidyr')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
df.candidates <- as.data.frame(html_table(candidate_nodes, header = TRUE, fill = TRUE), stringsAsFactors = FALSE)
df.candidates <- df.candidates[!is.na(df.candidates$Member),]
df.candidates <- df.candidates %>%
mutate(Party.State = gsub("[\\(\\)]", "", regmatches(Member, gregexpr("\\(.*?\\)", Member))[[1]])) %>%
separate(Party.State, into = c("Party","State"), sep = " - ") %>%
mutate(Full.name = trimws(regmatches(df.candidates$Member, regexpr("^[^\\(]+", df.candidates$Member)))) %>%
separate(Full.name, into = c("Last.Name","First.Name","Suffix"), sep = ",", fill = "right") %>%
select(First.Name, Last.Name, Suffix, Party, State, Member.ID)
This is a bit hackish, but if you want to extract the variables using regex here are a few pointers.
candidate_list <- unlist(candidate_list)
ID <- regmatches(candidate_list,
gregexpr("[a-zA-Z]{1}[0-9]{6}", candidate_list))
party_state <- regmatches(candidate_list,
gregexpr("(?<=\\()[^)]+(?=\\))", candidate_list, perl=TRUE))
names_etc <- strsplit(candidate_list, "[a-zA-Z]{1}[0-9]{6}")
names <- sapply(names_etc, function(x) sub(" \\([^)]*\\)", "", x))

Resources