Creating a dataframe from a scraped character vector

Creating a dataframe from a scraped character vector - r

I am trying to create a dataframe that has the columns: First name, Last name, Party, State, Member ID. Here is my code
library('rvest')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
candidate_list <- html_text(candidate_nodes)
My main issue is getting the member IDs. An example ID is A000009. When I use the gsub function I lose the leading A in this example. The A is from this candidate's last name (Abercrombie), but I do not know how to add the A back into the member ID. Of course if there's a better way I am open to any suggestions.

Since you've got an HTML table, use html_table to extract it to a data.frame. You'll need fill = TRUE, because the table has extra empty rows inserted between each entry, which you can easily drop afterwards with tidyr::drop_na.
library(tidyverse)
library(rvest)
page <- 'https://www.congress.gov/help/field-values/member-bioguide-ids' %>%
read_html()
members <- page %>%
html_node('table') %>%
html_table(fill = TRUE) %>%
set_names('member', 'bioguide') %>%
drop_na(member) %>% # remove empty rows inserted in the table
tbl_df() # for printing
members
#> # A tibble: 2,243 x 2
#> member bioguide
#> * <chr> <chr>
#> 1 Abdnor, James (Republican - South Dakota) A000009
#> 2 Abercrombie, Neil (Democratic - Hawaii) A000014
#> 3 Abourezk, James (Democratic - South Dakota) A000017
#> 4 Abraham, Ralph Lee (Republican - Louisiana) A000374
#> 5 Abraham, Spencer (Republican - Michigan) A000355
#> 6 Abzug, Bella S. (Democratic - New York) A000018
#> 7 Acevedo-Vila, Anibal (Democratic - Puerto Rico) A000359
#> 8 Ackerman, Gary L. (Democratic - New York) A000022
#> 9 Adams, Alma S. (Democratic - North Carolina) A000370
#> 10 Adams, Brock (Democratic - Washington) A000031
#> # ... with 2,233 more rows
The member column could be further extracted, if you like.
There are also many other useful sources for this data, some of which correlate it with other useful variables. This one is well-structured and updated regularly.

Give this a try. I have updated this to include separating out the different fields.
library('rvest')
library('dplyr')
library('tidyr')
candidate_url <- 'https://www.congress.gov/help/field-values/member-bioguide-ids'
candidate_page <- read_html(candidate_url)
candidate_nodes <- html_nodes(candidate_page, 'table')
df.candidates <- as.data.frame(html_table(candidate_nodes, header = TRUE, fill = TRUE), stringsAsFactors = FALSE)
df.candidates <- df.candidates[!is.na(df.candidates$Member),]
df.candidates <- df.candidates %>%
mutate(Party.State = gsub("[\\(\\)]", "", regmatches(Member, gregexpr("\\(.*?\\)", Member))[[1]])) %>%
separate(Party.State, into = c("Party","State"), sep = " - ") %>%
mutate(Full.name = trimws(regmatches(df.candidates$Member, regexpr("^[^\\(]+", df.candidates$Member)))) %>%
separate(Full.name, into = c("Last.Name","First.Name","Suffix"), sep = ",", fill = "right") %>%
select(First.Name, Last.Name, Suffix, Party, State, Member.ID)

This is a bit hackish, but if you want to extract the variables using regex here are a few pointers.
candidate_list <- unlist(candidate_list)
ID <- regmatches(candidate_list,
gregexpr("[a-zA-Z]{1}[0-9]{6}", candidate_list))
party_state <- regmatches(candidate_list,
gregexpr("(?<=\\()[^)]+(?=\\))", candidate_list, perl=TRUE))
names_etc <- strsplit(candidate_list, "[a-zA-Z]{1}[0-9]{6}")
names <- sapply(names_etc, function(x) sub(" \\([^)]*\\)", "", x))

Related

Rvest : Extracting clickable content

I am trying to extract the table in the link below
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
I want the whole table to be extracted and I am using the following code
html_page <- read_html(curl(curl))
tab <- html_page %>% html_table(., fill = TRUE)
I get the table in tab[[1]], however, if you notice that website it has a clickable section within the table that has additional data. That part is missing from the extracted table. Will appreciate any help on how the whole table can be extracted.

I'm not sure what you're getting. However, when I pulled from this website I see that there are multiple tabs but I pulled all of the data.
Here is the bottom of the table, when you show all.
Here are the results, when I query for the last line of this website data.
library(rvest)
library(tidyverse)
hx = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
htp <- read_html(hx) %>% html_table(., fill = T)
tbOne = htp[[1]][, 1:10] # just the data
tbOne %>% filter(`State Name` == "Uttar Pradesh",
`District Name` == "Badaun",
`Market Name` == "Wazirganj")
# # A tibble: 1 × 10
# `State Name` `District Name` `Market Name` Variety Group `Arrivals (Tonnes)`
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Uttar Pradesh Badaun Wazirganj Dara Cereals 3.50
# # … with 4 more variables: `Min Price (Rs./Quintal)` <chr>,
# # `Max Price (Rs./Quintal)` <chr>, `Modal Price (Rs./Quintal)` <chr>,
# # `Reported Date` <chr>
Update
When I pressed the 2, nothing happened (and I did try repeatedly). However, I needed to be really patient and I wasn't. Sorry about that.
The URL has the query in it, so the URL can be used to get all of the data. You could do this by adding the states you're missing, or you could do this for every state. For example, page one ends on Utter Pradesh, but we don't know if this is all of Utter Pradesh. That might make more sense when you see what I did.
Using rvest, I collected all of the states' names from the form. Then I put these name-value pairs into a data frame.
# collect form values for State
ht <- read_html(hx) %>% html_form()
df1 <- as.data.frame(ht[[1]][["fields"]][["ctl00$ddlState"]][["options"]]) %>%
rownames_to_column("State")
names(df1)[2] <- "Abb"
To only look at the states that were not included in page one, you could just query the states after Utter Pradesh like this.
which(df1$State == "Uttar Pradesh", arr.ind = T)
# [1] 35
# split the URL
urone = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State="
urtwo = "&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=West+Bengal&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
# collect remaining states' data
df2 <- map(36:nrow(df1),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df2)
# [1] 2
length(df2[[1]]) # state 36 is empty
length(df2[[2]]) # state 37 is not
# add the new data to the original data
df3 <- df2[[2]][[1]]
tbOne <- rbind(tbOne, df3) # one data frame of tabled data
If you wanted to make sure that you had all the data for each state, you could expand this. Although, using map for that much data may be slow. So I used the function mclapply from the package parallel. In this code, I used 15 cores. You may need to change this depending on your computer's processor. Using 15 made this take less than a second.
# skip row 1, that's "select" or all
df4 <- mclapply(2:nrow(df1), mc.cores = getOption("mc.cores", 15L),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df4)
# [1] 36
# create storage using first state with data
df5 <- df4[[7]][[1]]
map(8:36,
function(x){
y = length(df4[[x]])
if(y > 0){
df5 <<- rbind(df5, df4[[x]][[1]])
}
})
Now you have a data frame, df5 that started as each state queried separately.
I didn't look at how the data was different. However, my tbOne data frame has 577 observations. My df5 data frame has 584.

Is there a function similar to read_html() that can be used on data table or data frame types in R?

I'm attempting to webscrape from footballdb.com to get data related to NFL player injuries for a model I am creating from links such as this: https://www.footballdb.com/transactions/injuries.html?yr=2016&wk=1&type=reg which will then be output in a data table. Along with data related to individual player injury information (i.e. their name, injury, and status throughout the week leading up to the game), I also want to include the season and week of the injury in question for each player. I started by using nested for loops to generate the url for each webpage in question, along with the season and week corresponding to each webpage, which were stored in a data table with columns: link, season, and week.
I then tried to to use the functions map_df(), read_html(), and html_nodes() to extract the information I wanted from each webpage, but I run into errors as read_html() does not work for for objects of the data table or data frame class. I then tried to use different types of indexing and the $ operator with no luck either. Is there anyway I can modify the code I have produced thus far to extract the information I want from a data table? Below is what I have written thus far:
library(purrr)
library(rvest)
library(data.table)
#Remove file if file already exists
if (file.exists("./project/volume/data/interim/injuryreports.csv")) {
file.remove("./project/volume/data/interim/injuryreports.csv")}
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
###Errors Below####
DT <- map_df(result, function(x){
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
)
#####End of Errors###
#Write out injury data table
fwrite(DT,"./project/volume/data/interim/injuryreports.csv")

The issue is that your input data frame result is a datatable. When passing this to map_df it will loop over the columns(!!) of the datable not the rows.
One approach to make your code work is to split result by link and loop over the resulting list.
Note: For the reprex I only loop over the first two elements of the list. Additionally I have put your function outside of the map statement which made debugging easier.
library(purrr)
library(rvest)
library(data.table)
#Declare variables and empty data tables
path1<-("https://www.footballdb.com/transactions/injuries.html?yr=")
seasons<-c("2016", "2017", "2020")
weeks<-1:17
result<-data.table()
temp<-NULL
#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
for(w in 1:length(weeks)){
temp$link<- paste0(path1, seasons[s],"&wk=", as.character(w), "&type=reg")
temp$season<-as.numeric(seasons[s])
temp$week<-weeks[w]
result<-rbind(result,temp)
}
}
#Get rid of any potential empty values from result
result<-compact(result)
result <- split(result, result$link)
get_table <- function(x) {
page <- read_html(x[[1]])
data.table(
Season = x[[2]],
Week = x[[3]],
Player = page %>% html_nodes('.divtable .td:nth-child(1) b') %>% html_text(),
Injury = page %>% html_nodes('.divtable .td:nth-child(2)') %>% html_text(),
Wed = page %>% html_nodes('.divtable .td:nth-child(3)') %>% html_text(),
Thu = page %>% html_nodes('.divtable .td:nth-child(4)') %>% html_text(),
Fri = page %>% html_nodes('.divtable .td:nth-child(5)') %>% html_text(),
GameStatus = page %>% html_nodes('.divtable .td:nth-child(6)') %>% html_text()
)
}
DT <- map_df(result[1:2], get_table)
DT
#> Season Week Player Injury Wed Thu Fri
#> 1: 2016 1 Justin Bethel Foot Limited Limited Limited
#> 2: 2016 1 Lamar Louis Knee DNP Limited Limited
#> 3: 2016 1 Kareem Martin Knee DNP DNP DNP
#> 4: 2016 1 Alex Okafor Biceps Full Full Full
#> 5: 2016 1 Frostee Rucker Neck Limited Limited Full
#> ---
#> 437: 2016 10 Will Blackmon Thumb Limited Limited Limited
#> 438: 2016 10 Duke Ihenacho Concussion Full Full Full
#> 439: 2016 10 DeSean Jackson Shoulder DNP DNP DNP
#> 440: 2016 10 Morgan Moses Ankle Limited Limited Limited
#> 441: 2016 10 Brandon Scherff Shoulder Full Full Full
#> GameStatus
#> 1: (09/09) Questionable vs NE
#> 2: (09/09) Questionable vs NE
#> 3: (09/09) Out vs NE
#> 4: --
#> 5: --
#> ---
#> 437: (11/11) Questionable vs Min
#> 438: (11/11) Questionable vs Min
#> 439: (11/11) Doubtful vs Min
#> 440: (11/11) Questionable vs Min
#> 441: --

.TXT in long form to data.frame in wide form in R

I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex

Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language

What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.

Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)

R scrape html table and extract background color

I am trying to scrape some data off a wikipedia table from this page:
https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014 and I am interested in the table:
Summary of the 2014 Indian general election
I would also like to extract the party colors from the table.
Here's what I've tried so far:
library("rvest")
url <-
"https://en.wikipedia.org/wiki/Results_of_the_Indian_general_election,_2014"
electionstats <- read_html(url)
results <- html_nodes(electionstats, xpath='//*[#id="mw-content-text"]/div/table[79]') %>% html_table(fill = T)
party_colors <- electionstats %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_table(fill = T)
Printing out party_colors does not show any info about the colors
So, I tried:
party_colors <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>%
html_nodes('tr')
Now if I print out party_colors, I get:
[1] <tr style="background-color:#E9E9E9">\n<th style="text-align:left;vertical-align:bottom;" rowspan="2"></th>\n<th style="text-align:left; ...
[2] <tr style="background-color:#E9E9E9">\n<th style="text-align:center;">No.</th>\n<th style="text-align:center;">+/-</th>\n<th style="text ...
[3] <tr>\n<td style="background-color:#FF9933"></td>\n<td style="text-align:left;"><a href="/wiki/Bharatiya_Janata_Party" title="Bharatiya J ...
[4] <tr>\n<td style="background-color:#00BFFF"></td>\n<td style="text-align:left;"><a href="/wiki/Indian_National_Congress" title="Indian Na ...
[5] <tr>\n<td style="background-color:#009900"></td>\n<td style="text-align:left;"><a href="/wiki/All_India_Anna_Dravida_Munnetra_Kazhagam" ...
and so on...
But, now, I have no idea how to pull out the colors from this data. I cannot convert the output of the above to a html_table with:
html_table(fill = T)
I get the error:
Error: html_name(x) == "table" is not TRUE
I also tried various options with html_attrs, but I have no idea what the correct attribute I should be passing is.
I even tried SelectorGadget to try and figure out the attribute, but if I select the first column of the table in question, SelectorGadget shows just "td".

I would get the table like you did and then add the color attribute as a column. The wikitable sortable class works on many pages, so get the first one and remove the second header in row 1.
electionstats <- read_html(url)
x <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"]')[[1]] %>%
html_table(fill=TRUE)
# paste names from 2nd row header and then remove
names(x)[6:14] <- paste(names(x)[6:14], x[1,6:14])
x <- x[-1,]
The colors are in the first tr/td tags and you can add it to empty column 1 or 3 (see str(x))
names(x)[3] <- "Color"
x$Color <- html_nodes(electionstats, xpath='//table[#class="wikitable sortable"][1]/tr/td[1]') %>%
html_attr("style") %>% gsub("background-color:", "", .)
## drop table footer, extra columns
x <- x[1:83, 2:14]
head(x)
Party Color Alliance Abbr. Candidates No. Candidates +/- Candidates %
2 Bharatiya Janata Party #FF9933 NDA BJP 428 -5 78.82%
3 Indian National Congress #00BFFF UPA INC 464 24 85.45%
4 All India Anna Dravida Munnetra Kazhagam #009900 ADMK 40 17 7.37%
5 All India Trinamool Congress #00FF00 AITC 131 96 24.13%
6 Biju Janata Dal #006400 BJD 21 3 3.87%
7 Shiv Sena #E3882D NDA SHS 24 11 10.68%

Looks like your xml_nodeset contains both tr and td nodes.
Deal with both trs and tds, converting to data frames:
party_colors_tr <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr')
trs <- bind_rows(lapply(xml_attrs(party_colors_tr), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
party_colors_td <- electionstats %>% html_nodes(xpath='//*[#id="mw-content-text"]/div/table[3]') %>% html_nodes('tr') %>% html_nodes('td')
tds <- bind_rows(lapply(xml_attrs(party_colors_td), function(x) data.frame(as.list(x), stringsAsFactors=FALSE)))
Write function for extracting styles from data frames:
library(stringi)
list_styles <- function(nodes_frame) {
get_cols <- function(x) { stri_detect_fixed(x, 'background-color') }
has_style <- which(lapply(nodes_frame$style, get_cols) == TRUE)
res <- strsplit(nodes_frame[has_style,]$style, ':')
return(res)
}
Create data frame of extracted styles:
l_trs <- list_styles(trs)
df_trs <- data.frame(do.call('rbind', l_trs)[,1], do.call('rbind', l_trs)[,2])
names(df_trs) <- c('style', 'color')
l_tds <- list_styles(tds)
df_tds <- data.frame(do.call('rbind', l_tds)[,1], do.call('rbind', l_tds)[,2])
names(df_tds) <- c('style', 'color')
Combine trs and tds frames:
final_style_frame <- do.call('rbind', list(df_trs, df_tds))
Here are the first 20 rows:
final_style_frame[1:20,]

R dplyr, using mutate with na.omit causes error incompatible size (%d)

I'm doing data cleaning. I use mutate in Dplyr a lot since it generates new columns step by step and I can easily see how it goes.
Here are two examples where I have this error
Error: incompatible size (%d), expecting %d (the group size) or 1
Example 1: Get town name from zipcode. Data is simply like this:
Zip
1 02345
2 02201
And I notice when the data has NA in it, it doesn't work.
Without NA it works:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201'),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Source: local data frame [2 x 2]
Groups: <by row>
Zip Town1
1 02345 Manomet
2 02201 Boston
With NA it doesn't work:
library(dplyr)
library(zipcode)
data(zipcode)
test = data.frame(Zip=c('02345','02201',NA),stringsAsFactors=FALSE)
test %>%
rowwise() %>%
mutate( Town1 = zipcode[zipcode$zip==na.omit(Zip),'city'] )
resulting in
Error: incompatible size (%d), expecting %d (the group size) or 1
Example2. I wanna get rid of the redundant state name that occurs in the Town column in the following data.
Town State
1 BOSTON MA MA
2 NORTH AMAMS MA
3 CHICAGO IL IL
This is how I do it:
(1) split the string in Town into words, e.g. 'BOSTON' and 'MA' for row 1.
(2) see if any of these words match the State of that line
(3) delete the matched words
library(dplyr)
test = data.frame(Town=c('BOSTON MA','NORTH AMAMS','CHICAGO IL'), State=c('MA','MA','IL'), stringsAsFactors=FALSE)
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-is.state])
This results in:
Town State Town.word is.state Town1
1 BOSTON MA MA <chr[2]> 2 BOSTON
2 NORTH AMAMS MA <chr[2]> NA NA
3 CHICAGO IL IL <chr[2]> 2 CHICAGO
Meaning: E.g., row 1 shows is.state==2, meaning the 2nd word in Town is the state name. After getting rid of that work, Town1 is the correct town name.
Now I wanna fix the NA in row 2, but add na.omit would cause error:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(Town1 = Town.word[-na.omit(is.state)])
results in:
Error: incompatible size (%d), expecting %d (the group size) or 1
I checked the data type and size:
test %>%
mutate(Town.word = strsplit(Town, split=' ')) %>%
rowwise() %>% # rowwise ensures every calculation only consider currect row
mutate(is.state = match(State,Town.word ) ) %>%
mutate(length(is.state) ) %>%
mutate(class(na.omit(is.state)))
results in:
Town State Town.word is.state length(is.state) class(na.omit(is.state))
1 BOSTON MA MA <chr[2]> 2 1 integer
2 NORTH AMAMS MA <chr[2]> NA 1 integer
3 CHICAGO IL IL <chr[2]> 2 1 integer
So it is %d of length==1. Can somebody where's wrong? Thanks

Can you just sub it out?
test %>%
rowwise() %>%
mutate(Town=sub(sprintf('[, ]*%s$', State), '', Town))
## Source: local data frame [3 x 2]
## Groups: <by row>
##
## Town State
## 1 BOSTON MA
## 2 NORTH AMAMS MA
## 3 CHICAGO IL
(This way also catches commas after the town, if that happens.)
NB: if you use ungroup() here with a rowwise_df (as this is), it will wipe the tbl_df class as well and output a straight data.frame, which is fine for your data but will clobber your screen if you aren't careful and are looking at large amounts of data (as I've done countless times). (Github references #936 and #553.)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating a dataframe from a scraped character vector - r

Related

Rvest : Extracting clickable content

Is there a function similar to read_html() that can be used on data table or data frame types in R?

.TXT in long form to data.frame in wide form in R

R scrape html table and extract background color

R dplyr, using mutate with na.omit causes error incompatible size (%d)

Categories

Resources