I am scraping data off a website using rvest() for the first time.
It gave me a character vector that I am trying to split and convert to a data frame with columns.
How do you turn this vector:
char.vector <- c("John DoeTeacherSpeaksEnglishJapaneseRateUSD 10Video Intro","JaneTutorSpeaksJapaneseFrenchRateUSD 15Video Intro")
...into this data frame with columns:
Name
Role
English
Japanese
French
Rate_USD
John Doe
Teacher
1
1
0
10
Jane
Tutor
0
1
1
15
Splitting on spaces or character position is problematic. Is there maybe a way to create a vector of all the words I want to split at and use it as the split argument?
split.vector <- c("Teacher", "Tutor", "Speaks", "English", "Japanese", "French", "Rate", "Video")
My code and url:
EN.char <- read_html("https://www.italki.com/teachers/english") %>%
html_nodes(".teacher-card") %>%
html_text()
EN.char
So, as there are potentially different numbers of languages for each entry, I shamelessly took the approach used by #akrun in their answer here, whereby read.dcf is used to map out all the languages present, and put NA where a language is not present for a given entry. After having read this article I saw that:
The DCF rules as implemented in R are:
A database consists of one or more records, each with one or more
named fields. Not every record must contain each field, a field may
appear only once in a record.
Regular lines start with a non-whitespace character.
Regular lines are of form tag:value, i.e.,
have a name tag and a value for the field, separated by : (only the
first : counts). The value can be empty (=whitespace only).
Lines starting with whitespace are continuation lines (to the preceding
field) if at least one character in the line is non-whitespace.
Records are separated by one or more empty (=whitespace only) lines.
In order to resolve a malformed line error, I needed to fix rule 3 and insert ":" so as to have tag:value pairs.
Following #akrun's example, I wrapped the read.dcf within an as.data.frame call. I kept the unnecessary, in this case, if statement, for safety of missing entries. I doubt languages would be missing given the nature of the site.
I then used replace_na() to switch out NA with "0" and cast all the language columns as integer.
I then generated another DataFrame, with all the other details required, which only had one entry per field in the record (row). As I was sure of matching, and ordered, rows, I then used cbind() to join the DataFrames column-wise.
N.B. Given my location, the USD values are actually showing as GBP.
R:
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
page <- read_html("https://www.italki.com/teachers/english")
entries <- page %>% html_elements(".teacher-card")
language_df <- map_dfr(entries, ~ {
new <- .x %>%
html_elements("p + div > div") %>%
html_text() %>%
paste0(., ":1")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
}) %>%
mutate(across(.cols = everything(), ~ as.integer(tidyr::replace_na(.x, 0))))
details_df <- map_dfr(entries, ~
data.frame(
name = .x %>% html_element("div > .overflow-hidden") %>% html_text(),
role = .x %>% html_element("span + div > .text-tiny") %>% html_text(),
rate_usd = .x %>% html_element(".flex-1:nth-child(1) > div > span") %>% html_text()
))
results <- cbind(details_df, language_df)
Results:
Related
I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)
I've read in a table df which has numbers and strings.
I have a keywords stored in a vector arr_words. For every row in the table; if row contains any word from the vector ignoring case, I want to keep that row.
For instance, if one of the cells has "i like magIcalot", and one of my keywords is "magic", I want to keep all the attributes from that row.
I've been trying this, but I'm pretty sure it's wrong since it's getting me back zero rows-
df %>%
rowwise() %>%
filter(any(names(df) %in% arr_words))
If you want to search in any specific field say field1, you can use as below:
library(dplyr)
df %>%
filter(grepl(arr_words,field1))
If you want to search across all fields, then:
library(stringr)
library(dplyr)
df %>%
filter_all(any_vars(str_detect(., arr_words)))
What I want to do is create a column in the gamelog data frame that assigns the goalie name from the Goalies vector to the column
Is there any way to do this?
I heard of deparse(substitute()) but when I use it, it doesn't work in my for loop
library(XML)
Howard<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8470657'
Lehner<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8475215'
Binnington<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8476412'
Goalies<-c(Howard, Lehner, Binnington)
gamelog<-data.frame()
for(goalie in Goalies){
goaliehtml<-readHTMLTable(goalie)
goaliedata<-goaliehtml[['gamelog']]
goaliedata$player<-deparse(substitute(goalie))
gamelog<-rbind(gamelog, goaliedata)
}
I want goaliedata$player to be equal to the goalie that is being ran through the for loop
I would approach this differently. First, I'd store the player names and IDs in a list or data frame. For example:
player_id <- data.frame(player = c("Howard", "Lehner", "Binnington"),
id = c(8470657, 8475215, 8476412),
stringsAsFactors = FALSE)
player id
1 Howard 8470657
2 Lehner 8475215
3 Binnington 8476412
Next, I would write a function which takes playerand id and returns the data frame of data from the website with the player name column added.
My function uses the rvest library, which supplies read_html and html_table, instead of XML. There's a complication: missing values are represented by -, which turns a column into characters. But not all players have missing values, so those columns are numeric. So the function changes - to NA, then converts all values to numeric before combining the players. The dplyr library supplies the mutate functions.
library(rvest)
library(dplyr)
get_player_data <- function(player, id) {
base_url <- "http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid="
paste0(base_url, id) %>%
read_html() %>%
html_table(header = TRUE) %>%
.[[1]] %>%
mutate_at(vars(-starts_with("Game"), -starts_with("Team")),
funs(as.numeric(gsub("-", NA, .)))) %>%
mutate(player = player)
}
Now we can go through each player + id. Rather than a loop we can use pmap_df from the purrr library. This takes each player + id, sends it to our function and combines the outputs into a single data frame at the end:
library(purrr)
player_data <- pmap_df(player_id, get_player_data)
For your 3 example players, this returns a data frame of 83 rows and 52 columns, with the player name in the last column.
Note: it's assumed all player data has the same form as the 3 examples (52 columns, missing values represented by -). If not, the function will probably give errors.
goalie doesn't contain the name of the goalie. So first you to give the vector Goalies also the names of the goalie.
library(XML)
Howard<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8470657'
Lehner<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8475215'
Binnington<-'http://naturalstattrick.com/playerreport.php?fromseason=20182019&thruseason=20182019&stype=2&sit=5v5&stdoi=oi&rate=n&v=g&playerid=8476412'
Goalies<-c(Howard, Lehner, Binnington)
# give the vector the names of the goalies
names(Goalies) <- c("Howard", "Lehner", "Binnington")
gamelog<-data.frame()
for(i in 1:length(Goalies)) {
goaliehtml<-readHTMLTable(Goalies[i])
goaliedata<-goaliehtml[['gamelog']]
goaliedata$player<-names(Goalies[i])
gamelog<-rbind(gamelog, goaliedata)
}
Is this what you are looking for?
After reading an HTML table, my name column appears with records as follows:
\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t
The following code fails to generate the correct values in the First and Last name columns
separate(data=nametable, col = Name, into = c("First","Last"), sep= " ")
Curiously, the First column is blank, while the Last column contains only the person's first name.
How could I correctly turn this column into the First and Last column desired (i.e...
First Last
Mike Moon
Data example per recommendation of #r2evans and as appearing in correct answer code below:
nametable <- data.frame(Name="\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t", stringsAsFactors=FALSE)
It might help to trim whitespace from the field before moving on. trimws removes "leading and/or trailing whitespace from character strings" (from ?trimws).
Data:
nametable <- data.frame(Name="\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t", stringsAsFactors=FALSE)
library(dplyr)
nametable %>% mutate(Name = trimws(Name))
# Name
# 1 Mike Moon
I infer that you are using dplyr as well as tidyr, so I'm using it here. It is also really straight-forward to do nametable$Name <- trimws(nametable$Name) without the dplyr usage.
From here, it's as you initially coded:
nametable %>%
mutate(Name = trimws(Name)) %>%
tidyr::separate(col=Name, into=c("First", "Last"))
# First Last
# 1 Mike Moon
I have a dataframe with columns containing words that make up an ngram. I would like to sum up the number of stopwords in each ngram and add this column to the dataframe but I can't think of an elegant way to do it with multiple values for n (4-grams, 5-grams etc. . .).
So far I have been doing the following:
mutate(Bigram_Counts_By_Company,
stopword_count = (word1 %in% stop_words$word) %>% as.integer() +
(word2 %in% stop_words$word) %>% as.integer())
Now this works but I'd so much rather write a general function that does the same with all columns starting with "name".
What I'd like to do:
mutate(Web_Bigram_Counts_By_Company,
stopword_count = select(Web_Bigram_Counts_By_Company, starts_with("word")) %in% stop_words$word)
select(Web_Bigram_Counts_By_Company, starts_with("word")) works great to select the columns whose names start with 'name', but when I use it in the call to mutate I get this error: Column 'stopword_count' must be length 360463 (the number of rows) or one, not 2
Is this just a simple R fundamentals error or am I going about this wrong?