I am trying to use the following syntax to get the occupation information from George Clooney's wikipedia page. Eventually I would like there to be a loop to get data on various personalitys' occupations.
However, I get the following problem running the below code:
Error in if (symbol != "role") symbol = NULL : argument is of length zero
I am not sure why this keeps on coming up.
library(XML)
library(plyr)
url = 'http://en.wikipedia.org/wiki/George_Clooney'
# don't forget to parse the HTML, doh!
doc = htmlParse(url)
# get every link in a table cell:
links = getNodeSet(doc, '//table/tr/td')
# make a data.frame for each node with non-blank text, link, and 'title' attribute:
df = ldply(links, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, 'class')
if (symbol!='role') symbol=NULL
if(!is.null(text) & !is.null(symbol))
data.frame(symbol, text) } )
As #gsee mentioned, you need to check that symbol isn't NULL before you check its value. Here's a minor update to your code that works (at least for George).
df = ldply(
links,
function(x)
{
text = xmlValue(x)
if (!nzchar(text)) text = NULL
symbol = xmlGetAttr(x, 'class')
if (!is.null(symbol) && symbol != 'role') symbol = NULL
if(!is.null(text) & !is.null(symbol))
data.frame(symbol, text)
}
)
Use col.names = my_column_names in kable() with my_column_names being character vector of your wanted names, for me it worked!
Related
I am working with quarto to table some results from a qualitative data analysis, and present them in a {DT} or {gt} table.
I have a placeholder character in the table I'm receiving from another data source, but cannot seem to replace that placeholder with one or more carriage returns to make entries easier to read in the resulting DT or gt table.
Thanks for your collective help!
library(tidyverse)
library(DT)
library(gt)
df_text <- tibble::tibble(index = c("C-1", "C-2"),
finding = c("A finding on a single line.", "A finding with a return in the middle.<return>Second line is usually some additional piece of context or a quote.")) %>%
dplyr::mutate(finding = stringr::str_replace_all(finding, "<return>", "\\\n\\\n"))
DT::datatable(df_text)
gt::gt(df_text)
for gt you need
gt::gt(df_text) |> tab_style(style=cell_text(whitespace = "pre"),
locations=cells_body())
for DT you could modify the column with the required text to be html and then tell DT to respect your HTML
df_text <- tibble::tibble(index = c("C-1", "C-2"),
finding = c("A finding on a single line.", "A finding with a return in the middle.<return>Second line is usually some additional piece of context or a quote.")) %>%
dplyr::mutate(finding = paste0("<HTML>",
stringr::str_replace_all(finding, "<return>", "</br></br>"),
"</HTML>"))
DT::datatable(df_text,escape = FALSE)
Based on #Nir Graham's answer, I wrote a function. The type = "dt" isn't working for some reason, but Nir's recommendation to dplyr::mutate() inline does :shrug:
fx_add_returns <- function(x, type = c("dt", "gt")) {
type <- match.arg(type)
if (type == "dt"){
paste0("<HTML>",
stringr::str_replace_all(x, "<return>", "</br></br>"), "</HTML>")
}
if (type == "gt"){
stringr::str_replace_all(x, "<return>", "\\\n\\\n")
}
}
So I'm diving into yet another language (R), and need to be able to look at individual items in a dataframe(?). I've tried a number of ways to access this, but so far am confused by what R wants me to do to get this out. Current code:
empStatistics <- read.csv("C:/temp/empstats.csv", header = TRUE, row.names = NULL, encoding = "UTF-8", sep = ",", dec = ".", quote = "\"", comment.char = "")
attach(empStatistics)
library(svDialogs)
Search_Item <- dlgInput("Enter a Category", "")$res
if (!length(Search_Item)) {
cat("You didn't pick anything!?")
} else {
Category <- empStatistics[Search_Item]
}
Employee_Name <- dlgInput("Enter a Player", "")$res
if (!length(Employee_Name)) {
cat("No Person Selected!\n")
} else {
cat(empStatistics[Employee_Name, Search_Item])
}
and the sample of my csv file:
Name,Age,Salary,Department
Frank,25,40000,IT
Joe,24,40000,Sales
Mary,34,56000,HR
June,39,70000,CEO
Charles,60,120000,Janitor
From the languages I'm used to, I would have expected the brackets to work, but that obviously isn't the case here, so I tried looking for other solutions including separating each variable into its own brackets, trying to figure out how to use subset() (failed there, not sure it is applicable), tried to find out how to get the column and row indexes, and a few other things I'm not sure I can describe.
How can I enter values into variables, and then use that to get the individual pieces of data (ex, enter "Frank" for the name and "Age" for the search item and get back 25 or "June" for the name and "Department" for the search item to get back "CEO")?
If you would like to access it like that, you can do:
Search_Item <- "Salary"
Employee_Name <- "Frank"
empStatistics <- read.csv("empstats.csv",header = TRUE, row.names = 1)
empStatistics[Employee_Name,Search_Item]
[1] 40000
R doesn't have an Index for its data.frame. The other thing you can try is:
empStatistics <- read.csv("empstats.csv",header = TRUE)
empStatistics[match(Employee_Name,empStatistics$Name),Search_Item]
[1] 40000
I have written a function that "cleans up" taxonomic data from NGS taxonomic files. The problem is that I am unable to replace NA cells with a string like "undefined". I know that it has something to do with variables being made into factors and not characters (Warning message: In `...` : invalid factor level, NA generated), however even when importing data with stringsAsFactors = FALSE I still get this error in some cells.
Here is how I import the data:
raw_data_1 <- taxon_import(read.delim("taxonomy_site_1/*/*/*/taxonomy.tsv", stringsAsFactors = FALSE))
The taxon_import function is used to split the taxa and assign variable names:
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
Now the following function is used to "clean" the data and this is where I would like to replace certain strings with "Undefined", however I keep getting the error: In[<-.factor(tmp, thisvar, value = "Undefined") : invalid factor level, NA generated
Here follows the data_cleanup function:
data_cleanup <- function(data) {
strip_1 = list("D_0__", "D_1__", "D_2__", "D_3__", "D_4__", "D_5__", "D_6__")
for (i in strip_1) {
data <- as.data.frame(sapply(data, gsub, pattern = i, replacement = ""))
}
data[data==""] <- "Undefined"
strip_2 = list("__", "unidentified", "Ambiguous_taxa", "uncultured", "Unknown", "uncultured .*", "Unassigned .*", "wastewater Unassigned", "metagenome")
for (j in strip_2) {
data <- as.data.frame(sapply(data, gsub, pattern = j, replacement = "Undefined"))
}
return(data)
}
The function is simply applied like: test <- data_cleanup(raw_data_1)
I am appending the data from a cloud, since it is very lengthy data. Here is the link to a data file https://drive.google.com/open?id=1GBkV_sp3A0M6uvrx4gm9Woaan7QinNCn
I hope you will forgive my ignorance, however I tried many solutions before posting here.
We start by using the tidyverse library. Let me give a twist to your question, as it's about replacing NAs, but I think with this code you should avoid that problem.
As I read your code, you erase the strings "D_0__", "D_1__", ... from the observation strings. Then you replace the strings "Ambiguous_taxa", "unidentified", ... with the string "Undefined".
According to your data, I replaced the functions with regex, which makes a little easy to clean your data:
library(tidyverse)
taxon_import <- function(data) {
data <- as.data.frame(str_split_fixed(data$Taxon, ";", 7))
colnames(data) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
return(data)
}
raw_data_1 <- taxon_import(read.delim("taxonomy.tsv", stringsAsFactors = FALSE))
raw_data_1 <- data.frame(lapply(raw_data_1,as.character),stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(raw_data_1,function(x) sub("^D_[0-6]__","",x)), stringAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("__|unidentified|Ambiguous_taxa|uncultured","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("Unknown|uncultured\\s.\\*|Unassigned\\s.\\*","Undefined",x)), stringsAsFactors = FALSE)
depured <- as.data.frame(sapply(depured,function(x) sub("wastewater\\sUnassigned|metagenome","Undefined",x)), stringsAsFactors = FALSE)
depured[depured ==""] <- "Undefined"
Let me explain my code. First, I read in many websites that it's better to avoid loops, as "for". So how you replace text that starts with "D_0__"?
The answer is regex (regular expression). It seems complicated at first but with practice it'll be helpful. See this expression:
"^D_[0-6]__"
It means: "Take the start of the string which begins with "D_" and follows a number between 0 and 6 and follows "__"
Aha. So you can use the function sub
sub("^D_[0-6]__","",string)
which reads: replace the regular expression with a blank space "" in the string.
Now you see another regex:
"__|unidentified|Ambiguous_taxa|uncultured"
It means: select the string "__" or "unidentified" or "Ambiguous_taxa" ...
Be careful with this regex
"Unknown|uncultured\\s.\\*|Unassigned\\s.\\*"
it means: select the string "Unknown" or "uncultured .*" or...
the blank space it's represented by \s and the asterisk is \*
Now what about the as.data.frame function? Every time I use it I have to make it "stringsAsFactors = FALSE" because the function tries to use the characters, as factors.
With this code no NA are created.
Hope it helps, please don't hesitate to ask if needed.
Regards,
Alexis
I recently wrote a function that uses grep and regex to find invalid UTF-8 code point (Since I work on a mac, my locale is also UTF-8). The input doesn't have to be UTF-8, as it is looking for invalid UTF-8 bytes. I wrote the function for work, and was wondering if anyone could provide tips for generalizing/catch any red flags in the code that I didn't notice (e.g. using base code instead of dplyr). Feel free to use any of the code if it's useful to you.
enc_check <- function(data) {
library(dplyr)
library(magrittr)
# Create vector of all possible 2-digit hexadecimal numbers (2 digits is the lenth of a byte)
allBytes <- list(x_esc = '\\x',
hex1 = as.character(c(seq(0,9),
c('a','b','c','d','e','f'))),
hex2 = as.character(c(seq(0,9),
c('a','b','c','d','e','f')))
) %$%
expand.grid(x_esc, hex1, hex2) %>%
apply(1, paste, collapse = '')
# Valid mixed alphanumeric bytes
validBytes1 <- list(x_esc = '\\x',
hexNum = as.character(c(seq(2,7))),
hexAlpha = c('a','b','c','d','e','f')
) %$%
expand.grid(x_esc, hexNum, hexAlpha) %>%
apply(1, paste, collapse = '') %>%
extract(. != '\\x7f')
# Valid purely numeric bytes
validBytes2 <- list(x_esc = '\\x',
hexNum2 = as.character(seq(20,79))
) %$%
expand.grid(x_esc, hexNum2) %>%
apply(1, paste, collapse = '')
# New-line byte
validBytes3 <- '\\x0a'
# charToRaw('\n')
# [1] 0a
# Filter all possible combinations down to only invalid bytes
validBytes <- c(validBytes1, validBytes2, validBytes3)
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
# Create list of data.frame columns with invalid bytes
a_vector <- vector()
matches <- list()
for (i in 1:dim(data)[2]) {
a_vector <- data[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
}
# Get rid of empty list elements
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
# matches <- matches[lapply(matches,length) > 0]
return(matches)
}
Edit: Here's the updated code with the suggestions implemented.
enc_check <- function(dataset) {
library(dplyr)
library(magrittr)
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
validBytes <- paste0("\\x",
c(as.character(as.hexmode(32:126)),
sapply(rASCII, charToRaw))) %>%
extract(not(duplicated(.)))
invalidBytes <- allBytes %>%
extract(not(is_in(., validBytes)))
a_vector <- vector()
matches <- list()
for (i in 1:dim(dataset)[2]) {
a_vector <- dataset[,i]
matches[[i]] <- unlist(sapply(invalidBytes, grep, a_vector, useBytes = TRUE))
} # sapply() is preferable to lapply due to USE.NAMES = TRUE
names(matches) <- names(dataset)
matches %<>%
lapply(length) %$%
extract(matches, . > 0)
return(matches)
}
2nd Edit: A better strategy was to use iconv. Let's say you have a file or object with some invalid bytes but that is generally UTF-8. This is often the case with Mac computers, whose default locale setting seems to be UTF-8. Moreover, Mac-based RStudio seems to use UTF-8 internally, and this can't be changed even if you set your computer's locale to a different encoding. Anyway, you can use iconv to sub all invalid bytes, normally displayed as hexadecimal bytes, (e.g. "\x8f") for the Unicode replacement symbol. Then you can search for that symbol and return a list of unique observations within a data.frame column with that symbol. Based on that, you can use "sub()" to replace those characters with the desired ones. One thing to note is that converting the file to another encoding, say latin-1, can have unexpected results if invalid bytes are present. When I did this, I noticed that some invalid bytes were converted to Unicode control characters, while other invalid bytes apparently matched valid latin-1 bytes and were displayed as nonsensical characters. In either case, I wrote a package to search data.frames for these characters and return a list, then do some replacement. The package isn't nearly as official as something off of CRAN, but if anyone's interested then here's a link to the repository: https://github.com/jkroes/FixEncoding. It's important to note that the "stable" version of the package isn't on the "master" branch; it's actually on branch "iconv". The documentation can be searched in R via "?FixEncoding" after installation of the correct branch, then finding the functions listed there and searching help for those.
This will construct all the alpha-versions of the hex numbers to "ff":
allBytes <- as.character( as.hexmode(0:255) )
Or as greppish patterns as you seem to prefer:
allBytes <- paste0("\\x", as.character( as.hexmode(0:255) ) )
The "special" characters that R recognizes does include the "\n" that you lissted but also a few more listed on the ?Quotes help page:
rASCII <- c( '\n', '\r', '\t','\b',
'\a', '\f', '\v', '\\', '\'', '\"', '\`')
You could create a vector of valid grep patterns for "characters" " space up to tilde ("~") just with this:
validBytes1 <- c(rASCII, paste( "\\x", as.hexmode( c(20:126)) )
I have concerns about using this strategy since my R throws errors when it tried to do greppish matches with what it considers an invalid input string.
> txt <- "ttt\nuuu\tiii\xff"
> dfrm <- data.frame(a = txt)
> lapply(dfrm, grep, patt = "\\xff")
$a
integer(0)
Warning message:
In FUN(X[[i]], ...) : input string 1 is invalid in this locale
> lapply(dfrm, grep, patt = "\\\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
> lapply(dfrm, grep, patt = "\xff")
Error in FUN(X[[i]], ...) : regular expression is invalid in this locale
You may want to switch over to grepRaw since it doesn't throw the same errors:
> grepRaw("\xff", txt)
[1] 12
Or may use ?tools::showNonASCII as suggested by Duncan Murdoch when this came up on Rhelp 4 years ago:
?tools::showNonASCII
# and the help page has a reproducible example of its use:
out <- c(
"fa\xE7ile test of showNonASCII():",
"\\details{",
" This is a good line",
" This has an \xfcmlaut in it.",
" OK again.",
"}")
f <- tempfile()
cat(out, file = f, sep = "\n")
tools::showNonASCIIfile(f)
#-------output appears in red----
1: fa<e7>ile test of showNonASCII():
4: This has an <fc>mlaut in it.
I am trying to obtain data from a website and thanks to a helper i could get to the following script:
require(httr)
require(rvest)
res <- httr::POST(url = "http://apps.kew.org/wcsp/advsearch.do",
body = list(page = "advancedSearch",
AttachmentExist = "",
family = "",
placeOfPub = "",
genus = "Arctodupontia",
yearPublished = "",
species ="scleroclada",
author = "",
infraRank = "",
infraEpithet = "",
selectedLevel = "cont"),
encode = "form")
pg <- content(res, as="parsed")
lnks <- html_attr(html_node(pg,"td"), "href")
However, in some cases, like the example above, it does not retrieve the right link because, for some reason, html_attr does not find urls ("href") within the node detected by html_node. So far, i have tried different CSS selector, like "td", "a.onwardnav" and ".plantname" but none of them generate an object that html_attr can handle correctly.
Any hint?
You are really close on getting the answer your were expecting. If you would like to pull the links off of the desired page then:
lnks <- html_attr(html_nodes(pg,"a"), "href")
will return a list of all of the links at the "a" tag with a "href" attribute. Notice the command is html_nodes and not node. There are multiple "a" tags thus the plural.
If you are looking for the information from the table in the body of then try this:
html_table(pg, fill=TRUE)
#or this
html_nodes(pg,"tr")
The second line will return a list of the 9 rows from the table which one could then parse to obtain the row names ("th") and/or row values ("td").
Hope this helps.