I'm totally new on R and Weka and Data mining. But my tutor ask me to research with RWeka. I've been browsing online about the write. Arff() all the information, but still don't know how to use this function correctly:
write.arff(x, file, eol = "\n", relation = deparse(substitute(x)))
I hope someone can show me how to use the function in more detail and show me some samples using parameters completely.
Related
I'm fairly new to working with XML files within the R environment, but I have at least come further in making it work normally than I have with the specific file.
Quick background: I receive data in the attached format, but I cannot convert the data into a data frame (which I have succeeded in with other files.) Somehow my normal procedure doesn't work with this. My goal is to make the data into a data frame. Normally I would just use xmlToDataFrame(), but that provides me with the following error:
unable to find an inherited method for function ‘xmlToDataFrame’ for
signature ‘"xml_document", "missing", "missing", "missing", "missing"’
Then I tried the below sequence
data = read_xml("file.xml")
xmlimport = xmlTreeParse("file.xml")
topxml = xmlRoot(xmlimport)
topxml = xmlSApply(topxml,function(x) xmlSApply(x,xmlValue))
That provided me with the attached picture as output. All the data is contained within the cells, and I cannot seem to access the data. I feel like there is a really simple solution, but after working with the file for longer than I like to admit, I hope you can point something (hopefully) obvious out to me.
If you have the time to assist me in it, I've uploaded the file here
Hope that will do.
Thanks for taking the time to assist me.
Note: The data is a bank fee statement, and the data is completely fictional
Output result
note: I haven't asked a question here before, and am still not sure how to make this legible, so let me know of any confusion or tips on making this more readable
I'm trying to download user information from the 2004/06 to 2004/09 Internet Archive captures of makeoutclub.com (a wacky, now-defunct social network targeted toward alternative music fans, which was created in ~2000, making it one of the oldest profile-based social networks on the Internet) using r,* specifically the rcrawler package.
So far, I've been able to use the package to get the usernames and profile links in a dataframe, using xpath to identify the elements I want, but somehow it doesn't work for either the location or interests sections of the profiles, both of which are just text instead of other elements in the html. For an idea of the site/data I'm talking about, here's the page I've been texting my xpath on: https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html
I have been testing out my xpath expressions using rcrawler's ContentScraper function, which extracts the set of elements matching the specified xpath from one specific page of the site you need to crawl. Here is my functioning expression that identifies the usernames and links on the site, with the specific page I'm using specified, and returns a vector:
testwaybacktable <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = c("//tr[1]/td/font/a[1]/#href", "//tr[1]/td/font/a[1]"), ManyPerPattern = TRUE)
And here is the bad one, where I'm testing the "location," which ends up returning an empty vector
testwaybacklocations <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[1]/td/font/text()[2]", ManyPerPattern = TRUE)
And the other bad one, this one looking for the text under "interests":
testwaybackint <- ContentScraper(Url = "https://web.archive.org/web/20040805155243/http://www.makeoutclub.com/03/profile/html/boys/2.html", XpathPatterns = "//td/table/tbody/tr[2]/td/font/text()", ManyPerPattern = TRUE)
The xpath expressions I'm using here seem to select the right elements when I try searching them in the Chrome Inspect thing, but the program doesn't seem to read them. I also have tried selecting only one element for each field, and it still produced an empty vector. I know that this tool can read text in this webpage–I tested another random piece of text–but somehow I'm getting nothing when I run this test.
Is there something wrong with my xpath expression? Should I be using different tools to do this?
Thanks for your patience!
*This is for a digital humanities project will hopefully use some nlp to analyze especially language around gender and sexuality, in dialogue with some nlp analysis of the lyrics of the most popular bands on the site.
A late answer, but maybe it will help nontheless. Also I am not sure about the whole TOS question, but I think that's yours to figure out. Long story short ... I will just try to to adress the technical aspects of your problem ;)
I am not familiar with the rcrawler-package. Usually I use rvest for webscraping and I think it is a good choice. To achive the desired output you would have to use something like
# parameters
url <- your_url
xpath_pattern <- your_pattern
# get the data
wp <- xml2::read_html(url)
# extract whatever you need
res <- rvest::html_nodes(wp,xpath=xpath_pattern)
I think it is not possible to use a vector with multiple elements as pattern argument, but you can run html_nodes for each pattern you want to extract seperately.
I think the first two urls/patterns should work this way. The pattern in your last url seems to be wrong somehow. If you want to extract the text inside the tables, it should probably be something like "//tr[2]/td/font/text()[2]"
I have a quick question that I cannot figure out. I am reading some results from an output file using the code below and stored as a list in R that can be seen in the picture. I want to delete all of the information after an empty row, in other words, it would be everything after line 42:
Does anybody know anything that I could use? I tried using gsub was I was not very successful.
Thanks for all of the help I am new to programming in R. Again any help is very much appreciated.
LoadFFA <- function(filename, folder.out, TYPE = "PeakFQ_17C",
colStandard = TRUE){ # standardize column output names
require(data.table)
if(grepl("PEAKFQSA",TYPE)){ # PeakfqSA Bulleting 17C analysis
text.list<-lapply(fileinput,readLines)
skip.rows<-sapply(text.list, grep, pattern = '^Ann. Exc. Prob.\\s+EMA Est.')-1
PFA <- lapply(seq_along(text.list),function(i) read.delim(fileinput[i],skip=skip.rows[i],sep="\n",stringsAsFactors = TRUE,blank.lines.skip = FALSE))
}
EDIT
I don't know if I could upload directly so here is the google drive link.
Also, here is the command to run the function LoadFFA("03606500peaks.out","D:/Documents/hydraulic.failures","PEAKFQSA"). The screenshot is the result using print(PFA).
The reason why I am using a loop is because I am reading multiple files (output files) and they have a lot of data, multiple lenghts, and I am reading the data beginning Ann.Exc.Prob. and as per the screenshot provided I would like to end after line 42 (after a full empty row). I hope that clears some confusion.
Basically read the output files, start reading on "Ann.Exc.Prob" and end until the end of that data (line 42 for this particular file). I am using a function because I am running several times.
Again, sorry for the trouble. Thank you for your time and I appreciate your patience.
https://drive.google.com/file/d/1PGbGWIHFj7IQRevTAEfqqA9Okg4fz7Mg/view?usp=sharing
I'm a starter in web scraping and I'm not yet familiarized with the nomenclature for the problems I'm trying to solve. Nevertheless, I've searched exhaustively for this specific problem and was unsuccessful in finding a solution. If it is already somewhere else, I apologize in advance and thank your suggestions.
Getting to it. I'm trying to build a script with R that will:
1. Search for specific keywords in a newspaper website;
2. Give me the headlines, dates and contents for the number of results/pages that I desire.
I already know how to post the form for the search and scrape the results from the first page, but I've had no success so far in getting the content from the next pages. To be honest, I don't even know where to start from (I've read stuff about RCurl and so on, but it still haven't made much sense to me).
Below, it follows a partial sample of the code I've written so far (scraping only the headlines of the first page to keep it simple).
curl <- getCurlHandle()
curlSetOpt(cookiefile='cookies.txt', curl=curl, followlocation = TRUE)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
search=getForm("http://www.washingtonpost.com/newssearch/search.html",
.params=list(st="Dilma Rousseff"),
.opts=curlOptions(followLocation = TRUE),
curl=curl)
results=htmlParse(search)
results=xmlRoot(results)
results=getNodeSet(results,"//div[#class='pb-feed-headline']/h3")
results=unlist(lapply(results, xmlValue))
I understand that I could perform the search directly on the website and then inspect the URL for references regarding the page numbers or the number of the news article displayed in each page and, then, use a loop to scrape each different page.
But please bear in mind that after I learn how to go from page 1 to page 2, 3, and so on, I will try to develop my script to perform more searches with different keywords in different websites, all at the same time, so the solution in the previous paragraph doesn't seem the best to me so far.
If you have any other solution to suggest me, I will gladly embrace it. I hope I've managed to state my issue clearly so I can get a share of your ideas and maybe help others that are facing similar issues. I thank you all in advance.
Best regards
First, I'd recommend you use httr instead of RCurl - for most problems it's much easier to use.
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff"
)
)
stop_for_status(r)
content(r)
Second, if you look at url in your browse, you'll notice that clicking the page number, modifies the startat query parameter:
r <- GET("http://www.washingtonpost.com/newssearch/search.html",
query = list(
st = "Dilma Rousseff",
startat = 10
)
)
Third, you might want to try out my experiment rvest package. It makes it easier to extract information from a web page:
# devtools::install_github("hadley/rvest")
library(rvest)
page <- html(r)
links <- page[sel(".pb-feed-headline a")]
links["href"]
html_text(links)
I highly recommend reading the selectorgadget tutorial and using that to figure out what css selectors you need.
I am using tableNominal{reporttools} to produce frequency tables. The way I understand it, tableNominal() produces latex code which has to be copied and pasted onto a text file and then saved as .tex. But is it possible to simple export the table produced as can be done in print(xtable(table), file="path/outfile.tex"))?
You may be able to use either latex or latexTranslate from the "Hmisc" package for this purpose. If you have the necessary program infrastructure the output gets sent to your TeX engine. (You may be able to improve the level of our answers by adding specific examples.)
Looks like that function does not return a character vector, so you need to use a strategy to capture the output from cat(). Using the example in the help page:
capture.output( TN <- tableNominal(vars = vars, weights = weights, group = group,
cap = "Table of nominal variables.", lab = "tab: nominal") ,
file="outfile.tex")