Subsetting data frame in R after reading in data with scan - r

I'm reading in data about an HTTP access log. I've got a file with columns for the ip address, year, month, day, hour and requested URL. I read the file in like this:
ipdata = scan(file="sample_r.log", what=list(ip="", year=0, month=0, day=0, hour=0, verb="", url=""))
This seems to work. R-Studio says that ipdata is a list[7] and "names(ipdata)" returns
[1] "ip" "year" "month" "day" "hour" "verb" "url"
So that seems cool. I wanted to do something fun, like graph some data for a specific hour. I tried doing a subset:
s <- subset(ipdata, ipdata$hour==3)
This data looks remarkably different than the first data frame. s is a list[297275] and the following doesn't work right:
> table(ipdata$verb)
GET POST
2870709 1596748
> table(s$verb)
character(0)
Am I going about this the correct way? What I typically do is wrap my data frame in a table() and then barplot or dotplot it. Is R a good way to do this? I want to say "Show me all of the top URLs in hour 3", for example. Or "How many times did this IP address show up per hour?"
Update It looks like by using read.table instead of scan I was able to get a data frame. Apparently scan returns a list of lists or something? Definitely confusing to a n00b like myself but I'm feeling good about it now.

If you ran
dat <- as.data.frame(ipdata)
str(dat)
.... you would probably see that it was pretty much the same as the results of your read.table() operation. read.table is a wrapper for scan and does a lot of formatting and consistency checking.

Related

R: XML file to data frame

I'm fairly new to working with XML files within the R environment, but I have at least come further in making it work normally than I have with the specific file.
Quick background: I receive data in the attached format, but I cannot convert the data into a data frame (which I have succeeded in with other files.) Somehow my normal procedure doesn't work with this. My goal is to make the data into a data frame. Normally I would just use xmlToDataFrame(), but that provides me with the following error:
unable to find an inherited method for function ‘xmlToDataFrame’ for
signature ‘"xml_document", "missing", "missing", "missing", "missing"’
Then I tried the below sequence
data = read_xml("file.xml")
xmlimport = xmlTreeParse("file.xml")
topxml = xmlRoot(xmlimport)
topxml = xmlSApply(topxml,function(x) xmlSApply(x,xmlValue))
That provided me with the attached picture as output. All the data is contained within the cells, and I cannot seem to access the data. I feel like there is a really simple solution, but after working with the file for longer than I like to admit, I hope you can point something (hopefully) obvious out to me.
If you have the time to assist me in it, I've uploaded the file here
Hope that will do.
Thanks for taking the time to assist me.
Note: The data is a bank fee statement, and the data is completely fictional
Output result

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Cannot get head of sparkR data frame

I use sparkR to run some sql query to get sparkR data frame like the following
data = sql(sql_query)
And I can get the dimension of data by using dim(data)
However, when I want to take a look at the data by using head(data) it will fail and give me an error java.lang
.ClassCastException: org.apache.hadoop.hive.serde2.io.TimestampWritable cannot be
cast to org.apache.hadoop.io.IntWritable
I tried the sql query in Hive and it doesn't have any problem. The weird thing here is I can get the dimension but cannot get the head.
Any idea?
Try: View(head(data, 20))...also, it would be helpful if you could give a little bit more detail on your question.

Data Imported from Excel to R not Assigning Headers as Column Names

I'm brand new to R and am having difficulty with something very basic. I'm importing data from an excel file like this:
data1 <- read.csv(file.choose(), header=TRUE)
When I try to look at the data in the table by column, R doesn't recognize the column headers as objects. This is what it looks like
summary(Square.Feet)
Error in summary(Square.Feet) : object 'Square.Feet' not found
I need to run a regression and I'm having the same problem. Any help would be much appreciated.
Yes it recognizes, you have to tell R to select the dataframe so:
summary(data1$Square.Feet)
Where "data" is the name of your dataframe, and after the dollar goes the name of the variable
Hope it helps
UPDATE
As suggested below, you can use the following:
data1 <- read.csv(file.choose(), header=TRUE)
attach(data1)
This way, by doing "attach", you avoid to write everytime the name of the dataset, so we would go from
summary(data1$Square.Feet)
To this point after attaching the data:
summary(Square.Feet)
However I DO NOT recommend to do it, because if you load other datasets you may mess everything as it's quite common that variables have the same names, among other major problems, see here (Thanks Ben Bolker for your contribution): here , here, here and
here
if you want a summary of all data fields, then
summary(data1)
or you can use the 'with' helper function
with(data1, summary(Square.Feet))

R rvest connect with local host

I am creating a way to read in SPSS labels into R. Using library(sjPlot), view_spss(df, useViewer = FALSE) I can create a local html page such as http://localhost:11773/session/file1e0c67270a5.html that shows a nice table with columns for the variable names and the labels I am looking for.
Now I want to use rvest to scrape it but when I start with a command such as page <- rvest::html("http://localhost:11773/session/file1e0c67270a5.html") R just seems to get stuck.
I've tried searching for "connect with local host" but I can't seem to find any questions or answers related to the R package.
This doesn't really answer your specific question as I think the reason is that R spins up a non-persistent process to serve that HTML view of your data. But your approach seems quite round-a-bout to just get to variable labels. This is a general way that works quite well:
library(foreign)
d <- read.spss("your_data.sav", use.value.labels=TRUE, to.data.frame=FALSE)
var_labels <- attr(d, "variable.labels")
## To access the label of a variable named 'var_name':
var_labels[["var_name"]]
Where d results in a list of data, and var_labels is a named list of labels keyed by variable/column.
If you want to get variable and/or value label of SPSS-imported data, you can use get_val_labels and get_var_labels of the sjmisc-package.
See examples here. Both functions accept either a single variable (vector) or a data frame and return the associated variable and value labels. See also this blog post.
The sjmisc-Package supports data frames imported both with the haven- or foreign-package.

Resources