How can I fully extract all elements into a data frame? - r

I retrieve some data from an API and convert it to a flat structure.
library(httr)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
raw_original <- GET(url)
raw <- rawToChar(raw_original$content)
raw <- fromJSON(raw)
api_extr <- do.call("rbind", lapply(raw, data.frame))
At first, all seems well (a 5-column data frame):
> head(api_extr)
from to intensity.forecast intensity.actual intensity.index
1 2019-11-24T23:30Z 2019-11-25T00:00Z 210 200 moderate
2 2019-11-25T00:00Z 2019-11-25T00:30Z 199 200 moderate
3 2019-11-25T00:30Z 2019-11-25T01:00Z 200 198 moderate
4 2019-11-25T01:00Z 2019-11-25T01:30Z 204 189 moderate
5 2019-11-25T01:30Z 2019-11-25T02:00Z 199 191 moderate
6 2019-11-25T02:00Z 2019-11-25T02:30Z 192 193 moderate
However, one of the columns (intensity) is in fact a data frame which contains three further columns.
> str(api_extr)
'data.frame': 49 obs. of 3 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity:'data.frame': 49 obs. of 3 variables:
..$ forecast: int 210 199 200 204 199 192 191 194 197 192 ...
..$ actual : int 200 200 198 189 191 193 197 193 193 194 ...
..$ index : chr "moderate" "moderate" "moderate" "moderate" ...
I would expect the data frame to have five columns whereas instead it only has three.
At first glance this may seem insignificant, but the problems will start when it comes to working with the data (i.e. plotting it).
How can I achieve five columns?

You can pass the URL directly to fromJSON and flatten the result in a single step.
library(jsonlite)
url <- "https://api.carbonintensity.org.uk/intensity/2019-11-25/2019-11-26"
df <-fromJSON(url, flatten = TRUE)[[1]]
str(df)
'data.frame': 49 obs. of 5 variables:
$ from : chr "2019-11-24T23:30Z" "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" ...
$ to : chr "2019-11-25T00:00Z" "2019-11-25T00:30Z" "2019-11-25T01:00Z" "2019-11-25T01:30Z" ...
$ intensity.forecast: int 210 199 200 204 199 192 191 194 197 192 ...
$ intensity.actual : int 200 200 198 189 191 193 197 193 193 194 ...
$ intensity.index : chr "moderate" "moderate" "moderate" "moderate" ...

Related

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

How do I plot asset stock prices in R?

I'm trying to plot asset stock prices in R. I'm downloading the data in csv format from Yahoo Finance and then importing it to R so I can run some statistical tests on it and draw a few plots.
I'm currently trying to plot the closing price vs the date, and I'm not having a lot of success. R is just plotting it as a series of distinct points and won't join these points up with lines, despite me trying to use the argument type = "l".
price <- read.csv("~/Downloads/AAPL.csv")
plot(price$Date,price$Close,type="l")
I'm just grabbing the data from here: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
I get an output like this every time, regardless of what kind of extra arguments I try.
For example, I tried to make it red, didn't change at all.
Thanks!
The problem is that pric$Date is a factor (categorical variable) and not a number. You can convert the date string to a Posix timestamp with as.POSIXlt, and then compute a floating point representation therefrom, e.g. year + yday/366.
Try this
price$Date = as.Date(price$Date)
plot(price$Date,price$AAPL.Close,type="l",col=4)
or better
library(quantmod)
fro = '2014-07-31'
Apple = getSymbols('AAPL',auto.assign = F,from=fro)
chartSeries(Apple,subset = "last 3 years")
You don't need to use a package unless you want to create candlestick charts.
df <- read.csv("AAPL.csv")
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Factor w/ 254 levels "2019-07-10","2019-07-11",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
df$Date <- as.Date(df$Date) # Otherwise it is treated as a factor variable
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Date, format: "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-15" ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
plot(y=df$Close, x=df$Date, col="red", type = "l") # look at ?plot for more details

read complicated dataset in R

My dataset look something like given below. The first number is the feature number and then colon and then the value associated with that specific feature. I am not sure how to import this dataset in R. Anyone has any ideas?
236:24 500:163 732:234 869:117 885:106 1249:103 1280:158 1889:119 2015:55 2718:126 3307:137 3578:25 3770:26 4139:128 4723:114 4957:82 5128:50 5420:124 5603:135 5897:34 5946:117 6069:154 6153:55 6347:87 6372:77 6666:109 6866:223 6984:39 7709:253 7950:87 8078:38 8945:141 9316:111 9948:103 9989:68 10276:43 10530:76 10532:55 10799:15 10802:20 10848:82 11347:16 11871:51 11883:105 12534:133 12601:13 12781:178 12798:116 12842:106 12916:7 12935:51 12968:154 13028:58 13330:105 13384:2 13568:47 13641:632 13829:18 13964:62 14385:93 14392:272 15280:140 15424:119 15492:52 15523:31 16311:23 16464:69 16478:94 16584:102 16586:107 16705:272 17138:108 17181:150 17526:280 17540:163 18007:114 18050:53 18180:2 18806:160 18943:73 19055:41 19255:88 19774:59 19889:72 19921:45
101:68 572:57 732:63 962:120 1304:61 1831:60 1889:58 1973:105 2518:161 2629:228 2990:158 3147:75 3578:11 3860:88 4011:18 4623:141 4684:411 4758:69 4820:120 6149:102 6234:134 6306:118 6866:147 6927:89 6988:51 7048:178 7193:31 7257:61 7709:229 8061:125 8202:188 8272:17 8759:165 9104:77 9325:135 9860:97 10055:684 10532:180 10735:64 10744:267 10820:120 10848:186 10923:128 10936:129 11203:160 11303:144 11668:87 11867:97 11871:207 12191:83 12238:193 12380:51 12968:164 13369:58 13929:39 14531:102 14800:130 14931:99 15314:91 15632:62 16165:7 16353:120 16584:137 17216:172 18372:31 18893:75 19133:93 19154:101 19165:133 19607:20 19784:141 19889:97 19921:60
Assuming your data is stored in input.txt,
input <- scan('input.txt', what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Alternatively, you can use read.table to parse the input rather than manually splitting the strings which is slightly slower but more readable.
data <- read.table(text = input, sep = ':')
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Edit: adapted for your dataset. Reads your Feature/Value pairs into a data frame.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/dexter_test.data'
input <- scan(url, what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature','Value')
str(data)
# 'data.frame': 192449 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 79 10848 105 11018 76 ...

TM - Clustering data with special date variable

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.
Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

Failure passing plots to `saveHTML(){animation}` in R

The background:
I am trying to create an animation with saveHTML(){animation} to show how runner's pace between two consecutive laps changes over time. I tried to pass the plots into the expr block with the following code:
MakeSpLaps <- function(finishers.pace, lap1, lap2, start.lap) {
sp <- qplot(lap1, lap2, data=finishers.pace,
color=gender, alpha = I(.7) )
# + additional elements removed;
return(sp)
}
MakeSpLapsAnimation <- function(){
brk <- seq(0, 3000, 60)
lbl <- seconds_to_period(brk)
oopt = ani.options(interval = 0.2, nmax = 20)
saveHTML({
par(mar = c(4, 4, 0.5, 0.5))
for (i in 3:11){
# The problematic line below
MakeSpLaps(p, p[[i]], p[[i+1]], i-2)
ani.pause()
}
}, img.name = "lap_plot", imgdir = "lap_dir", htmlfile = "laps.html",
autobrowse = FALSE, title = "Plots of consecutive laps.",
description = "Plots of consecutive laps.")
}
Where the data.frame p looks like this:
'data.frame': 17051 obs. of 11 variables:
$ bib : int 10001 10003 10004 10005 10006 10009 10010 10011 10012 10013 ...
$ gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ X5km_lap : num 290 204 196 315 228 ...
$ X10km_lap : num 280 204 201 322 225 ...
$ X15km_lap : num 283 205 204 326 235 ...
$ X20km_lap : num 282 206 204 342 229 ...
$ X25km_lap : num 280 210 205 371 235 ...
$ X30km_lap : num 280 225 216 407 254 ...
$ X35km_lap : num 279 274 231 404 267 ...
$ X40km_lap : num 284 251 257 357 262 ...
$ Finish_lap: num 289 242 247 333 265 ...
The problem and question:
Running MakeSpLaps(p, p[[3]], p[[4]], 1) alone creates the graph I want, but when I plug it to saveHTML(), no plot was created except for a blank PNG. The HTML files are created with the following warning. How can I correctly pass the plots to the function saveHTML()?
animation option 'nmax' changed: 20 --> 1
animation option 'nmax' changed: 1 --> 20
HTML file created at: laps.html
The actual code is here: https://github.com/hktang/rscraper/blob/3d542b18b5f6fbf1a1fa31b0bd3936f1179cdc59/r/visuals.R#L145

Resources