I am trying to get informations about not occupied parking space in a car park. The info on the website is constantly updating the numbers of free parking spots.
Since I'm on the beginning of learning webscraping with R, I started learning the basics.
So I tried getting the Year of an IMDB Movie with the code
url2 <- "https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature"
page2 <- read_html(url2)
data2 <- page2 %>%
html_node(".lister-item-year") %>%
html_text
data2
This code is running with no problems.
Now I tried the same with the website about parking spots and since the HTML Code is almost the same as in the example above, I figured it shouldn't be that hard.
url <- "https://www.rosenheim.de/stadt-buerger/verkehr/parken.html"
page <- read_html(url)
data <- page %>%
html_node('.jwGetFreeParking-8') %>%
html_text
data
But as a result I don't get the information about free parking spots. The Result I get is "". So nothing.
Is it because the number on the second webpage is updating from time to time?
This page is rendered using javascript thus the techniques from your example don't apply. If you use the developer tools from your browser and examine the files loaded on the network tab, you will find a file named "index.php". This is a JSON file containing the parking information.
Downloading this file will provide the requested information. The fromJSON function the "jsonlite" library will access the file and convert it into a data frame.
library(jsonlite)
answer<-fromJSON("https://www.rosenheim.de/index.php?eID=jwParkingGetParkings")
answer
uid title parkings occupied free isOpened link
1 4 Reserve 0 0 --- FALSE 0
2 7 Reserve 0 0 --- FALSE 0
3 13 Reserve 0 0 --- FALSE 0
4 14 Reserve 0 0 --- FALSE 0
5 0 P1 Zentrum 257 253 4 TRUE 224
6 1 P2 KU'KO 138 133 5 TRUE 225
7 2 P3 Rathaus 31 29 2 TRUE 226
8 3 P4 Mitte 275 275 0 TRUE 227
9 5 P6 Salinplatz 232 148 84 TRUE 228
10 6 P7 Altstadt-Ost 82 108 0 TRUE 229
11 10 P8 Beilhack-Citydome 160 130 30 TRUE 230
12 8 P9 Am Klinikum 426 424 2 TRUE 1053
13 9 P10 Stadtcenter 56 54 2 TRUE 231
14 11 P11 Beilhack-Gießereistr. 155 155 --- FALSE 1151
15 12 P12 Bahnhof Nord 148 45 103 TRUE 1203
this seems like a basic question; however, I am not sure if I am unable to word my question to search for the answer that I need.
This is the sample:
id2 sbp1 dbp1 age1 sbp2 dbp2 sex bmi1 bmi2 smoke drink exercise
1 1 134.5 89.5 40 146 84 2 21.74685 22.19658 1 0 1
2 4 128.5 89.5 48 125 70 1 24.61942 22.29476 1 0 0
3 5 105.5 64.5 42 121 80 2 22.15103 26.90204 1 0 0
4 8 116.5 79.5 39 107 72 2 21.08032 27.64403 0 0 1
5 9 106.5 73.5 26 132 81 2 21.26762 29.16131 0 0 0
6 10 120.5 81.5 34 130 85 1 24.91663 26.89427 1 1 0
I have this code here for a function I am making:
linreg.ols<- function(indat, dv, p1, p2, p3){
data<- read.csv(file= indat, header=T)
data[1:5,]
y<- data$dv
x <- as.matrix(data.frame(x0=rep(1,nrow(data)), x1=data$p1, x2=data$p2,
x3=data$p3))
inv<- solve(t(x)%*%x)
xy<- t(x)%*%y
betah<- inv%*%xy
print("Value of beta hat")
betah
}
And when I run my code with this line:
linreg.ols("bp.csv",sbp1,smoke,drink,exercise)
I get the following error:
Error in data.frame(x0 = rep(1, nrow(data)), x1 = data$p1, x2 = data$p2, :
arguments imply differing number of rows: 75, 0
I have a feeling that it's because of how I am extracting the p1, p2, and p3 columns on the line where I create the x variable.
EDIT: changed to y<-data$dv
EDIT: added on part of the sample. Also, I tried:
x <- as.matrix(data.frame(1,data[,c("p1","p2","p3")]))
But that returned the error:
Error in `[.data.frame`(data, , c("p1", "p2", "p3")) : undefined columns selected
I have a problem calculating the mean of columns for a dataset imported from this CSV file
I import the file using the following command:
dataGSR = read.csv("ShimmerData.csv", header = TRUE, sep = ",",stringsAsFactors=T)
dataGSR$X=NULL #don't need this column
Then I take a subset of this
dati=dataGSR[4:1000,]
i check they are correct
head(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 31329 0 713 623.674691281028 2545 3706.5641025641 2409 3529.67032967033
5 31649 9.765625 713 623.674691281028 2526 3678.89230769231 2501 3664.46886446886
6 31969 19.53125 712 638.528829576655 2528 3681.80512820513 2501 3664.46886446886
7 32289 29.296875 713 623.674691281028 2516 3664.3282051282 2498 3660.07326007326
8 32609 39.0625 711 654.10779696494 2503 3645.39487179487 2496 3657.14285714286
9 32929 48.828125 713 623.674691281028 2505 3648.30769230769 2496 3657.14285714286
When I type
means=colMeans(dati)
Error in colMeans(dati) : 'x' must be numeric
In order to solve this problem I convert everything into a matrix
datiM=data.matrix(dati)
But when I check the new variable, data values are different
head(datiM)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 370 1 10 1 65 65 1 1
5 375 3707 10 1 46 46 24 24
6 381 1025 9 2 48 48 24 24
7 386 2162 10 1 36 36 21 21
8 392 3126 8 3 23 23 19 19
9 397 3229 10 1 25 25 19 19
My questions here is:
How to convert correctly the "dati" variable in order to perform the colMeans()?
In addition to #akrun's advice, another option is to convert the columns to numeric yourself (rather than having read.csv do it):
dati <- data.frame(
lapply(dataGSR[-c(1:3),-9],as.numeric))
##
R> colMeans(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
33004.2924 18647.4609 707.4335 718.3989 2521.3626 3672.1383 2497.9013 3659.9287
Where dataGSR was read in with stringsAsFactors=F,
dataGSR <- read.csv(
file="F:/temp/ShimmerData.csv",
header=TRUE,
stringsAsFactors=F)
Unless you know for sure that you need character columns to be factors, you are better off setting this option to FALSE.
The header lines ("character") in the dataset span first 4 lines. We could skip the 4 lines, use header=FALSE and then change the column names based on the info from the first 4 lines.
dataGSR <- read.csv('ShimmerData.csv', header=FALSE,
stringsAsFactors=FALSE, skip=4)
lines <- readLines('ShimmerData.csv', n=4)
colnames(dataGSR) <- do.call(paste, c(strsplit(lines, ','),
list(sep="_")))
dataGSR <- dataGSR[,-9]
unname(colMeans(dataGSR))
# [1] 33004.2924 18647.4609 707.4335 718.3989 2521.3626
# 3672.1383 2497.9013
# [8] 3659.9287
The libraries used are: library(survival)
library(splines)
library(boot)
library(frailtypack) and the function used is in the library frailty pack.
In my data I have two recurrent events(delta.stable and delta.unstable) and one terminal event (delta.censor). There are some time-varying explanatory variables, like unemployment rate(u.rate) (is quarterly) that's why my dataset has been splitted by quarters.
Here there is a link to the subsample used in the code just below, just in case it may be helpful to see the mistake. https://www.dropbox.com/s/spfywobydr94bml/cr_05_males_services.rda
The problem is that it takes a lot of time running until the warning message appear.
Main variables of the Survival function are:
I have two recurrent events:
delta.unstable (unst.): takes value one when the individual find an unstable job.
delta.stable (stable): takes value one when the individual find a stable job.
And one terminal event
delta.censor (d.censor): takes value one when the individual has death, retired or emigrated.
row id contadorbis unst. stable d.censor .t0 .t
1 78 1 0 1 0 0 88
2 101 2 0 1 0 0 46
3 155 3 0 1 0 0 27
4 170 4 0 0 0 0 61
5 170 4 1 0 0 61 86
6 213 5 0 0 0 0 92
7 213 5 0 0 0 92 182
8 213 5 0 0 0 182 273
9 213 5 0 0 0 273 365
10 213 5 1 0 0 365 394
11 334 6 0 1 0 0 6
12 334 7 1 0 0 0 38
13 369 8 0 0 0 0 27
14 369 8 0 0 0 27 119
15 369 8 0 0 0 119 209
16 369 8 0 0 0 209 300
17 369 8 0 0 0 300 392
When I apply multivePenal I obtain the following message:
Error en aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Además: Mensajes de aviso perdidos
In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created
#### multivePenal function
fit.joint.05_malesP<multivePenal(Surv(.t0,.t,delta.stable)~cluster(contadorbis)+terminal(as.factor(delta.censor))+event2(delta.unstable),formula.terminalEvent=~1, formula2=~as.factor(h.skill),data=cr_05_males_serv,Frailty=TRUE,recurrentAG=TRUE,cross.validation=F,n.knots=c(7,7,7), kappa=c(1,1,1), maxit=1000, hazard="Splines")
I have checked if Surv(.t0,.t,delta.stable) contains NA, and there are no NA's.
In addition, when I apply for the same data the function frailtyPenal for both possible combinations, the function run well and I get results. I take one week looking at this and I do not find the key. I would appreciate some of light to this problem.
#delta unstable+death
enter code here
fit.joint.05_males<-frailtyPenal(Surv(.t0,.t,delta.unstable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+ as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+ as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities)+
terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE)
###Be patient. The program is computing ...
###The program took 2259.42 seconds
#delta stable+death
fit.joint.05_males<frailtyPenal(Surv(.t0,.t,delta.stable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities)+terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE)
###The program took 3167.15 seconds
Because you neither provide information about the packages used, nor the data necessary to run multivepenal or frailtyPenal, I can only help you with the Surv part (because I happened to have that package loaded).
The Surv warning message you provided (In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created) suggests that something is strange with your variables .t0 (the time argument in Surv, refered to as 'start time' in the warning), and/or .t (time2 argument, 'Stop time' in the warning). I check this possibility with a simple example
# read the data you feed `Surv` with
df <- read.table(text = "row id contadorbis unst. stable d.censor .t0 .t
1 78 1 0 1 0 0 88
2 101 2 0 1 0 0 46
3 155 3 0 1 0 0 27
4 170 4 0 0 0 0 61
5 170 4 1 0 0 61 86
6 213 5 0 0 0 0 92
7 213 5 0 0 0 92 182
8 213 5 0 0 0 182 273
9 213 5 0 0 0 273 365
10 213 5 1 0 0 365 394
11 334 6 0 1 0 0 6
12 334 7 1 0 0 0 38
13 369 8 0 0 0 0 27
14 369 8 0 0 0 27 119
15 369 8 0 0 0 119 209
16 369 8 0 0 0 209 300
17 369 8 0 0 0 300 392", header = TRUE)
# create survival object
mysurv <- with(df, Surv(time = .t0, time2 = .t, event = stable))
mysurv
# create a new data set where one .t for some reason is less than .to
# on row five .t0 is 61, so I set .t to 60
df2 <- df
df2$.t[df2$.t == 86] <- 60
# create survival object using new data which contains at least one Stop time that is less than Start time
mysurv2 <- with(df2, Surv(time = .t0, time2 = .t, event = stable))
# Warning message:
# In Surv(time = .t0, time2 = .t, event = stable) :
# Stop time must be > start time, NA created
# i.e. the same warning message as you got
# check the survival object
mysurv2
# as you can see, the fifth interval contains NA
# I would recommend you check .t0 and .t in your data set carefully
# one way to examine rows where Stop time (.t) is less than start time (.t0) is:
df2[which(df2$.t0 > df2$.t), ]
I am not familiar with multivepenal but it seems that it does not accept a survival object which contains intervals with NA, whereas might frailtyPenal might do so.
The authors of the package have told me that the function is not finished yet, so perhaps that is the reason that it is not working well.
I encountered the same error and arrived at this solution.
frailtyPenal() will not accept data.frames of different length. The data.frame used in Surv and data.frame named in data= in frailtyPenal must be the same length. I used a Cox regression to identify the incomplete cases, reset the survival object to exclude the missing cases and, finally, run frailtyPenal:
library(survival)
library(frailtypack)
data(readmission)
#Reproduce the error
#change the first start time to NA
readmission[1,3] <- NA
#create a survival object with one missing time
surv.obj1 <- with(readmission, Surv(t.start, t.stop, event))
#observe the error
frailtyPenal(surv.obj1 ~ cluster(id) + dukes,
data=readmission,
cross.validation=FALSE,
n.knots=10,
kappa=1,
hazard="Splines")
#repair by resetting the surv object to omit the missing value(s)
#identify NAs using a Cox model
cox.na <- coxph(surv.obj1 ~ dukes, data = readmission)
#remove the NA cases from the original set to create complete cases
readmission2 <- readmission[-cox.na$na.action,]
#reset the survival object using the complete cases
surv.obj2 <- with(readmission2, Surv(t.start, t.stop, event))
#run frailtyPenal using the complete cases dataset and the complete cases Surv object
frailtyPenal(surv.obj2 ~ cluster(id) + dukes,
data = readmission2,
cross.validation = FALSE,
n.knots = 10,
kappa = 1,
hazard = "Splines")
I need some suggestions how to better design my problem’s resolution.
I starting from many Csv file of result of parametric study (time series data). I want to analyze the influence of some parameters on variable. The idea is to extract some variable from table of result for each id of parametric study and create a data.frame for each variable to easily make some plot and some analysis.
The problem is that some parameters change the time step of parametric study, so there are some csv much longer. One variable for example is Temperature. It is possible to maintain the differences on time step and evaluate Delta T varying one parameter? Plyr can do that? Or I have to resample part of my result to make this evaluation losing part of information?
I achieve to this point at moment:
head(data, 5)
names Date.Time Tout.dry.bulb RHout TsupIn TsupOut QconvIn[Wm2]
1 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:03:00 0 50 23 15.84257 -1.090683e-14
2 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:06:00 0 50 23 16.66988 0.000000e+00
3 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:09:00 0 50 23 13.83446 1.090683e-14
4 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:12:00 0 50 23 14.34774 2.181366e-14
5 G_0-T_0-W_0-P1_0-P2_0 2005-01-01 00:15:00 0 50 23 12.59164 2.181366e-14
QconvOut[Wm2] Hvout[Wm2K] Qradout[Wm2] MeanRadTin MeanAirTin MeanOperTin
1 0.0000 17.76 -5.428583e-08 23 23 23
2 -281.3640 17.76 -1.151613e-07 23 23 23
3 -296.0570 17.76 -1.018871e-07 23 23 23
4 -245.7001 17.76 -1.027338e-07 23 23 23
5 -254.8158 17.76 -9.458750e-08 23 23 23
> str(data)
'data.frame': 1858080 obs. of 13 variables:
$ names : Factor w/ 35 levels "G_0-T_0-W_0-P1_0-P2_0",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Date.Time : POSIXct, format: "2005-01-01 00:03:00" "2005-01-01 00:06:00" "2005-01-01 00:09:00" ...
$ Tout.dry.bulb: num 0 0 0 0 0 0 0 0 0 0 ...
$ RHout : num 50 50 50 50 50 50 50 50 50 50 ...
$ TsupIn : num 23 23 23 23 23 23 23 23 23 23 ...
$ TsupOut : num 15.8 16.7 13.8 14.3 12.6 ...
$ QconvIn[Wm2] : num -1.09e-14 0.00 1.09e-14 2.18e-14 2.18e-14 ...
$ QconvOut[Wm2]: num 0 -281 -296 -246 -255 ...
$ Hvout[Wm2K] : num 17.8 17.8 17.8 17.8 17.8 ...
$ Qradout[Wm2] : num -5.43e-08 -1.15e-07 -1.02e-07 -1.03e-07 -9.46e-08 ...
$ MeanRadTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanAirTin : num 23 23 23 23 23 23 23 23 23 23 ...
$ MeanOperTin : num 23 23 23 23 23 23 23 23 23 23 ...
names(DF)
[1] "G_0-T_0-W_0-P1_0-P2_0" "G_0-T_0-W_0-P1_0-P2_1" "G_0-T_0-W_0-P1_0-P2_2"
[4] "G_0-T_0-W_0-P1_0-P2_3" "G_0-T_0-W_0-P1_0-P2_4" "G_0-T_0-W_0-P1_0-P2_5"
[7] "G_0-T_0-W_0-P1_0-P2_6" "G_0-T_0-W_0-P1_1-P2_0" "G_0-T_0-W_0-P1_1-P2_1"
[10] "G_0-T_0-W_0-P1_1-P2_2" "G_0-T_0-W_0-P1_1-P2_3" "G_0-T_0-W_0-P1_1-P2_4"
[13] "G_0-T_0-W_0-P1_1-P2_5" "G_0-T_0-W_0-P1_1-P2_6" "G_0-T_0-W_0-P1_2-P2_0"
[16] "G_0-T_0-W_0-P1_2-P2_1" "G_0-T_0-W_0-P1_2-P2_2" "G_0-T_0-W_0-P1_2-P2_3"
[19] "G_0-T_0-W_0-P1_2-P2_4" "G_0-T_0-W_0-P1_2-P2_5" "G_0-T_0-W_0-P1_2-P2_6"
[22] "G_0-T_0-W_0-P1_3-P2_0" "G_0-T_0-W_0-P1_3-P2_1" "G_0-T_0-W_0-P1_3-P2_2"
[25] "G_0-T_0-W_0-P1_3-P2_3" "G_0-T_0-W_0-P1_3-P2_4" "G_0-T_0-W_0-P1_3-P2_5"
[28] "G_0-T_0-W_0-P1_3-P2_6" "G_0-T_0-W_0-P1_4-P2_0" "G_0-T_0-W_0-P1_4-P2_1"
[31] "G_0-T_0-W_0-P1_4-P2_2" "G_0-T_0-W_0-P1_4-P2_3" "G_0-T_0-W_0-P1_4-P2_4"
[34] "G_0-T_0-W_0-P1_4-P2_5" "G_0-T_0-W_0-P1_4-P2_6"
From P1_4-P2_0 to P1_4-P2_6 the length is 113760 obs estand of 37920 because the time step change from 3 min to 1 min.
I’d like to have separated database for each variable in which I have date.time and value of variable for each names in column.
How I can do it?
Thank for any suggestion
I strongly suggest using a data structure that is appropriate for working with time series. In this case, the zoo package would work well. Load each CSV file into a zoo object, using your Date.Time column to define the index (timestamps) of the data. You can use the zoo() function to create those objects, for example.
Then use the merge function of zoo to combine the objects. It will find observations with the same timestamp and put them into one row. With merge, you can specify all=TRUE to get the union of all timestamps; or you can specify all=FALSE to get the intersection of the timestamps. For the union (all=TRUE), missing observations will be NA.
The read.zoo function could be difficult to use for reading your data. I suggest replacing your call to read.zoo with something like this:
table <- read.csv(filepath, header=TRUE, stringsAsFactors=FALSE)
dateStrings <- paste("2005/", table$Date.Time, sep="")
dates <- as.POSIXct(dateStrings)
dat <- zoo(table[,-1], dates)
(I assume that Date.Time is the first column in your file. That's why I wrote table[,-1].)