How to analyze data from the Internet with R to find discrepancies?

How to analyze data from the Internet with R to find discrepancies? - r

I am new to "R"; I have this html table here
I need to find out if there is a gap in the "time (DT)" column of more than one minute. I need to analyze the data and create a new table with just two columns, the first one with the time and the second one with the number of the gap.
Like this: output
So far I am able to download the data!!!
require(XML)
u='http://cronos.est.pr/test.html'
tables = readHTMLTable(u)
datatest=tables[[1]]
View(datatest)
What's next???

Convert the first column to "POSIXct" class, take differences and replace differences of one minute or less with NA. No packages are used.
with(datatest, {
Time <- as.POSIXct(`Time (DT)`)
Diff <- c(0 , c(diff(Time, units = "minutes")))
data.frame(Time, Diff = ifelse(Diff <= 1, NA, Diff))
})
giving:
Time Diff
1 2010-01-01 09:10:00 NA
2 2010-01-01 09:11:00 NA
3 2010-01-01 09:12:00 NA
4 2010-01-01 09:13:00 NA
5 2010-01-01 09:17:00 4
6 2010-01-01 09:18:00 NA
7 2010-01-01 09:19:00 NA
8 2010-01-01 09:20:00 NA
9 2010-01-01 09:22:00 2
10 2010-01-01 09:24:00 2
11 2010-01-01 09:25:00 NA
12 2010-01-01 09:26:00 NA
13 2010-01-01 09:38:00 12
14 2010-01-01 09:39:00 NA
15 2010-01-01 09:40:00 NA

Use the lubridate package.
library(lubridate)
minutes = minute(datatest[,"Time (DT)"])
gaps = c(0, diff(minutes))
output = data.frame("date_time" = datatest[,"Time (DT)"], gaps = gaps)
The output is like you requested except that every gap is recorded, not just the ones greater than 1 minute. To get just the big gaps, do
output[output$gaps > 1,]

Related

Efficient way for insertion of multiple rows at given indices & with repetitions

I have a data frame (DATA) with > 2 million rows (observations at different time points) and another data frame (INSERTION) which gives info about missing observations. The latter object contains 2 columns: 1st column with row indices after which empty (NA) rows should be inserted into DATA, and 2nd column with the number of empty rows that should be inserted at that position.
Below is a minimum working example:
DATA <- data.frame(datetime=strptime(as.character(c(201301011700, 201301011701, 201301011703, 201301011704, 201301011705, 201301011708, 201301011710, 201301011711, 201301011715, 201301011716, 201301011718, 201301011719, 201301011721, 201301011722, 201301011723, 201301011724, 201301011725, 201301011726, 201301011727, 201301011729, 201301011730, 201301011731, 201301011732, 201301011733, 201301011734, 201301011735, 201301011736, 201301011737, 201301011738, 201301011739)), format="%Y%m%d%H%M"), var1=rnorm(30), var2=rnorm(30), var3=rnorm(30))
INSERTION <- data.frame(index=c(2, 5, 6, 8, 10, 12, 19), repetition=c(1, 2, 1, 3, 1, 1, 1))
Now I'm looking for an efficient (and thus fast) way to insert the n empty rows at given row indices of the original file. How can I additionally complement the correct datetimes for these empty rows (add 1 minute for every new row; however, every weekend and bank holidays there are some regular gaps which are not contained in INSERTION!)?
Any help is appreciated!

Looking at the pattern in INSERTION and matching it with DATA most probably you are trying to fill the missing minutes in datetime of DATA. You can create a dataframe with every minute sequence from min to max value of datetime from DATA and then merge
merge(data.frame(datetime = seq(min(DATA$datetime), max(DATA$datetime),
by = "1 min")),DATA, all.x = TRUE)
# datetime var1 var2 var3
#1 2013-01-01 17:00:00 -1.063326 0.11925 -0.788622
#2 2013-01-01 17:01:00 1.263185 0.24369 -0.502199
#3 2013-01-01 17:02:00 NA NA NA
#4 2013-01-01 17:03:00 -0.349650 1.23248 1.496061
#5 2013-01-01 17:04:00 -0.865513 -0.51606 -1.137304
#6 2013-01-01 17:05:00 -0.236280 -0.99251 -0.179052
#7 2013-01-01 17:06:00 NA NA NA
#8 2013-01-01 17:07:00 NA NA NA
#9 2013-01-01 17:08:00 -0.197176 1.67570 1.902362
#10 2013-01-01 17:09:00 NA NA NA
#...
#...
Or using similar logic with tidyr::complete
tidyr::complete(DATA, datetime = seq(min(datetime), max(datetime), by = "1 min"))

If performance is a factor on a large data frame, this approach avoids joins:
# Generate new data.frame containing missing datetimes
tmp <- data.frame(datetime = DATA$datetime[with(INSERTION, rep(index, repetition))] + sequence(INSERTION$repetition)*60)
# Create variables filled with NA to match main data.frame
tmp[setdiff(names(DATA), names(tmp))] <- NA
# Bind and sort
new_df <- rbind(DATA, tmp)
new_df <- new_df[order(new_df$datetime),]
head(new_df, 15)
datetime var1 var2 var3
1 2013-01-01 17:00:00 0.98789253 0.68364933 0.70526985
2 2013-01-01 17:01:00 -0.68307496 0.02947599 0.90731512
31 2013-01-01 17:02:00 NA NA NA
3 2013-01-01 17:03:00 -0.60189915 -1.00153188 0.06165694
4 2013-01-01 17:04:00 -0.87329313 -1.81532302 -2.04930719
5 2013-01-01 17:05:00 -0.58713154 -0.42313098 0.37402224
32 2013-01-01 17:06:00 NA NA NA
33 2013-01-01 17:07:00 NA NA NA
6 2013-01-01 17:08:00 2.41350911 -0.13691754 1.57618578
34 2013-01-01 17:09:00 NA NA NA
7 2013-01-01 17:10:00 -0.38961552 0.83838954 1.18283382
8 2013-01-01 17:11:00 0.02290672 -2.10825367 0.87441448
35 2013-01-01 17:12:00 NA NA NA
36 2013-01-01 17:13:00 NA NA NA
37 2013-01-01 17:14:00 NA NA NA

Compute the variance of a moving window in a dataframe

Hey I want to compute the variance of column. My dataframe is sorted by the as.Date() format. Here you can see a snippet of it:
Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA
The dataframe ranges from january 2004 up to dezember 2018. But I do not want to compute the compute the variance of the whole columnes.
I want to compute the variance of one year (or 12 values) which is moving month by month.
I do not really know how to start. I can imagine using the zoo package and the rollapply. But here the problem is (I think) that R computes uses the values around it and not past it?
I also found this question: R: create a data frame out of a rolling window, so my idea was to get rid of the date column. It is easy to build the matrix, but now I do not understand how to apply the variance function to my data...
Is there a smart way to compute it all in one and also using the information of the date? If not, I also appreciate any other solution from you!

We can use rollappyr to perform the rolling computations. Since there are only 11 rows in the data in the question we can't take 12 month averages but using 3 month averages instead we can illustrate it. Remove fill = NA if you want to omit the NA rows or replace it with partial = TRUE if you want variances using fewer than 12 near the beginning. If you want a data frame result use fortify.zoo(zv) .
library(zoo)
z <- read.zoo(DF)
zv <- rollapplyr(z, 3, var, fill = NA)
zv
giving this zoo object:
USA ARG BRA CHL COL MEX PER
2012-04-01 NA NA NA NA NA NA NA
2012-05-01 NA NA NA NA NA NA NA
2012-06-01 0 1.287083e-04 4.998008e-04 1.126781e-09 1.237524e-11 5.208793e-06 NA
2012-07-01 0 1.033001e-04 5.217420e-05 9.109406e-10 3.883996e-12 3.565057e-06 NA
2012-08-01 0 9.358558e-06 1.396497e-05 2.060928e-09 4.221043e-12 4.600220e-06 NA
2012-09-01 0 1.113297e-05 3.108380e-08 9.159058e-10 4.826929e-12 7.453672e-07 NA
2012-10-01 0 1.988357e-06 4.498977e-08 2.485889e-10 2.953403e-12 8.001948e-07 NA
2012-11-01 0 3.560373e-06 1.944961e-05 2.615387e-10 1.168389e-11 2.971477e-07 NA
2012-12-01 0 3.717777e-05 2.655440e-05 1.271886e-10 1.814869e-11 4.312436e-07 NA
2013-01-01 0 2.042867e-05 3.268476e-05 2.806455e-10 7.540331e-11 1.231438e-06 NA
2013-02-01 0 4.134729e-07 1.129013e-04 1.186146e-10 1.983651e-11 3.263780e-07 NA
We can plot the log of the variances like this:
library(ggplot2)
autoplot(log(zv), facet = NULL) + geom_point() + ylab("log(var(.))")
Note
We assume that the starting point is the data frame generated reproducibly below:
Lines <- "Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA"
DF <- read.table(text = Lines, header = TRUE)

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.

In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.

This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3

Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

Given start and end times, create hourly labels to indicate whether an hour is in the duration or not

I have start and end times of some commercial event for a couple of locations. The event may or may not take place on each day and the event duration does not overlap. For example, run this:
inputdata = data.frame(
location = c('x','x','y','z','z'),
start = c(as.POSIXct("2010/1/1 8:28:00"),as.POSIXct("2010/1/2 7:20:00"),
as.POSIXct("2010/1/1 10:22:00"),
as.POSIXct("2010/1/5 13:28:00"),as.POSIXct("2010/1/7 15:39:00")),
end = c(as.POSIXct("2010/1/1 13:25:00"),as.POSIXct("2010/1/2 10:09:00"),
as.POSIXct("2010/1/1 15:24:00"),
as.POSIXct("2010/1/6 00:28:00"),as.POSIXct("2010/1/7 19:34:00"))
)
The input data looks like:
location start end
1 x 2010-01-01 08:28:00 2010-01-01 13:25:00
2 x 2010-01-02 07:20:00 2010-01-02 10:09:00
3 y 2010-01-01 10:22:00 2010-01-01 15:24:00
4 z 2010-01-05 13:28:00 2010-01-06 00:28:00
5 z 2010-01-07 15:39:00 2010-01-07 19:34:00
I want to construct an hourly dataset with three columns: 1.location, 2.hour, and 3.indicator and each row is for a pair of location and sharp hour (for instance, as.POSIXct("2010/1/1 13:00:00")) where indicator is a dummy, =1 if this hour is between some event start and end times for the location.
For instance, let's say the output hourly data are for 2010-01-01 to 2010-01-07. Run this:
output = data.frame(
location = rep(c('x','y','z'),
each=length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))),
hour = rep(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"),3),
indicator = rep(0,3*length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))))
So we get the first six rows look like this:
location hour indicator
1 x 2010-01-01 00:00:00 0
2 x 2010-01-01 01:00:00 0
3 x 2010-01-01 02:00:00 0
4 x 2010-01-01 03:00:00 0
5 x 2010-01-01 04:00:00 0
6 x 2010-01-01 05:00:00 0
Now, we need to change the value of indicator to 1 if the hour in the same row has an event in effect for the location in the same row.
For instance, since location x has an event between 8:28 am on 2010/1/1 and 13:25 pm on 2010/1/1. So the rows for 7 am to 14 pm should look like this:
location hour indicator
8 x 2010-01-01 07:00:00 0
9 x 2010-01-01 08:00:00 1
10 x 2010-01-01 09:00:00 1
11 x 2010-01-01 10:00:00 1
12 x 2010-01-01 11:00:00 1
13 x 2010-01-01 12:00:00 1
14 x 2010-01-01 13:00:00 1
15 x 2010-01-01 14:00:00 0
It seems that I can do exhaustively search for each pair of location and hour and update the value of indicator is the hour is between the start and end hour of some event at that location. But I doubt this is the best way.
Or I am thinking that I can first, convert the input data to hourly data where the hour would be there only if they are between the start and end hour. In other words, the converted data should look like:
location hour indicator
1 x 2010-01-01 08:00:00 1
2 x 2010-01-01 09:00:00 1
3 x 2010-01-01 10:00:00 1
4 x 2010-01-01 11:00:00 1
5 x 2010-01-01 12:00:00 1
6 x 2010-01-01 13:00:00 1
7 x 2010-01-02 07:00:00 1
8 x 2010-01-02 08:00:00 1
9 x 2010-01-02 09:00:00 1
10 x 2010-01-02 10:00:00 1
11 y 2010-01-01 10:00:00 1
12 y 2010-01-01 11:00:00 1
and then I go from there to get the correct indicators for each hour for each location. Though, I don't know how to convert the start/end hours to hourly observations.
This is all I get for this problem so far.
With this said, I do not have a solution and would like to ask for help.
Also, all I want is that output with three columns. When contributing, please do not constrained by my thoughts which may not be efficient.
It is worth mentioning that the actual problem covers 5 years and there are 30 locations. So the algorithm needs to be efficient.

Here is a way to do this with a cross join.
library(dplyr)
hours =
data_frame(hour = seq(as.POSIXct("2010/1/1"),
as.POSIXct("2010/1/7 23:00:00"),
"hours") ) %>%
merge(inputdata %>% select(location) %>% distinct)
hours %>%
left_join(inputdata) %>%
filter(start <= hour & hour <= end) %>%
right_join(hours) %>%
mutate(indicator = +!is.na(start))

How to do a BETWEEN merge the data.table way?

I have two data.tables that are each 5-10GB in size. They look similar to the following.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
The idea is to find rows in B where the datetime is within 0-10 minutes of a matching row in A (matching is done by person) and mark them in A. The question is how can I do it most efficiently using data.table?
One plan is to join the two data tables based on [I]person[/I] only, then calculate the time difference and find rows where the time difference is between 0 and 600 seconds, and finally outer join the latter with A:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Which gives us the correct result:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
However, I am not sure if this is the most efficient way. Specifically, I am using AB[diff < 600 & diff > 0] which I assume will run a vector search not a binary search, but I cannot think of how to do it using a binary search.
Also, I am not sure if converting to POSIXct is the most efficient way of calculating time differences.
Any ideas on how to improve efficiency are high appreciated.

data.table's rolling join is perfect for this task:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
The only difference with your expected output is that it checks timedifference as less or equal to 10 minutes (<=). If that is bad for you you can just delete equal matches

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to analyze data from the Internet with R to find discrepancies? - r

Related

Efficient way for insertion of multiple rows at given indices & with repetitions

Compute the variance of a moving window in a dataframe

Using a rolling time interval to count rows in R and dplyr

Given start and end times, create hourly labels to indicate whether an hour is in the duration or not

How to do a BETWEEN merge the data.table way?

Categories

Resources