How to read file with a specific format in R? - r

I'd like to read a file where each line represents a dataset containing date, some text as well as numbers. Example:
Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328
There is no general separating character (like in CSV), but the format can be described pretty good, since tabs, charcters and Text can be used.
%DATESTRING%\tUptime: %uptime% Threads: %threads% Questions: %questions% Slow queries: %slow% Opens: %opens% Flush tables: %flush% Open tables: %otables% Queries per second avg: %qps%
Is there a function that takes the description of the format and the file and fills a data.frame with the given data.?

The packages tidyr has some utility functions that may be useful for this, although I wouldn't be surprised if there were more special-purpose tools built for this job.
We start by loading the data, in this case from a string
raw <- 'Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328'
df <- read.csv(textConnection(raw), header=F)
Here I've used read.csv so that I get it as a data frame, but you could also just use readLines and add it to a frame yourself.
Then we process it
library(tidyr)
> processed <- df %>% extract(V1,
c("Date", "Uptime", "Threads", "Questions"),
"(.*) *Uptime: (\\d+) *Threads: (\\d+) *Questions: (\\d+)")
> processed
Date Uptime Threads Questions
1 Fri Dec 11 12:40:01 CET 2015 108491 2 576603
2 Fri Dec 11 12:50:01 CET 2015 109090 2 580407
3 Fri Dec 11 13:00:01 CET 2015 109690 2 583895
4 Fri Dec 11 13:10:01 CET 2015 110290 1 586891
5 Fri Dec 11 13:20:01 CET 2015 110890 2 590871
It should be clear how to extract the remaining columns from here.

Two more options:
txt <- "Fri Dec 11 12:40:01 CET 2015 Uptime: 108491 Threads: 2 Questions: 576603 Slow queries: 10 Opens: 2238 Flush tables: 1 Open tables: 7 Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015 Uptime: 109090 Threads: 2 Questions: 580407 Slow queries: 10 Opens: 2253 Flush tables: 1 Open tables: 6 Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015 Uptime: 109690 Threads: 2 Questions: 583895 Slow queries: 10 Opens: 2268 Flush tables: 1 Open tables: 8 Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015 Uptime: 110290 Threads: 1 Questions: 586891 Slow queries: 10 Opens: 2279 Flush tables: 1 Open tables: 6 Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015 Uptime: 110890 Threads: 2 Questions: 590871 Slow queries: 10 Opens: 2292 Flush tables: 1 Open tables: 5 Queries per second avg: 5.328"
## first just tack on the date label
txt <- gsub('^', 'Date: ', readLines(textConnection(txt)))
option 1
sp <- strsplit(txt, '\\s{2,}')
out <- lapply(sp, function(x) gsub('([\\w ]+:)\\s+(.*)$', '\\2', x, perl = TRUE))
dd <- setNames(do.call('rbind.data.frame', out),
gsub('([\\w ]+):\\s+(.*)$', '\\1', sp[[1]], perl = TRUE))
dd[, -1] <- lapply(dd[, -1], function(x) as.numeric(as.character(x)))
dd
option 2: This one uses the yaml package but is much more straight-forward and does the type conversion for you
yml <- gsub('\\s{2,}', '\n', txt)
do.call('rbind.data.frame', lapply(yml, yaml::yaml.load))
# Date Uptime Threads Questions Slow queries Opens Flush tables
# 1 Fri Dec 11 12:40:01 CET 2015 108491 2 576603 10 2238 1
# 2 Fri Dec 11 12:50:01 CET 2015 109090 2 580407 10 2253 1
# 3 Fri Dec 11 13:00:01 CET 2015 109690 2 583895 10 2268 1
# 4 Fri Dec 11 13:10:01 CET 2015 110290 1 586891 10 2279 1
# 5 Fri Dec 11 13:20:01 CET 2015 110890 2 590871 10 2292 1
# Open tables Queries per second avg
# 1 7 5.314
# 2 6 5.320
# 3 8 5.323
# 4 6 5.321
# 5 5 5.328

Related

Fuction to return how many times name is told exact n in R

Dataset:
[1] Wed sun Sat fri mon sun sun Wed Wed sun sun Wed Sat Sat fri thu thu Wed Wed mon sun thu thu
[24] Wed fri thu Wed Sat thu sun sun sun sun Sat sun sun Wed tue sun sun Sat fri Wed mon mon sun
Need a function which returns how many times days come exactly 1,5,10 times.
Outcome should be tabled like:
1 5 10
0 1 1
Thought about yousing tapply and use a new Vector as index like:
count<- c(1,10,15)
Try using the base R table function:
days <- read.table(text="Wed sun Sat fri mon sun sun Wed Wed sun sun Wed Sat Sat fri thu thu Wed Wed mon sun thu thu Wed fri thu Wed Sat thu sun sun sun sun Sat sun sun Wed tue sun sun Sat fri Wed mon mon sun", sep = " ")
table(t(days))
#fri mon Sat sun thu tue Wed
# 4 4 6 15 6 1 10
To re-aggregate the initial table count:
table(table(t(days)))
# 1 4 6 10 15
# 1 2 2 1 1
If this is your data:
days <- "Wed sun Sat fri mon sun sun Wed Wed sun sun Wed Sat Sat fri thu thu Wed Wed mon sun thu thu Wed fri thu Wed Sat thu sun sun sun sun Sat sun sun Wed tue sun sun Sat fri Wed mon mon sun"
you will probably want to convert to lower-case before and split into substrings:
days <- tolower(unlist(strsplit(days, " ")))
Then you can use table:
table(days)
days
fri mon sat sun thu tue wed
4 4 6 15 6 1 10
If you want to know which days occur a certain number of times:
t <- as.data.frame.table(table(days))
t$days[t$Freq == 10]
[1] wed
Here you are asking which day occurs exactly 10 times.
Or if you want know which days occur exactly 4 times:
t$days[t$Freq == 4]
[1] fri mon
If you just want to know how many days occur exactly, say, 4 times:
length(t$days[t$Freq == 4])
[1] 2
If you want to you know how many days occur n times you can define a for loop:
times <- c()
for(i in 1:15){
times[i] <- length(t$days[t$Freq == i])
}
times
[1] 1 0 0 2 0 2 0 0 0 1 0 0 0 0 1
If you want this info in a nice dataframe:
df <- data.frame(
freq = 1:15
)
for(i in 1:15){
df$times[i] <- length(t$days[t$Freq == i])
}
df
freq times
1 1 1
2 2 0
3 3 0
4 4 2
5 5 0
6 6 2
7 7 0
8 8 0
9 9 0
10 10 1
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
Does this answer your question?

Daily Average of Time series derived from monthly data R monthdays()

I have a time series object ts. I have mentioned the entire object here. It has data from Jan 2013 to Dec 2017 for all years. I am trying to find the daily average value so that the value is divided by the number of days in a month.
Expected output
The first value for Jan 2013 in ts is 23770, I want the value to be 23770/31 where 31 is the number of days in Jan, second value for Feb 2013 is 23482. I want the value to be 23482/28 as 28 was the number of days in Feb 2013 and so on
Tried so far:
I know monthdays() can do this. Something like ts/monthdays() .Monthdays() returns number of days in a month. I am not able to implement it here. Read about this tapply somewhere but it is not giving me desired result, since i need values corresponding to each month year combination.
ts
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 23770 23482 23601 22889 23401 24240 23873 23647 23378 23871 22624 23496
2014 26765 27619 26341 27320 27389 27418 26874 27005 27538 26324 27267 27583
2015 28354 27452 28336 28998 28595 28338 27806 28660 27226 28317 28666 28574
2016 30209 30659 31554 30248 30358 31091 30389 30247 31227 31839 30602 30609
2017 32180 32203 31639 31784 32375 30856 31863 32827 32506 31702 31681 32176
> cycle(ts_actual_group2)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 2 3 4 5 6 7 8 9 10 11 12
2014 1 2 3 4 5 6 7 8 9 10 11 12
2015 1 2 3 4 5 6 7 8 9 10 11 12
2016 1 2 3 4 5 6 7 8 9 10 11 12
2017 1 2 3 4 5 6 7 8 9 10 11 12
Using tapply since i read it , but this is not giving desired output
tapply(ts_actual_group2, cycle(ts_actual_group2), mean)
1 2 3 4 5 6 7 8 9 10 11 12
28255.6 28283.0 28294.2 28247.8 28423.6 28388.6 28161.0 28477.2 28375.0 28410.6 28168.0 28487.6
I am not able to implement it here.
I'm not sure why you couldn't. The monthdays function from the forecast package, when applied to a ts object, returns the number of days in each month of the series. The object returned is a time-series of the same dimension as the input. So you can simply divide them.
library(forecast)
ts/monthdays(ts)
Jan Feb Mar Apr May Jun Jul
2013 766.7742 838.6429 761.3226 762.9667 754.8710 808.0000
2014 863.3871 986.3929 849.7097 910.6667 883.5161 913.9333
2015 914.6452 980.4286 914.0645 966.6000 922.4194 944.6000
2016 974.4839 1057.2069 1017.8710 1008.2667 979.2903 1036.3667
2017 1038.0645 1150.1071 1020.6129 1059.4667 1044.3548 1028.5333
monthsdays(ts) # Accepts a time-series object
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 31 28 31 30 31 30 31 31 30 31 30 31
2014 31 28 31 30 31 30 31 31 30 31 30 31
2015 31 28 31 30 31 30 31 31 30 31 30 31
2016 31 29 31 30 31 30 31 31 30 31 30 31
2017 31 28 31 30 31 30 31 31 30 31 30 31

converting to time series using ts() in r

Good afternoon
I have a time series
v2<-c(12,13,15,17,18,12,11,12)
which run from July 1996 to October 1997, just the months between July and October
when I try to convert to time series with
v2.ts<-ts(v2, frequency=12, start=c(1996,7), end=c(1997,10))
It yields me this result
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1996 12 13 15 17 18 12
1997 11 12 12 13 15 17 18 12 11 12
what parameters can I use to make it like:
Jul Aug Sep Oct
1996 12 13 15 17
1997 18 12 11 12
Thanks in advance for the help
A ts series must be regularly spaced but the output shown has points that are one month apart except between Oct of the first year and July of the second year so it is not of that form.
There are several packages that can represent irregularly spaced series. With the zoo package it would be done like this:
library(zoo)
z <- as.zoo(v2.ts)
z[cycle(z) %in% 7:10]
## Jul 1996 Aug 1996 Sep 1996 Oct 1996 Jul 1997 Aug 1997 Sep 1997 Oct 1997
## 12 13 15 17 18 12 11 12
If you are not looking for a time series but just a matrix with the indicated elements then:
tapply(c(v2.ts), list(floor(time(v2.ts)), cycle(v2.ts)), c)[, 7:10]
## 7 8 9 10
## 1996 12 13 15 17
## 1997 18 12 11 12

R convert year-month-day-hour local standard time to UTC

Background info (see question at bottom):
I received a dataset of hourly averaged observations collected by instruments at hundreds of sites in different time zones every hour for the past 10 years. The instruments are never adjusted for daylight savings time, so all times in the dataset are in local standard time. The hourly reported values are averages of measurements made every minute for the previous hour. Year, month, day, and hour are reported in separate columns. The hours go from 1:24 instead of 0:23. I want to create a new column containing the UTC datetime.
The data table below is a sample dataset with my most recent solution, as far as it goes. For many frustrating hours for two weeks, I have experimented with strptime, chron, POXITcl, and POXITlt, and scoured stackoverflow and other sources to try and understand what the solution would be. I'm never sure of what's going on in my attempts at conversion (except when I'm sure it's wrong, which is most of the time!).
I'm not sure that the datetime column I've created is the correct intermediate step I should be using, either, or how to get from that to UTC time that R will handle correctly. I inserted the character "T" between data and time in my datetime column to force the column to remain as character, otherwise unexpected things happen. For example, my computer operating system timezone is America/Toronto, and
as.POSIXct(mydata$datetime, format="%Y-%m-%dT%H:%M %z")
converts 2013-01-01T01:00-0800 to 2013-01-01 04:00:00 . The above command is seems to be converting to my machine's timezone, not UTC. So, if I change the R environment time zone, without changing the computer operating system time zone, before running the command
Sys.setenv(TZ = "GMT")
mydata$dateUTC <- as.POSIXct(mydata $datetime, format="%Y-%m-%dT%H:%M %z")
Sys.unsetenv("TZ")
then the above command converts 2013-01-01T01:00-0800 to 2013-01-01 09:00:00 which appears to be the UTC time I'm looking for.
I'm not too worried about hour24, because it seems that whatever method is used, the date is automatically increased to the next day and the hour changed to 00:00 (e.g., 2013-01-01 24:00 becomes 2013-01-02 00:00).
When converting from UTC to local time, I'm not too worried about the fact that the date on which times change from Standard time to Daylight Savings time can, and has changed over the years. Given the correct UTC time and Olson timezone, if I use the IANA timezone database this should be automatically taken care of (I think).
Question 1:
Using R, how should I convert year-month-day-hour reported in local standard time all year to UTC time?
Question 2:
Using R, How should I convert from UTC time to local standard time (without converting to DST in localities that use DST for civil time)?
Question 3:
Using R, how should I convert from UTC time to local time, taking into account daylight saving time?
Question 4:
In order to convert from UTC to local time, I will need the timezone names from the IANA database. Is there some way I can pull this in from somewhere on the web, given the latitude and longitude for each site?
filename = mydata
site year month day hourend UTCoffset datetime obs
2001 2015 1 1 22:00 -0200 2013-01-01T22:00-0200 1356
2001 2015 1 1 23:00 -0200 2013-01-01T23:00-0300 1593
2001 2015 1 1 24:00 -0200 2013-01-01T24:00-0200 946
2001 2015 1 2 01:00 -0200 2013-01-02T01:00-0200 271
2001 2015 1 2 02:00 -0200 2013-01-02T02:00-0200 665
3001 2015 1 1 22:00 -0350 2013-01-01T22:00-0350 548
3001 2015 1 1 23:00 -0350 2013-01-01T23:00-0350 936
3001 2015 1 1 24:00 -0350 2013-01-01T24:00-0350 1938
3001 2015 1 2 01:00 -0350 2013-01-02T01:00-0350 952
3001 2015 1 2 02:00 -0350 2013-01-02T02:00-0350 1584
4001 2015 1 1 22:00 -0400 2013-01-01T22:00-0400 1837
4001 2015 1 1 23:00 -0400 2013-01-01T23:00-0400 1275
4001 2015 1 1 24:00 -0400 2013-01-01T24:00-0400 382
4001 2015 1 2 01:00 -0400 2013-01-02T01:00-0400 837
4001 2015 1 2 02:00 -0400 2013-01-02T02:00-0400 592
5001 2015 1 1 22:00 -0500 2013-01-01T22:00-0500 392
5001 2015 1 1 23:00 -0500 2013-01-01T23:00-0500 15
5001 2015 1 1 24:00 -0500 2013-01-01T24:00-0500 403
5001 2015 1 2 01:00 -0500 2013-01-02T01:00-0500 993
5001 2015 1 2 02:00 -0500 2013-01-02T02:00-0500 1287
6001 2015 1 1 22:00 -0600 2013-01-01T22:00-0600 738
6001 2015 1 1 23:00 -0600 2013-01-01T23:00-0600 992
6001 2015 1 1 24:00 -0600 2013-01-01T24:00-0600 1392
6001 2015 1 2 01:00 -0600 2013-01-02T01:00-0600 189
6001 2015 1 2 02:00 -0600 2013-01-02T02:00-0600 1282
7001 2015 1 1 22:00 -0700 2013-01-01T22:00-0700 839
7001 2015 1 1 23:00 -0700 2013-01-01T23:00-0700 742
7001 2015 1 1 24:00 -0700 2013-01-01T24:00-0700 942
7001 2015 1 2 01:00 -0700 2013-01-02T01:00-0700 882
7001 2015 1 2 02:00 -0700 2013-01-02T02:00-0700 993
8001 2015 1 1 22:00 -0800 2013-01-01T22:00-0800 1140
8001 2015 1 1 23:00 -0800 2013-01-01T23:00-0800 1532
8001 2015 1 1 24:00 -0800 2013-01-01T24:00-0800 1834
8001 2015 1 2 01:00 -0800 2013-01-02T01:00-0800 1732
8001 2015 1 2 02:00 -0800 2013-01-02T02:00-0800 954
You can check out the "Lubridate" package in R. The strptime function there would be useful for your case.

Dealing with nonexistent data when converting to time-series in CRAN R

I have got following data set and I am trying to convert the consumption to time series. Some of the data are nonexistent (e.g. there is no data for 10/2014).
year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706
when I use ts() in R, the wrong values are replaced for nonexistent months.
ts(mkt$consumptions, start = c(2014,7),end=c(2015,11), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 10617 8318 3199 2066 10825 3096
2015 1665 3651 5807 2951 5885 3653 4266 9706 10617 8318 3199
,y question is how to simply replace the nonexistent values with zero or blank?
"ts" class requires that the data be regularly spaced, i.e. every month should be present or NA but that is not the case here. The zoo package can handle irregularly spaced series. Read the input into zoo using the "yearmon" class for the year/month and then simply use it as a "zoo" series or else convert it to "ts". If the input is in a file but otherwise is exactly the same as in Lines then replace text = Lines with something like "myfile.dat" .
Lines <- "year month consumption
2014 7 10617
2014 8 8318
2014 9 3199
2014 12 2066
2015 1 10825
2015 2 3096
2015 3 1665
2015 4 3651
2015 5 5807
2015 7 2951
2015 8 5885
2015 9 3653
2015 10 4266
2015 11 9706"
library(zoo)
toYearmon <- function(y, m) as.yearmon(paste(y, m), "%Y %m")
z <- read.zoo(text = Lines, header = TRUE, index = 1:2, FUN = toYearmon)
as.ts(z)

Resources