applying a function to a timeseries in a R dataframe

applying a function to a timeseries in a R dataframe - r

I'm trying to apply a function to a column in a dataframe that contains dates and keep getting an error. I'm not exactly sure what I'm doing wrong.
My dataframe:
dates total
1 2014-12-08 01:10:00 163.7
2 2014-12-08 01:10:00 163.9
3 2014-12-08 01:12:00 163.6
4 2014-12-08 08:27:00 163.0
5 2014-12-08 08:35:00 163.7
6 2014-12-08 08:39:00 162.4
I want to replace the dates by either 'morning' or 'night' or alternatively created a new column with 'morning' or 'night'. the approach that i took involved unclassing the date so i could get the hour. I defined a night as before 4am or after 5pm. I put this in a function called timeofday:
timeofday <- function(x) {
bmk <- unclass(x)
if (bmk$hour < 4) {
return("night")
} else if (bmk$hour > 17) {
return("night")
} else {
return("morning")
}
}
I then did the following:
timeofday(df$dates)
Warning message:
In if (bmk$hour < 4) { :
the condition has length > 1 and only the first element will be used
Any help on identifying the issue would be greatly appreciated.

you could also use cut as in:
cut(unclass(x)$hour-7,c(0,15,24)-8,c('night','morning'))
(note that you have to shift your frame of reference so that you don't have two 'night' categories with this solution)

Your code contains this if statement
if (bmk$hour < 4)
If bmk is a vector, like in your case, you have an if statement containing a vector and therefore it will take account of the first element of the vector only.
This is the workaround
sapply(df$dates, timeofday)

Related

R convert data into a vector

I have an object (Seurat object) an I need to get certain data out of it
> sc#misc[["colors"]][["seurat_clusters"]]
0 1 2 3 4 5 6 7
"#CC0C00FF" "#5C88DAFF" "#84BD00FF" "#FFCD00FF" "#7C878EFF" "#00B5E2FF" "#00AF66FF" "#CC0C00B2"
This data is needed as an vector but I don't know how to pull "#CC0C00FF" "#5C88DAFF" etc. out of it.
In order to hand this data to the next function, the result should look like this:
> vec
[1] "#CC0C00FF" "#5C88DAFF" "#84BD00FF"
Thanks in advance!

Solved it! I'm pretty disappointed by myself, because I didn't know this function existed:
> as.vector(sc#misc[["colors"]][["seurat_clusters"]])
[1] "#CC0C00FF" "#5C88DAFF" "#84BD00FF" "#FFCD00FF" "#7C878EFF" "#00B5E2FF" "#00AF66FF" "#CC0C00B2"

subset function returns all rows

I recently reverted to R version 3.1.3 for compatibility reasons and am now encountering an unexplained error with the subset function.
I want to extract all rows for the gene "Migut.A00003" from the data frame transcr_effects using the gene name as listed in the data frame expr_mim_genes. (this will later become a loop). This action always returns all rows instead of specific rows I am looking for, no matter the formatting of the subset lookup:
> class(expr_mim_genes)
[1] "data.frame"
> sapply(expr_mim_genes, class)
gene longest.tr pair.length
"character" "logical" "numeric"
> head(expr_mim_genes)
gene longest.tr pair.length
1 Migut.A00003 NA 0
2 Migut.A00006 NA 0
3 Migut.A00007 NA 0
4 Migut.A00012 NA 0
5 Migut.A00014 NA 0
6 Migut.A00015 NA 0
> class(transcr_effects)
[1] "data.frame"
> sapply(transcr_effects, class)
pair gene
"character" "character"
> head(transcr_effects)
pair gene
1 pair1 Migut.N01020
2 pair10 Migut.A00351
3 pair1000 Migut.F00857
4 pair10007 Migut.D01637
5 pair10008 Migut.A00401
6 pair10009 Migut.G00442
. . .
7168 pair3430 Migut.A00003
. . .
The gene I am interested in:
> expr_mim_genes[1,"gene"]
[1] "Migut.A00003"
R sees these two terms as equivalent:
> expr_mim_genes[1,"gene"] == "Migut.A00003"
[1] TRUE
If I type in the name of the gene manually, the correct number of rows are returned:
> nrow(subset(transcr_effects, transcr_effects$gene=="Migut.A00003"))
[1] 1
> subset(transcr_effects, transcr_effects$gene=="Migut.A00003")
pair gene
7168 pair3430 Migut.A00003
However, this should return one row from the data.frame but it returns all rows:
> nrow(subset(transcr_effects, transcr_effects$gene == (expr_mim_genes[1,"gene"]))
[1] 10122
I have a feeling this has something to do with text formatting, but I've tried everything and haven't been able to figure it out. I've seen this issue with quoted v.s. unquoted entries, but it does not appear to be the issue here (see equality above).
I didn't have this problem before switching to R v.3.1.3, so maybe it is a version convention I am unaware of?
EDIT:
This is driving me crazy, but at least I think I have found a patch. There was quite a bit of data and file processing to get to this point in the code, involving loading at least 4 files. I've tried taking snippets of each file to post a reproducible example here, but sometimes when I analyze the snippets the error recurs, sometimes it does not (!!). After going through the process though, I discover that:
i = 1
gene = expr_mim_genes[i,"gene"]
> nrow(subset(transcr_effects, gene == gene))
[1] 10122
> nrow(subset(transcr_effects, gene == (expr_mim_genes[i,"gene"])))
[1] 1
I still can't explain this behavior of the code, but at least I know how to work around it.
Thanks all.

Convert periods (hundredth of a second) r

I am trying to convert a vector of the following form:
data$Time[1:10]
[1] 0:00.00 0:00.01 0:00.02 0:00.03 0:00.04 0:00.05 0:00.06 0:00.07 0:00.08 0:00.09
573394 Levels: 0:00.00 0:00.01 0:00.02 0:00.03 0:00.04 0:00.05 0:00.06 0:00.07 0:00.08 0:00.09 0:00.10 0:00.11 0:00.12 0:00.13 0:00.14 ... 9:59.99
notice that this is a factor form
class(data$Time)
factor
I've tried the following
hms(data$Time[1:10])
[1] "0S" "1S" "2S" "3S" "4S" "5S" "6S" "7S" "8S" "9S"
it sees the 1/100 of a second as a second! same thing for
period_to_seconds(hms(data$Time[1:10]))
[1] 0 1 2 3 4 5 6 7 8 9
I need to be able to extract the time (with the require accuracy) to be able to subtract and calculate periods. Notice that these files will extend to few hours. So a solution that is good for HH:MM:SS.00 will be appreciated
another approach that only works if you have data that is either H M S or M S solely is the following:
Test <- c('03:5.05', '1:03.05.05')
tmp <- strptime(as.character(Test),"%H:%M:%OS")
tmp
[1] NA NA
tmp <- strptime(as.character(Test),"%M:%OS")
tmp
[1] "2016-04-30 00:03:05.05 CDT" "2016-04-30 00:01:03.05 CDT
(The hours had to be removed)

## set option to use digits for seconds
options(digits.secs = 2)
## convert your factor to a string and then to Posix format
tmp <- strptime(as.character(data$Time),'%H:%M:%OS')
## convert it to a numeric (unit seconds)
as.numeric(strftime(tmp,'%OS'))+60*as.numeric(strftime(tmp,'%M'))+60*60*as.numeric(strftime(tmp,'%H'))

There is a ms function in lubridate package to read only the minutes and seconds.
Test <- c('0:00.02', '9:59.99')
library(lubridate)
Test %>% ms() %>% period_to_seconds()
[1] 0.02 599.99

Based on Jorg's answer. I think I was able to solve my problem. The files I am working with extend for few hours (with each point representing 0.01 sec). So I split the vector (data$Time) and applied the MS script for the first 360000 points and the HMS script for what following:
options(digits.secs = 2)
tmp1 <- strptime(as.character(data$Time[1:360000]),"%M:%OS")
tmp2 <- strptime(as.character(data$Time[-(1:360000)]),"%H:%M:%OS")
tmp1_numeric <-as.numeric(strftime(tmp1,'%OS'))+60*as.numeric(strftime(tmp1,'%M'))+60*60*as.numeric(strftime(tmp1,'%H'))
tmp2_numeric <-as.numeric(strftime(tmp2,'%OS'))+60*as.numeric(strftime(tmp2,'%M'))+60*60*as.numeric(strftime(tmp2,'%H'))
tmp_numeric <- c(tmp1_numeric, tmp2_numeric)

Warning message when using read.zoo function

I have data frame(df) that contains daily stock index prices covering over 4000 days. It looks like:
Date Prices
1986-1-1 20
. .
. .
. .
. .
2001-08-31 40
I am trying to convert the data frame into zoo object using read.zoo(df) (read.zoo is a function in zoo package). However it gives me the following error:
Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
that affects the subsequent codes I apply to the object.
For a reproducibility purpose, the original data (FTSE100jensen.csv) and code (JensenPaper.R) is available on https://github.com/ahmedfsalhin/1stpaper

The problem is that you called read.zoo() without providing a value for format=, but your dates are formated like "%d/%m/%Y", not "%Y-%m-%d"

I'm not quite sure why this error was occurring, but I first converted Date to the Date class and was able to call read.zoo without error using this:
options(stringsAsFactors=FALSE)
library(zoo)
##
Data <- read.csv(
"F:/gitData.csv",
header=TRUE)
#
Data$Date <- as.Date(
Data$Date,
"%d/%m/%Y")
##
zData <- read.zoo(Data)
##
> head(zData)
Open High Low Close Volume Adj.Close
1986-01-01 1412.6 1412.6 1412.6 1412.6 0 1412.6
1986-01-02 1412.6 1420.8 1412.0 1420.5 0 1420.5
1986-01-03 1420.5 1430.0 1419.6 1429.8 0 1429.8
1986-01-06 1429.8 1436.3 1424.1 1424.1 0 1424.1
1986-01-07 1419.8 1419.8 1411.6 1415.2 0 1415.2
1986-01-08 1415.2 1419.3 1400.3 1404.2 0 1404.2
and everything seems to be in order, e.g. I can call .zoo methods properly, etc...
> plot(zData)
To address the comments above, the error message does seem to indicate that there are duplicated dates, but this is not the case:
> dim(Data)
[1] 4088 7
> length(unique(Data$Date))
[1] 4088

How do I change the index in a csv file to a proper time format?

I have a CSV file of 1000 daily prices
They are of this format:
1 1.6
2 2.5
3 0.2
4 ..
5 ..
6
7 ..
.
.
1700 1.3
The index is from 1:1700
But I need to specify a begin date and end date this way:
Start period is lets say, 25th january 2009
and the last 1700th value corresponds to 14th may 2013
So far Ive gotten this close to this problem:
> dseries <- ts(dseries[,1], start = ??time??, freq = 30)
How do I go about this? thanks
UPDATE:
managed to create a seperate object with dates as suggested in the answers and plotted it, but the y axis is weird, as shown in the screenshot

Something like this?
as.Date("25-01-2009",format="%d-%m-%Y") + (seq(1:1700)-1)
A better way, thanks to #AnandaMahto:
seq(as.Date("2009-01-25"), by="1 day", length.out=1700)
Plotting:
df <- data.frame(
myDate=seq(as.Date("2009-01-25"), by="1 day", length.out=1700),
myPrice=runif(1700)
)
plot(df)

R stores Date-classed objects as the integer offset from "1970-01-01" but the as.Date.numeric function needs an offset ('origin') which can be any staring date:
rDate <- as.Date.numeric(dseries[,1], origin="2009-01-24")
Testing:
> rDate <- as.Date.numeric(1:10, origin="2009-01-24")
> rDate
[1] "2009-01-25" "2009-01-26" "2009-01-27" "2009-01-28" "2009-01-29"
[6] "2009-01-30" "2009-01-31" "2009-02-01" "2009-02-02" "2009-02-03"
You didn't need to add the extension .numeric since R would automticallly seek out that function if you used the generic stem, as.Date, with an integer argument. I just put it in because as.Date.numeric has different arguments than as.Date.character.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

applying a function to a timeseries in a R dataframe - r

you could also use cut as in: cut(unclass(x)$hour-7,c(0,15,24)-8,c('night','morning')) (note that you have to shift your frame of reference so that you don't have two 'night' categories with this solution)

Your code contains this if statement if (bmk$hour < 4) If bmk is a vector, like in your case, you have an if statement containing a vector and therefore it will take account of the first element of the vector only. This is the workaround sapply(df$dates, timeofday)

Related

R convert data into a vector

subset function returns all rows

Convert periods (hundredth of a second) r

Warning message when using read.zoo function

How do I change the index in a csv file to a proper time format?

Categories

Resources