Create warnings in R - r

I wanna write a script whicht makes R usable for "everybody" at this special topic of analysis. Is there a possibility to create warnings?
time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3
For example if the value is at least 3 times 0 (afterwards - better within a settet period of time - 3 days) give warnings - and naming the date. Maybe create something like a report, if I am combining conditions.
In general: Masurement data are read via read.csv and then set Date by as.POSIXct - xts/zoo. I want the "user" to get a clear message if the values are changing etc.; if they are 0 for a long time etc.
The second step would be sending emails - maybe running on a server later.
Additional Questions:
I do have a df in xts now - is it possible to check if the value is greater a threshold value? It's not working because it's not an atomic vector.
Thanks

Try this.
x <- read.table(text = "time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3", header = TRUE, sep = ",")
if(any(rle(x$value)$lengths >= 3)) warning("I noticed some dates have value 0 at least three times.")
Warning message:
I noticed some dates have value 0 at least three times.
I'll leave it to you as a training exercise to paste a warning message that would also give you the date(s).

Related

In Surv(start_time, end_time, new_death) : Stop time must be > start time, NA created

I am using the package "survival" to fit a cox model with time intervals (intervals are 30 days long). I am reading the data in from an xlsx worksheet. I keep getting the error that says my stop time must be greater than my start time. The start values are all smaller than the stop values.
I checked to make sure these are being read in as numbers which they are. I also changed them to integers which did not solve the problem. I used this code to see if any observations met this criterion:
a <- a1[which(a1$end_time > a1$start_time),]
About half the dataset meets this criterion, but when I look at the data all the start times appear to be less than the end times.
Does anyone know why this is happening and how I can fix it? I am an R newbie so perhaps there is something obvious I don't know about?
model1<- survfit(Surv(start_time, end_time, censor) ~ exp, data=a1, weights = weight)
enter image description here

Count duration of value in vector in R

I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.

Using Loops to find the Difference between two Rows in the same Column

So I have a 252 rows of data in column 4, and I would like to find the difference between two consecutive rows throughout the entire column
My current code is:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4]
}
appleClose[]
I tried, and failed, with:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4] - AA[i+1,4]
}
appleClose[]
Edit:
I am trying to optimize a stock market portfolio in retrospect.
AA is the ticker symbol for Apple. I downloaded that information through some R code written earlier in the program.
I have not yet checked out the diff function yet. I will do that now.
The error I am receiving is
Error in [.xts(AA, i + 1, 4) : subscript out of bounds
Is this what you mean?
> Apple=runif(5,1,10)
#5 numbers
> Apple
[1] 3.362267 2.489085 3.899513 5.591127 9.315716
#4 differences
> diff(Apple)
[1] -0.8731816 1.4104271 1.6916143 3.7245894
or depending on your data either
>diff(AA$Apple)
or maybe
>diff(AA[,4])
Another option (if you are referring to this, your question is not much clear)
AA[-1,4]- AA[-dim(A)[1],4]

ROCR Plot using R

I have a csv file (tab deliminated) which contains 2 columns which looks like such
5 0
6 0
9 0
8 1
"+5000 lines similar lines"
I am attempting to create a ROC plot using ROCR.
This is what I have tried so far:
p<-read.csv(file="forROC.csv", head=TRUE, sep="\t")
pred<-prediction(p[1],p[2])
The second line gives me an error: Error in prediction(p[1], p[2]) : Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.
I am not sure what the error is. Is there something wrong with my CSV file?
My guess is that your array indexing isn't setup properly. If you read in that CSV file, you should expect a data.frame (think matrix or 2D array, depending on your background) with two columns and 5,000+ rows.
So your current call to p[1] or p[2] aren't especially meaningful. You probably want to access the first and second column of that data.frame, which you can do using the syntax of p[,1] for the first column and p[,2] for the second.
The specific error you're encountering, however, is a complaint that the "truth" variable you're using isn't binary. It seems that your file is setup to have an output of 1 and 0, so this error may go away once you properly access your array. But if you encounter this in the future, just be sure to binarize your truth data before you use it. For instance:
p[,2] <- p[,2] != 0
Would set the values to FALSE if it's a zero, and TRUE for each non-zero cell in the column.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources