Is there some way to detect 'wrong' measures in a dataframe? - r

I'm struggling on how can I remove 'wrong' measures from my dataset. I'm dealing with kind a huge table, where I have a date and the size of an equipment. It can't get bigger with use, at most it can stay the same size, so of course this problem is a measurement error.
My database is extensive and with several particular cases, which makes it impossible for me to place it here, among other business reasons... Therefore, I use an image and a part of the data as an example, but the problem is what I described above...
simplest_example = test = data.frame(data1 = c("20-09-2020", "15-10-2020", "13-05-2021", "20-10-2021","20-11-2021"), measure = c(5,4,3,5,2))
#as result:
# data1 measure
#1 20-09-2020 5
#2 15-10-2020 4
#3 13-05-2021 3
#4 20-11-2021 2
The point is: Select the largest non-ascending sequence possible, and exclude some values that inhibit this from happening.
So I would like to ask for a suggestion, if anyone here has come across something similar, and let me know how to recommend something.

If I understand, you want to detect any time the variable measure is greater than the value at the previous time point? I'd create a lag column, which is just the measure column lagged by one time. Then identify when a previous measure is greater than the current measure
library(dplyr)
simplest_example %>%
mutate(previous_measure = lag(measure)) %>%
filter(previous_measure < measure)

Related

Finding the percentage of a specific value in the column of a data set

I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)

Vectorizing R custom calculation with dynamic day range

I have a big dataset (around 100k rows) with 2 columns referencing a device_id and a date and the rest of the columns being attributes (e.g. device_repaired, device_replaced).
I'm building a ML algorithm to predict when a device will have to be maintained. To do so, I want to calculate certain features (e.g. device_reparations_on_last_3days, device_replacements_on_last_5days).
I have a function that subsets my dataset and returns a calculation:
For the specified device,
That happened before the day in question,
As long as there's enough data (e.g. if I want last 3 days, but only 2 records exist this returns NA).
Here's a sample of the data and the function outlined above:
data = data.frame(device_id=c(rep(1,5),rep(2,10))
,day=c(1:5,1:10)
,device_repaired=sample(0:1,15,replace=TRUE)
,device_replaced=sample(0:1,15,replace=TRUE))
# Exaxmple: How many times the device 1 was repaired over the last 2 days before day 3
# => getCalculation(3,1,data,"device_repaired",2)
getCalculation <- function(fday,fdeviceid,fdata,fattribute,fpreviousdays){
# Subset dataset
df = subset(fdata,day<fday & day>(fday-fpreviousdays-1) & device_id==fdeviceid)
# Make sure there's enough data; if so, make calculation
if(nrow(df)<fpreviousdays){
calculation = NA
} else {
calculation = sum(df[,fattribute])
}
return(calculation)
}
My problem is that the amount of attributes available (e.g. device_repaired) and the features to calculate (e.g. device_reparations_on_last_3days) has grown exponentially and my script takes around 4 hours to execute, since I need to loop over each row and calculate all these features.
I'd like to vectorize this logic using some apply approach which would also allow me to parallelize its execution, but I don't know if/how it's possible to add these arguments to a lapply function.

R - efficiently organize tables on condition over time

I'd like to know how to organize a data.frame into tables on conditions over time. I have a politics data set where certain organizations take a position on a bill and whether the bill passed or failed, over the last few decades.
I know how to organize the data individually into tables, but I do it one-by-one, and its really hard to see the trends. The stackoverflow community always seems to have ingenious ways of grouping data. Here's some mock data:
Data <- data.frame(
year = sample(1998:2004, 200, replace = TRUE),
outcome = sample(0:1, 200, replace = TRUE),
biz1 = sample(-2:2, 200, replace = TRUE),
biz2 = sample(-2:2, 200, replace = TRUE),
biz3 = sample(-2:2, 200, replace = TRUE)
)
In biz, a negative number means they oppose the outcome and a positive outcome means they support it. In outcome, a zero means the law did not pass, a 1 means that it did.
I would like to use tables to see how each business has become more or less successful over time, by looking at how their positive numbers match 1s and negative numbers match 0s, compared to ever other organization (and vice verse with positive matching the number of negative numbers).
A few notes
In the data set, I have about 100 businesses as columns, so I definitely need an efficient way to make the tables without naming every single column. I can select them in a range, like 125:300, since they are ordered together.
Of course i'm open to all ideas! Feel free to list any other ways of looking at this.
If i failed to ask this question right, please let me know how I could improve it.
The comments above about your question being too vague are right on target. Having said that this interests me and the vagueness leaves me free to interpret...
First, I'd recode the outcome as -1 if the bill fails. Then ourtcome * bizn is in a sense a success score for that business on that legislation: positive if either a bill that the business supported passed, or if a bill that the business opposed failed. Then there are several ways to visualize the scores. Here are just a few to get you started.
# re-code outcomes
Data$outcome <- ifelse(Data$outcome==0,-1,1)
library(reshape2) # for melt(...)
library(ggplot2)
gg <- melt(Data, id=c("year","outcome"),
variable.name="business", value.name="support")
gg$score <- with(gg,outcome*support) # score represents level of success
# mean success vs. year with +/- 1 sd
ggplot(gg,aes(x=year,y=score, color=business))+
stat_summary(fun.data="mean_sdl")+
stat_summary(fun.y=mean,geom="line")+
facet_grid(business~.)
# boxplot of success scores
ggplot(gg,aes(x=factor(year),y=score))+
geom_boxplot(aes(fill=business))+
facet_grid(business~.)
# barplot of success/failure frequencies
# excludes cases where a business did not take a position pro or con
gg.bar <- aggregate(score~year+business,gg,
function(eff)c(success=sum(eff>0),failure=sum(eff<0)))
gg.bar <- data.frame(gg.bar[1:2],gg.bar$score)
ggplot(gg.bar,aes(x=factor(year)))+
geom_bar(aes(y=success,fill="success"),stat="identity")+
geom_bar(aes(y=-failure,fill="failure"),stat="identity")+
geom_hline(xintercept=0,linetype=2,color="blue")+
scale_fill_discrete(name="",breaks=c("success","failure"))+
labs(x="",y="frequency")+
facet_grid(business~.)
All of these represent rather simplistic ways of looking at the data. If this was a serious project I would probably run a principal components analysis on the businesses to identify groups of businesses that tend to support or oppose the same legislation. Then I'd run a cluster analysis on the principal components to identify groups of legislation that tend to attract the support or opposition of groups of businesses.
Another way to approach this would be to run a logistic regression on the outcomes using the support/opposition of the various businesses as predictors. This would tell you which businesses tend to be more influential.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources