Alternative to for loop and indexing? - r

I have a large data set of 3 columns, Order, Discharge, Date (numeric). There are 20 years of daily Discharge values for each Order, which can extend beyond 100.
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 0.00000 6574
5 3.54100 6574
6 3.61500 6574
For a given Order x, I would like to replace the Discharge value with the average of the Discharge at x+1 and x-1 for that date. I have been doing this in a crude manner with a for loop and indexing, but it takes over an hour to process. I know there has to be a better way.
x <- 4
for(i in min(dat[,3]):max(dat[,3]))
dat[,2][dat[,3] == i & dat[,1] == x ] <-
mean(c(dat[,2][dat[,3] == i & dat[,1] == x + 1],
dat[,2][dat[,3] == i & dat[,1] == x - 1]))
Gives
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 1.77050 6574
5 3.54100 6574
6 3.61500 6574
Where the Discharge at Order 4, for date 6574 has been replaced with 1.77050. It works, but it's ridiculously slow.
I should specify that I don't need to do this calculation on every Order, but only a select few (only 8 out of a total of 117). Based on the answer, I have the following.
dat$NewDischarge <- by(dat$Discharge,dat$date,function(x)
colMeans(cbind(c(x[-1],NA), x,
c(NA, x[-length(x)])), na.rm=T))
I am trying to figure out a way still to only have the values of the select Orders to be calculated and am stuck in the rut of a for loop and indexing on date and Orders.

I would go by it as following:
Ensure that Order is a factor.
For each Order, you now have a sub-problem:
Sort the sub-data-frame by date.
Each Discharge-mean can be produced "vectorally" as:
colMeans(cbind(c(Discharge[-1], NA), Discharge, c(NA, Discharge[-length(Discharge)])))
The sub-problem can be dealt with a simple for-loop or the function by. I would prefer by.
Your data has been rearranged, but you can easily reorder it.
For point 2.2, imagine it (or try it) with a simple vector and see the effects of the cbind operation. It also forces you to consider the limit-situations; how is the first and last Discharge-value calculated (no preceding or proceeding dates).

There are several ways to solve your particular dilemma, but the basic question to ask when confronted with a slow for loop is, "How do I use vectorization to replace this loop?" (Well, maybe you should ask "Should I...?" first.) In your case, you're looping across dates, but there's no need to explicitly do that, since just grabbing all of the rows where dat$Order==x will implicitly grab all the dates.
The dataset you posted only has one date, but I can generate some fake data to illustrate:
generate.data <- function(n.order, n.date){
dat <- expand.grid(Order=seq_len(n.order), date=seq_len(n.date))
dat$Discharge <- rlnorm(n.order * n.date)
dat[, c("Order", "Discharge", "date")]
}
dat <- generate.data(10, 5)
head(dat)
# Order Discharge date
# 1 1 2.1925563 1
# 2 2 0.4093022 1
# 3 3 2.5525497 1
# 4 4 1.9274013 1
# 5 5 1.1941986 1
# 6 6 1.2407451 1
tail(dat)
# Order Discharge date
# 45 5 1.4344575 5
# 46 6 0.5757580 5
# 47 7 0.4986190 5
# 48 8 1.2076292 5
# 49 9 0.3724899 5
# 50 10 0.8288401 5
Here's all the rows where dat$Order==4, across all dates:
dat[dat$Order==4, ]
# Order Discharge date
# 4 4 1.9274013 1
# 14 4 3.5319072 2
# 24 4 0.2374532 3
# 34 4 0.4549798 4
# 44 4 0.7654059 5
You can just take the Discharge column, and you'll have the left-hand side of your assignment:
dat[dat$Order==4, ]$Discharge
# [1] 1.9274013 3.5319072 0.2374532 0.4549798 0.7654059
Now you just need the right side, which has two components: the x-1 discharges and the x+1 discharges. You can grab these the same way you grabbed the x discharges:
dat[dat$Order==4-1, ]$Discharge
# [1] 2.5525497 1.9143963 0.2800546 8.3627810 7.8577635
dat[dat$Order==4+1, ]$Discharge
# [1] 1.1941986 4.6076114 0.3963693 0.4190957 1.4344575
To obtain the new values, you need the parallel mean. R doesn't have a pmean function, but you can cbind these and take the rowMeans:
rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge, dat[dat$Order==4+1, ]$Discharge))
# [1] 1.8733741 3.2610039 0.3382119 4.3909383 4.6461105
So, in the end you have:
dat[dat$Order==4, ]$Discharge <- rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge,
dat[dat$Order==4+1, ]$Discharge))
You can even use %in% to make this work across all of your x values.
Note that this assumes your data is ordered.

Related

How to use if else statement in a dataframe when comparing dates?

I have a dataframe D and I would want to calculate a daily return of "Close" only if they share the same month. So for example there would be 0 for 1995-08-01
Date Close Month
1 1995-07-27 163.32 1995-07
2 1995-07-28 161.36 1995-07
3 1995-07-30 162.91 1995-07
4 1995-08-01 162.95 1995-08
5 1995-08-02 162.69 1995-08
I am trying to use an if-else statement and looping to apply it on other dataframes.
D1 <- D[-1,]
for (i in c("Close"))
{ TT <- dim(D)[1]
if (D[1:(TT-1),"Month"] == D[2:TT,"Month"]) {
D1[,i] = round((100*(log(D[2:TT,i]/D[1:(TT-1),i]))), digits = 4)
}
else {
D1[i] = 0 }
}
I get these results but in the forth row it should be 0.0000 because the forth row is a from different month than the the third row. Moreover, I get this warning message : "Warning message: In if (D[1:(TT - 1), "Month"] == D[2:TT, "Month"]) { : the condition has length > 1 and only the first element will be used". Can you please help me out? Thank you.
Date Close Month
1 1995-07-27 0.5903 1995-07
2 1995-07-28 1.4577 1995-07
3 1995-07-30 0.9139 1995-07
4 1995-08-01 0.0006 1995-08
5 1995-08-02 0.0255 1995-08
Next time you should REALLY provide a reproducible example here I did it for you. My solution uses diff and ifelse as requested.
month <- c(1,1:5,5:6)
data <- (1:8)*(1:8)
df <- data.frame(cbind(month, data))
diffs <- sapply(df, diff)
diffs <- data.frame(rbind(NA, diffs))
df$result <- ifelse(diffs$month==0, diffs$data, 0)
df
month data result
1 1 1 NA
2 1 4 3
3 2 9 0
4 3 16 0
5 4 25 0
6 5 36 0
7 5 49 13
8 6 64 0
if() expects a single value (usually TRUE or FALSE, but can also be 0 or 1, and it can handle other single values, e.g., it treats positive values like ones). You are feeding in a vector of values. The warning message is telling you that it is ignoring all the other values of the vector except the first, which is usually a strong indication that your code is not doing what you intend it to do.
Here's one do-it-yourself approach with no loops (I'm sure some time-series package has a function to calculate returns):
# create your example dataset
D <- data.frame(
Date = (as.Date("1995-07-27") + 0:6)[-c(3,5)],
Close = 162 + c(1.32, -.64, .91, .95, .69)
)
# get lagged values as new columns
D$Close_lag <- dplyr::lag(D$Close)
D$Date_lag <- dplyr::lag(D$Date)
# calculate all returns
D$return <- D$Close / D$Close_lag - 1
# identify month switches
D$new_month <- lubridate::month(D$Date) != lubridate::month(D$Date_lag)
# replace returns with zeros when month switches
D[!is.na(D$return) & D$new_month==TRUE, "return"] <- 0
# print results
D

If (condition), add 1 to previous value, else, subtract 1

I'm tracking Meals and satiety in a dataframe. I would like to have R add 1 to the previous value in the satiety column when a meal is eaten, and subtract 1 when no meal is eaten (meal=NA).
I'm trying to accomplish this with a for loop nested in an ifelse statement but it is not working.
My current attempt:
ifelse(Meals=="NA",for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]-1+i)}, for (i in 1:length(Day$Fullness)){
print(Day$Fullness[[i]]+1+i)}
Error: Error in ans[test & ok] <- rep(yes, length.out = length(ans))
[test & ok] :
replacement has length zero
In addition: Warning message:
In rep(yes, length.out = length(ans)) :
'x' is NULL so the result will be NULL
I'm not sure how to create a table on here but I will do my best to make sense.
Time: 9:30 AM 10:00 AM 10:30 AM ETC
Meals: NA NA Breakfast NA NA Snack NA NA NA ETC
Satiety: Range from 0-10.
My current satiety data is just a vector I created, but I would like it to start at 0 and increase by 1 after every meal, while decreasing by 1 after every 30 minute timeframe where there is no meal(where meal= NA).
I'm sure there is a much better way to do this.
Thank you.
Here's some sample data and a potential solution.
set.seed(123)
meals <- sample(c(1, 1, 1, NA), 20, replace = TRUE)
df <- data.frame(meals = meals)
head(df)
# meals
# 1 1
# 2 NA
# 3 1
# 4 NA
# 5 NA
# 6 1
df$meals[is.na(df$meals)] <- -1
df$satiety <- cumsum(df$meals)
head(df)
# meals satiety
# 1 1 1
# 2 -1 0
# 3 1 1
# 4 -1 0
# 5 -1 -1
# 6 1 0
tail(df)
# meals satiety
# 15 1 5
# 16 -1 4
# 17 1 5
# 18 1 6
# 19 1 7
# 20 -1 6
I would suggest not coding the absence of a meal (or a skipped meal) as NA which means "I don't know". If you're using NA to mean the meal was skipped, than you do actually know and you should give it something that represents a skipped meal. Here, since your model interprets a skipped meal as having a negative impact on satiety (not a neutral impact), -1 actually makes quite a lot of sense. If that's how you use it in your model, then code it that way.
A couple of things here.
Unless the data includes the string "NA", you should use the command is.na(x) to check if a value or values are NA. It's hard to tell however without sample data.
Generally speaking, in R you will want to use vectorised solutions. In many cases, if you're using a for loop, it's incorrect.
You've stated that "Meals" is in a dataframe. As such, you will need to refer to Meals as a subset of that data frame. For example, if the data frame is data, then the expression should be data$Meals.
Summarising all of this, I'd probably do something similar to the following:
Day$Meals.na <- is.na(Day$Meals)
print(Day$Fullness + (-1)^Day$Meals.na)
This uses a nice trick: TRUE and FALSE are both stored as 1 and 0 respectively under the hood.
Hopefully this helps. If not, we'd really need sample data and expected outputs to be able to be of more use.

How can I fill in NA values based on the next real value but divide that value between the preceding NAs?

Please note: this is a hyper simplified explanation of where the 'data' comes from, but where the data is from is irrelevant to the coding question.
I have a data set created by collecting water in a tube everyday.
I can't go and measure the tube every day (but the tube keeps filling) so there are gaps in the water value records.
This dummy data set shows where this has happened on days 5 and 10, because this is a dummy dataset I have made an assumption that each day 500ml of water goes into the tube (the real data set is a alot messier!)
dummy data
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)
Data explanation: I have collected every day for days 1:4 so the value for each day is 500ml, missed day 5 so the value is NA, collected on day 6 so the value is 1000ml (the water from day 5 and day 6 combined), missed 7,8,9, so values equal NA, collected on day 10 to give a value of 2000ml for the 4 days) then collected every day for the last two)
I would like to fill in the NA gaps by taking the value of the next 'real' measurement and dividing that value between the NA's and that value's day.Yes, I am assuming that if I have not made a measurement there is a constant process and that I can divide the last measurement equally between the days.
this is what the output data should look like
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)
Again this is just a dummy data set otherwise the easiest way would just be replace NA with 500 with 'value[is.na(value)] <- 500', but in the real data set the values can be 457.6, 779, 376, etc.
Also tried to do a loop but keep getting stuck...
Any ideas on how I can do this?
Help is greatly appreciated
Here's a possible solution :
# Create test Data:
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)
# "Cleansing" starts here :
RLE <- rle(is.na(df$value))
# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
RLE$lengths <- head(RLE$lengths,-1)
RLE$values <- head(RLE$values,-1)
}
afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences
df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)
Result :
> df
day value
1 1 250
2 2 250
3 3 500
4 4 500
5 5 1500
6 6 1500
7 7 1250
8 8 1250
9 9 1250
10 10 1250
11 11 500
12 12 500
13 13 NA
14 14 NA
15 15 NA

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Subsetting an integer vector based on a vector of corresponding dates

Elementary question:
I'm trying to subset a vector of a data frame based on a vector of dates that correspond with the vector that I wish to subset. Consider the following data frame as an example:
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 NA
2 6/12/10 5:01:00 40 1 NA
3 6/12/10 5:02:00 50 1 NA
4 6/13/10 5:03:00 10 2 NA
5 6/13/10 5:04:00 20 2 NA
6 6/13/10 5:05:00 30 2 NA
I want to fill the column to the right with the sum of values for each day. Basically, (1:3,5) should = 110, and (4:6,5) should = 60.
I know there are many ways to do this that are smarter/faster/better than what I'm attempting to do (e.g., my date variable is a factor split into "levels" that I don't know how to access), but I'm trying to build my skills from the ground up, and want to figure out how to:
Take a subset of data$Axis1 that will only grab the values for the 1st day
Take a subset of the values of data$Axis1 that will only grab the values for the 2nd day
Sum the values for each day, and place them in column 5, overwriting the "NA"
I successfully performed a function similar to this to auto-fill-in the "Day" vector, which was originally full of "NA" values (below). But I'm getting stuck as I think about how to a) subset with dates, and b) sum while subsetting.
Thanks in advance for your help - also, let me know if my question could be clearer/I'm violating cardinal stackoverflow rules. I'm very new to R and the coding community in general; I appreciate your help!
dates <-c("6/12/10","6/13/10")
counts <- c(1:2)
x <- nrow(data)
for (i in 1:x) {
for (j in 1:12) {
if (data[i,1] == dates[j]) {
data[i,4] <- counts[j]
}
}
}
Using ave :
transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum))
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 110
2 6/12/10 5:01:00 40 1 110
3 6/12/10 5:02:00 50 1 110
4 6/13/10 5:03:00 10 2 60
5 6/13/10 5:04:00 20 2 60
6 6/13/10 5:05:00 30 2 60
Another way would be using data.table
#Let's say df is your dataset
library(data.table)
dt = as.data.table(df)
dt = dt[, Sum.A1.Daily := sum(Axis1), by = Date]

Resources