How can I fill in NA values based on the next real value but divide that value between the preceding NAs? - r

Please note: this is a hyper simplified explanation of where the 'data' comes from, but where the data is from is irrelevant to the coding question.
I have a data set created by collecting water in a tube everyday.
I can't go and measure the tube every day (but the tube keeps filling) so there are gaps in the water value records.
This dummy data set shows where this has happened on days 5 and 10, because this is a dummy dataset I have made an assumption that each day 500ml of water goes into the tube (the real data set is a alot messier!)
dummy data
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)
Data explanation: I have collected every day for days 1:4 so the value for each day is 500ml, missed day 5 so the value is NA, collected on day 6 so the value is 1000ml (the water from day 5 and day 6 combined), missed 7,8,9, so values equal NA, collected on day 10 to give a value of 2000ml for the 4 days) then collected every day for the last two)
I would like to fill in the NA gaps by taking the value of the next 'real' measurement and dividing that value between the NA's and that value's day.Yes, I am assuming that if I have not made a measurement there is a constant process and that I can divide the last measurement equally between the days.
this is what the output data should look like
day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)
Again this is just a dummy data set otherwise the easiest way would just be replace NA with 500 with 'value[is.na(value)] <- 500', but in the real data set the values can be 457.6, 779, 376, etc.
Also tried to do a loop but keep getting stuck...
Any ideas on how I can do this?
Help is greatly appreciated

Here's a possible solution :
# Create test Data:
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)
# "Cleansing" starts here :
RLE <- rle(is.na(df$value))
# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
RLE$lengths <- head(RLE$lengths,-1)
RLE$values <- head(RLE$values,-1)
}
afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences
df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)
Result :
> df
day value
1 1 250
2 2 250
3 3 500
4 4 500
5 5 1500
6 6 1500
7 7 1250
8 8 1250
9 9 1250
10 10 1250
11 11 500
12 12 500
13 13 NA
14 14 NA
15 15 NA

Related

Delete rows when certain factor is present more than 200 times

I have a dataset with over 400,000 cows. These cows are (unevenly) spreak over 2355 herds. Some herds are only present once in the data, while one herd is even present 2033 times in the data, meaning that 2033 cows belong to this herd. I want to delete herds from my data that occur less than 200 times.
With use of plyr and subset, I can obtain a list of which herds occur less than 200 times, I however can not find out how to apply this selection to the full dataset.
For example, my current data looks a little like:
cow herd
1 1
2 1
3 1
4 2
5 3
6 4
7 4
8 4
With function count() I can obtain the following:
x freq
1 3
2 1
3 1
4 3
Say I want to delete the data belonging to herds that occur less than 3 times, I want my data to look like this eventually:
cow herd
1 1
2 1
3 1
6 4
7 4
8 4
I do know how to tell R to delete data herd by herd, however since, in my real datatset, over 1000 herds occur less then 200 times, it would mean that I would have to type every herd number in my script one by one. I am sure there is an easier and quicker way of asking R to delete data above or below a certain occurence.
I hope my explanation is clear and someone can help me, thanks in advance!
Use n + group_by:
library(dplyr)
your_data %>%
group_by(herd) %>%
filter(n() >= 3)

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Subsetting an integer vector based on a vector of corresponding dates

Elementary question:
I'm trying to subset a vector of a data frame based on a vector of dates that correspond with the vector that I wish to subset. Consider the following data frame as an example:
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 NA
2 6/12/10 5:01:00 40 1 NA
3 6/12/10 5:02:00 50 1 NA
4 6/13/10 5:03:00 10 2 NA
5 6/13/10 5:04:00 20 2 NA
6 6/13/10 5:05:00 30 2 NA
I want to fill the column to the right with the sum of values for each day. Basically, (1:3,5) should = 110, and (4:6,5) should = 60.
I know there are many ways to do this that are smarter/faster/better than what I'm attempting to do (e.g., my date variable is a factor split into "levels" that I don't know how to access), but I'm trying to build my skills from the ground up, and want to figure out how to:
Take a subset of data$Axis1 that will only grab the values for the 1st day
Take a subset of the values of data$Axis1 that will only grab the values for the 2nd day
Sum the values for each day, and place them in column 5, overwriting the "NA"
I successfully performed a function similar to this to auto-fill-in the "Day" vector, which was originally full of "NA" values (below). But I'm getting stuck as I think about how to a) subset with dates, and b) sum while subsetting.
Thanks in advance for your help - also, let me know if my question could be clearer/I'm violating cardinal stackoverflow rules. I'm very new to R and the coding community in general; I appreciate your help!
dates <-c("6/12/10","6/13/10")
counts <- c(1:2)
x <- nrow(data)
for (i in 1:x) {
for (j in 1:12) {
if (data[i,1] == dates[j]) {
data[i,4] <- counts[j]
}
}
}
Using ave :
transform(dat,Sum.A1.Daily=ave(dat$Axis1,dat$Date,FUN=sum))
Date Time Axis1 Day Sum.A1.Daily
1 6/12/10 5:00:00 20 1 110
2 6/12/10 5:01:00 40 1 110
3 6/12/10 5:02:00 50 1 110
4 6/13/10 5:03:00 10 2 60
5 6/13/10 5:04:00 20 2 60
6 6/13/10 5:05:00 30 2 60
Another way would be using data.table
#Let's say df is your dataset
library(data.table)
dt = as.data.table(df)
dt = dt[, Sum.A1.Daily := sum(Axis1), by = Date]

Alternative to for loop and indexing?

I have a large data set of 3 columns, Order, Discharge, Date (numeric). There are 20 years of daily Discharge values for each Order, which can extend beyond 100.
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 0.00000 6574
5 3.54100 6574
6 3.61500 6574
For a given Order x, I would like to replace the Discharge value with the average of the Discharge at x+1 and x-1 for that date. I have been doing this in a crude manner with a for loop and indexing, but it takes over an hour to process. I know there has to be a better way.
x <- 4
for(i in min(dat[,3]):max(dat[,3]))
dat[,2][dat[,3] == i & dat[,1] == x ] <-
mean(c(dat[,2][dat[,3] == i & dat[,1] == x + 1],
dat[,2][dat[,3] == i & dat[,1] == x - 1]))
Gives
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 1.77050 6574
5 3.54100 6574
6 3.61500 6574
Where the Discharge at Order 4, for date 6574 has been replaced with 1.77050. It works, but it's ridiculously slow.
I should specify that I don't need to do this calculation on every Order, but only a select few (only 8 out of a total of 117). Based on the answer, I have the following.
dat$NewDischarge <- by(dat$Discharge,dat$date,function(x)
colMeans(cbind(c(x[-1],NA), x,
c(NA, x[-length(x)])), na.rm=T))
I am trying to figure out a way still to only have the values of the select Orders to be calculated and am stuck in the rut of a for loop and indexing on date and Orders.
I would go by it as following:
Ensure that Order is a factor.
For each Order, you now have a sub-problem:
Sort the sub-data-frame by date.
Each Discharge-mean can be produced "vectorally" as:
colMeans(cbind(c(Discharge[-1], NA), Discharge, c(NA, Discharge[-length(Discharge)])))
The sub-problem can be dealt with a simple for-loop or the function by. I would prefer by.
Your data has been rearranged, but you can easily reorder it.
For point 2.2, imagine it (or try it) with a simple vector and see the effects of the cbind operation. It also forces you to consider the limit-situations; how is the first and last Discharge-value calculated (no preceding or proceeding dates).
There are several ways to solve your particular dilemma, but the basic question to ask when confronted with a slow for loop is, "How do I use vectorization to replace this loop?" (Well, maybe you should ask "Should I...?" first.) In your case, you're looping across dates, but there's no need to explicitly do that, since just grabbing all of the rows where dat$Order==x will implicitly grab all the dates.
The dataset you posted only has one date, but I can generate some fake data to illustrate:
generate.data <- function(n.order, n.date){
dat <- expand.grid(Order=seq_len(n.order), date=seq_len(n.date))
dat$Discharge <- rlnorm(n.order * n.date)
dat[, c("Order", "Discharge", "date")]
}
dat <- generate.data(10, 5)
head(dat)
# Order Discharge date
# 1 1 2.1925563 1
# 2 2 0.4093022 1
# 3 3 2.5525497 1
# 4 4 1.9274013 1
# 5 5 1.1941986 1
# 6 6 1.2407451 1
tail(dat)
# Order Discharge date
# 45 5 1.4344575 5
# 46 6 0.5757580 5
# 47 7 0.4986190 5
# 48 8 1.2076292 5
# 49 9 0.3724899 5
# 50 10 0.8288401 5
Here's all the rows where dat$Order==4, across all dates:
dat[dat$Order==4, ]
# Order Discharge date
# 4 4 1.9274013 1
# 14 4 3.5319072 2
# 24 4 0.2374532 3
# 34 4 0.4549798 4
# 44 4 0.7654059 5
You can just take the Discharge column, and you'll have the left-hand side of your assignment:
dat[dat$Order==4, ]$Discharge
# [1] 1.9274013 3.5319072 0.2374532 0.4549798 0.7654059
Now you just need the right side, which has two components: the x-1 discharges and the x+1 discharges. You can grab these the same way you grabbed the x discharges:
dat[dat$Order==4-1, ]$Discharge
# [1] 2.5525497 1.9143963 0.2800546 8.3627810 7.8577635
dat[dat$Order==4+1, ]$Discharge
# [1] 1.1941986 4.6076114 0.3963693 0.4190957 1.4344575
To obtain the new values, you need the parallel mean. R doesn't have a pmean function, but you can cbind these and take the rowMeans:
rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge, dat[dat$Order==4+1, ]$Discharge))
# [1] 1.8733741 3.2610039 0.3382119 4.3909383 4.6461105
So, in the end you have:
dat[dat$Order==4, ]$Discharge <- rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge,
dat[dat$Order==4+1, ]$Discharge))
You can even use %in% to make this work across all of your x values.
Note that this assumes your data is ordered.

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources