I have some cumulative count data. Because of reporting innacuracies, sometimes the cumulative sum decreases such as 0 1 2 2 3 3 2 4 5.
I would like to created a new vector that retains the largest value reported and carries it forward until the cumulative count data catches up. So the corrected version of the above would be 0 1 2 2 3 3 3 4 5
I tried the following
mydf <- data.frame(ts1 = c(0,1,1,1,2,3,2,2,3,4,4,5))
mydf$lag1 <- lag(mydf[,1])
mydf$corrected <- ifelse(is.na(mydf[,2]),mydf[,1],
ifelse(mydf[,2] > mydf[,1], mydf[,2], mydf[,1]))
which returns:
ts1 lag1 corrected
1 0 NA 0
2 1 0 1
3 1 1 1
4 1 1 1
5 2 1 2
6 3 2 3
7 2 3 3
8 2 2 2
9 3 2 3
10 4 3 4
11 4 4 4
12 5 4 5
This worked for the case of the first time that the next value was smaller than the previous value(line7) but it fails for the second time(line 8).
I thought there must be a better way of doing this. New Vector that is equal to input vector unless value decreases in which case it retains prior value until input vector exceeds that retained value.
You are looking for cummax :
cummax(mydf$ts1)
#[1] 0 1 1 1 2 3 3 3 3 4 4 5
Related
For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.
I could use some help. I need to add a new variable to a dataframe based on whether or not the value of a variable in a dataframe equals the index value of another vector. Below is a simplified example:
vector [2 7 15 4 5]
dataframe (4 variables; Index, Site, Quad, Count)
Index Site Quad Count
1 2 3 0
1 3 7 2
2 1 8 0
2 3 3 1
3 2 3 0
4 3 7 2
5 1 8 0
5 3 3 1
The variable I would like to create would match value of df$Index from the dataframe with the matching position in the vector. That is, when df$Index = 1, the new variable would be 2 (position 1 in the vector), when df$Index = 2, the new variable would be 7 (position 2 in the vector), when df$Index = 3, the new variable would be 3 (position 3 in the vector).
I've ended up in a R wormhole, and know the solution is simple, but I cannot seem to get it. Thanks for any help.
If your indexes are atually integer indices, for example
dd<-read.table(text="Index Site Quad Count
1 2 3 0
1 3 7 2
2 1 8 0
2 3 3 1
3 2 3 0
4 3 7 2
5 1 8 0
5 3 3 1", header=TRUE)
vec <- c(2, 7, 15, 4, 5)
Then you can create the new column with
dd$value <- vec[dd$Index]
dd
# Index Site Quad Count value
# 1 1 2 3 0 2
# 2 1 3 7 2 2
# 3 2 1 8 0 7
# 4 2 3 3 1 7
# 5 3 2 3 0 15
# 6 4 3 7 2 4
# 7 5 1 8 0 5
# 8 5 3 3 1 5
I want repeat a sequence for specific length:
Sequence is 1:4 and I want to repeat the sequence till number of rows in a data frame.
Lets say length of the data frame is 24
I tried following:
test <- rep(1:4, each=24/4)
1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
Lengthwise this is fine but i want to retain the sequence
1 2 3 4 1 2 3 4 1 2 3 4.....
You need to use times instead of each
rep(1:4, times=24/4)
[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
We can just pass it without any argument and it takes the times by default
rep(1:4, 24/4)
#[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
I have the following dataset called asteroids
3 4 3 3 1 4 1 3 2 3
1 1 4 2 3 3 2 6 1 1
3 3 2 2 2 2 1 3 2 1
6 1 3 2 2 1 2 2 4 2
I need to find out what proportion of this dataset is 1.
If you have a specific value in mind you can just do an equality comparison and then use mean on the resulting logical vector.
> asteroids <- scan(what=numeric())
1: 3 4 3 3 1 4 1 3 2 3 1 1 4 2 3 3 2 6 1 1 3 3 2 2 2 2 1 3 2 1 6 1 3 2 2 1 2 2 4 2
41:
Read 40 items
> mean(asteroids == 1)
[1] 0.25
This works since the equality comparison will give TRUE and FALSE and when T/F are coerced numerically they become 1s and 0s so mean ends up giving us the proportion of TRUEs.
I assumed asteroids was a vector. You don't specify in your question but if it's a different type of structure you'll probably need to coerce it into a vector in some way or another.
Assuming that 'asteroids' is a data.frame, unlist it, get the table and find the proportion with prop.table.
prop.table(table(unlist(asteroids)==1))
# FALSE TRUE
# 0.75 0.25
Or as #Richard Scriven mentioned, we can convert the data.frame to a logical matrix, and use table directly on it as 'matrix' is a vector with dim attributes.
prop.table(table(asteroids == 1))
I have the following data.frame:
head(data.c)
mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
As there are NAs I need to construct an additional column of normal sequence from 1:length(mark). However, if it is NA vector cell has to take a previous value. So it must look like this:
mark high_mark mark_cum mark_seq
5 0 0 1
7 1 1 2
7 1 2 3
NA 0 2 3
7 1 3 4
7 1 4 5
NA 0 4 5
1) cumsum This solution uses the fact that each mark_seq element equals the cumulative number of non-NA elements in mark at that point.
transform(data.c, mark_seq = cumsum(!is.na(mark)))
giving:
mark high_mark mark_cum mark_seq
1 5 0 0 1
2 7 1 1 2
3 7 1 2 3
4 NA 0 2 3
5 7 1 3 4
6 7 1 4 5
7 NA 0 4 5
data.c <- read.table(text = Lines, header = TRUE)
2) na.locf Here is a second solution using seq_along and na.locf (from zoo). It creates a sequence the same length as the number of non-NA elements in mark and uses replace to put them in the spots where the non-NA elements exist. Then na.locf is used to fill in the NAs with the prior values.
library(zoo)
transform(data.c, mark_seq=na.locf(replace(mark, !is.na(mark), seq_along(na.omit(mark)))))
3) mark_cum It was not stated in the question how the input column mark_cum is constructed but in the sample output in the question the mark_seq column equals the mark_cum column plus 1 so if that is always the case then an easy solution is:
transform(data.c, mark_seq = mark_cum + 1)
Note: We used this as the input:
Lines <- "mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
NA 0 4"