R sum consecutive duplicate rows and remove all but first - r

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.

With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3

You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)

nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

Related

Impute missing values for missing dates

Imagine I have the following two data frames:
> sp
date value
1 2004-08-20 1
2 2004-08-23 2
3 2004-08-24 4
4 2004-08-25 5
5 2004-08-26 10
6 2004-08-27 11
> other
date value
1 2004-08-20 2
2 2004-08-23 4
3 2004-08-24 5
4 2004-08-25 10
5 2004-08-27 11
where the first columns represents the dates and the second the values for each day. The matrix of reference is sp and I want to impute to the matrix other the missing dates and values with respect to sp. For instance, in this case I miss the date "2004-08-26" in the matrix other. I should add to the matrix other a new row, with the date "2004-08-26" and the value which is given by the mean of the values at "2004-08-25" and "2004-08-27".
Could anyone suggest me how I can do it?
Data
sp <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-26", "2004-08-27"), value=c(1, 2, 4, 5, 10, 11))
other <- data.frame(date=c("2004-08-20", "2004-08-23", "2004-08-24", "2004-08-25",
"2004-08-27"), value=c(2, 4, 5, 10, 11))
An option using zoo::na.approx :
library(dplyr)
sp %>%
select(date) %>%
left_join(other, by = 'date') %>%
mutate(value = zoo::na.approx(value))
# date value
#1 2004-08-20 2.0
#2 2004-08-23 4.0
#3 2004-08-24 5.0
#4 2004-08-25 10.0
#5 2004-08-26 10.5
#6 2004-08-27 11.0
If I understand correctly, you want to add dates from sp that are missing in other.
You can merge other with just the date column of sp. Note, that by default from one-column data frames (and matrices) dimensions are dropped, so we need drop=FALSE.
The resulting NA can be e.g. linearly interpolated using approx, which gives the desired mean.
other2 <- merge(other, sp[, 'date', drop=FALSE], all=TRUE) |>
transform(value=approx(value, xout=seq_along(value))$y)
other2
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0
Note: For R < 4.1, do:
transform(merge(other, sp[, "date", drop = FALSE], all = TRUE),
value = approx(value, xout = seq_along(value))$y)
# date value
# 1 2004-08-20 2.0
# 2 2004-08-23 4.0
# 3 2004-08-24 5.0
# 4 2004-08-25 10.0
# 5 2004-08-26 10.5 ## interpolated
# 6 2004-08-27 11.0

Nested For Loop in R to iterate between columns

I'm trying to generate a dataset in which I keep the changing the first column every three rows later and as I change the first column, the second and third column change with it as well. Example given below. I'm a little confused as to how I can achieve this with the nested for loop.
df = NULL
for (CDRID in 1:3)
{
for (STARTDATE in 20200517:20200519)
{
for (PRIIDENTITY in 4:6)
{
df1 = rbind(df, data.frame ( "CDR_ID"=CDRID, "START_DATE"=STARTDATE, "PRI_IDENTITY"=PRIIDENTITY)) }}}
df <- data.frame(ID=1:3, START_DATE=20200517:20200519, PRI_IDENTITY=4:6)
df[rep(seq_len(nrow(df)), each=3),]
# ID START_DATE PRI_IDENTITY
# 1 1 20200517 4
# 1.1 1 20200517 4
# 1.2 1 20200517 4
# 2 2 20200518 5
# 2.1 2 20200518 5
# 2.2 2 20200518 5
# 3 3 20200519 6
# 3.1 3 20200519 6
# 3.2 3 20200519 6
A thought: 20200517:20200519, where I'm making an assumption that these are intended to be dates. This is fragile in that it doesn't know about wrapping days between months. In R, it might be better to use proper Date objects. With that, try:
> df <- data.frame(ID=1:3, START_DATE=seq.Date(as.Date("2020-05-17"), as.Date("2020-05-19"), by="days"), PRI_IDENTITY=4:6)
Browse[2]> df[rep(seq_len(nrow(df)), each=3),]
ID START_DATE PRI_IDENTITY
1 1 2020-05-17 4
1.1 1 2020-05-17 4
1.2 1 2020-05-17 4
2 2 2020-05-18 5
2.1 2 2020-05-18 5
2.2 2 2020-05-18 5
3 3 2020-05-19 6
3.1 3 2020-05-19 6
3.2 3 2020-05-19 6
As akrun suggested, if you are using (or amenable to using) packages from the tidyverse, then instead of the rep(seq_len(... part, you can do
# df <- data.frame(...)
tidyr::uncount(df, 3)

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

multiple condition then creating new column

I have a dataset with two columns, I need to create a third column carries conditions on first one and second one.
set.seed(1)
x1=(sample(1:10, 100,replace=T))
y1=sample(seq(1,10,0.1),100,replace=T)
z=cbind(x1,y1)
unique(as.data.frame(z)$x1)
z%>%as.data.frame()%>%dplyr::filter(x1==3)
table(x1)
1 2 3 4 5 6 7 8 9 10
7 6 11 14 14 5 11 15 11 6
> z%>%as.data.frame()%>%dplyr::filter(x1==3)
x1 y1
1 3 6.9
2 3 9.5
3 3 10.0
4 3 5.6
5 3 4.1
6 3 2.5
7 3 5.3
8 3 9.5
9 3 5.5
10 3 8.9
11 3 1.2
for example when I filter x==3 then y1 values can be seen, I need to write 1 on 11th row, rest will be 0. I need to find a minimum in that column. My original dataset has 43545 rows but only 638 unique numbers like x1. table x1 shows that 1 repeated 7 times but in my dataset some have a frequency of 1 some have frequency of 100. I should use case_when but how can I check every y1 to find the smallest to put 1?
If I understand correctly, you are looking for the row with minimal y1 value for each value of x1
library(tidyverse)
z %>% as.data.frame() %>%
group_by(x1) %>%
arrange(y1) %>% # sort values by increasing order within each group
mutate(flag = ifelse(row_number()==1,1,0)) %>% # create flag for first row in group
ungroup()

Change Column Random Numbers To Serial

In a data set, there is a specific column that as random values which repeat at regular interval. I want to replace these with increasing values as explained below.
Column_B has random data
Column_A Column_B
1.5 0
0.2 1
0.3 5
4.5 6
12.5 7
1.6 0
7.8 1
1.8 5
6.9 6
11.0 7
After transformation Column_B should have
Column_A Column_B
1.5 0
0.2 1
0.3 2
4.5 3
12.5 4
1.6 0
7.8 1
1.8 2
6.9 3
11.0 4
Is there a faster way to do this rather than creating a new column and then replacing it with Column_B? Thanks.
You can use recycling to fill the column with a repeating sequence. for example, if you want the sequence to be 64 long before repeating then you can use
DF$column_B <- 0:(64 - 1L)
More generally, for patterns like your example in which each element within the repeating sequence is distinct, you can find how long the sequence is, using which, then do the same thing
seq.length = which(dt$B == dt$B[1L])[2L] - 1L
dt$B = 0:(seq.length - 1L)
We group by cumulative sum of 'Column_B' where elements are 0 (or where there is decrease in the next element) and get the sequence of roww to assign it to 'Column_B'
library(data.table)
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(Column_B==0)]
df1
# Column_A Column_B
# 1: 1.5 0
# 2: 0.2 1
# 3: 0.3 2
# 4: 4.5 3
# 5: 12.5 4
# 6: 1.6 0
# 7: 7.8 1
# 8: 1.8 2
# 9: 6.9 3
#10: 11.0 4
Or find the difference between adjacent elements in 'Column_B', get the cumulative sum based on that to create the group_by variable
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(c(TRUE, diff(Column_B)< 0))]

Resources