finding the closest rivals in the group - r

i have a unbalaned panel company data as followed:
time comid group sales closeone
1988m1 tw1701 1 2.45 tw1410
1988m1 tw1213 1 1.98 tw1701
1988m1 tw1707 1 2.67
1988m1 tw2702 1 9.45
1988m1 tw9902 1 4.16
1988m1 tw1410 1 2.57
1988m2 tw2601 3 27.44 tw2505
1988m2 tw2505 3 9.49
1988m2 tw1413 3 1.46
1988m2 tw2901 3 3.74
1988m2 tw1417 4 1.87 tw1506
1988m2 tw1506 4 3.24
1988m2 tw1215 4 3.58
my aim is to find out the closest rival within the same group and time.
just as column colseone show. for example,
in the closeone first row ,tw1410 is under group=1,time=1988m1 condition,
abs(all sales - sales of tw1701) and find the min value and exclude zero(i.e. can't be itself).

I'm not sure why you were downvoted, I didn't think this was trivial. This is how I solved it. There might be an easier way. I couldn't get data.table operations to set the final value so had to use a for loop.
basically it sorts the data by groups, calculated the difference between the rows above and below within the same group, finds the min of those two values, and then sets the value of the rival by that reference.
library(data.table)
setDT(dat)
setorder(dat,time,group,-sales)
dat[ , "Diff" := c(NA, diff(sales)), by = .(time,group)]
dat[ , "Diff2" := c(diff((sales)),NA), by = .(time,group)]
dat[ ,"Min" := ifelse(abs(Diff) < abs(Diff2), 1, 2)]
dat[ ,"Min" := ifelse(is.na(Diff),2,Min)]
dat[ ,"Min" := ifelse(is.na(Diff2),1,Min)]
dat[, "Rival" := NA]
for(i in 1:nrow(dat)){
if(dat$Min[i] == 2){
dat$Rival[i] = as.character(dat[i+1,comid])
}else{
dat$Rival[i] = as.character(dat[i-1,comid])
}
}
> dat
time comid group sales Diff Diff2 Min Rival
1: 1988m1 tw2702 1 9.45 NA -5.29 2 tw9902
2: 1988m1 tw9902 1 4.16 -5.29 -1.49 2 tw1707
3: 1988m1 tw1707 1 2.67 -1.49 -0.10 2 tw1410
4: 1988m1 tw1410 1 2.57 -0.10 -0.12 1 tw1707
5: 1988m1 tw1701 1 2.45 -0.12 -0.47 1 tw1410
6: 1988m1 tw1213 1 1.98 -0.47 NA 1 tw1701
7: 1988m2 tw2601 3 27.44 NA -17.95 2 tw2505
8: 1988m2 tw2505 3 9.49 -17.95 -5.75 2 tw2901
9: 1988m2 tw2901 3 3.74 -5.75 -2.28 2 tw1413
10: 1988m2 tw1413 3 1.46 -2.28 NA 1 tw2901
11: 1988m2 tw1215 4 3.58 NA -0.34 2 tw1506
12: 1988m2 tw1506 4 3.24 -0.34 -1.37 1 tw1215
13: 1988m2 tw1417 4 1.87 -1.37 NA 1 tw1506
If anyone has a better solution I'd love to see it.
EDIT
The reason I couldn't get this into vector format apparently was because comid was a factor. I have no idea why that would break the function, but when I changed it to character it worked.
Replace the for loop with this:
dat$comid = as.character(dat$comid)
dat[, "Rival" := ifelse(Min == 2, shift(comid, type = "lead"), shift(comid, type = "lag"))]
> dat
time comid group sales Diff Diff2 Min Rival
1: 1988m1 tw2702 1 9.45 NA -5.29 2 tw9902
2: 1988m1 tw9902 1 4.16 -5.29 -1.49 2 tw1707
3: 1988m1 tw1707 1 2.67 -1.49 -0.10 2 tw1410
4: 1988m1 tw1410 1 2.57 -0.10 -0.12 1 tw1707
5: 1988m1 tw1701 1 2.45 -0.12 -0.47 1 tw1410
6: 1988m1 tw1213 1 1.98 -0.47 NA 1 tw1701
7: 1988m2 tw2601 3 27.44 NA -17.95 2 tw2505
8: 1988m2 tw2505 3 9.49 -17.95 -5.75 2 tw2901
9: 1988m2 tw2901 3 3.74 -5.75 -2.28 2 tw1413
10: 1988m2 tw1413 3 1.46 -2.28 NA 1 tw2901
11: 1988m2 tw1215 4 3.58 NA -0.34 2 tw1506
12: 1988m2 tw1506 4 3.24 -0.34 -1.37 1 tw1215
13: 1988m2 tw1417 4 1.87 -1.37 NA 1 tw1506
That should run a lot faster.

Related

Method in R to find difference between rows with varying row spacing

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

In R, how do I keep the first single occurrence of a row based on a repeated value in one column?

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.
In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.
Any help is appreciated!
Original Dataframe
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
Final Dataframe
2007-01-31 2.72 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-11-30 5.75 4.76 4
2008-04-30 6.86 4.91 1
2008-07-31 8.00 5.13 4
Here's another solution using run length encoding rle().
lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]
Output:
V1 V2 V3 V4
1 2007-01-31 2.72 4.75 2
4 2007-04-30 2.74 4.75 3
5 2007-05-31 2.46 4.75 2
6 2007-06-30 2.98 4.75 3
11 2007-11-30 5.75 4.76 4
16 2008-04-30 6.86 4.91 1
19 2008-07-31 8.00 5.13 4
You can use data.table::shift to filter, plus the first row, in rbind
library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])
Or an equivalent approach using dplyr
library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))
Output:
date v1 v2 v3
<IDat> <num> <num> <int>
1: 2007-01-31 2.72 4.75 2
2: 2007-04-30 2.74 4.75 3
3: 2007-05-31 2.46 4.75 2
4: 2007-06-30 2.98 4.75 3
5: 2007-11-30 5.75 4.76 4
6: 2008-04-30 6.86 4.91 1
7: 2008-07-31 8.00 5.13 4
DATA
x <- "
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
"
df <- read.table(textConnection(x) , header = F)
and use this two lines
df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]
#> V1 V2 V3 V4
#> 1 2007-01-31 2.72 4.75 2
#> 4 2007-04-30 2.74 4.75 3
#> 5 2007-05-31 2.46 4.75 2
#> 6 2007-06-30 2.98 4.75 3
#> 11 2007-11-30 5.75 4.76 4
#> 16 2008-04-30 6.86 4.91 1
#> 19 2008-07-31 8.00 5.13 4
Created on 2022-06-12 by the reprex package (v2.0.1)

R data.table Creating simple custom function [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I am currently working in R on a data set that looks somewhat like the following (except it holds millions of rows and more variables) :
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
I am working to create a function, in which the following variables are added :
haz_1.5, waz_1.5, whz_1.5, htcm_1.5, wtkg_1.5, and bmi_1.5
each variable will follow the same pattern of criteria as below :
!is.na(haz) and agedays > 61-45 and agedays <=61-15, haz_1.5 will hold the value of haz
The new data set should look like the following (except bmi_1.5, wtkg_1.5, and htcm_1.5 are omitted from the output below, so table sample can fit in box):
pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5
1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA
Here's the code that I've tried so far :
measure<-list("haz", "waz", "whz", "htcm", "wtkg", "bmi")
set_1.5_months <- function(x, y, z){
maled_anthro[!is.na(z) & agedays > (x-45) & agedays <= (x-15), y:=z]
}
for(i in 1:length(measure)){
z <- measure[i]
y <- paste(measure[i], "1.5", sep="_")
x <- 61
maled_anthro_1<-set_1.5_months(x, y, z)
}
The code above has not been successful. I just end up with a new variable "y" added into the original data table that holds the values "bmi" or "NA". Can someone help me with figuring out where I went wrong with this code?
I'd like to keep the function as similar to the formatting above (easy to change) as I have other similar functions that will need to be created in which the values "1.5" and x==61 will need to be swapped out with other numbers and I like that these are relatively easy to change in the current format.
I believe the following is a idiomatic way to create new columns by applying a function to many existing columns.
Note that I've left the condition as it was, negating it all to make the code as close to the question's as possible.
library(data.table)
setDT(maled_anthro)
set_1.5_months <- function(y, agedays, x = 61){
z <- y
is.na(z) <- !(!is.na(y) & agedays > (x - 45) & agedays <= (x - 15))
z
}
measure <- c("haz", "waz", "whz", "htcm", "wtkg", "bmi")
new_measure <- paste(measure, "1.5", sep = "_")
maled_anthro[, (new_measure) := lapply(.SD, function(y) set_1.5_months(y, agedays, x=61)), .SDcols = measure ]
# pid agedays wtkg htcm bmi haz waz whz haz_1.5 waz_1.5 whz_1.5 htcm_1.5 wtkg_1.5 bmi_1.5
#1: 1 2 1.92 44.2 9.74 -2.72 -3.23 NA NA NA NA NA NA NA
#2: 1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00 -2.21 -3.03 -2.00 49.2 2.68 11.07
#3: 1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48 NA NA NA NA NA NA
#4: 1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14 NA NA NA NA NA NA
#5: 2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54 NA NA NA NA NA NA
#6: 2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79 -0.14 -0.58 -0.79 53.1 3.78 13.41
#7: 2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18 NA NA NA NA NA NA
#8: 2 104 5.82 61.3 15.49 0.23 -0.38 -0.70 NA NA NA NA NA NA
Data
maled_anthro <- read.table(text = "
pid agedays wtkg htcm bmi haz waz whz
1 2 1.92 44.2 9.74 -2.72 -3.23 NA
1 29 2.68 49.2 11.07 -2.21 -3.03 -2.00
1 61 3.63 52.0 13.42 -2.49 -2.62 -0.48
1 89 4.11 55.0 13.59 -2.20 -2.70 -1.14
2 1 2.40 48.1 10.37 -0.65 -1.88 -2.54
2 28 3.78 53.1 13.41 -0.14 -0.58 -0.79
2 56 4.53 55.2 14.87 -0.68 -0.74 -0.18
2 104 5.82 61.3 15.49 0.23 -0.38 -0.70
", header = TRUE)

Replace ID variable values with counts of value occurrences

I have a data frame like:
DATE x y ID
06/10/2003 7.21 0.651 1
12/10/2003 5.99 0.428 1
18/10/2003 4.68 1.04 1
24/10/2003 3.47 0.363 1
30/10/2003 2.42 0.507 1
02/05/2010 2.72 0.47 2
05/05/2010 2.6 0. 646 2
08/05/2010 2.67 0.205 2
11/05/2010 3.57 0.524 2
12/05/2010 0.428 4.68 3
13/05/2010 1.04 3.47 3
14/05/2010 0.363 2.42 3
18/10/2003 0.507 2.52 3
24/10/2003 0.418 4.68 3
30/10/2003 0.47 3.47 3
29/04/2010 0.646 2.42 4
18/10/2003 3.47 2.52 4
i have the count of number of rows per group for column ID as an integer vector like 5 4 6 2
is there a way to replace the group values in column id with these integer vector 5 4 6 2
the output i am expecting is
DATE x y ID
06/10/2003 7.21 0.651 5
12/10/2003 5.99 0.428 5
18/10/2003 4.68 1.04 5
24/10/2003 3.47 0.363 5
30/10/2003 2.42 0.507 5
02/05/2010 2.72 0.47 4
05/05/2010 2.6 646 4
08/05/2010 2.67 0.205 4
11/05/2010 3.57 0.524 4
12/05/2010 0.428 4.68 6
13/05/2010 1.04 3.47 6
14/05/2010 0.363 2.42 6
18/10/2003 0.507 2.52 6
24/10/2003 0.418 4.68 6
30/10/2003 0.47 3.47 6
29/04/2010 0.646 2.42 2
18/10/2003 3.47 2.52 2
i am quite new to R and tried to find if there is any idea replace function. But having a hard time. Any help is much appreciated.
above data is just an example for understanding my requirement.
A compact solution with the data.table-package:
library(data.table)
setDT(mydf)[, ID := .N, by = ID][]
which gives:
> mydf
DATE x y ID
1: 06/10/2003 7.210 0.651 5
2: 12/10/2003 5.990 0.428 5
3: 18/10/2003 4.680 1.040 5
4: 24/10/2003 3.470 0.363 5
5: 30/10/2003 2.420 0.507 5
6: 02/05/2010 2.720 0.470 4
7: 05/05/2010 2.600 0.646 4
8: 08/05/2010 2.670 0.205 4
9: 11/05/2010 3.570 0.524 4
10: 12/05/2010 0.428 4.680 6
11: 13/05/2010 1.040 3.470 6
12: 14/05/2010 0.363 2.420 6
13: 18/10/2003 0.507 2.520 6
14: 24/10/2003 0.418 4.680 6
15: 30/10/2003 0.470 3.470 6
16: 29/04/2010 0.646 2.420 2
17: 18/10/2003 3.470 2.520 2
What this does:
setDT(mydf) converts the dataframe to a data.table
by = ID groups by ID
ID := .N replaces the original value of ID with the count by group
You can use the ave() function to calculate how many rows each ID takes up. In the example below I created a new variable ID2, but you could replace the original ID if you want.
(I included code to create your data in R below, but when you ask questions in the future please include your data in the question by using the dput() function on the data object. That's what I did to make the code below.)
mydata <- structure(list(DATE = c("06/10/2003", "12/10/2003", "18/10/2003",
"24/10/2003", "30/10/2003", "02/05/2010", "05/05/2010", "08/05/2010",
"11/05/2010", "12/05/2010", "13/05/2010", "14/05/2010", "18/10/2003",
"24/10/2003", "30/10/2003", "29/04/2010", "18/10/2003"),
x = c(7.21, 5.99, 4.68, 3.47, 2.42, 2.72, 2.6, 2.67, 3.57, 0.428, 1.04, 0.363,
0.507, 0.418, 0.47, 0.646, 3.47),
y = c(0.651, 0.428, 1.04, 0.363, 0.507, 0.47, 646, 0.205, 0.524, 4.68, 3.47,
2.42, 2.52, 4.68, 3.47, 2.42, 2.52),
ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4)),
.Names = c("DATE", "x", "y", "ID"),
class = c("data.frame"),
row.names = c(NA, -17L))
# ave() takes an input object, an object of group IDs of the same length
# as the input object, and a function to apply to the input object split across groups
mydata$ID2 <- ave(mydata$ID, mydata$ID, FUN = length)
mydata
DATE x y ID ID2
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 646.000 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2
# if you want to replace the original ID variable, you can assign to it
# instead of adding a new variable
mydata$ID <- ave(mydata$ID, mydata$ID, FUN = length)
A solution with dplyr:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ID2 = n()) %>%
ungroup() %>%
mutate(ID = ID2) %>%
select(-ID2)
Edit:
I've just found a solution that's a bit cleaner than the above:
df %>%
group_by(ID2 = ID) %>%
mutate(ID = n()) %>%
select(-ID2)
Result:
# A tibble: 17 x 4
DATE x y ID
<fctr> <dbl> <dbl> <int>
1 06/10/2003 7.210 0.651 5
2 12/10/2003 5.990 0.428 5
3 18/10/2003 4.680 1.040 5
4 24/10/2003 3.470 0.363 5
5 30/10/2003 2.420 0.507 5
6 02/05/2010 2.720 0.470 4
7 05/05/2010 2.600 0.646 4
8 08/05/2010 2.670 0.205 4
9 11/05/2010 3.570 0.524 4
10 12/05/2010 0.428 4.680 6
11 13/05/2010 1.040 3.470 6
12 14/05/2010 0.363 2.420 6
13 18/10/2003 0.507 2.520 6
14 24/10/2003 0.418 4.680 6
15 30/10/2003 0.470 3.470 6
16 29/04/2010 0.646 2.420 2
17 18/10/2003 3.470 2.520 2
Notes:
The reason behind ungroup() %>% mutate(ID = ID2) %>% select(-ID2) is that dplyr doesn't allow mutateing on grouping variables. So this would not work:
df %>%
group_by(ID) %>%
mutate(ID = n())
Error in mutate_impl(.data, dots) : Column ID can't be modified
because it's a grouping variable
If you don't care about replacing the original ID column, you can just do:
df %>%
group_by(ID) %>%
mutate(ID2 = n())
Alternative Result:
# A tibble: 17 x 5
# Groups: ID [4]
DATE x y ID ID2
<fctr> <dbl> <dbl> <int> <int>
1 06/10/2003 7.210 0.651 1 5
2 12/10/2003 5.990 0.428 1 5
3 18/10/2003 4.680 1.040 1 5
4 24/10/2003 3.470 0.363 1 5
5 30/10/2003 2.420 0.507 1 5
6 02/05/2010 2.720 0.470 2 4
7 05/05/2010 2.600 0.646 2 4
8 08/05/2010 2.670 0.205 2 4
9 11/05/2010 3.570 0.524 2 4
10 12/05/2010 0.428 4.680 3 6
11 13/05/2010 1.040 3.470 3 6
12 14/05/2010 0.363 2.420 3 6
13 18/10/2003 0.507 2.520 3 6
14 24/10/2003 0.418 4.680 3 6
15 30/10/2003 0.470 3.470 3 6
16 29/04/2010 0.646 2.420 4 2
17 18/10/2003 3.470 2.520 4 2

Error : missing value where TRUE/FALSE needed

WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 5.99 2847.50 0.00 3
2 4995 3.33 36759.00 3.33 3
3 4996 5.99 2517.00 0.00 3
4 4997 5.49 2858.50 0.00 3
5 5001 3.33 32425.00 3.33 3
6 5002 5.49 4205.50 0.00 3
7 5004 5.99 4329.50 0.00 3
8 5006 2.74 55811.00 2.74 3
9 5007 5.49 4133.00 0.00 3
10 5008 5.99 4074.00 0.00 3
11 5009 3.99 12125.25 3.99 3
12 5017 2.74 77645.00 2.74 3
13 5018 5.49 5315.50 0.00 3
14 5020 2.74 78699.00 2.74 3
15 5021 5.49 5158.50 0.00 3
16 5023 5.99 5315.00 0.00 3
17 5024 5.49 6545.00 0.00 3
18 5025 3.33 63418.00 3.33 3
If there are consecutive 0 sale price entries then I want to keep last entry with sale price 0. Like I want to remove week 4996 and want to keep week 4997, I want week 5004 and I want to remove 5002. Similarly I want to delete 5021 & 5023 and want to keep week 5024.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). create a grouping variable with rleid based on a logical vector of the presence of 0 in 'SALE_PRICE' (!SALE_PRICE). Using the 'grp' as grouping variable, we get the last row of 'Subset of Data.table (.SD[.N]) if the 'SALE_PRICEelements areall0 orelseget the.SD` i.e. the full rows for a particular group.
library(data.table)
setDT(df1)[, grp:= rleid(!SALE_PRICE)
][,if(all(!SALE_PRICE)) .SD[.N] else .SD , grp
][, grp := NULL][]
# WEEK PRICE QUANTITY SALE_PRICE TYPE
# 1: 4992 5.99 2847.50 0.00 3
# 2: 4995 3.33 36759.00 3.33 3
# 3: 4997 5.49 2858.50 0.00 3
# 4: 5001 3.33 32425.00 3.33 3
# 5: 5004 5.99 4329.50 0.00 3
# 6: 5006 2.74 55811.00 2.74 3
# 7: 5008 5.99 4074.00 0.00 3
# 8: 5009 3.99 12125.25 3.99 3
# 9: 5017 2.74 77645.00 2.74 3
#10: 5018 5.49 5315.50 0.00 3
#11: 5020 2.74 78699.00 2.74 3
#12: 5024 5.49 6545.00 0.00 3
#13: 5025 3.33 63418.00 3.33 3
Or an option using dplyr by creating a grouping variable with diff and cumsum, then filter the rows to keep only the last row of 'SALE_PRICE' that are 0 or (|) select the rows where 'SALE_PRICE' is not 0.
library(dplyr)
df1 %>%
group_by(grp = cumsum(c(TRUE,diff(!SALE_PRICE)!=0))) %>%
filter( !duplicated(!SALE_PRICE, fromLast=TRUE)|SALE_PRICE!=0) %>%
select(-grp)
# grp WEEK PRICE QUANTITY SALE_PRICE TYPE
# (int) (int) (dbl) (dbl) (dbl) (int)
#1 1 4992 5.99 2847.50 0.00 3
#2 2 4995 3.33 36759.00 3.33 3
#3 3 4997 5.49 2858.50 0.00 3
#4 4 5001 3.33 32425.00 3.33 3
#5 5 5004 5.99 4329.50 0.00 3
#6 6 5006 2.74 55811.00 2.74 3
#7 7 5008 5.99 4074.00 0.00 3
#8 8 5009 3.99 12125.25 3.99 3
#9 8 5017 2.74 77645.00 2.74 3
#10 9 5018 5.49 5315.50 0.00 3
#11 10 5020 2.74 78699.00 2.74 3
#12 11 5024 5.49 6545.00 0.00 3
#13 12 5025 3.33 63418.00 3.33 3

Resources