WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 5.99 2847.50 0.00 3
2 4995 3.33 36759.00 3.33 3
3 4996 5.99 2517.00 0.00 3
4 4997 5.49 2858.50 0.00 3
5 5001 3.33 32425.00 3.33 3
6 5002 5.49 4205.50 0.00 3
7 5004 5.99 4329.50 0.00 3
8 5006 2.74 55811.00 2.74 3
9 5007 5.49 4133.00 0.00 3
10 5008 5.99 4074.00 0.00 3
11 5009 3.99 12125.25 3.99 3
12 5017 2.74 77645.00 2.74 3
13 5018 5.49 5315.50 0.00 3
14 5020 2.74 78699.00 2.74 3
15 5021 5.49 5158.50 0.00 3
16 5023 5.99 5315.00 0.00 3
17 5024 5.49 6545.00 0.00 3
18 5025 3.33 63418.00 3.33 3
If there are consecutive 0 sale price entries then I want to keep last entry with sale price 0. Like I want to remove week 4996 and want to keep week 4997, I want week 5004 and I want to remove 5002. Similarly I want to delete 5021 & 5023 and want to keep week 5024.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). create a grouping variable with rleid based on a logical vector of the presence of 0 in 'SALE_PRICE' (!SALE_PRICE). Using the 'grp' as grouping variable, we get the last row of 'Subset of Data.table (.SD[.N]) if the 'SALE_PRICEelements areall0 orelseget the.SD` i.e. the full rows for a particular group.
library(data.table)
setDT(df1)[, grp:= rleid(!SALE_PRICE)
][,if(all(!SALE_PRICE)) .SD[.N] else .SD , grp
][, grp := NULL][]
# WEEK PRICE QUANTITY SALE_PRICE TYPE
# 1: 4992 5.99 2847.50 0.00 3
# 2: 4995 3.33 36759.00 3.33 3
# 3: 4997 5.49 2858.50 0.00 3
# 4: 5001 3.33 32425.00 3.33 3
# 5: 5004 5.99 4329.50 0.00 3
# 6: 5006 2.74 55811.00 2.74 3
# 7: 5008 5.99 4074.00 0.00 3
# 8: 5009 3.99 12125.25 3.99 3
# 9: 5017 2.74 77645.00 2.74 3
#10: 5018 5.49 5315.50 0.00 3
#11: 5020 2.74 78699.00 2.74 3
#12: 5024 5.49 6545.00 0.00 3
#13: 5025 3.33 63418.00 3.33 3
Or an option using dplyr by creating a grouping variable with diff and cumsum, then filter the rows to keep only the last row of 'SALE_PRICE' that are 0 or (|) select the rows where 'SALE_PRICE' is not 0.
library(dplyr)
df1 %>%
group_by(grp = cumsum(c(TRUE,diff(!SALE_PRICE)!=0))) %>%
filter( !duplicated(!SALE_PRICE, fromLast=TRUE)|SALE_PRICE!=0) %>%
select(-grp)
# grp WEEK PRICE QUANTITY SALE_PRICE TYPE
# (int) (int) (dbl) (dbl) (dbl) (int)
#1 1 4992 5.99 2847.50 0.00 3
#2 2 4995 3.33 36759.00 3.33 3
#3 3 4997 5.49 2858.50 0.00 3
#4 4 5001 3.33 32425.00 3.33 3
#5 5 5004 5.99 4329.50 0.00 3
#6 6 5006 2.74 55811.00 2.74 3
#7 7 5008 5.99 4074.00 0.00 3
#8 8 5009 3.99 12125.25 3.99 3
#9 8 5017 2.74 77645.00 2.74 3
#10 9 5018 5.49 5315.50 0.00 3
#11 10 5020 2.74 78699.00 2.74 3
#12 11 5024 5.49 6545.00 0.00 3
#13 12 5025 3.33 63418.00 3.33 3
Related
I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.
I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.
In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.
Any help is appreciated!
Original Dataframe
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
Final Dataframe
2007-01-31 2.72 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-11-30 5.75 4.76 4
2008-04-30 6.86 4.91 1
2008-07-31 8.00 5.13 4
Here's another solution using run length encoding rle().
lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]
Output:
V1 V2 V3 V4
1 2007-01-31 2.72 4.75 2
4 2007-04-30 2.74 4.75 3
5 2007-05-31 2.46 4.75 2
6 2007-06-30 2.98 4.75 3
11 2007-11-30 5.75 4.76 4
16 2008-04-30 6.86 4.91 1
19 2008-07-31 8.00 5.13 4
You can use data.table::shift to filter, plus the first row, in rbind
library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])
Or an equivalent approach using dplyr
library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))
Output:
date v1 v2 v3
<IDat> <num> <num> <int>
1: 2007-01-31 2.72 4.75 2
2: 2007-04-30 2.74 4.75 3
3: 2007-05-31 2.46 4.75 2
4: 2007-06-30 2.98 4.75 3
5: 2007-11-30 5.75 4.76 4
6: 2008-04-30 6.86 4.91 1
7: 2008-07-31 8.00 5.13 4
DATA
x <- "
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
"
df <- read.table(textConnection(x) , header = F)
and use this two lines
df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]
#> V1 V2 V3 V4
#> 1 2007-01-31 2.72 4.75 2
#> 4 2007-04-30 2.74 4.75 3
#> 5 2007-05-31 2.46 4.75 2
#> 6 2007-06-30 2.98 4.75 3
#> 11 2007-11-30 5.75 4.76 4
#> 16 2008-04-30 6.86 4.91 1
#> 19 2008-07-31 8.00 5.13 4
Created on 2022-06-12 by the reprex package (v2.0.1)
I have a data frame df1 that summarises water temperature every 2 meters until 39 meters depth over time. As an example:
df1<-data.frame(Datetime=c("2016-08-18 00:00:00","2016-08-18 00:01:00","2016-08-18 00:02:00","2016-08-18 00:03:00"),
Site=c("BD","HG","BD","HG"),
m0=c(2,5,6,1),
m2=c(3,5,2,4),
m4=c(4,1,9,3),
m6=c(2,5,6,1),
m8=c(3,5,2,4),
m10=c(2,5,6,1),
m12=c(4,1,9,3),
m14=c(3,5,2,4),
m16=c(2,5,6,1),
m18=c(4,1,9,3),
m20=c(3,5,2,4),
m22=c(2,5,6,1),
m24=c(4,1,9,3),
m26=c(3,5,2,4),
m28=c(2,5,6,1),
m30=c(4,1,9,3),
m32=c(3,5,2,4),
m34=c(2,5,6,1),
m36=c(4,1,9,3),
m38=c(3,5,2,4)
)
> df1
Datetime Site m0 m2 m4 m6 m8 m10 m12 m14 m16 m18 m20 m22 m24 m26 m28 m30 m32 m34 m36 m38
1 2016-08-18 00:00:00 BD 2 3 4 2 3 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3
2 2016-08-18 00:01:00 HG 5 5 1 5 5 5 1 5 5 1 5 5 1 5 5 1 5 5 1 5
3 2016-08-18 00:02:00 BD 6 2 9 6 2 6 9 2 6 9 2 6 9 2 6 9 2 6 9 2
4 2016-08-18 00:03:00 HG 1 4 3 1 4 1 3 4 1 3 4 1 3 4 1 3 4 1 3 4
I would like to calculate the water temperature for layers of 8 meters instead of 2 meters by averaging water temperatures between the proper columns. For instance, I would like to convert columns m0, m2, m4 and m6 to a unique column called m3.5 that reflects the mean water temperature between 0 and 7 meters depth.
As my desired result:
> df1
Datetime Site m3.5 m11.5 m19.5 m27.5 m35.5
1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00
Does any one how to do that with dplyr?
here is a solution that would work with any number of columns
num_meters <- 39
grp <- as.factor(cumsum(seq(0,num_meters, 2) %% 8 == 0))
df <- data.frame(df1[,c(1,2)],
t(apply(df1[,-c(1,2)], 1, function(x) tapply(x, grp, mean))))
# Datetime Site X1 X2 X3 X4 X5
#1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
#2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
#3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
#4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00
# in case you also need the colnames that you have specified
colnames(df)[-c(1,2)] <- paste("m", tapply(seq(0,num_meters, 2), grp, mean) + 0.5, sep = "")
With tidyverse you could as well do something like this:
df1 %>%
gather(var, val, -Datetime, -Site) %>%
mutate(group = rep(seq(3.5, 35.5, 8), each = 16)) %>%
group_by(group, Site, Datetime) %>%
summarise(value = mean(val)) %>%
spread(group, value)
Site Datetime `3.5` `11.5` `19.5` `27.5` `35.5`
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BD 2016-08-18 00:00:00 2.75 3 2.75 3.25 3
2 BD 2016-08-18 00:02:00 5.75 4.75 5.75 6.5 4.75
3 HG 2016-08-18 00:01:00 4 4 4 3 4
4 HG 2016-08-18 00:03:00 2.25 3 2.25 2.75 3
You're probably looking for rowMeans:
df1$m3.5 <- rowMeans(df1[, c("m0", "m2", "m4", "m6")])
No need for dplyr.
The following does it.
library(dplyr)
df1 %>%
mutate(m3.5 = rowMeans(.[3:6]),
m11.5 = rowMeans(.[7:10]),
m19.5 = rowMeans(.[11:14]),
m27.5 = rowMeans(.[15:18]),
m35.5 = rowMeans(.[19:22])) %>%
select(Datetime, Site, m3.5:m35.5)
# Datetime Site m3.5 m11.5 m19.5 m27.5 m35.5
#1 2016-08-18 00:00:00 BD 2.75 3.00 2.75 3.25 3.00
#2 2016-08-18 00:01:00 HG 4.00 4.00 4.00 3.00 4.00
#3 2016-08-18 00:02:00 BD 5.75 4.75 5.75 6.50 4.75
#4 2016-08-18 00:03:00 HG 2.25 3.00 2.25 2.75 3.00
i have a unbalaned panel company data as followed:
time comid group sales closeone
1988m1 tw1701 1 2.45 tw1410
1988m1 tw1213 1 1.98 tw1701
1988m1 tw1707 1 2.67
1988m1 tw2702 1 9.45
1988m1 tw9902 1 4.16
1988m1 tw1410 1 2.57
1988m2 tw2601 3 27.44 tw2505
1988m2 tw2505 3 9.49
1988m2 tw1413 3 1.46
1988m2 tw2901 3 3.74
1988m2 tw1417 4 1.87 tw1506
1988m2 tw1506 4 3.24
1988m2 tw1215 4 3.58
my aim is to find out the closest rival within the same group and time.
just as column colseone show. for example,
in the closeone first row ,tw1410 is under group=1,time=1988m1 condition,
abs(all sales - sales of tw1701) and find the min value and exclude zero(i.e. can't be itself).
I'm not sure why you were downvoted, I didn't think this was trivial. This is how I solved it. There might be an easier way. I couldn't get data.table operations to set the final value so had to use a for loop.
basically it sorts the data by groups, calculated the difference between the rows above and below within the same group, finds the min of those two values, and then sets the value of the rival by that reference.
library(data.table)
setDT(dat)
setorder(dat,time,group,-sales)
dat[ , "Diff" := c(NA, diff(sales)), by = .(time,group)]
dat[ , "Diff2" := c(diff((sales)),NA), by = .(time,group)]
dat[ ,"Min" := ifelse(abs(Diff) < abs(Diff2), 1, 2)]
dat[ ,"Min" := ifelse(is.na(Diff),2,Min)]
dat[ ,"Min" := ifelse(is.na(Diff2),1,Min)]
dat[, "Rival" := NA]
for(i in 1:nrow(dat)){
if(dat$Min[i] == 2){
dat$Rival[i] = as.character(dat[i+1,comid])
}else{
dat$Rival[i] = as.character(dat[i-1,comid])
}
}
> dat
time comid group sales Diff Diff2 Min Rival
1: 1988m1 tw2702 1 9.45 NA -5.29 2 tw9902
2: 1988m1 tw9902 1 4.16 -5.29 -1.49 2 tw1707
3: 1988m1 tw1707 1 2.67 -1.49 -0.10 2 tw1410
4: 1988m1 tw1410 1 2.57 -0.10 -0.12 1 tw1707
5: 1988m1 tw1701 1 2.45 -0.12 -0.47 1 tw1410
6: 1988m1 tw1213 1 1.98 -0.47 NA 1 tw1701
7: 1988m2 tw2601 3 27.44 NA -17.95 2 tw2505
8: 1988m2 tw2505 3 9.49 -17.95 -5.75 2 tw2901
9: 1988m2 tw2901 3 3.74 -5.75 -2.28 2 tw1413
10: 1988m2 tw1413 3 1.46 -2.28 NA 1 tw2901
11: 1988m2 tw1215 4 3.58 NA -0.34 2 tw1506
12: 1988m2 tw1506 4 3.24 -0.34 -1.37 1 tw1215
13: 1988m2 tw1417 4 1.87 -1.37 NA 1 tw1506
If anyone has a better solution I'd love to see it.
EDIT
The reason I couldn't get this into vector format apparently was because comid was a factor. I have no idea why that would break the function, but when I changed it to character it worked.
Replace the for loop with this:
dat$comid = as.character(dat$comid)
dat[, "Rival" := ifelse(Min == 2, shift(comid, type = "lead"), shift(comid, type = "lag"))]
> dat
time comid group sales Diff Diff2 Min Rival
1: 1988m1 tw2702 1 9.45 NA -5.29 2 tw9902
2: 1988m1 tw9902 1 4.16 -5.29 -1.49 2 tw1707
3: 1988m1 tw1707 1 2.67 -1.49 -0.10 2 tw1410
4: 1988m1 tw1410 1 2.57 -0.10 -0.12 1 tw1707
5: 1988m1 tw1701 1 2.45 -0.12 -0.47 1 tw1410
6: 1988m1 tw1213 1 1.98 -0.47 NA 1 tw1701
7: 1988m2 tw2601 3 27.44 NA -17.95 2 tw2505
8: 1988m2 tw2505 3 9.49 -17.95 -5.75 2 tw2901
9: 1988m2 tw2901 3 3.74 -5.75 -2.28 2 tw1413
10: 1988m2 tw1413 3 1.46 -2.28 NA 1 tw2901
11: 1988m2 tw1215 4 3.58 NA -0.34 2 tw1506
12: 1988m2 tw1506 4 3.24 -0.34 -1.37 1 tw1215
13: 1988m2 tw1417 4 1.87 -1.37 NA 1 tw1506
That should run a lot faster.
I have a dataframe price.hierarchy
read.table(header=TRUE, text=
" WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 3.49 11541.600 3.49 1
2 5001 3.00 38944.000 3.00 1
3 5002 3.49 10652.667 3.49 1
4 5008 3.00 21445.000 3.00 1
5 5009 3.49 10039.667 3.49 1
6 5014 3.33 22624.000 3.33 1
7 5015 3.49 9146.500 3.49 1
8 5027 3.33 14751.000 3.33 1
9 5028 3.49 9146.667 3.49 1
10 5034 3.33 18304.000 3.33 1
11 5035 3.49 10953.500 3.49 1")
I want output like
read.table(header=F, text=
"1 4992 3.49 11541.600 3.49 1 5001 3.00 38944.000 3.00 1
2 5001 3.00 38944.000 3.00 1 5002 3.49 10652.667 3.49 1
3 5002 3.49 10652.667 3.49 1 5008 3.00 21445.000 3.00 1
4 5008 3.00 21445.000 3.00 1 5009 3.49 10039.667 3.49 1
5 5009 3.49 10039.667 3.49 1 5014 3.33 22624.000 3.33 1
6 5014 3.33 22624.000 3.33 1 5015 3.49 9146.500 3.49 1
7 5015 3.49 9146.500 3.49 1 5027 3.33 14751.000 3.33 1
8 5027 3.33 14751.000 3.33 1 5028 3.49 9146.667 3.49 1
9 5028 3.49 9146.667 3.49 1 5034 3.33 18304.000 3.33 1
10 5034 3.33 18304.000 3.33 1 5035 3.49 10953.500 3.49 1")
I am trying to combine first and second row, second and third row etc at the end from the same data frame.
I tried
price.hierarchy1 <- price.hierarchy[c(1: (nrow(price.hierarchy)-1)), ]
price.hierarchy2 <- price.hierarchy[c(2: nrow(price.hierarchy)), ]
price.hierarchy3 <- cbind(price.hierarchy1, price.hierarchy2)
price.hierarchy3 <- unique(price.hierarchy3)
Another variant is
cbind(df1[-nrow(df1),], df1[-1,])
First dispart the rows, then cbind()
cbind(df[1:(nrow(df)-1),], df[2:nrow(df),])
or
cbind(head(df,-1), tail(df,-1))