Combine two consecutive rows in a dataframe - r

I have a dataframe price.hierarchy
read.table(header=TRUE, text=
" WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 3.49 11541.600 3.49 1
2 5001 3.00 38944.000 3.00 1
3 5002 3.49 10652.667 3.49 1
4 5008 3.00 21445.000 3.00 1
5 5009 3.49 10039.667 3.49 1
6 5014 3.33 22624.000 3.33 1
7 5015 3.49 9146.500 3.49 1
8 5027 3.33 14751.000 3.33 1
9 5028 3.49 9146.667 3.49 1
10 5034 3.33 18304.000 3.33 1
11 5035 3.49 10953.500 3.49 1")
I want output like
read.table(header=F, text=
"1 4992 3.49 11541.600 3.49 1 5001 3.00 38944.000 3.00 1
2 5001 3.00 38944.000 3.00 1 5002 3.49 10652.667 3.49 1
3 5002 3.49 10652.667 3.49 1 5008 3.00 21445.000 3.00 1
4 5008 3.00 21445.000 3.00 1 5009 3.49 10039.667 3.49 1
5 5009 3.49 10039.667 3.49 1 5014 3.33 22624.000 3.33 1
6 5014 3.33 22624.000 3.33 1 5015 3.49 9146.500 3.49 1
7 5015 3.49 9146.500 3.49 1 5027 3.33 14751.000 3.33 1
8 5027 3.33 14751.000 3.33 1 5028 3.49 9146.667 3.49 1
9 5028 3.49 9146.667 3.49 1 5034 3.33 18304.000 3.33 1
10 5034 3.33 18304.000 3.33 1 5035 3.49 10953.500 3.49 1")
I am trying to combine first and second row, second and third row etc at the end from the same data frame.
I tried
price.hierarchy1 <- price.hierarchy[c(1: (nrow(price.hierarchy)-1)), ]
price.hierarchy2 <- price.hierarchy[c(2: nrow(price.hierarchy)), ]
price.hierarchy3 <- cbind(price.hierarchy1, price.hierarchy2)
price.hierarchy3 <- unique(price.hierarchy3)

Another variant is
cbind(df1[-nrow(df1),], df1[-1,])

First dispart the rows, then cbind()
cbind(df[1:(nrow(df)-1),], df[2:nrow(df),])
or
cbind(head(df,-1), tail(df,-1))

Related

In R, how do I keep the first single occurrence of a row based on a repeated value in one column?

I want to keep the row with the first occurrence of a changed value in a column (the last column in the example below). My dataframe is an xts object.
In the example below, I would keep the first row with a 2 in the last column, but not the next two because they are unchanged from the first 2. I'd then keep the next three rows (the sequence 323) because they change each time, and remove the next 4 because they didn't change, and so on. The final data frame would look like to smaller one below the original.
Any help is appreciated!
Original Dataframe
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
Final Dataframe
2007-01-31 2.72 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-11-30 5.75 4.76 4
2008-04-30 6.86 4.91 1
2008-07-31 8.00 5.13 4
Here's another solution using run length encoding rle().
lens <- rle(df$V4)$lengths
df[cumsum(lens) - lens + 1,]
Output:
V1 V2 V3 V4
1 2007-01-31 2.72 4.75 2
4 2007-04-30 2.74 4.75 3
5 2007-05-31 2.46 4.75 2
6 2007-06-30 2.98 4.75 3
11 2007-11-30 5.75 4.76 4
16 2008-04-30 6.86 4.91 1
19 2008-07-31 8.00 5.13 4
You can use data.table::shift to filter, plus the first row, in rbind
library(data.table)
rbind(setDT(dt)[1],dt[v3!=shift(v3)])
Or an equivalent approach using dplyr
library(dplyr)
bind_rows(dt[1,], filter(dt, v3!=lag(v3)))
Output:
date v1 v2 v3
<IDat> <num> <num> <int>
1: 2007-01-31 2.72 4.75 2
2: 2007-04-30 2.74 4.75 3
3: 2007-05-31 2.46 4.75 2
4: 2007-06-30 2.98 4.75 3
5: 2007-11-30 5.75 4.76 4
6: 2008-04-30 6.86 4.91 1
7: 2008-07-31 8.00 5.13 4
DATA
x <- "
2007-01-31 2.72 4.75 2
2007-02-28 2.82 4.75 2
2007-03-31 2.85 4.75 2
2007-04-30 2.74 4.75 3
2007-05-31 2.46 4.75 2
2007-06-30 2.98 4.75 3
2007-07-31 4.19 4.75 3
2007-08-31 4.55 4.75 3
2007-09-30 4.20 4.75 3
2007-10-31 4.36 4.75 3
2007-11-30 5.75 4.76 4
2007-12-31 5.92 4.76 4
2008-01-31 6.95 4.87 4
2008-02-29 7.67 4.87 4
2008-03-31 8.21 4.90 4
2008-04-30 6.86 4.91 1
2008-05-31 6.53 5.07 1
2008-06-30 7.35 5.08 1
2008-07-31 8.00 5.13 4
2008-08-31 8.36 5.19 4
"
df <- read.table(textConnection(x) , header = F)
and use this two lines
df$V5 <- c(1 ,diff(df$V4))
df[abs(df$V5) > 0 ,][1:4]
#> V1 V2 V3 V4
#> 1 2007-01-31 2.72 4.75 2
#> 4 2007-04-30 2.74 4.75 3
#> 5 2007-05-31 2.46 4.75 2
#> 6 2007-06-30 2.98 4.75 3
#> 11 2007-11-30 5.75 4.76 4
#> 16 2008-04-30 6.86 4.91 1
#> 19 2008-07-31 8.00 5.13 4
Created on 2022-06-12 by the reprex package (v2.0.1)

How to change data frame by getting some specific rows repeated? [duplicate]

This question already has an answer here:
How to repeat rows with 0 price from dataframe before non zero price row?
(1 answer)
Closed 6 years ago.
a b c d
5015 3.49 1059.500 0.00
5023 2.50 6056.000 2.50
5024 3.00 1954.500 3.00
5026 3.49 1163.833 0.00
5037 2.50 6797.000 2.50
5038 3.00 2109.000 3.00
5040 2.50 4521.000 2.50
5041 3.33 2469.000 3.33
I want to repeat previously observed row with column 0 before a row non zero value of d. so, I will get rows with alternate rows of row with zero value of d then non zero value of d. a row with zero value of d must be previously observed row.
Output I want is:
a b c d
5015 3.49 1059.500 0.00
5023 2.50 6056.000 2.50
5015 3.49 1059.500 0.00
5024 3.00 1954.500 3.00
5026 3.49 1163.833 0.00
5037 2.50 6797.000 2.50
5026 3.49 1163.833 0.00
5038 3.00 2109.000 3.00
5026 3.49 1163.833 0.00
5040 2.50 4521.000 2.50
5026 3.49 1163.833 0.00
5041 3.33 2469.000 3.33
We can create a custom function f that will interleave the first row. Split on cumsum(d == 0) creating an index for values equaling 0. Finally we combine with do.call(rbind, ...). I added an optional 'row.names<-'(..., NULL) call to undo the default naming convention:
f <- function(x) x[c(rbind(rep(1,nrow(x)-1), 2:nrow(x))),]
`row.names<-`(do.call(rbind, lapply(split(df1, cumsum(df1$d == 0)), f)), NULL)
# a b c d
# 1 5015 3.49 1059.500 0.00
# 2 5023 2.50 6056.000 2.50
# 3 5015 3.49 1059.500 0.00
# 4 5024 3.00 1954.500 3.00
# 5 5026 3.49 1163.833 0.00
# 6 5037 2.50 6797.000 2.50
# 7 5026 3.49 1163.833 0.00
# 8 5038 3.00 2109.000 3.00
# 9 5026 3.49 1163.833 0.00
# 10 5040 2.50 4521.000 2.50
# 11 5026 3.49 1163.833 0.00
# 12 5041 3.33 2469.000 3.33
There is an interleave trick in there. Try c(rbind(c(1,1,1), c(2,3,4))) to see the way the numbers will be weaved together
Package data.table's grouping by is useful here:
library(data.table)
DF <-fread(" a b c d
5015 3.49 1059.500 0.00
5023 2.50 6056.000 2.50
5024 3.00 1954.500 3.00
5026 3.49 1163.833 0.00
5037 2.50 6797.000 2.50
5038 3.00 2109.000 3.00
5040 2.50 4521.000 2.50
5041 3.33 2469.000 3.33")
DF[ #find indices:
DF[, {ind <- .I[rep(1L, (.N - 1) * 2)] #first repeat the first index
ind[c(FALSE, TRUE)] <- .I[-1] #then replace every second repeat with the other indices
ind
}, by = cumsum(abs(d) < .Machine$double.eps^0.5)][["V1"]] #group by the different d = 0 rows,
#beware of floating point errors if you have calculated d
] #subset with the indices
# a b c d
# 1: 5015 3.49 1059.500 0.00
# 2: 5023 2.50 6056.000 2.50
# 3: 5015 3.49 1059.500 0.00
# 4: 5024 3.00 1954.500 3.00
# 5: 5026 3.49 1163.833 0.00
# 6: 5037 2.50 6797.000 2.50
# 7: 5026 3.49 1163.833 0.00
# 8: 5038 3.00 2109.000 3.00
# 9: 5026 3.49 1163.833 0.00
# 10: 5040 2.50 4521.000 2.50
# 11: 5026 3.49 1163.833 0.00
# 12: 5041 3.33 2469.000 3.33

How to repeat rows with 0 price from dataframe before non zero price row?

I have a data
WEEK PRICE QUANTITY SALE_PRICE
4992 3.49 1908.750 0.00
4999 2.50 5681.000 2.50
5001 3.00 3187.000 3.00
5002 3.49 1455.000 0.00
5008 2.69 2263.500 2.69
5011 3.49 1515.500 0.00
5013 2.69 3297.000 2.69
5015 3.49 1059.500 0.00
5023 2.50 6056.000 2.50
5024 3.00 1954.500 3.00
5026 3.49 1163.833 0.00
5037 2.50 6797.000 2.50
5038 3.00 2109.000 3.00
5040 2.50 4521.000 2.50
5041 3.33 2469.000 3.33
I want to repeat the last observed row with sale price 0, immediately before a non zero sale price only at the places where there is no row with sale price 0 before a row having a sale price. The output should look like this:
WEEK PRICE QUANTITY SALE_PRICE TYPE
4992 3.49 1908.750 0.00 3
4999 2.50 5681.000 2.50 3
4992 3.49 1908.750 0.00 3
5001 3.00 3187.000 3.00 3
5002 3.49 1455.000 0.00 3
5008 2.69 2263.500 2.69 3
5011 3.49 1515.500 0.00 3
5013 2.69 3297.000 2.69 3
5015 3.49 1059.500 0.00 3
5023 2.50 6056.000 2.50 3
5015 3.49 1059.500 0.00 3
5024 3.00 1954.500 3.00 3
5026 3.49 1163.833 0.00 3
5037 2.50 6797.000 2.50 3
5026 3.49 1163.833 0.00 3
5038 3.00 2109.000 3.00 3
5026 3.49 1163.833 0.00 3
5040 2.50 4521.000 2.50 3
5026 3.49 1163.833 0.00 3
5041 3.33 2469.000 3.33 3
Data
structure(list(WEEK = c(4992L, 4999L, 5001L, 5002L, 5008L, 5011L,
5013L, 5015L, 5023L, 5024L, 5026L, 5037L, 5038L, 5040L, 5041L
), PRICE = c(3.49, 2.5, 3, 3.49, 2.69, 3.49, 2.69, 3.49, 2.5,
3, 3.49, 2.5, 3, 2.5, 3.33), QUANTITY = c(1908.75, 5681, 3187,
1455, 2263.5, 1515.5, 3297, 1059.5, 6056, 1954.5, 1163.833, 6797,
2109, 4521, 2469), SALE_PRICE = c(0, 2.5, 3, 0, 2.69, 0, 2.69,
0, 2.5, 3, 0, 2.5, 3, 2.5, 3.33)), .Names = c("WEEK", "PRICE",
"QUANTITY", "SALE_PRICE"), class = "data.frame", row.names = c(NA,
-15L))
I have tried
if(nrow(price.hierarchy) > 2) {
for(i in 2: (nrow(price.hierarchy) - 1)) {
if(price.hierarchy$SALE_PRICE[i] !=0 & price.hierarchy$SALE_PRICE[i+1] !=0 & price.hierarchy$SALE_PRICE[i-1]==0) {
price.hierarchy1 <- price.hierarchy[which(price.hierarchy[, 1] > price.hierarchy[i,1]), ]
price.hierarchy[i+1, ] <- NA
price.hierarchy[i+1, ] <- price.hierarchy[i-1, ]
price.hierarchy2 <- price.hierarchy[which(price.hierarchy[, 1] < price.hierarchy[i+2,1]), ]
price.hierarchy <- rbind(price.hierarchy2, price.hierarchy1)
}
else
price.hierarchy
}
}
Which gives:
> price.hierarchy
WEEK PRICE QUANTITY SALE_PRICE
1 4992 3.49 1908.750 0.00
2 4999 2.50 5681.000 2.50
3 4992 3.49 1908.750 0.00
31 5001 3.00 3187.000 3.00
4 5002 3.49 1455.000 0.00
5 5008 2.69 2263.500 2.69
6 5011 3.49 1515.500 0.00
7 5013 2.69 3297.000 2.69
8 5015 3.49 1059.500 0.00
9 5023 2.50 6056.000 2.50
10 5015 3.49 1059.500 0.00
101 5024 3.00 1954.500 3.00
11 5026 3.49 1163.833 0.00
12 5037 2.50 6797.000 2.50
13 5026 3.49 1163.833 0.00
131 5038 3.00 2109.000 3.00
14 5040 2.50 4521.000 2.50
15 5041 3.33 2469.000 3.33
The shift function from data.table could be useful here.
I want to repeat last observed row with sale price 0, immediate before
non zero sale price only at the places where there is no row with sale
price 0 before a row having sale price
I'm not sure about the latter requirements without more explanation. Here's a quick solution that gives the observations an identifier i for sorting purposes, and then duplicates and appends those that precede a zero sale price. Finally, sort on identifiers adjusted to place duplicated rows two rows after the originals.
library(data.table)
setDT(d)
d[, i := seq_len(.N)]
d[, led_zero := shift(SALE_PRICE, type = "lead") == 0]
d_duped = d[(led_zero)][, i := i + 1.5]
d_out = rbind(d, d_duped)
d_out[order(i)]
# WEEK PRICE QUANTITY SALE_PRICE i led_zero
# 1: 4992 3.49 1908.750 0.00 1.0 FALSE
# 2: 4999 2.50 5681.000 2.50 2.0 FALSE
# 3: 5001 3.00 3187.000 3.00 3.0 TRUE
# 4: 5002 3.49 1455.000 0.00 4.0 FALSE
# 5: 5001 3.00 3187.000 3.00 4.5 TRUE
# 6: 5008 2.69 2263.500 2.69 5.0 TRUE
# 7: 5011 3.49 1515.500 0.00 6.0 FALSE
# 8: 5008 2.69 2263.500 2.69 6.5 TRUE
# 9: 5013 2.69 3297.000 2.69 7.0 TRUE
# 10: 5015 3.49 1059.500 0.00 8.0 FALSE
# 11: 5013 2.69 3297.000 2.69 8.5 TRUE
# 12: 5023 2.50 6056.000 2.50 9.0 FALSE
# 13: 5024 3.00 1954.500 3.00 10.0 TRUE
# 14: 5026 3.49 1163.833 0.00 11.0 FALSE
# 15: 5024 3.00 1954.500 3.00 11.5 TRUE
# 16: 5037 2.50 6797.000 2.50 12.0 FALSE
# 17: 5038 3.00 2109.000 3.00 13.0 FALSE
# 18: 5040 2.50 4521.000 2.50 14.0 FALSE
# 19: 5041 3.33 2469.000 3.33 15.0 NA

Error : missing value where TRUE/FALSE needed

WEEK PRICE QUANTITY SALE_PRICE TYPE
1 4992 5.99 2847.50 0.00 3
2 4995 3.33 36759.00 3.33 3
3 4996 5.99 2517.00 0.00 3
4 4997 5.49 2858.50 0.00 3
5 5001 3.33 32425.00 3.33 3
6 5002 5.49 4205.50 0.00 3
7 5004 5.99 4329.50 0.00 3
8 5006 2.74 55811.00 2.74 3
9 5007 5.49 4133.00 0.00 3
10 5008 5.99 4074.00 0.00 3
11 5009 3.99 12125.25 3.99 3
12 5017 2.74 77645.00 2.74 3
13 5018 5.49 5315.50 0.00 3
14 5020 2.74 78699.00 2.74 3
15 5021 5.49 5158.50 0.00 3
16 5023 5.99 5315.00 0.00 3
17 5024 5.49 6545.00 0.00 3
18 5025 3.33 63418.00 3.33 3
If there are consecutive 0 sale price entries then I want to keep last entry with sale price 0. Like I want to remove week 4996 and want to keep week 4997, I want week 5004 and I want to remove 5002. Similarly I want to delete 5021 & 5023 and want to keep week 5024.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). create a grouping variable with rleid based on a logical vector of the presence of 0 in 'SALE_PRICE' (!SALE_PRICE). Using the 'grp' as grouping variable, we get the last row of 'Subset of Data.table (.SD[.N]) if the 'SALE_PRICEelements areall0 orelseget the.SD` i.e. the full rows for a particular group.
library(data.table)
setDT(df1)[, grp:= rleid(!SALE_PRICE)
][,if(all(!SALE_PRICE)) .SD[.N] else .SD , grp
][, grp := NULL][]
# WEEK PRICE QUANTITY SALE_PRICE TYPE
# 1: 4992 5.99 2847.50 0.00 3
# 2: 4995 3.33 36759.00 3.33 3
# 3: 4997 5.49 2858.50 0.00 3
# 4: 5001 3.33 32425.00 3.33 3
# 5: 5004 5.99 4329.50 0.00 3
# 6: 5006 2.74 55811.00 2.74 3
# 7: 5008 5.99 4074.00 0.00 3
# 8: 5009 3.99 12125.25 3.99 3
# 9: 5017 2.74 77645.00 2.74 3
#10: 5018 5.49 5315.50 0.00 3
#11: 5020 2.74 78699.00 2.74 3
#12: 5024 5.49 6545.00 0.00 3
#13: 5025 3.33 63418.00 3.33 3
Or an option using dplyr by creating a grouping variable with diff and cumsum, then filter the rows to keep only the last row of 'SALE_PRICE' that are 0 or (|) select the rows where 'SALE_PRICE' is not 0.
library(dplyr)
df1 %>%
group_by(grp = cumsum(c(TRUE,diff(!SALE_PRICE)!=0))) %>%
filter( !duplicated(!SALE_PRICE, fromLast=TRUE)|SALE_PRICE!=0) %>%
select(-grp)
# grp WEEK PRICE QUANTITY SALE_PRICE TYPE
# (int) (int) (dbl) (dbl) (dbl) (int)
#1 1 4992 5.99 2847.50 0.00 3
#2 2 4995 3.33 36759.00 3.33 3
#3 3 4997 5.49 2858.50 0.00 3
#4 4 5001 3.33 32425.00 3.33 3
#5 5 5004 5.99 4329.50 0.00 3
#6 6 5006 2.74 55811.00 2.74 3
#7 7 5008 5.99 4074.00 0.00 3
#8 8 5009 3.99 12125.25 3.99 3
#9 8 5017 2.74 77645.00 2.74 3
#10 9 5018 5.49 5315.50 0.00 3
#11 10 5020 2.74 78699.00 2.74 3
#12 11 5024 5.49 6545.00 0.00 3
#13 12 5025 3.33 63418.00 3.33 3

How to count these transitions - in R

Given a table of values, where A = state of system, B = length of state, and C = cumulative length of states:
A B C
1 1.16 1.16
0 0.51 1.67
1 1.16 2.84
0 0.26 3.10
1 0.59 3.69
0 0.39 4.08
1 0.78 4.85
0 0.90 5.75
1 0.78 6.53
0 0.26 6.79
1 0.12 6.91
0 0.51 7.42
1 0.26 7.69
0 0.51 8.20
1 0.39 8.59
0 0.51 9.10
1 1.16 10.26
0 1.10 11.36
1 0.59 11.95
0 0.51 12.46
How would I use R to calculate the number of transitions (where A gives the state) per constant interval length - where the intervals are consecutive and could be any arbitrary number (I chose a value of 2 in my image example)? For example, using the table values or the image included we count 2 transitions from 0-2, 3 transitions from greater than 2-4, 3 transitions from >4-6, etc.
This is straightforward in R. All you need is column C and ?cut. Consider:
d <- read.table(text="A B C
1 1.16 1.16
0 0.51 1.67
1 1.16 2.84
0 0.26 3.10
1 0.59 3.69
0 0.39 4.08
1 0.78 4.85
0 0.90 5.75
1 0.78 6.53
0 0.26 6.79
1 0.12 6.91
0 0.51 7.42
1 0.26 7.69
0 0.51 8.20
1 0.39 8.59
0 0.51 9.10
1 1.16 10.26
0 1.10 11.36
1 0.59 11.95
0 0.51 12.46", header=TRUE)
fi <- cut(d$C, breaks=seq(from=0, to=14, by=2))
table(fi)
# fi
# (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14]
# 2 3 3 5 3 3 1

Resources