I have run across similar questions, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to include a unique non-repeating ID to nested groups that can have identical values. While I regularly conduct this type of data wrangling, both the structure of this data set as well as the required outcome are beyond my skillset at this time.
Below I have provided an example data set (df) and what the results should look like.
I used the below code in my actual data set, but realized that it fails under certain circumstances...which are exaggerated in the example data set provided here. I prefer the ID to be sequentially numbered.
df$ID = cumsum(c(TRUE, diff(df$LENGTH) != 0))
I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP REGION TIME LENGTH
a x 1 3
a x 2 3
a x 3 3
a y 4 3
a y 5 3
a y 6 3
a z 7 2
a z 8 2
b z 1 2
b z 2 2
b x 3 2
b x 4 2
c x 1 2
c x 2 2
c y 3 2
c y 4 2
c x 5 2
c x 6 2
c z 7 1", header = TRUE)
result <- read.table(text = "GROUP REGION TIME LENGTH ID
a x 1 3 1
a x 2 3 1
a x 3 3 1
a y 4 3 2
a y 5 3 2
a y 6 3 2
a z 7 2 3
a z 8 2 3
b z 1 2 4
b z 2 2 4
b x 3 2 5
b x 4 2 5
c x 1 2 6
c x 2 2 6
c y 3 2 7
c y 4 2 7
c x 5 2 8
c x 6 2 8
c z 7 1 9", header = TRUE)
Paste GROUP and REGION columns and use rle to create a sequential ID column.
transform(df,ID = with(rle(paste(GROUP, REGION)),rep(seq_along(values),lengths)))
In data.table we can use rleid.
library(data.table)
setDT(df)[, ID := rleid(GROUP, REGION)]
# GROUP REGION TIME LENGTH ID
# 1: a x 1 3 1
# 2: a x 2 3 1
# 3: a x 3 3 1
# 4: a y 4 3 2
# 5: a y 5 3 2
# 6: a y 6 3 2
# 7: a z 7 2 3
# 8: a z 8 2 3
# 9: b z 1 2 4
#10: b z 2 2 4
#11: b x 3 2 5
#12: b x 4 2 5
#13: c x 1 2 6
#14: c x 2 2 6
#15: c y 3 2 7
#16: c y 4 2 7
#17: c x 5 2 8
#18: c x 6 2 8
#19: c z 7 1 9
Another base R option, but without rle
transform(
df,
ID = cumsum(c(1, (s <- paste0(GROUP, REGION))[-1] != head(s, -1)))
)
gives
GROUP REGION TIME LENGTH ID
1 a x 1 3 1
2 a x 2 3 1
3 a x 3 3 1
4 a y 4 3 2
5 a y 5 3 2
6 a y 6 3 2
7 a z 7 2 3
8 a z 8 2 3
9 b z 1 2 4
10 b z 2 2 4
11 b x 3 2 5
12 b x 4 2 5
13 c x 1 2 6
14 c x 2 2 6
15 c y 3 2 7
16 c y 4 2 7
17 c x 5 2 8
18 c x 6 2 8
19 c z 7 1 9
With dplyr
library(dplyr)
library(data.table)
df %>%
mutate(ID = rleid(GROUP, REGION))
I have a dataframe with 5 columns and many many rows, that have repetition of elements only for the first 3 columns (in short, it is a volume built by several volumes, and so there are same coordinates (x,y,z) with different labels, and I would like to eliminate the repeated coordinates).
How can I eliminate these with R commands?
Thanks
AV
You can use duplicated function, e.g. :
# create an example data.frame
Lab1<-letters[1:10]
Lab2<-LETTERS[1:10]
x <- c(3,4,3,3,4,2,4,3,9,0)
y <- c(3,4,3,5,4,2,1,5,7,2)
z <- c(8,7,8,8,4,3,1,8,6,3)
DF <- data.frame(Lab1,Lab2,x,y,z)
> DF
Lab1 Lab2 x y z
1 a A 3 3 8
2 b B 4 4 7
3 c C 3 3 8
4 d D 3 5 8
5 e E 4 4 4
6 f F 2 2 3
7 g G 4 1 1
8 h H 3 5 8
9 i I 9 7 6
10 j J 0 2 3
# remove rows having repeated x,y,z
DF2 <- DF[!duplicated(DF[,c('x','y','z')]),]
> DF2
Lab1 Lab2 x y z
1 a A 3 3 8
2 b B 4 4 7
4 d D 3 5 8
5 e E 4 4 4
6 f F 2 2 3
7 g G 4 1 1
9 i I 9 7 6
10 j J 0 2 3
EDIT :
To allow choosing amongst the rows having the same coordinates, you can use for example by function (even if is less efficient then previous approach) :
res <- by(DF,
INDICES=paste(DF$x,DF$y,DF$z,sep='|'),
FUN=function(equalRows){
# equalRows is a data.frame with the rows having the same x,y,z
# for exampel here we choose the first row ordering by Lab1 then Lab2
row <- equalRows[order(equalRows$Lab1,equalRows$Lab2),][1,]
return(row)
})
DF2 <- do.call(rbind.data.frame,res)
> DF2
Lab1 Lab2 x y z
0|2|3 j J 0 2 3
2|2|3 f F 2 2 3
3|3|8 a A 3 3 8
3|5|8 d D 3 5 8
4|1|1 g G 4 1 1
4|4|4 e E 4 4 4
4|4|7 b B 4 4 7
9|7|6 i I 9 7 6
Suppose I have this dataset
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 0 0 1 X K John
1 A 2 0 0 2 X K John
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
2 B 2 0 0 2 X L Sam
2 B 2 0 0 3 X M John
2 B 2 0 0 4 X L John
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I want to delete rows with zeroes for Sales and Profit column by Id group
So for a certain Id if two or more consecutive rows have zero values for sales and profit those rows will get delete. So this dataset will become like this.
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I can remove all rows if they have zero values for Sales and Profit with
df1 = df[!(df$sales==0 & test$Profit==0),]
But how to delete rows only in certain group in this case by Id
P.S The idea is to delete entries for those products if they started selling after few months or got abandoned after few months in a year cycle.
Here's an approach using rleid from "data.table":
library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][
!(sales == 0 & Profit == 0 & N >= 2)]
## Id Name Price sales Profit Month Category Mode Supplier N
## 1: 1 A 2 5 8 3 X K John 2
## 2: 1 A 2 5 8 4 X L Sam 2
## 3: 2 B 2 3 4 1 X L Sam 1
## 4: 3 C 2 0 0 1 X K John 1
## 5: 3 C 2 8 10 2 Y M John 2
## 6: 3 C 2 8 10 3 Y K John 2
## 7: 3 C 2 0 0 4 Y K John 1
## 8: 5 E 2 0 0 1 Y M Sam 1
## 9: 5 E 2 5 5 2 Y L Sam 2
## 10: 5 E 2 5 9 3 Y M Sam 2
## 11: 5 E 2 0 0 4 Z M Kyle 1
## 12: 5 E 2 5 8 5 Z L Kyle 2
## 13: 5 E 2 5 8 6 Z M Kyle 2
Here's how to do it with dplyr. Basically, I'm only keeping lines that are not zero OR that the previous/following lines is not zero.
table1 %>%
group_by(Id) %>%
mutate(Lag=lag(sales),Lead=lead(sales)) %>%
rowwise() %>%
mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>%
filter(sales>0|Min>0) %>%
select(-Lead,-Lag,-Min)
Id Name Price sales Profit Month Category Mode Supplier
(int) (chr) (int) (int) (int) (int) (chr) (chr) (chr)
1 1 A 2 5 8 3 X K John
2 1 A 2 5 8 4 X L Sam
3 2 B 2 3 4 1 X L Sam
4 3 C 2 0 0 1 X K John
5 3 C 2 8 10 2 Y M John
6 3 C 2 8 10 3 Y K John
7 3 C 2 0 0 4 Y K John
8 5 E 2 0 0 1 Y M Sam
9 5 E 2 5 5 2 Y L Sam
10 5 E 2 5 9 3 Y M Sam
11 5 E 2 0 0 4 Z M Kyle
12 5 E 2 5 8 5 Z L Kyle
13 5 E 2 5 8 6 Z M Kyle
Data
table1 <-read.table(text="
Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier
1,A,2,0,0,1,X,K,John
1,A,2,0,0,2,X,K,John
1,A,2,5,8,3,X,K,John
1,A,2,5,8,4,X,L,Sam
2,B,2,3,4,1,X,L,Sam
2,B,2,0,0,2,X,L,Sam
2,B,2,0,0,3,X,M,John
2,B,2,0,0,4,X,L,John
3,C,2,0,0,1,X,K,John
3,C,2,8,10,2,Y,M,John
3,C,2,8,10,3,Y,K,John
3,C,2,0,0,4,Y,K,John
5,E,2,0,0,1,Y,M,Sam
5,E,2,5,5,2,Y,L,Sam
5,E,2,5,9,3,Y,M,Sam
5,E,2,0,0,4,Z,M,Kyle
5,E,2,5,8,5,Z,L,Kyle
5,E,2,5,8,6,Z,M,Kyle
",sep=",",stringsAsFactors =FALSE, header=TRUE)
UPDATE
To filter on more than one column with these criteria, here's how to do it. In the present case, the result is the same because when sales are 0, profits are also 0.
library(dplyr)
table1 %>%
group_by(Id) %>%
mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>%
rowwise() %>%
mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>%
filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR
select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)
I can't do it in one line, but here it is in three:
x <- df$sales==0 & df$Profit==0
y <- cumsum(c(1,head(x,-1)!=tail(x,-1)))
df[ave(x,df$Id,y,FUN=sum)<2,]
# Id Name Price sales Profit Month Category Mode Supplier
# 3 1 A 2 5 8 3 X K John
# 4 1 A 2 5 8 4 X L Sam
# 5 2 B 2 3 4 1 X L Sam
# 9 3 C 2 0 0 1 X K John
# 10 3 C 2 8 10 2 Y M John
# 11 3 C 2 8 10 3 Y K John
# 12 3 C 2 0 0 4 Y K John
# 13 5 E 2 0 0 1 Y M Sam
# 14 5 E 2 5 5 2 Y L Sam
# 15 5 E 2 5 9 3 Y M Sam
# 16 5 E 2 0 0 4 Z M Kyle
# 17 5 E 2 5 8 5 Z L Kyle
# 18 5 E 2 5 8 6 Z M Kyle
This works by first identifying all rows where sales and Profit are both 0 (x). The variable y groups consecutive TRUE and FALSE values. The ave() function splits the first input variable (x) according to the subsequent input variables (df$Id and y) then applies the function within groups. Since the function is sum(), it will add up all the TRUE values in x, then it returns a vector of the same length as x, so we just need to keep all the rows where the result is less than 2.
Here my solution:
aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x)
with(x, cbind(rep(values, lengths), rep(lengths, lengths))))
df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),]
Id Name Price sales Profit Month Category Mode Supplier
3 1 A 2 5 8 3 X K John
4 1 A 2 5 8 4 X L Sam
5 2 B 2 3 4 1 X L Sam
9 3 C 2 0 0 1 X K John
10 3 C 2 8 10 2 Y M John
11 3 C 2 8 10 3 Y K John
12 3 C 2 0 0 4 Y K John
13 5 E 2 0 0 1 Y M Sam
14 5 E 2 5 5 2 Y L Sam
15 5 E 2 5 9 3 Y M Sam
16 5 E 2 0 0 4 Z M Kyle
17 5 E 2 5 8 5 Z L Kyle
18 5 E 2 5 8 6 Z M Kyle
Image we have a matrix, M*N, M rows and N columns, like
b a d c e
a 2 1 4 3 5
b 3 2 5 4 6
c 1 3 3 2 4
I want to write a function, where take the above matrix, return the following matrix:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4
Where the first part of the matrix M*M, 3*3 in this case is symmetric in terms of rownames and colnames, and 3*5 in total, the rest 3*2 matrix is pushed afterwards.
For an N x M matrix where N <= M and all row names are contained in col names, this will bring the columns with names existing in row names to the front in the same order as the row names, and leave the rest of the columns in their original order after that:
mat_ord <- function(mx) mx[, c(rownames(mx), setdiff(colnames(mx), rownames(mx)))]
mat_ord(mx)
produces:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4
To see the difference, consider mx2 which has rows and columns ordered differently than mx:
e a b d c
b 6 2 3 5 4
a 5 1 2 4 3
c 4 3 1 3 2
And with mat_ord(mx2) we get:
b a c e d
b 3 2 4 6 5
a 2 1 3 5 4
c 1 3 2 4 3
UPDATE: this sorts rows and columns while ensuring symmetry on first N cols/rows:
mat_ord2 <- function(mx) mx[sort(rownames(mx)), c(sort(rownames(mx)), sort(setdiff(colnames(mx), rownames(mx))))]
mat_ord2(mx2)
produces:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4