Delete certain rows in a group of rows in R - r
Suppose I have this dataset
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 0 0 1 X K John
1 A 2 0 0 2 X K John
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
2 B 2 0 0 2 X L Sam
2 B 2 0 0 3 X M John
2 B 2 0 0 4 X L John
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I want to delete rows with zeroes for Sales and Profit column by Id group
So for a certain Id if two or more consecutive rows have zero values for sales and profit those rows will get delete. So this dataset will become like this.
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I can remove all rows if they have zero values for Sales and Profit with
df1 = df[!(df$sales==0 & test$Profit==0),]
But how to delete rows only in certain group in this case by Id
P.S The idea is to delete entries for those products if they started selling after few months or got abandoned after few months in a year cycle.
Here's an approach using rleid from "data.table":
library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][
!(sales == 0 & Profit == 0 & N >= 2)]
## Id Name Price sales Profit Month Category Mode Supplier N
## 1: 1 A 2 5 8 3 X K John 2
## 2: 1 A 2 5 8 4 X L Sam 2
## 3: 2 B 2 3 4 1 X L Sam 1
## 4: 3 C 2 0 0 1 X K John 1
## 5: 3 C 2 8 10 2 Y M John 2
## 6: 3 C 2 8 10 3 Y K John 2
## 7: 3 C 2 0 0 4 Y K John 1
## 8: 5 E 2 0 0 1 Y M Sam 1
## 9: 5 E 2 5 5 2 Y L Sam 2
## 10: 5 E 2 5 9 3 Y M Sam 2
## 11: 5 E 2 0 0 4 Z M Kyle 1
## 12: 5 E 2 5 8 5 Z L Kyle 2
## 13: 5 E 2 5 8 6 Z M Kyle 2
Here's how to do it with dplyr. Basically, I'm only keeping lines that are not zero OR that the previous/following lines is not zero.
table1 %>%
group_by(Id) %>%
mutate(Lag=lag(sales),Lead=lead(sales)) %>%
rowwise() %>%
mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>%
filter(sales>0|Min>0) %>%
select(-Lead,-Lag,-Min)
Id Name Price sales Profit Month Category Mode Supplier
(int) (chr) (int) (int) (int) (int) (chr) (chr) (chr)
1 1 A 2 5 8 3 X K John
2 1 A 2 5 8 4 X L Sam
3 2 B 2 3 4 1 X L Sam
4 3 C 2 0 0 1 X K John
5 3 C 2 8 10 2 Y M John
6 3 C 2 8 10 3 Y K John
7 3 C 2 0 0 4 Y K John
8 5 E 2 0 0 1 Y M Sam
9 5 E 2 5 5 2 Y L Sam
10 5 E 2 5 9 3 Y M Sam
11 5 E 2 0 0 4 Z M Kyle
12 5 E 2 5 8 5 Z L Kyle
13 5 E 2 5 8 6 Z M Kyle
Data
table1 <-read.table(text="
Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier
1,A,2,0,0,1,X,K,John
1,A,2,0,0,2,X,K,John
1,A,2,5,8,3,X,K,John
1,A,2,5,8,4,X,L,Sam
2,B,2,3,4,1,X,L,Sam
2,B,2,0,0,2,X,L,Sam
2,B,2,0,0,3,X,M,John
2,B,2,0,0,4,X,L,John
3,C,2,0,0,1,X,K,John
3,C,2,8,10,2,Y,M,John
3,C,2,8,10,3,Y,K,John
3,C,2,0,0,4,Y,K,John
5,E,2,0,0,1,Y,M,Sam
5,E,2,5,5,2,Y,L,Sam
5,E,2,5,9,3,Y,M,Sam
5,E,2,0,0,4,Z,M,Kyle
5,E,2,5,8,5,Z,L,Kyle
5,E,2,5,8,6,Z,M,Kyle
",sep=",",stringsAsFactors =FALSE, header=TRUE)
UPDATE
To filter on more than one column with these criteria, here's how to do it. In the present case, the result is the same because when sales are 0, profits are also 0.
library(dplyr)
table1 %>%
group_by(Id) %>%
mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>%
rowwise() %>%
mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>%
filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR
select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)
I can't do it in one line, but here it is in three:
x <- df$sales==0 & df$Profit==0
y <- cumsum(c(1,head(x,-1)!=tail(x,-1)))
df[ave(x,df$Id,y,FUN=sum)<2,]
# Id Name Price sales Profit Month Category Mode Supplier
# 3 1 A 2 5 8 3 X K John
# 4 1 A 2 5 8 4 X L Sam
# 5 2 B 2 3 4 1 X L Sam
# 9 3 C 2 0 0 1 X K John
# 10 3 C 2 8 10 2 Y M John
# 11 3 C 2 8 10 3 Y K John
# 12 3 C 2 0 0 4 Y K John
# 13 5 E 2 0 0 1 Y M Sam
# 14 5 E 2 5 5 2 Y L Sam
# 15 5 E 2 5 9 3 Y M Sam
# 16 5 E 2 0 0 4 Z M Kyle
# 17 5 E 2 5 8 5 Z L Kyle
# 18 5 E 2 5 8 6 Z M Kyle
This works by first identifying all rows where sales and Profit are both 0 (x). The variable y groups consecutive TRUE and FALSE values. The ave() function splits the first input variable (x) according to the subsequent input variables (df$Id and y) then applies the function within groups. Since the function is sum(), it will add up all the TRUE values in x, then it returns a vector of the same length as x, so we just need to keep all the rows where the result is less than 2.
Here my solution:
aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x)
with(x, cbind(rep(values, lengths), rep(lengths, lengths))))
df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),]
Id Name Price sales Profit Month Category Mode Supplier
3 1 A 2 5 8 3 X K John
4 1 A 2 5 8 4 X L Sam
5 2 B 2 3 4 1 X L Sam
9 3 C 2 0 0 1 X K John
10 3 C 2 8 10 2 Y M John
11 3 C 2 8 10 3 Y K John
12 3 C 2 0 0 4 Y K John
13 5 E 2 0 0 1 Y M Sam
14 5 E 2 5 5 2 Y L Sam
15 5 E 2 5 9 3 Y M Sam
16 5 E 2 0 0 4 Z M Kyle
17 5 E 2 5 8 5 Z L Kyle
18 5 E 2 5 8 6 Z M Kyle
Related
Creating groups based on running totals against a value
I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group. My data looks something like this ID X Y Z 1 A A 1 2 A B 5 3 A C 2 4 A D 42 5 A E 10 6 A F 2 7 A G 0 8 A H 3 9 A I 0 10 A J 8 11 A K 19 12 A L 3 13 A M 1 14 A N 1 15 A O 2 16 A P 0 17 A Q 1 18 A R 2 What is want is something like this ID X Y Z CumSum Group 1 A A 1 1 1 2 A B 5 6 1 3 A C 2 8 1 4 A D 42 50 1 5 A E 10 10 2 6 A F 2 12 2 7 A G 0 12 2 8 A H 3 15 2 9 A I 0 15 2 10 A J 8 23 2 11 A K 19 42 2 12 A L 3 45 2 13 A M 1 1 3 14 A N 1 2 3 15 A O 2 4 3 16 A P 0 4 3 17 A Q 1 5 3 18 A R 2 7 3 Please let me know how I can achieve this with R. EDIT: I extended the minimum reproducible example for more clarity EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?
A function for this is available in the MESS-package... library(MESS) library(data.table) DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][] output ID X Y Z Group Cumsum 1: 1 A A 1 1 1 2: 2 A B 5 1 6 3: 3 A C 2 1 8 4: 4 A D 42 1 50 5: 5 A E 10 2 10 6: 6 A F 2 2 12 7: 7 A G 0 2 12 8: 8 A H 3 2 15 9: 9 A I 0 2 15 10: 10 A J 8 2 23 11: 11 A K 19 2 42 12: 12 A L 3 2 45 sample data DT <- fread("ID X Y Z 1 A A 1 2 A B 5 3 A C 2 4 A D 42 5 A E 10 6 A F 2 7 A G 0 8 A H 3 9 A I 0 10 A J 8 11 A K 19 12 A L 3")
Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used. Accum <- function(acc, x) if (acc < 45) acc + x else x applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE) cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum GroupNo <- function(x) { y <- cumsumr(x >= 45) max(y) - y + 1 } transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)), Group = ave(Cumsum, ID, FUN = GroupNo)) giving: ID X Y Z Cumsum Group 1 1 A A 1 1 1 2 2 A B 5 6 1 3 3 A C 2 8 1 4 4 A D 42 50 1 5 5 A E 10 10 2 6 6 A F 2 12 2 7 7 A G 0 12 2 8 8 A H 3 15 2 9 9 A I 0 15 2 10 10 A J 8 23 2 11 11 A K 19 42 2 12 12 A L 3 45 2 13 12 A L 3 3 3 14 12 A L 3 6 3 Note The input in reproducible form: Lines <- "ID X Y Z 1 A A 1 2 A B 5 3 A C 2 4 A D 42 5 A E 10 6 A F 2 7 A G 0 8 A H 3 9 A I 0 10 A J 8 11 A K 19 12 A L 3 12 A L 3 12 A L 3" DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
One tidyverse possibility could be: df %>% mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)), Group = cumsum(Cumsum >= 45), Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1) ID X Y Z Cumsum Group 1 1 A A 1 1 1 2 2 A B 5 6 1 3 3 A C 2 8 1 4 4 A D 42 50 1 5 5 A E 10 10 2 6 6 A F 2 12 2 7 7 A G 0 12 2 8 8 A H 3 15 2 9 9 A I 0 15 2 10 10 A J 8 23 2 11 11 A K 19 42 2 12 12 A L 3 45 2
Not a pretty solution, but functional. df$Group<-0 group<-1 while (df$Group[nrow(df)]==0) { df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0]) df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group group=group+1 } df ID X Y Z ww Group 1 1 A A 1 1 1 2 2 A B 5 6 1 3 3 A C 2 8 1 4 4 A D 42 50 1 5 5 A E 10 10 2 6 6 A F 2 12 2 7 7 A G 0 12 2 8 8 A H 3 15 2 9 9 A I 0 15 2 10 10 A J 8 23 2 11 11 A K 19 42 2 12 12 A L 3 45 2 OK, yeah, #tmfmnk 's solution is vastly better: Unit: milliseconds expr min lq mean median uq max neval tm 2.224536 2.805771 6.76661 3.221449 3.990778 303.7623 100 iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062 100
Or using data.table: library(data.table) n <- 45L DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][, Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)] output: ID X Y Z cs Group 1: 1 A A 1 1 1 2: 2 A B 5 6 1 3: 3 A C 2 8 1 4: 4 A D 42 42 1 5: 5 A E 10 10 2 6: 6 A F 2 12 2 7: 7 A G 0 12 2 8: 8 A H 3 15 2 9: 9 A I 0 15 2 10: 10 A J 8 23 2 11: 11 A K 19 42 2 12: 12 A L 3 45 2 13: 13 A M 1 1 3 14: 14 A N 1 2 3 15: 15 A O 2 4 3 16: 16 A P 0 4 3 17: 17 A Q 1 5 3 18: 18 A R 2 7 3 data: library(data.table) DT <- fread("ID X Y Z 1 A A 1 2 A B 5 3 A C 2 4 A D 42 5 A E 10 6 A F 2 7 A G 0 8 A H 3 9 A I 0 10 A J 8 11 A K 19 12 A L 3 13 A M 1 14 A N 1 15 A O 2 16 A P 0 17 A Q 1 18 A R 2")
determining total number of times distinct values 0 or 1 or na in each column in a data frame in R
I have 15 columns and I want to group by values in each column by either 0 or 1 or na. my dataset A,B,C,D,E,F,G,H,I,J,K,L,M,N,O 0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0 1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0 1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0 NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA 1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0 1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0 1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0 1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA 1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA 1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0 1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0 1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0 1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0 1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0 0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0 1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0 1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0 1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0 0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA 1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0 NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0 0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA I want output to be like: A B C D E F G H I J K L M N O 0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1 1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1 NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always" sapply(df1, table, useNA="always") If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1 sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always")) # A B C D E F G H I J K L M N O #0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9 #1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11 #<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5
combine rowPerc, colPerc in one matrix?
Suppose I have data with sales of different products in different categories in different months and I want to see their percentage of sales or number of items in each category Id Name Price sales Profit Month Category Mode Supplier 1 A 2 5 8 1 X K John 1 A 2 6 9 2 X K John 1 A 2 5 8 3 X K John 2 B 2 4 6 1 X L Sam 2 B 2 3 4 2 X L Sam 2 B 2 5 7 3 X L Sam 3 C 2 5 11 1 X M John 3 C 2 5 11 2 X L John 3 C 2 5 11 3 X K John 4 D 2 8 10 1 Y M John 4 D 2 8 10 2 Y K John 4 D 2 5 7 3 Y K John 5 E 2 5 9 1 Y M Sam 5 E 2 5 9 2 Y L Sam 5 E 2 5 9 3 Y M Sam 6 F 2 4 7 1 Z M Kyle 6 F 2 5 8 2 Z L Kyle 6 F 2 5 8 3 Z M Kyle applying table on category and mode will show us how many times particular category existed under particular mode K L M X 4 4 1 Y 2 1 3 Z 0 1 2 Now rowPerc and colPerc will give us either percentage on row wise or column wise. But what if I am interested to know for example X/K makes up how much percentage of the total data which is 22.22% (total data in matrix is 18). Is there any way I can get a matrix of percentage of particular point in total data. something like this K L M X 22.22 22.22 5.55 Y 11.11 5.55 16.67 Z 0.00 5.55 11.11 so total sum of matrix is 100% instead of rows or column. I hope I explained it clearly. Thanks
If df is your data.frame with(df, prop.table(table(Category, Mode))*100)
Fill in missing rows in R
Suppose I have a data frame which looks like this ID A B C D Month 1 X M 5 1 3 1 X K 4 2 4 1 X K 3 7 5 1 X K 2 6 6 2 Y L 5 8 1 2 Y L 2 3 2 2 Y M 5 1 3 2 Y K 2 7 5 2 Y M 2 8 6 3 Z K 5 3 1 3 Z M 6 3 2 3 Z M 5 8 3 3 Z K 4 2 4 In this data ID and A are unique variables, while B,C,D,Month can change their value Month has 6 factor values from 1 to 6 B have 3 factor value from K,L,M C,D can have any value. I want this data to become like this ID A B C D Month 1 X 0 0 0 1 1 X 0 0 0 2 1 X M 5 1 3 1 X K 4 2 4 1 X K 3 7 5 1 X K 2 6 6 2 Y L 5 8 1 2 Y L 2 3 2 2 Y M 5 1 3 2 Y 0 0 0 4 2 Y K 2 7 5 2 Y M 2 8 6 3 Z K 5 3 1 3 Z M 6 3 2 3 Z M 5 8 3 3 Z K 4 2 4 3 Z 0 0 0 5 3 Z 0 0 0 6 It should fill in the missing rows by keeping the unique variables values same and filling in the varying ones with zero. I can use zoo library to fill in the missing values but how to fill in the complete missing rows?
Maybe something like this would work for your needs: library(dplyr) mydf %>% full_join(expand.grid(ID = unique(mydf$ID), Month = 1:6)) %>% group_by(ID) %>% mutate(A = replace(A, is.na(A), unique(na.omit(A)))) %>% arrange(ID, A, Month) %>% replace(., is.na(.), 0) # Joining by: c("ID", "Month") # Source: local data frame [18 x 6] # Groups: ID # # ID A B C D Month # 1 1 X 0 0 0 1 # 2 1 X 0 0 0 2 # 3 1 X M 5 1 3 # 4 1 X K 4 2 4 # 5 1 X K 3 7 5 # 6 1 X K 2 6 6 # 7 2 Y L 5 8 1 # 8 2 Y L 2 3 2 # 9 2 Y M 5 1 3 # 10 2 Y 0 0 0 4 # 11 2 Y K 2 7 5 # 12 2 Y M 2 8 6 # 13 3 Z K 5 3 1 # 14 3 Z M 6 3 2 # 15 3 Z M 5 8 3 # 16 3 Z K 4 2 4 # 17 3 Z 0 0 0 5 # 18 3 Z 0 0 0 6
Here's a way using base R frame <- expand.grid(ID = unique(dat$ID), Month = 1:6) dat2 <- merge(dat, frame, by=c("ID", "Month"), all=TRUE)[, union(names(dat), names(frame))] levels(dat2$B) <- c(levels(dat2$B), 0) res <- lapply(split(dat2, dat2$ID), function(x) { x$A[which(is.na(x$A))] <- unique(x$A)[!is.na(unique(x$A))] x[is.na(x)] <- 0 x }) do.call(rbind, res) ID A B C D Month 1.1 1 X 0 0 0 1 1.2 1 X 0 0 0 2 1.3 1 X M 5 1 3 1.4 1 X K 4 2 4 1.5 1 X K 3 7 5 1.6 1 X K 2 6 6 2.7 2 Y L 5 8 1 2.8 2 Y L 2 3 2 2.9 2 Y M 5 1 3 2.10 2 Y 0 0 0 4 2.11 2 Y K 2 7 5 2.12 2 Y M 2 8 6 3.13 3 Z K 5 3 1 3.14 3 Z M 6 3 2 3.15 3 Z M 5 8 3 3.16 3 Z K 4 2 4 3.17 3 Z 0 0 0 5 3.18 3 Z 0 0 0 6
Change variable value-- repeated IDs
I've this data set id <- c(0,0,1,1,2,2,3,3,4,4) gender <- c("m","m","f","f","f","f","m","m","m","m") x1 <-c(1,1,1,1,2,2,3,3,10,10) x2 <- c(3,7,5,6,9,15,10,15,12,20) alldata <- data.frame(id,gender,x1,x2) which looks like: id gender x1 x2 0 m 1 3 0 m 1 7 1 f 1 5 1 f 1 6 2 f 2 9 2 f 2 15 3 m 3 10 3 m 3 15 4 m 10 12 4 m 10 20 Notice that for each unique id x1 are similar, but x2 are different. I need to sort data by id and x2 (from smallest to largest) and then for each unique id I need to set x1(for the second record) = x2 (for the first record). The data would look like: id gender x1 x2 0 m 1 3 0 m 3 7 1 f 1 5 1 f 5 6 2 f 2 9 2 f 9 15 3 m 3 10 3 m 10 15 4 m 10 12 4 m 12 20
I found this easier using data.table > library(data.table) > dt = data.table(alldata) > setkey(dt, id, x2) #sort the data This next line says: within each ID for x1, take the first value of x1, then every remaining value take from x2 as needed. > dt[,x1 := c(x1[1], x2)[1:.N],keyby=id] > dt id gender x1 x2 1: 0 m 1 3 2: 0 m 3 7 3: 1 f 1 5 4: 1 f 5 6 5: 2 f 2 9 6: 2 f 9 15 7: 3 m 3 10 8: 3 m 10 15 9: 4 m 10 12 10: 4 m 12 20
Here's another possible solution using the seq command to select every other record: alldata <- alldata[order(id, x2),] alldata$x1[seq(2, length(alldata$x1), 2)] <- alldata$x2[seq(1, length(alldata$x2) - 1, 2)]
Here is a dplyr solution. library(dplyr) arrange(alldata,id,x2) %>% group_by(id) %>% mutate(x1= c(first(x1), first(x2))) Source: local data frame [10 x 4] Groups: id id gender x1 x2 1 0 m 1 3 2 0 m 3 7 3 1 f 1 5 4 1 f 5 6 5 2 f 2 9 6 2 f 9 15 7 3 m 3 10 8 3 m 10 15 9 4 m 10 12 10 4 m 12 20
`rownames<-`(do.call(rbind,by(alldata,alldata$id,function(g) { o <- order(g$x2); g$x1[o[2]] <- g$x2[o[1]]; g; })),NULL); ## id gender x1 x2 ## 1 0 m 1 3 ## 2 0 m 3 7 ## 3 1 f 1 5 ## 4 1 f 5 6 ## 5 2 f 2 9 ## 6 2 f 9 15 ## 7 3 m 3 10 ## 8 3 m 10 15 ## 9 4 m 10 12 ## 10 4 m 12 20