Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1" )
ID2<-c("b8","b99","b5","b5","b2","b8" )
Value1<-c(2,5,6,6,2,7)
Value2<- c(23,51,63,64,23,23)
Year<- c(2004,2004,2004,2004,2005,2004)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with the smaller value.
Expected result:
ID1 ID2 Value1 Value2 Year new
2 a4 b99 5 51 2004 a4_b99_2004
4 a6 b5 6 64 2004 a6_b5_2004
5 a5 b2 2 23 2005 a5_b2_2005
6 a1 b8 7 23 2004 a1_b8_2004
I tried the following:
Find a unique identifier for the conditions I am interested
df$new<-paste(df$ID1,df$ID2, df$Year, sep="_")
I can use the unique identifier to find the rows of the database that contain the duplicates
IND<-which(duplicated(df$new) | duplicated(df$new, fromLast = TRUE))
In a for loop if unique identifier has duplicate compare the values and erase the rows, but the loop is too complicated and I cannot solve it.
for (i in df$new) {
if(sum(df$new == i)>1)
{
ind<-which(df$new==i)
m= min(df$Value1[ind])
df<-df[-which.min(df$Value1[ind]),]
m= min(df$Value2[ind])
df<-df[-which.min(df$Value2[ind]),]
}
}
Some different possibilities. Using dplyr:
df %>%
group_by(ID1, ID2, Year) %>%
filter(Value1 == max(Value1) & Value2 == max(Value2))
Or:
df %>%
rowwise() %>%
mutate(max_val = sum(Value1, Value2)) %>%
ungroup() %>%
group_by(ID1, ID2, Year) %>%
filter(max_val == max(max_val)) %>%
select(-max_val)
Using data.table:
setDT(df)[df[, .I[Value1 == max(Value1) & Value2 == max(Value2)], by = list(ID1, ID2, Year)]$V1]
Or:
setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)
][filter != FALSE
][, -c("max_val", "filter")]
Or:
subset(setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)], filter != FALSE)[, -c("max_val", "filter")]
Consider aggregate to retrieve the max values by your grouping, ID1, ID2, and Year:
df_new <- aggregate(.~ID1 + ID2 + Year, df, max)
df_new
# ID1 ID2 Year Value1 Value2
# 1 a6 b5 2004 6 64
# 2 a1 b8 2004 7 23
# 3 a4 b99 2004 5 51
# 4 a5 b2 2005 2 23
Solution without loading libraries:
ID1 ID2 Value1 Value2 Year
a6.b5.2004 a6 b5 6 64 2004
a1.b8.2004 a1 b8 7 23 2004
a4.b99.2004 a4 b99 5 51 2004
a5.b2.2005 a5 b2 2 23 2005
Code
do.call(rbind, lapply(split(df, list(df$ID1, df$ID2, df$Year)), # make identifiers
function(x) {return(x[which.max(x$Value1 + x$Value2),])})) # take max of sum
Related
I have a single column dataframe with all possible IDs:
ID
a1
a2
b1
b11
c1
I get dataframe from my database with same column "ID". But in that dataframe, not all IDs might be. Here is example of that table:
ID value
a1 18
a2 10
b1 10
I want to bind those tw tables in that way, so IDs which were not in my table have value zero. So, how to bind these two tables to get this:
ID value
a1 18
a2 10
b1 10
b11 0
c1 0
join the two tables and replace NA value with 0.
Using dplyr :
library(dplyr)
df1 %>%
full_join(df2, by = 'ID') %>%
mutate(value = replace(value, is.na(value), 0))
# ID value
#1 a1 18
#2 a2 10
#3 b1 10
#4 b11 0
#5 c1 0
In base R, you can do this as :
transform(merge(df1, df2, by = 'ID', all = TRUE),
value = replace(value, is.na(value), 0))
We can also do
library(dplyr)
df1 %>%
full_join(df2, by = 'ID') %>%
mutate(value = case_when(is.na(value) ~ 0, TRUE ~ value))
there is a matrix:
mat<-matrix(0,ncol = 10, nrow = 5)
colnames(mat)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10")
rownames(mat)<-c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5")
mat[1,] <-c(0,0,1,1,1,1,0,0,0,0)
mat[2,]<-c(0,0,0,1,1,1,0,0,0,0)
mat[3,]<-c(0,0,0,1,1,1,1,1,1,0)
mat[4,]<-c(0,0,0,0,0,1,1,1,1,0)
mat[5,]<-c(0,0,0,0,0,0,1,1,1,1)
I want to convert this matrix into a table with three columns - "ID", "start" and "stop", where "start" is a column with the first value (1) in row "ID", "stop" is a column with the last value in the row. I would like to receive this output:
Could You please help me?
Thanks in advance.
Here is one way using dplyr, converting the matrix to dataframe, convert rownames to column, get the data in long format, filter rows with value = 1 and select first and last column name for each id.
library(dplyr)
mat %>%
as.data.frame() %>%
tibble::rownames_to_column('id') %>%
tidyr::pivot_longer(cols = -id) %>%
filter(value == 1) %>%
group_by(id) %>%
summarise(start = first(name), stop = last(name))
# A tibble: 5 x 3
# id start stop
# <chr> <chr> <chr>
#1 ID_1 A3 A6
#2 ID_2 A4 A6
#3 ID_3 A4 A9
#4 ID_4 A6 A9
#5 ID_5 A7 A10
In base R and keeping mat as matrix :
t(apply(mat, 1, function(x) {
inds <- which(x == 1)
c(start = colnames(mat)[min(inds)], stop = colnames(mat)[max(inds)])
}))
You can do this using the ties.method argument in max.col. Use the result to subset the colnames.
data.frame(id = rownames(mat),
start = colnames(mat)[max.col(mat, "first")],
stop = colnames(mat)[max.col(mat, "last")])
# id start stop
# 1 ID_1 A3 A6
# 2 ID_2 A4 A6
# 3 ID_3 A4 A9
# 4 ID_4 A6 A9
# 5 ID_5 A7 A10
I have a dataframe with different parameters in each. I'll like to merge rows using a different set of parameters for each row.
Here is my sameple data ZZ:
ZZ<-data.frame(Name =c("A","B","C","D","E","F"),A1=c(19,20,21,23,45,67),A2=c(1,2,3,4,5,6),A3=c(7,8,13,24,88,90),x=c(4,5,6,8,23,16),y=c(-3,-7,-6,-9,3,2))
> ZZ
Name A1 A2 A3 x y
1 A 19 1 7 4 -3
2 B 20 2 8 5 -7
3 C 21 3 13 6 -6
4 D 23 4 24 8 -9
5 E 45 5 88 23 3
6 F 67 6 90 16 2
I want to aggregate the rows A,B,C and D,E,F such that a new name is defined for each group (eg:C1 and C2), A1,A2 and A3 are combined by sum while x and y using the mean.
How can this be done please? The result should be:
> ZZ2
Name A1 A2 A3 x y
1 C1 60 6 28 5.000 -5.333
2 C2 135 15 202 15.667 -1.333
Based on how I interpreted your question I believe this should give you what you want using dplyr:
library(dplyr)
result <- ZZ %>%
mutate(Name = ifelse(Name %in% c("A", "B", "C"), "C1", "C2")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()
Depending on how many rows you have with different names there might be better alternatives for the mutating the Name variable into the 2 groups.
EDIT: Example if 4 cases exist
result <- ZZ %>%
mutate(Name = case_when(Name %in% c("A", "B", "C") ~ "C1",
Name %in% c("D", "E") ~ "C2",
Name %in% c("F", "G") ~ "C3",
Name %in% c("H", "I") ~ "C4")) %>%
group_by(Name) %>%
summarise(A1 = sum(A1), A2 = sum(A2), A3 = sum(A3), x = mean(x), y = mean(y)) %>%
ungroup()
Given a dataframe df like below
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE)
I want to add a row with parameter b_diff for each car, qtr combination with val as difference of parameter b for two consecutive qtr. The qtr ascending order is FY18Q1, FY18Q2. For the first qtr which is FY18Q1, the val for b_diff shall be NA as there is no previous qtr.
The expected output is as below.
parameter car qtr val
a a3 FY18Q1 23
b a3 FY18Q1 10000
b_diff a3 FY18Q1 NA
a a3 FY18Q2 14
b a3 FY18Q2 12000
b_diff a3 FY18Q2 2000
a cla FY18Q1 15
b cla FY18Q1 12000
c cla FY18Q1 5.5
b_diff cla FY18Q1 NA
a cla FY18Q2 26
b cla FY18Q2 10000
c cla FY18Q2 6.2
b_diff cla FY18Q2 -2000
How do I go about doing this with dplyr ?
A solution using dplyr and purrr. We can create a group ID using group_indices and based on that to split the data frame, summarize the data and then combine them. df5 is the final output.
library(dplyr)
library(purrr)
df2 <- df %>% mutate(GroupID = group_indices(., car, qtr))
df3 <- df2 %>%
filter(parameter %in% "b") %>%
group_by(car) %>%
mutate(val = val - lag(val), parameter = "b_diff") %>%
ungroup() %>%
split(f = .$GroupID)
df4 <- df2 %>% split(f = .$GroupID)
df5 <- map2_dfr(df4, df3, bind_rows) %>% select(-GroupID)
df5
# parameter car qtr val
# 1 a a3 FY18Q1 23.0
# 2 b a3 FY18Q1 10000.0
# 3 b_diff a3 FY18Q1 NA
# 4 a a3 FY18Q2 14.0
# 5 b a3 FY18Q2 12000.0
# 6 b_diff a3 FY18Q2 2000.0
# 7 a cla FY18Q1 15.0
# 8 b cla FY18Q1 12000.0
# 9 c cla FY18Q1 5.5
# 10 b_diff cla FY18Q1 NA
# 11 a cla FY18Q2 26.0
# 12 b cla FY18Q2 10000.0
# 13 c cla FY18Q2 6.2
# 14 b_diff cla FY18Q2 -2000.0
DATA
Notice that it is better to have stringsAsFactors = FALSE.
text <- "
parameter,car,qtr,val
a,a3,FY18Q1,23
b,a3,FY18Q1,10000
a,a3,FY18Q2,14
b,a3,FY18Q2,12000
a,cla,FY18Q1,15
b,cla,FY18Q1,12000
c,cla,FY18Q1,5.5
a,cla,FY18Q2,26
b,cla,FY18Q2,10000
c,cla,FY18Q2,6.2
"
df <- read.table(textConnection(text), sep = ",", header = TRUE, stringsAsFactors = FALSE)
Here is one algorithm:
Reshape the data to "wide" format, so that qtr and car form a unique row index, with the parameter column "spread" into columns
Within each car value, take the 1-period diff of the new parameter_b column
Reshape the data back to "long" format
Equivalent code, using reshape2 and dplyr:
# optional. you could just use `c(NA, diff(x))` below, but this is more general
padded_diff <- function(x, lag = 1L) {
c(rep.int(NA, lag), diff(x, lag = lag))
}
df %>%
dcast(car + qtr ~ parameter, value.var = "val") %>%
mutate(b_diff = padded_diff(b)) %>%
melt(id.vars = c("car", "qtr"), variable.name = "parameter") %>%
arrange(car, qtr, parameter)
Here is another algorithm:
Group the data frame by car
Within each group, temporarily filter so that only rows with paramter == "b" are present
Take the 1-period diff of the val column
Remove the filter and ungroup
Equivalent code, using only dplyr, using a temporary table to simulate a "removable" filter:
make_b_diff_within_group <- function(df) {
tmp <- df %>%
filter(parameter == "b") %>%
transmute(
qtr = qtr,
val = padded_diff(val),
parameter = "b_diff")
bind_rows(df, tmp)
}
df %>%
group_by(car) %>%
do(make_b_diff_within_group(.)) %>%
ungroup() %>%
arrange(car, qtr, parameter)
This second algorithm could be implemented using several other "split-apply-combine" paradigms, including the tapply or by functions in base R, the ddply function in the plyr package (an ancestor of dplyr by the same author), and the split method from dplyr, as shown in this answer.
I have a dataframe (df) with three columns like so:
Structure:
id id1 age
A1 a1 32
A1 a2 45
A1 a3 45
A1 a4 12
A2 b1 15
A2 b5 34
A2 b64 17
Expected Output:
id count count1
A1 4 1
A2 3 2
Logic:
Column "count" is the number of times "id" is repeated
Column "count1" is the number of rows where age is less than 21
Current Code:
library(dplyr)
df_summarized <- df %>%
group_by(id) >%>
summarise(count = n(),count1 = count(age<21))
Problem:
Error: no applicable method for 'group_by_' applied to an object of class "logical"
We need to do the sum
df %>%
group_by(id) %>%
summarise(count = n(),count1 = sum(age < 21))
# A tibble: 2 × 3
# id count count1
# <chr> <int> <int>
#1 A1 4 1
#2 A2 3 2
as count applies to data.frame or tbl_df and not in a single column inside the summarise
Or using data.table
library(data.table)
setDT(df)[, .(count = .N, count1 = sum(age < 21)), id]
Or with base R
cbind(count = rowSums(table(df[-2])), count1 = as.vector(rowsum(+(df$age < 21), df$id)))
# count count1
#A1 4 1
#A2 3 2
Or using aggregate based on the sum
do.call(data.frame, aggregate(age~id, df, FUN =
function(x) c(count = length(x), count1 = sum(x<21))))
NOTE: All the above methods give the dataset with proper columns. This will be especially noted in aggregate. That is the reason the output column i.e. a matrix is converted to proper columns with do.call(data.frame
With base R, we can use aggregate to find number of rows for each group (id) as well as number of rows with value less than 21
aggregate(age~id, df, function(x) c(count = length(x),
count1 = length(x[x < 21])))
# id age.count age.count1
#1 A1 4 1
#2 A2 3 2