Finding the max per grouped variable and convert in new variables [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I have the following data set and I would like to identify the product with the highest amount per customer_ID and convert it into a new column. I also want to keep only one record per ID.
Data to generate the data set:
x <- data.frame(customer_id=c(1,1,1,2,2,2),
product=c("a","b","c","a","b","c"),
amount=c(50,125,100,75,110,150))
Actual data set looks like this:
customer_id product amount
1 a 50
1 b 125
1 c 100
2 a 75
2 b 110
2 c 150
Desired output wanted should look like this:
customer_ID product_b product_c
1 125 0
2 0 150

We can do this with tidyverse. After grouping by 'customer_id', slice the row that has the maximum 'amount', paste with prefix ('product_') to 'product' column (if needed) and spread to wide format
library(dplyr)
library(tidyr)
x %>%
group_by(customer_id) %>%
slice(which.max(amount)) %>%
mutate(product = paste0("product_", product)) %>%
spread(product, amount, fill = 0)
# customer_id product_b product_c
#* <dbl> <dbl> <dbl>
#1 1 125 0
#2 2 0 150
Another option is to arrange the dataset by 'customer_id' and 'amount' in descending order, get the distinct rows based on 'customer_id' and `spread to 'wide'
arrange(x, customer_id, desc(amount)) %>%
distinct(customer_id, .keep_all = TRUE) %>%
spread(customer_id, amount, fill = 0)

Using reshape2 package,
library(reshape2)
x1 <- x[!!with(x, ave(amount, customer_id, FUN = function(i) i == max(i))),]
dcast(x1, customer_id ~ product, value.var = 'amount', fill = 0)
# customer_id b c
#1 1 125 0
#2 2 0 150

Related

Join with closest value between two values in R

I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30

Identifying values from one database to use in another database

I am working on a project in which I need to work with 2 databases, identify values from one database to use in another.
I have a dataframe 1,
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995))
and a dataframe 2,
df2 <- data.frame("Condition A"=c("A","A","B","B"),"Condiction B"=c("1","2","1","2"),"<1990"=c(20,30,50,80),"1990-2000"=c(100,90,80,30),">2000"=c(300,200,800,400))
I would like to add a new column to df1 called "Value", in which, for each ID (from df1), collects the values from column 3,4 or 5 from df2 (depending on the year), and following conditions A and B available in both databases. The end result would be something like this:
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995),"Value"=c(800,50,90))
thanks!
I think we can simply left_join, then mutate with case_when, then drop the undesired columns with select:
library(dplyr)
left_join(df1, df2, by=c("Condition.A", "Condition.B"))%>%
mutate(Value=case_when(Year<1990 ~ X.1990,
Year<2000 ~ X1990.2000,
Year>=2000 ~ X.2000))%>%
select(-starts_with("X"))
ID Condition.A Condition.B Year Value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
EDIT: I edited your code, removing the "Condiction" typo
You could use
library(dplyr)
library(tidyr)
df2 %>%
rename(Condition.B = Condiction.B) %>%
pivot_longer(matches("\\d+{4}")) %>%
right_join(df1, by = c("Condition.A", "Condition.B")) %>%
filter(name == case_when(
Year < 1990 ~ "X.1990",
Year > 2000 ~ "X.2000",
TRUE ~ "X1990.2000")) %>%
select(ID, Condition.A, Condition.B, Year, Value = value) %>%
arrange(ID)
This returns
# A tibble: 3 x 5
ID Condition.A Condition.B Year Value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
At first we rename the misspelled column Condiction.B of df2 and bring it into a "long format" based on the "<1990", "1990-2000", ">2000" columns. Note that those columns can't be named like this, they are automatically renamed to X.1990, X1990.2000 and X.2000.
Next we use a right join with df1 on the two Condition columns.
Finally we filter just the matching years based on a hard coded case_when function and do some clean up (selecting and arranging).
We could do it this way:
Condiction must be a typo so I changed it to Condition
in df1 create a helper column that assigns each your to the group which is a column name in df2
bring df2 in long format
finally apply left_join by by=c("Condition.A", "Condition.B", "helper"="name")
library(dplyr)
library(tidyr)
df1 <- df1 %>%
mutate(helper = case_when(Year >=1990 & Year <=2000 ~"X1990.2000",
Year <1990 ~ "X.1990",
Year >2000 ~ "X.2000"))
df2 <- df2 %>%
pivot_longer(
cols=starts_with("X")
)
df3 <- left_join(df1, df2, by=c("Condition.A", "Condition.B", "helper"="name")) %>%
select(-helper)
ID Condition.A Condition.B Year value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

Setting column value of a subset of rows in a dataframe in R [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 5 years ago.
I have a dataframe df with a column called ID.
Multiple rows may have the same ID and I want to set a column value "occurrence" to indicate how many times the ID has been seen before.
for (i in unique(df$ID)) {
rows = df[df$ID==i, ]
for (idx in 1:nrow(rows)) {
rows[idx,'occurrence'] = idx
}
}
Unfortunately, this adds the occurrence column to rows, but it does not update the original data frame. How do I get the occurrence column added to df?
Update: The row_number() function pointed out by neilfws works great. Actually, I have a followup question: The dataframe also has a year column, an what I need to do is to add a new column (say Prev.Year.For.This.ID) for the year of the previous occurrence of the ID. e.g if the input is
Year = c(1991,1991,1993,1994,1995)
ID = c(1,2,1,2,1)
df <- data.frame (Year, ID)
I'd like the output to look like this:
ID Year occurrence Prev.Year.For.This.Id
1 1991 1 <NA>
2 1992 1 <NA>
1 1993 2 1991
2 1994 2 1992
1 1995 3 1993
You can use dplyr to group_by ID, then row_number gives the running total of occurrences.
library(dplyr)
df1 <- data.frame(ID = c(1,2,3,1,4,5,6,2,7,8,2))
df1 %>%
group_by(ID) %>%
mutate(cnt = row_number()) %>%
ungroup()
ID cnt
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 1 2
5 4 1
6 5 1
7 6 1
8 2 2
9 7 1
10 8 1
11 2 3
Are you after something like the following (I made up sample data for you):
library(dplyr)
df = data.frame(ID = c(1,1,1,2,2,3))
answer = df %>% group_by(ID) %>% mutate(occurrence = cumsum(ID / ID) - 1) %>% as.data.frame
This will give something which looks like this:
ID occurrence
1 0
1 1
1 2
2 0
2 1
3 0
The dplyr package is a great tool for grouping and summarising data. I also find the code very readable when I use the pipe %>% (though, admittedly, it does take some getting used to).
> library(data.table)
> df = data.frame(ID = c(1,1,1,2,2,3))
> df <- data.table(df)
> df[, occurrence := sequence(.N), by = c("ID")]
> df
ID occurrence
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 3 1

Resources