R: Sumif equivalent from one dataframe to another - r

I have two dataframes that look somewhat like this (they're much bigger)
df1 <- data.frame(center = c("5012","5012","5025"),
product = c("Apple","Grape","Apple"),
value = c(20,30,50))
df1:
Center Product Value
1 5012 Apple 20
2 5012 Grape 30
3 5025 Apple 50
df2 <- data.frame(center = c("5012","5012","5012","5012","5012","5025"),
profitCenter = c("A","B","C","D","A","A"),
product = c("Apple","Apple","Apple", "Apple","Grape","Apple"),
volume = c(20,30,50,70,60,80))
df2:
Center Profitcenter Product Volume
1 5012 A Apple 20
2 5012 B Apple 30
3 5012 C Apple 50
4 5012 D Apple 70
5 5012 A Grape 60
6 5025 A Apple 80
I wanted to get the sum of the "volume" column from DF2 by "center" and "product" into DF1. In excel I would do a sumif, but I'm struggling to think on how to properly do this with R:
DF1:
Center Product Value Volume
1 5012 Apple 20 170
2 5012 Grape 30 60
3 5025 Apple 50 80
Currently I'm creating an aggregated version of DF2 (with dplyr's group_by) and then doing a left_join, but I have to do this a few more times and I'm sure there's a better way.

Similar to #MrFlick commented but I would prefer to first summarise in/works with df2 and then add the resulting column to df1. But outcome and performance are the same. As #r2evans pointed out:
library(tidyverse)
df2 %>%
group_by(center, product) %>%
summarise(Volume=sum(volume)) %>%
right_join(df1,by=c("center","product"))
Or:
library(tidyverse)
df1 %>%
left_join(
df2 %>%
group_by(center,product) %>%
summarise(sum(volume)),
by=c("center","product")
)
Or (for avoiding nesting, thanks #r2evans again):
library(tidyverse)
df2 %>%
group_by(center, product) %>%
summarise(Volume=sum(volume)) %>%
left_join(df1,.,by=c("center","product"))
Output:
center product value sum(volume)
1 5012 Apple 20 170
2 5012 Grape 30 60
3 5025 Apple 50 80

Related

subset data based on condition in r [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
I want to select those household where all the member's age is greater than 20 in r.
household Members_age
100 75
100 74
100 30
101 20
101 50
101 60
102 35
102 40
102 5
Here two household satisfy the condition. Household 100 and 101.
How to do it in r?
what I did is following but it's not working.
sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
100 75
102 35
Please suggest. Here is the sample dataset
library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
Members_age = c(75,74,30,20,50,60,35,40,5))
You can use ave.
data[ave(data$Members_age, data$household, FUN=min) > 20,]
# household Members_age
#1 100 75
#2 100 74
#3 100 30
or only the households.
unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100
I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).
I suggest to implement your logic, you would change your sqldf code to the following:
sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
# household Members_age
# 1 100 30
which is effectively the SQL analog of GKi's ave answer.
An alternative:
library(dplyr)
data %>%
group_by(household) %>%
filter(all(Members_age > 20)) %>%
ungroup()
# # A tibble: 3 x 2
# household Members_age
# <dbl> <dbl>
# 1 100 75
# 2 100 74
# 3 100 30
and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).
But for base R, I think nothing is likely to be better than GKi's use of ave.

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

Create a Table with Alternating Total Rows Followed by Sub-Rows Using Dplyr and Tidyverse

library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))

Subsetting observations with grouping some features

I have a dataset like below:
date, time,product,shop_id
20140104 900 Banana 18
20140104 900 Banana 19
20140104 924 Banana 18
20140104 929 Banana 18
20140104 932 Banana 20
20140104 948 Banana 18
and i need to extract the observations with different product, and different shop_id
so, i need to group the observations by product+shop_id
here is my code:
library(plyr)
d_ply( shop, .( product,shop_id ),table )
print(p)
unfortunately, it prints null
dataset:
date=c(20140104,20140104,20140104,20140104,20140104)
time=c(924 ,900,854,700,1450)
product=c(Banana ,Banana ,Banana ,Banana ,Banana)
shop_id=c(18,18,18,19,20)
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id)
the output should be
date, time, product, shop_id
20140104 900 Banana 19
20140104 932 Banana 20
20140104 948 Banana 18
We can do
library(tidyverse)
shop %>%
group_by(product, shop_id) %>%
mutate(n = n()) %>%
group_by(time) %>%
arrange(n) %>%
slice(1) %>%
group_by(product, shop_id) %>%
arrange(-time) %>%
slice(1) %>%
select(-n) %>%
arrange(time)
# date time product shop_id
# <int> <int> <chr> <int>
#1 20140104 900 Banana 19
#2 20140104 932 Banana 20
#3 20140104 948 Banana 18
In order to take only first unique combination, just use aggregate from package stats:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]})
Group.1 Group.2 date time product shop_id
1 Banana 18 20140104 924 Banana 18
2 Banana 19 20140104 700 Banana 19
3 Banana 20 20140104 1450 Banana 20
Explanation: My FUN=function(x){x[1]} takes only first element in case of collision
To drop "Group.1", "Group.2" or other columns:
> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){x[1]})
> res[ , !(names(res) %in% c("Group.1", "Group.2"))]
date time product shop_id
1 20140104 924 Banana 18
2 20140104 700 Banana 19
3 20140104 1450 Banana 20
P.S. Your dataset provided is inconsistent with examples you required, so that's why there is a difference in numbers.
P.S.2 If you want to get all data in case of collision:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN="identity")
Group.1 Group.2 date time product shop_id
1 Banana 18 20140104, 20140104, 20140104 924, 900, 854 1, 1, 1 18, 18, 18
2 Banana 19 20140104 700 1 19
3 Banana 20 20140104 1450 1 20
If you want to mark collisions:
> aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NA else x})
Group.1 Group.2 date time product shop_id
1 Banana 18 NA NA NA NA
2 Banana 19 20140104 700 1 19
3 Banana 20 20140104 1450 1 20
If you want to exclude non-unique rows:
> res <- aggregate(shop, by=list(shop$product, shop$shop_id), FUN=function(x){if (length(x) > 1) NULL else x})
> res[res$product != "NULL", !(names(res) %in% c("Group.1", "Group.2"))]
date time product shop_id
2 20140104 700 1 19
3 20140104 1450 1 20
If you want to avoid coerce from String to Int (for product), use ""/"NULL"/"NA" instead of NULL/NA.
It can be done using dplyr as follows:
# create the sample dataset
date=c(20140104,20140104,20140104,20140104,20140104)
time=c(924 ,900,854,700,1450)
product=c("Banana","Banana","Banana","Banana","Banana")
shop_id=c(18,18,18,19,20)
shop<-data.frame(date=date,time=time,product=product,shop_id=shop_id)
# load a dplyr library
library(dplyr)
# take shop data
shop %>%
# group by product, shop id, date
group_by(product, shop_id, date) %>%
# for each such combination, find the earliest time
summarise(time = min(time)) %>%
# group by product, shop id
group_by(product, shop_id) %>%
# for each combination of product & shop id
# return the earliest date and time recorded on the earliest date
summarise(date = min(date), time = time[date == min(date)])

Resources