Summarise rows in dataframe by two columns - r

I have this data frame called Worldwhich shows the following:
City Year Income Tourist
London 2008 50 100
NY 2009 75 250
Paris 2010 45 340
Dubai 2008 32 240
London 2011 50 140
Abu Dhabi 2009 60 120
Paris 2009 70 140
NY 2007 50 150
Tokyo 2008 45 150
Dubai 2010 40 480
#With 207 more rows
I want to summarise each rows so that every city shows the total income and tourists for all the years. So I want to find a code where City and Years are matched and then summarised so that every city just have one row.
Something like this:
City Income Tourist
London 1051 5040
NY 1547 5432
Paris 2600 4321
Dubai 3222 5312
Abu Dhabi 3100 7654
Tokyo 2404 4321
#With 40 more rows
After the research I've done n_distinct and group_by should be used.

Base R solution:
You can use the sapply() function to iterate over cities.
the first argument will be a vector of unique cities
we then write our function that select all the rows (years) of each city and returns the "Income" and "Tourist" columns
Sum the columns values with colSums() function
Transpose the output using the t() function.
t( sapply( unique( World$City ),function(CITY) colSums(World[World$City==CITY,c("Income","Tourist")] ) ) )
Solution with R's data.table package:
Make sure your object is of type data.table.
in the j part of the bracket (the do part):
you can provide names to the wanted columns ("Income="),
and specify the wanted output ("sum(Income)").
To group the cities, add a by argument to the data.table object.
World[,.(Income=sum(Income),Tourist=sum(Tourist)),by=City]

yes, you can use group_by and summarise function.
world %>% group_by(City) %>% summarise(across(c(Income,Tourist), sum))
you can also add Year in the group by function.
world %>% group_by(City,Year) %>% summarise(across(c(Income,Tourist), sum))

Related

Sum already existing rows into one row given specific years

This is my current data
Year
Economy
GDP
2000
US
26
2000
China
24
2000
Rest of the World
100
2001
US
25
2001
China
25
2001
Rest of the World
120
I want to add "China" to "Rest of the World" for each year. My final data should look like this. Thanks in advance for your assistance.
Year
Economy
GDP
2000
US
26
2000
Rest of the World
124
2001
US
25
2001
Rest of the World
145
We can do a group by 'Year' and with case_when replace the 'Economy' values that are no 'US' to "Rest of the World", then get the sum of "GDP"
library(dplyr)
df1 %>%
group_by(Year, Economy = case_when(Economy != "US" ~
"Rest of the World", TRUE ~ Economy)) %>%
summarise(GDP = sum(GDP, na.rm = TRUE))

Conditional imputation of one variable using Dplyr

I have a dataset (main dataset) which looks like this:
id cleaning_fee boro zipcode price
1 NA Manhattan 10014 100
2 70 Manhattan 10013 125
3 NA Brooklyn 11201 97
4 25 Manhattan 10012 110
5 30 Staten Island 10305 60
Grouping by Borough and Zipcode I get this (using na.rm = True):
borough zipcode avgCleaningFee
Brooklyn 11217 88.32000
Brooklyn 11231 89.05085
Brooklyn 11234 42.50000
Manhattan 10003 97.03738
Manhattan 10011 109.97647
What I want to do is impute the NAs in the 'cleaning_fee' variable in my main dataset by either:
(a) imputing the grouped mean (as shown above in table 2 where I group on 2 conditions)
or
(b) use KNN regression on variables such as zipcode, boro and the price to impute the cleaning fee variable. (PS I understand how KNN regression works but I haven't used it, would be great if you can explain the code in 1 line or so)
Would be great if anyone can help me out with this. Thanks!!
We can use the first method
library(dplyr)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = replace(Cleaning_fee,
is.na(Cleaning_fee), mean(Cleaning_fee, na.rm = TRUE))
Or with na.aggregate from zoo
library(zoo)
df1 %>%
group_by(Borough, Zipcode) %>%
mutate(cleaning_fee = na.aggregate(cleaning_fee))

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Add columns to other columns

I would like to take two columns and add them two other columns. For example, I have the data below:
EU.Member.States X. Other.countries..continued. X..1
Austria 122 Cameroon 203
Belgium 150 Canada 156
Denmark 179 Canary Islands 132
Finland 156 Cape Verde 147
France 130 Cayman Islands 213
How can I take the rows under "Other.countries..continued." and "X..1" and add them directly under "EU.Member.States" and "X." respectively?
I have tried using unite of (tidyr) with no success.
Your question is almost identical to this one. Using the piping from dplyr package I can suggest a solution by first duplicating your column names, and then applying classic rbind. I used only the first 2 lines of your example:
df %>% setNames(names(df)[c(1,2,1,2)]) %>% {rbind(.[,1:2], .[,3:4])}
#### EU.Member.States X.
#### 1 Austria 122
#### 2 Belgium 150
#### 3 Cameroon 203
#### 4 Canada 156
Note: the brackets are here to tell the piping not to take the . as an implicit first argument.

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Resources