How to relate two different dataframes to make calculations - r

I know how to work and computing math/statistics with one dataframe. But, what happens when I have to deal with two? For example:
> df1
supervisor salesperson
1 Supervisor1 Matt
2 Supervisor2 Amelia
3 Supervisor2 Philip
> df2
month channel Matt Amelia Philip
1 Jan Internet 10 50 20
2 Jan Cellphone 20 60 30
3 Feb Internet 40 40 30
4 Feb Cellphone 30 120 40
How can I compute the sales by supervisor grouped by channel in a efficient and generalizable way?. Is there any methodology or criteria when you need to relate two or more dataframes in order to compute the data you need?
PS: The number are the sales made by each sales person.

Here is the idea of converting to long and merging using tidyverse,
library(tidyverse)
df2 %>%
gather(salesperson, val, -c(1:2)) %>%
left_join(., df1, by = 'salesperson') %>%
spread(salesperson, val, fill = 0) %>%
group_by(channel, supervisor) %>%
summarise_at(vars(names(.)[4:6]), funs(sum))
which gives,
# A tibble: 4 x 5
# Groups: channel [?]
channel supervisor Amelia Matt Philip
<fct> <fct> <dbl> <dbl> <dbl>
1 Cellphone Supervisor1 0. 50. 0.
2 Cellphone Supervisor2 180. 0. 70.
3 Internet Supervisor1 0. 50. 0.
4 Internet Supervisor2 90. 0. 50.
NOTE: You can also add month in the group_by

Related

Removing matching observations where their adjacent column does not equal to 100

I have ~4000 observations in my data frame, test_11, and have pasted part of the data frame below:
data frame snippit
The k_hidp column represents matching households, the k_fihhmnnet1_dv column is their reported household income and the percentage_income_rounded reports each participant's income contribution to the total household income
I want to filter my data to remove all k_hidp observations where their collective income in the percentage_income_rounded does not equal 100.
So for example, the first household 68632420 reported a contribution of 83% (65+13) instead of the 100% as the other households report.
Is there any way to remove these household observations so I am only left with households with a collective income of 100%?
Thank you!
Try this:
## Creating the dataframe
df=data.frame(k_hidp = c(68632420,68632420,68632420,68632420,68632420,68632420,68632422,68632422,68632422,68632422,68632428,68632428),
percentage_income_rounded = c(65,18,86,14,49,51,25,25,25,25,50,50))
## Loading the libraries
library(dplyr)
## Aggregating and determining which household collective income is 100%
df1 = df %>%
group_by(k_hidp) %>%
mutate(TotalPercentage = sum(percentage_income_rounded)) %>%
filter(TotalPercentage == 100)
Output
> df1
# A tibble: 6 x 3
# Groups: k_hidp [2]
k_hidp percentage_income_rounded TotalPercentage
<dbl> <dbl> <dbl>
1 68632422 25 100
2 68632422 25 100
3 68632422 25 100
4 68632422 25 100
5 68632428 50 100
6 68632428 50 100

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

Cumsum w/ panel data: different start dates

Trying to find the cumsum across different types of contracts. Each has a unique stop (i.e. delivery) date with several months of expected delivery leading up to that date. Needing to calculate the cumsum of all expected deliveries before the actual delivery date.
For some reason the cumsum/rollsum function is not working. I have tried both DT and dplyr versions but both have failed.
Here is a simplified data for the problem I am working on.
df <- data.frame(report_year = c(rep(2017,10), rep(2018,10)),
report_month = c(seq(1,5,1), seq(2,6,1), seq(3,7,1), seq(2,6,1)),
delivery_year = c(rep(2017,10), rep(2018,10)),
delivery_month = c(rep(5,5),rep(6,5), rep(7,5), rep(6,5)),
sum = c(rep(seq(100,500,100), 4)),
cumsum = c(rep(c(100,300,600,1000,1500),4)))
The first 5 columns is what I currently have.
I am trying to get the last column (i.e. cumsum)
I am probably doing something wrong. Any help is appreciated.
The question did not specifically define which grouping columns to use so this may have to be modified slightly depending on what you want but this does it without any packages:
df$cumsum <- NULL # remove the result from df shown in question
transform(df, cumsum = ave(sum, delivery_year, delivery_month, FUN = cumsum))
Note that although the above works you may run into some problems using sum and cumsum as the column names due to confusion with the functions of the same name so you might want to use Sum and Cumsum, say. For example if you don't null out cumsum as we did above then FUN = cumsum will think that you want to apply the cumsum column which is not a function.
Use arrange and mutate
# Import library
library(dplyr)
# Calculating cumsum
df %>%
group_by(delivery_year, delivery_month) %>%
arrange(sum) %>%
mutate(cs = cumsum(sum))
Output
report_year report_month delivery_year delivery_month sum cumsum cs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 1 2017 5 100 100 100
2 2017 2 2017 6 100 100 100
3 2018 3 2018 7 100 100 100
4 2018 2 2018 6 100 100 100
5 2017 2 2017 5 200 300 300
6 2017 3 2017 6 200 300 300
7 2018 4 2018 7 200 300 300

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

Convert data.frame wide to long while concatenating date formats

In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.

Resources