Count Observations Meeting Certain Criteria by Group R - r

I need some help counting observations meeting certain criteria by group. I first want the number of employees by location as a column. Then I would like to retrieve the number of employees that have worked more than 40 hours (by location) and summarize that into a column. I assume there is an easy way to do it with dplyr or base R but I'm stumped. My data is below.
name hours_worked location
Bob 55 IL
Nick 25 IL
Sally 30 IL
Patricia 50 WI
Tim 35 WI
Liz 42 OH
Brad 60 OH
Sam 48 OH
Ideal output would be something like:
location headcount over_40
IL 3 1
WI 2 1
OH 3 3

We can do a group by operation - grouped by 'location' get the number of rows (n()) for headcount and the sum of logical vector to get the count of 'over_40'
library(dplyr)
df1 %>%
group_by(location) %>%
summarise(headcount = n(), over_40 = sum(hours_worked > 40))
-output
# A tibble: 3 x 3
location headcount over_40
<chr> <int> <int>
1 IL 3 1
2 OH 3 3
3 WI 2 1
data
df1 <- structure(list(name = c("Bob", "Nick", "Sally", "Patricia", "Tim",
"Liz", "Brad", "Sam"), hours_worked = c(55L, 25L, 30L, 50L, 35L,
42L, 60L, 48L), location = c("IL", "IL", "IL", "WI", "WI", "OH",
"OH", "OH")), class = "data.frame", row.names = c(NA, -8L))

Related

How do I sum up certain rows with different values but keep all rows on certain columns the same and then collapse this into one line? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
For example I have the data frame:
firstname lastname season attempts yards weight
bob smith 2018 7 38 200
bob smith 2018 11 56 200
bob smith 2018 17 88 200
bob smith 2018 8 24 200
And I want to condense this into one line that reads:
firstname lastname season attempts yards weight
bob smith 2018 43 206 200
We can use aggregate from base R. Use the formula method, specify the columns to sum as a matrix on the lhs of ~ and . represents all the other columns used as grouping. Specify the aggregating function - sum
aggregate(cbind(attempts, yards) ~ ., df1, sum)
-output
firstname lastname season weight attempts yards
1 bob smith 2018 200 43 206
Or in tidyverse, group across columns other than 'attempts', 'yards', and summarise across all other (everything()) and get the sum
library(dplyr)
df1 %>%
group_by(across(-c(attempts, yards))) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(names(df1))
-outupt
# A tibble: 1 x 6
firstname lastname season attempts yards weight
<chr> <chr> <int> <int> <int> <int>
1 bob smith 2018 43 206 200
data
df1 <- structure(list(firstname = c("bob", "bob", "bob", "bob"),
lastname = c("smith",
"smith", "smith", "smith"), season = c(2018L, 2018L, 2018L, 2018L
), attempts = c(7L, 11L, 17L, 8L), yards = c(38L, 56L, 88L, 24L
), weight = c(200L, 200L, 200L, 200L)), class = "data.frame", row.names = c(NA,
-4L))

Calculate Growth Increase, grouping by Type (Using R)

I have a dataset, z, that I wish to calculate the growth increase by the type:
location size type date
ny 5 hello 10/01/2020
ny 7 ai 10/02/2020
ny 8 ai 10/03/2020
ny 6 hello 10/04/2020
ca 15 cool 10/05/2020
ca 10 name 10/06/2020
ca 5 name 10/07/2020
ca 16 cool 10/08/2020
Desired output
location type increase percent_increase start_date end_date
ca cool 1 6.67% 10/05/2020 10/08/2020
ca name -5 -50% 10/6/2020 10/7/2020
ny hello 1 20% 10/01/2020 10/4/2020
ny ai 1 14.28% 10/2/2020 10/3/2020
This is what I am doing:
library(tidyverse)
z %>%
group_by(type, location) %>%
mutate(percent_increase = (size/lead(size) - 1) * 100)
I am not getting my desired output. Any assistance is appreciated.
To get the results you want, you need a different calculation in your mutate line:
I also added a filter to remove any results with NA for the percent_increase variable.
And finally added ```arrange`` to sort alphabetically by location to match the same order as your requested output.
CODE
z %>%
group_by(type, location) %>%
mutate(
increase = (lead(size) - size),
percent_increase = (increase/size) * 100,
start_date = date,
end_date = lead(date)) %>%
filter(!is.na(percent_increase)) %>%
arrange(location)
OUTPUT
# A tibble: 4 x 8
# Groups: type, location [4]
location size type date increase percent_increase start_date end_date
<chr> <int> <chr> <chr> <int> <dbl> <chr> <chr>
1 ca 15 cool 10/05/2020 1 6.67 10/05/2020 10/08/2020
2 ca 10 name 10/06/2020 -5 -50 10/06/2020 10/07/2020
3 ny 5 hello 10/01/2020 1 20 10/01/2020 10/04/2020
4 ny 7 ai 10/02/2020 1 14.3 10/02/2020 10/03/2020
INPUT
z <- structure(list(location = c("ny", "ny", "ny", "ny", "ca", "ca",
"ca", "ca"), size = c(5L, 7L, 8L, 6L, 15L, 10L, 5L, 16L), type = c("hello",
"ai", "ai", "hello", "cool", "name", "name", "cool"), date = c("10/01/2020",
"10/02/2020", "10/03/2020", "10/04/2020", "10/05/2020", "10/06/2020",
"10/07/2020", "10/08/2020")), class = "data.frame", row.names = c(NA,
-8L))
you're missing arrange function to organize by date
like this:
z %>%
group_by(type, location) %>%
arrange(date) %>%
mutate(percent_increase = (size/lead(size) - 1) * 100)

Dividing each row by the previous one in R

I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)

How to append 2 data sets one below the other having slightly different column names?

Data set1:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
Data set2:
Terr ID Name Comments
LA 5 Rick yes
MH 11 Oly no
I want final data set to have columns of 1st data set only and identify Territory is same as Terr and does not bring forward Comments column.
Final data should look like:
ID Name Territory Sales
1 Richard NY 59
8 Sam California 44
5 Rick LA NA
11 Oly MH NA
Thanks in advance
A possible solution:
# create a named vector with names from 'set2'
# with the positions of the matching columns in 'set1'
nms2 <- sort(unlist(sapply(names(set2), agrep, x = names(set1))))
# only keep the columns in 'set2' for which a match is found
# and give them the same names as in 'set1'
set2 <- setNames(set2[names(nms2)], names(set1[nms2]))
# bind the two dataset together
# option 1:
library(dplyr)
bind_rows(set1, set2)
# option 2:
library(data.table)
rbindlist(list(set1, set2), fill = TRUE)
which gives (dplyr-output shown):
ID Name Territory Sales
1 1 Richard NY 59
2 8 Sam California 44
3 5 Rick LA NA
4 11 Oly MH NA
Used data:
set1 <- structure(list(ID = c(1L, 8L),
Name = c("Richard", "Sam"),
Territory = c("NY", "California"),
Sales = c(59L, 44L)),
.Names = c("ID", "Name", "Territory", "Sales"), class = "data.frame", row.names = c(NA, -2L))
set2 <- structure(list(Terr = c("LA", "MH"),
ID = c(5L, 11L),
Name = c("Rick", "Oly"),
Comments = c("yes", "no")),
.Names = c("Terr", "ID", "Name", "Comments"), class = "data.frame", row.names = c(NA, -2L))

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Resources