R: sum row based on several conditions - r

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.

We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6

With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Related

R - manipulate last rows depending on group and previous elements

Im fairly new to R and struggling to find a solution for the following problem:
I have a tibble consisting of 3 columns. First column describes ids of stocks (e.g. ID1,ID2..), the second the Date of observation and third the corresponding return. (ID | Date | Return )
For tidying my dataset I need to delete all zero returns starting from end of sample period until i reach the first non zero return.
The following picture further visualises my issue.
DatasetExample
In case of the example Dataset depicted above, I need to delete the yellow coloured elements.
Hence, one needs to first group by ID and second iterate over the table from bottom to top until reaching a non zero return.
I already found a way by converting the tibble into a matrix and then looping over each element but this apporach is rather naive and does not perform well on large datasets (+2 mio. observations), which is exactly my case.
Is there any more effcient way to achieve this aim? Solutions using dplyr would be highly appreciated.
Thanks in advance.
Here is a dplyr solution. I believe it's a bit complicated, but it works.
library(dplyr)
df1 %>%
mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>%
group_by(ID) %>%
arrange(desc(Date), .by_group = TRUE) %>%
mutate(flag = min(which(Return == 0)),
flag = cumsum(Return != 0 & flag <= row_number())) %>%
filter(flag > 0) %>%
select(-flag) %>%
arrange(Date, .by_group = TRUE)
## A tibble: 7 x 3
## Groups: ID [2]
# ID Date Return
# <int> <date> <dbl>
#1 1 2020-09-20 0.377
#2 1 2020-09-21 0
#3 1 2020-09-22 -1.10
#4 2 2020-09-20 0.721
#5 2 2020-09-21 0
#6 2 2020-09-22 0
#7 2 2020-09-23 1.76
Test data creation code
set.seed(2020)
df1 <- data.frame(ID = rep(1:2, each = 5), Date = Sys.Date() - 5:1, Return = rnorm(10))
df1$Date <- format(df1$Date, "%d.%m.%Y")
df1$Return[sample(1:5, 2)] <- 0
df1$Return[sample(6:10, 2)] <- 0
df1$Return[10] <- 0
There might be a more elegant way but this could work:
split_data <- split(data,data$ID)
split_tidy_data <- lapply(split_data,function(x) x[1:which.max(x[,"Return"]!=0),])
tidy_data <- do.call(rbind,split_tidy_data)
Note: This only works if there is at least 1 "Return" which is not equal 0

How to create columns from a list in a for loop using mutate

I was wondering if there was a way to create multiple columns from a list in R using the mutate() function within a for loop.
Here is an example of what I mean:
The Problem:
I have a data frame df that has 2 columns: category and rating. I want to add a column for every element of df$category and in that column, I want a 1 if the category column matches the iterator.
library(dplyr)
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
Doing it manually, I could do:
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
However, what happens when I have 50 categories? (Which is close to what I have in my original problem. That would take a lot of time!)
What I tried:
category_names <- df$category
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
Unfortunately, It doesn't seem to work.
I'd appreciate any light on the subject!
Full Code:
library(dplyr)
#Creates tibble
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
#Showcases the operation I would like to loop over df
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
#Creates a variable for clarity
category_names <- df$category
#For loop I tried
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
I am aware that what I am essentially doing is a form of model.matrix(); however, before I found out about that function I was still perplexed why what I was doing before wasn't working.
We can use pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1) %>%
pivot_wider(names_from = category, values_from = n,
values_fill = list(n = 0)) %>%
select(-rn)
# A tibble: 3 x 4
# rating Art Technology Finance
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 95 0 1 0
#3 50 0 0 1
Or another option is map
library(purrr)
map_dfc(unique(df$category), ~ df %>%
transmute(!! .x := +(category == .x))) %>%
bind_cols(df, .)
# A tibble: 3 x 5
# category rating Art Technology Finance
#* <chr> <dbl> <int> <int> <int>
#1 Art 100 1 0 0
#2 Technology 95 0 1 0
#3 Finance 50 0 0 1
If we need a for loop
for(name in category_names) df <- df %>% mutate(!! name := +(category == name))
Or in base R with table
cbind(df, as.data.frame.matrix(table(seq_len(nrow(df)), df$category)))
# category rating Art Finance Technology
#1 Art 100 1 0 0
#2 Technology 95 0 0 1
#3 Finance 50 0 1 0
Wanted to throw something in for anyone who stumbles across this question. The problem in the OP is that the "name" column name gets re-used during each iteration of the loop: you end up with only one new column, when you really wanted three (or 50). I consistently find myself wanting to create multiple new columns within loops, and I recently found out that mutate can now take "glue"-like inputs to do this. The following code now also solves the original question:
for(name in category_names){
df <-
df %>%
mutate("{name}" := ifelse(category == name, 1, 0))
}
This is equivalent to akrun's answer using a for loop, but it doesn't involve the !! operator. Note that you still need the "walrus" := operator, and that the column name needs to be a string (I think since it's using "glue" in the background). I'm thinking some people might find this format easier to understand.
Reference: https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy-eval/

How can I speed up a function combining rbind and lapply?

I have a large dataframe(100K rows, 19 columns). I need to count the number of cases each month that contain each possible combination of 5 items.
The following code works for a small dataset but with my complete dataset it takes way too long. From my searching I suspect that pre-allocating a dataframe is the key, but I cannot figure out how to do that.
library(dplyr)
Case<-c(1,1,1,2,2,3,4,5,5,6,6,6,7,8,8,8,9,9,9)
Month<- c("Jan","Jan","Jan","Mar","Mar","Sep","Sep","Nov","Nov","Dec","Dec","Dec","Apr","Dec","Dec","Dec","Dec","Dec","Dec")
Fruits<-c("Apple","Orange","Grape","Grape","Orange","Apple","Apple","Orange","Grape","Apple","Orange","Grape","Grape","Apple","Orange","Grape","Apple","Orange","Grape")
df<-data.frame(Case,Month,Fruits)
Patterns <- with(df, do.call(rbind, lapply(unique(Case), function(x){
y <- subset(df, Case == x )
Date<-as.character(y$Month[1])
Fruits <- paste(unique(y$Fruits[order(y$Fruits)]), collapse = ' / ')
as.data.frame(unique (cbind(Case = y$Case, Date, Fruits)))
})))
Total<-Patterns %>%
group_by(Date,Fruits) %>%
tally()
The results I get are acceptable but the process takes too long and with a large dataset I run out of memory.
Over large datasets, data.table will be a lot quicker than dplyr:
library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]
We could do all of it in one command using dplyr. First we group_by Case and Month to paste all Fruits together by group and then grouping by Month and Fruits we add the number of rows for each group using tally.
library(dplyr)
df %>%
group_by(Case, Month) %>%
summarise(Fruits = paste(Fruits, collapse = "/")) %>%
group_by(Month, Fruits) %>%
tally()
# OR count()
# Month Fruits n
# <fct> <chr> <int>
#1 Apr Grape 1
#2 Dec Apple/Orange/Grape 3
#3 Jan Apple/Orange/Grape 1
#4 Mar Grape/Orange 1
#5 Nov Orange/Grape 1
#6 Sep Apple 2

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

R: Generate a table of win/loss records against specific players

Let's say I have the following data:
dat <- read.table(text="p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T)
I'm trying to use dplyr to output a summary table of some specific player's (e.g. jon's) statistics against every other player in the dataframe. So, the output should be:
joe: 1-0
james: 1-0
ken: 0-1
I want to use 'group_by' to work with a corpus of joe games, but don't know how to implement conditional group_by's (e.g. group_by joe if p1 or p2 == joe). I could mutate to create a dummy column that is equal to 1 if either of those conditions are true, and group_by that, but was hoping there was a more parsimonious strategy. And then, the only way I can see of counting a 'win' for Joe is to use an ifelse statement whereby if p1 == Joe and outcome == 1-0 or p2 == Joe and outcome == 0-1, then count that as a win for Joe. However, not sure how to do these if statements within dplyr piping.
This would be a dplyr solution that allows for multiple games between jon and the other players (not just one game). It basically filters all games that jon was part of and extracts the opponent via mutate and ifelse. It then summarizes the number of wins and losses after grouping by opponent. In the end I paste the overall result for each opponent and only select this pasted column:
dat %>% mutate(p1 = as.character(p1), p2 = as.character(p2)) %>%
filter((p1 == "jon")|(p2 == "jon")) %>%
mutate(opponent= ifelse(p1 == "jon",p2,p1)) %>%
group_by(opponent) %>%
summarize(Wins = sum((outcome == "1-0" & p1 == "jon") |
(outcome == "0-1" & p2 == "jon")) ,
Losses = n() - Wins) %>%
mutate(Outcome = paste(opponent, ": ",Wins, "-", Losses)) %>%
select(Outcome)
I had to add the as.character mutate to properly return the opponents in the ifelse. Otherwise the variables p1 and p2 would still be factor and the numbers would be returned instead of the labels (i.e. names of the players).
Here's an alternative tidyverse solution:
# example data
dat <- read.table(text="
p1 p2 outcome
jon joe 1-0
jon james 0-1
james ken 1-0
ken jon 1-0", header=T, stringsAsFactors=F)
library(tidyverse)
# reshape your dataset
dat2 = dat %>%
mutate(game_id = row_number()) %>% # add game id
unite(p, p1, p2, sep="-") %>% # combine player names
separate_rows(p, outcome) # separate rows using name and scores
# get summary stats for jon
dat2 %>%
group_by(game_id) %>% # for each game id
filter("jon" %in% p) %>% # keep games that jon played
summarise(pl = p[p != "jon"], # get the name of the other player
outcome = paste0(outcome[p=="jon"], "-", outcome[p!="jon"])) # combine the scores (jon vs. other)
# # A tibble: 3 x 3
# game_id pl outcome
# <int> <chr> <chr>
# 1 1 joe 1-0
# 2 2 james 0-1
# 3 4 ken 0-1
Assuming you can reshape you original dataset once, in beginning, you can create a function using the second part:
GetSummaryStats = function(x) {
dat2 %>%
group_by(game_id) %>%
filter(x %in% p) %>%
summarise(pl = p[p != x],
outcome = paste0(outcome[p==x], "-", outcome[p!=x])) }
and call it like this:
GetSummaryStats("jon")
for any player you like.

Resources