R - how to sum each columns from df - r

I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.

You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9

You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9

Related

Copy a row if there is only one record for that group

I have some pre-post data (in R) for which some individuals only have a value for time 1, like episode "2" here:
episode <- c('1','1','2','3','3')
score <- c('44','12','37','40','9')
df <- data.frame(episode,score)
For all the records where there is only "pre" data (1 score per episode), I would like to use R (dplyr preferred) to copy that record and then indicate for all records which is the pre and which is the post. So the final should look something like:
episode
score
time
1
44
1
1
12
2
2
37
1
2
37
2
3
40
1
3
9
2
Thanks!
Here is one option - create a column of frequency count by 'episode' and if the value is 1, then add 1 on the logical (n == 1) and replicate the rows with uncount
library(dplyr)
library(tidyr)
df %>%
add_count(episode) %>%
mutate(n = (n == 1) + 1) %>%
uncount(n) %>%
group_by(episode) %>%
mutate(time = row_number()) %>%
ungroup
-output
# A tibble: 6 × 3
episode score time
<chr> <chr> <int>
1 1 44 1
2 1 12 2
3 2 37 1
4 2 37 2
5 3 40 1
6 3 9 2
Or create the 'time' column first and then use complete with fill
df %>%
group_by(episode) %>%
mutate(time = row_number()) %>%
ungroup %>%
complete(episode, time) %>%
fill(score)

Counting how often x occures per y and Visualize in R

I would like to count certain things in my dataset. I have panel data and ideally would like to count the number of activities per person.
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
So my output would tell me that person 4 has 2 tasks.
people 1
frequency activity 2
Would i need to group something? Ideally i would like to also visualize this as a histogram.
I have tried this:
> ##activity per person cllw %>%
> ## Group observations by people group_by(id_user) %>%
> ## count activities per person and i am not sure how to create frequencies at all
Like this?
library(dplyr)
df %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 5 x 2
people `frequency activity`
<dbl> <int>
1 1 3
2 2 2
3 3 2
4 4 2
5 5 2
Or like this if you only want "active" tasks:
df %>%
filter(completion != 1) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 2
2 2 1
3 4 2
4 5 1
Edit for unique tasks per person:
df %>%
filter(completion != 1) %>%
distinct(people, activity) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 1
2 2 1
3 4 1
4 5 1

Calculate ratio for subsets within subsets using dplyr

I have a set of data for many authors (AU), spanning multiple years (Year) and multiple topics (Topic). For each AU, Year, and Topic combination I want to calculate a ratio of the total FL by Topic / total FL for the year.
The data will look like this:
Data <- data.frame("AU" = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
"Year" = c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2010,2010,2010,2011,2011,2011,2011,2010,2011,2011),
"Topic" = c(1,1,1,2,2,2,1,1,2,2,2,2,2,1,1,1,1,1,1,1),
"FL" = c(1,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,1,1,1))
I've been playing around with dplyr trying to figure out how to do this. I can group_by easy enough but I'm not sure how to go about calculating the ratio using a "group" for numerator and a total across all groups for the denominator
Results <- Data %>%
group_by(Year, AU) %>%
summarise(ratio = ???) # Should be (Sum(FL) by Topic) / (Sum(FL) across all Topics)
If I understand correctly your desired output, you can calculate the total by Topic, Year, AU and total by Year, AU separately and join them together using left_join.
left_join(
Data %>%
group_by(AU, Year, Topic) %>%
summarise(FL_topic = sum(FL)) %>%
ungroup(),
Data %>%
group_by(AU, Year) %>%
summarise(FL_total = sum(FL)) %>%
ungroup(),
by = c("AU", "Year")
) %>%
mutate(ratio = FL_topic/FL_total)
# A tibble: 7 x 6
# AU Year Topic FL_topic FL_total ratio
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2010 1 2 4 0.5
# 2 1 2010 2 2 4 0.5
# 3 1 2011 1 0 2 0
# 4 1 2011 2 2 2 1
# 5 2 2010 1 1 4 0.25
# 6 2 2010 2 3 4 0.75
# 7 2 2011 1 4 4 1

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

Count number of observations without N/A per year in R

I have a dataset and I want to summarize the number of observations without the missing values (denoted by NA).
My data is similar as the following:
data <- read.table(header = TRUE,
stringsAsFactors = FALSE,
text="CompanyNumber ResponseVariable Year ExplanatoryVariable1 ExplanatoryVariable2
1 2.5 2000 1 2
1 4 2001 3 1
1 3 2002 NA 7
2 1 2000 3 NA
2 2.4 2001 0 4
2 6 2002 2 9
3 10 2000 NA 3")
I was planning to use the package dplyr, but that does only take the years into account and not the different variables:
library(dplyr)
data %>%
group_by(Year) %>%
summarise(number = n())
How can I obtain the following outcome?
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
To get the counts, you can start by using:
library(dplyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.)))
## A tibble: 3 x 3
# Year ExplanatoryVariable1 ExplanatoryVariable2
# <int> <int> <int>
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2
If you want to reshape it as shown in your question, you can extend the pipe using tidyr functions:
library(tidyr)
data %>%
group_by(Year) %>%
summarise_at(vars(starts_with("Expla")), ~sum(!is.na(.))) %>%
gather(var, count, -Year) %>%
spread(Year, count)
## A tibble: 2 x 4
# var `2000` `2001` `2002`
#* <chr> <int> <int> <int>
#1 ExplanatoryVariable1 2 2 1
#2 ExplanatoryVariable2 2 2 2
Just to let OP know, since they have ~200 explanatory variables to select. You can use another option of summarise_at to select the variables. You can simply name the first:last variable, if they are ordered correctly in the data, for example:
data %>%
group_by(Year) %>%
summarise_at(vars(ExplanatoryVariable1:ExplanatoryVariable2), ~sum(!is.na(.)))
Or:
data %>%
group_by(Year) %>%
summarise_at(3:4, ~sum(!is.na(.)))
Or store the variable names in a vector and use that:
vars <- names(data)[4:5]
data %>%
group_by(Year) %>%
summarise_at(vars, ~sum(!is.na(.)))
data %>%
gather(cat, val, -(1:3)) %>%
filter(complete.cases(.)) %>%
group_by(Year, cat) %>%
summarize(n = n()) %>%
spread(Year, n)
# # A tibble: 2 x 4
# cat `2000` `2001` `2002`
# * <chr> <int> <int> <int>
# 1 ExplanatoryVariable1 2 2 1
# 2 ExplanatoryVariable2 2 2 2
Should do it. You start by making the data stacked, and the simply calculating the n for both year and each explanatory variable. If you want the data back to wide format, then use spread, but either way without spread, you get the counts for both variables.
Using base R:
do.call(cbind,by(data[3:5], data$Year,function(x) colSums(!is.na(x[-1]))))
2000 2001 2002
ExplanatoryVariable1 2 2 1
ExplanatoryVariable2 2 2 2
For aggregate:
aggregate(.~Year,data[3:5],function(x) sum(!is.na(x)),na.action = function(x)x)
You could do it with aggregate in base R.
aggregate(list(ExplanatoryVariable1 = data$ExplanatoryVariable1,
ExplanatoryVariable2 = data$ExplanatoryVariable2),
list(Year = data$Year),
function(x) length(x[!is.na(x)]))
# Year ExplanatoryVariable1 ExplanatoryVariable2
#1 2000 2 2
#2 2001 2 2
#3 2002 1 2

Resources