This question already has answers here:
R table function: how to sum instead of counting? [duplicate]
(3 answers)
R reshape a data frame to get the total number of appearance of an observation
(2 answers)
Sum by two variables
(2 answers)
Closed 1 year ago.
I have a binary variable in my dataframe (people are older than 65 years or not) and the total amount of people for the two groups for several years.
I would like to have only one row for each year, so one column would show the amount of people older than 65 and one column the amount of people younger than 65.
How can I split my column with the binary variable up to make two columns out of it?
Thank you very much for your answer.
Is this what you're looking for:
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))
dat
# over65 year n
# 1 0 2016 176
# 2 1 2016 109
# 3 0 2017 133
# 4 1 2017 142
# 5 0 2018 150
# 6 1 2018 110
# 7 0 2019 127
# 8 1 2019 138
# 9 0 2020 151
# 10 1 2020 159
dat %>% pivot_wider(names_from="over65", values_from="n", names_prefix="over65_")
# # A tibble: 5 x 3
# year over65_0 over65_1
# <int> <dbl> <dbl>
# 1 2016 176 109
# 2 2017 133 142
# 3 2018 150 110
# 4 2019 127 138
# 5 2020 151 159
Here is a data.table approach:
# H/t to DaveArmstrong for the data!
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))
library(data.table)
setDT(dat)
dcast(dat, year ~ paste0("over65_", over65), fun.aggregate = sum)
#> year over65_0 over65_1
#> 1: 2016 146 159
#> 2: 2017 134 120
#> 3: 2018 164 113
#> 4: 2019 185 163
#> 5: 2020 180 114
Created on 2021-03-18 by the reprex package (v0.3.0)
We can use xtabs in base R
xtabs(n ~ year + over65, dat)
library(dplyr)
dat1 <- dat %>%
group_by(year) %>%
summarize(older_65 = case_when(over65==1 ~ n),
younger_65 = case_when(over65==0 ~ n)) %>%
mutate(older_65=lead(older_65)) %>%
na.omit()
data: borrowed from DaveArmstrong
set.seed(123)
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))
Related
For example, I have a dataset called data, and the column names are date min max avg. The total number of rows is 366.
I want to add the each seven rows to get the total value of min. e.g. 1-7 8-14. How can I do this.
If you create a grouping column which increments after every 7 days you may apply all the answers from How to sum a variable by group .
Here's how you can do it in base R.
set.seed(123)
df <- data.frame(Date = Sys.Date() - 365:0, min = rnorm(366), max = runif(366))
df$group <- ceiling(seq(nrow(df))/7)
aggregate(min~group, df, sum)
# group min
#1 1 3.1438325
#2 2 -0.3022263
#3 3 -1.0769539
#4 4 -3.2934430
#5 5 2.8419110
#...
This is a solution based on {tidyverse}, in particular using {dplyr} for the main operations and {lubridate} for formatting your dates.
First simulate some data - as you have not provided a reproducible dataset. I take the year 2020 which has 366 days ... obviously adapt this to your problem.
For the min, max, and average values (columns) , let's generate some random numbers. Again, adapt this to your needs.
library(dplyr) # for general data frame crunching
library(lubridate) # to coerce date-time
data <- data.frame(
date = seq(from = lubridate::ymd("2020-01-01")
, to = lubridate::ymd("2020-12-31"), by = 1)
, min = sample(x = 1:10, size = 366, replace = TRUE)
, max = sample(x = 10:15, size = 366, replace = TRUE)) %>%
dplyr::mutate(avg = mean(min + max))
To group your data, inject a binning / grouping variable.
The following is a generic "every 7th row" based on the modulo operator.
If you want to group by weeks, etc. check out the {lubridate} documentation. You can get some useful bits out of dates for this. Or insert any other binning you need.
data <- data %>%
mutate(bin = c(0, rep(1:(nrow(data)-1)%/%7)))
This yields:
> data
# A tibble: 366 x 5
# Groups: bin [53]
date min max avg bin
<date> <int> <int> <dbl> <dbl>
1 2020-01-01 2 11 18.2 0
2 2020-01-02 6 14 18.2 0
3 2020-01-03 7 13 18.2 0
4 2020-01-04 6 15 18.2 0
5 2020-01-05 3 10 18.2 0
6 2020-01-06 5 12 18.2 0
7 2020-01-07 5 12 18.2 0
8 2020-01-08 7 13 18.2 1
9 2020-01-09 8 11 18.2 1
10 2020-01-10 5 10 18.2 1
We can now summarise our grouped data.
For this you use the bin-variable to group your data, and then summarise to perform aggregations on these groups. Based on your question, the following sums the min-values. Put the function/summary you need:
data %>%
group_by(bin) %>%
summarise(tot_min = sum(min))
# A tibble: 53 x 2
bin tot_min
<dbl> <int>
1 0 34
2 1 31
3 2 35
4 3 44
5 4 40
6 5 50
7 6 46
8 7 38
9 8 33
10 9 21
# ... with 43 more rows
Assign the result to your liking or whatever type of output you need.
If you want to combine this with your original data dataframe, read up on bind_rows().
I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8
I have a large dataframe in R with daily time series data of rainfall for a number of locations (each in their own column). I would like to know the number of times the rainfall is less than, or is greater than a threshold value for each location in each month and also by year.
My dataframe is large so I have provided example data here:
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
How would I find out the number of times the value in A, B, C and D is greater than 4 for each month and then also for each year.
I think I should then be able to summarise this into two new tables.
One like this (example, ignore numbers):
A B C D
2000-01 1 0 5 0
2000-02 2 16 25 0
2000-03 1 5 26 0
And one like this (example, ignore numbers):
A B C D
2000 44 221 67 0
2001 67 231 4 132
2002 99 111 66 4
2003 33 45 45 4
I think I should be using dplyr for this? But I'm not sure how to get the dates to work.
A solution using the dplyr and lubridate package. The key is to create Year and Month columns, group by those columns, and use summarise_all to summarize the data.
# Create the example data frame, set the seed for reproducibility
set.seed(199)
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
library(dplyr)
library(lubridate)
# Summarise for each month
df2 <- df %>%
mutate(Year = year(Date_ex), Month = month(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year, Month) %>%
summarise_all(funs(sum(. > 4))) %>%
ungroup()
df2
# # A tibble: 61 x 6
# Year Month A B C D
# <dbl> <dbl> <int> <int> <int> <int>
# 1 2000 1 13 8 13 0
# 2 2000 2 12 7 8 0
# 3 2000 3 7 9 9 0
# 4 2000 4 9 12 10 0
# 5 2000 5 11 12 8 0
# 6 2000 6 12 9 16 0
# 7 2000 7 10 11 10 0
# 8 2000 8 8 12 14 0
# 9 2000 9 12 12 12 0
# 10 2000 10 9 9 7 0
# # ... with 51 more rows
# Summarise for each year and month
df3 <- df %>%
mutate(Year = year(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year) %>%
summarise_all(funs(sum(. > 4)))
df3
# # A tibble: 6 x 5
# Year A B C D
# <dbl> <int> <int> <int> <int>
# 1 2000 120 119 125 0
# 2 2001 119 123 113 0
# 3 2002 135 122 105 0
# 4 2003 114 112 104 0
# 5 2004 115 125 124 0
# 6 2005 9 14 11 0
Here are a few solutions.
1) aggregate This solution uses only base R. The new Date column is the date for the first of the month or first of the year.
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "month"))), sum)
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "year"))), sum)
1a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
aggregate(df[-1] > 4, list(Date = as.yearmon(df[[1]])), sum)
aggregate(df[-1] > 4, list(Date = toyear(df[[1]])), sum)
2) rowsum This is another base R solution. The year/month or year is given by the row names.
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y-%m"))
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y"))
2a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
rowsum((df[-1] > 4) + 0, as.yearmon(df[[1]]))
rowsum((df[-1] > 4) + 0, toyear(df[[1]]))
3) aggregate.zoo Convert to a zoo object and use aggregate.zoo. Note that yearmon class internally represents a year and month as the year plus 0 for Jan, 1/12 for Feb, 2/12 for March, etc. so taking the integer part gives the year.
library(zoo)
z <- read.zoo(df)
aggregate(z > 4, as.yearmon, sum)
toyear <- function(x) as.integer(as.yearmon(x))
aggregate(z > 4, toyear, sum)
The result is a zoo time series with a yearmon index in the first case and an integer index in the second. If you want a data frame use fortify.zoo(ag) where ag is the result of aggregate.
4) dplyr toyear is from (3).
library(dplyr)
library(zoo)
df %>%
group_by(YearMonth = as.yearmon(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
df %>%
group_by(Year = toyear(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
Data.table is missing so I'm adding this. Comments are in the code. I used set.seed(1) to generate the samples.
library(data.table)
setDT(df)
# add year and month to df
df[, `:=`(month = month(Date_ex),
year = year(Date_ex))]
# monthly returns, remove date_ex
monthly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year, month), .SDcols = -("Date_ex")]
year month A B C D
1: 2000 1 10 10 11 0
2: 2000 2 10 11 8 0
3: 2000 3 11 11 11 0
4: 2000 4 10 11 8 0
5: 2000 5 7 10 8 0
6: 2000 6 9 6 7 0
.....
# yearly returns, remove Date_ex and month
yearly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year), .SDcols = -c("Date_ex", "month")]
year A B C D
1: 2000 114 118 113 0
2: 2001 127 129 120 0
3: 2002 122 108 126 0
4: 2003 123 128 125 0
5: 2004 123 132 131 0
6: 2005 14 15 15 0
I have a data frame in R like this:
D I S ...
110 2012 1000
111 2012 2000
110 2012 1000
111 2014 2000
110 2013 1000
111 2013 2000
I want to calculate how many factor levels are there for each factor and safe this in an DF like this:
D Count I Count S Count ...
110 3 2012 3 1000 3
111 3 2013 2 2000 3
2014 1
or this:
D Count
110 3
111 3
I Count
2012 3
2013 2
2014 1
S Count
1000 3
2000 3
....
I tried to do it with sapply, levels, the library(dplyr) or aggregate, but it does not produce the desired output. How can I do that?
I think the most efficient way to do it, in terms of length of code and storing final output in a tidy format is this:
library(tidyverse)
# example data
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
data %>%
gather(name,value) %>% # reshape datset
count(name, value) # count combinations
# # A tibble: 7 x 3
# name value n
# <chr> <chr> <int>
# 1 D 110 3
# 2 D 111 3
# 3 I 2012 3
# 4 I 2013 2
# 5 I 2014 1
# 6 S 1000 3
# 7 S 2000 3
1st column represent the name of you factor variable.
2nd column has the unique values of each variable.
3rd column is the counter.
Here is a sulution using data.table
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
str(data)
# you just want
table(data$D)
table(data$I)
table(data$S)
# one option using data.table
require(data.table)
dt <- as.data.table(data)
dt # see dt
dt[, table(D)] # or dt[, .N, by = D], for one variable
paste(names(dt), "Count", sep = "_") # names of new count columns
dt[, paste(names(dt), "Count", sep = "_") := lapply(.SD, table)]
dt # new dt
data2 <- as.data.frame(dt)[, sort(names(dt))]
data2 # final data frame
And a dplyr's one for the second output.
counts <- data %>%
lapply(table) %>%
lapply(as.data.frame)
counts
I think the easy way is by using the "plyr" R-library.
library(plyr)
count(data$D)
count(data$I)
count(data$S)
It will give you
> count(data$D)
x freq
1 110 3
2 111 3
> count(data$I)
x freq
1 2012 3
2 2013 2
3 2014 1
> count(data$S)
x freq
1 1000 3
2 2000 3
My data frame has a 10 columns and 100,000 rows, each row is an observation and the columns are data pertaining to each observation. One of the columns has the date of an observation in the julian day(ie feb 4= day 34). I want to reduce my data set so I'd have the first 10% observations PER year PER species. Ie, for species 1 in the year 1901 I want the average day of appearance based on the first 10% of observations.
Example of what I have: note id= species but as a number. ie blue=1
date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)
habitat=c(forest,plain,mountain...)
ect
What i want:
date=c(3,84,76,86...)
species=c(purple,pink,pink, white...)
id=c(2,5,5,6...)
year=c(1901,2000,2000,1986...)
habitat=c(forest,plain,mountain...)
new=c(3,84,79,86...)
Assuming the data set dd defined below
set.seed(123)
n <- 100000
dd <- data.frame(year = sample(1901:2000, n, replace = TRUE),
date = sample(0:364, n, replace = TRUE),
species = sample(1:5, n, replace = TRUE))
1) base Aggregate dd with the indicated function. No packages are used:
avg10 <- function(date) {
ok <- seq_along(date) <= length(date) / 10
if (any(ok)) mean(date[ok]) else NA
}
aggregate(date ~ species + year, dd, avg10)
2) data.table Here is a data.table solution:
data.table(dd)[,
{ok <- .I <= .10 * .N; if (any(ok)) mean(date[ok]) else NA}, by = "species,year"]
Note: If you don't want NA's then use this instead of either of the if statements above to get the first point in that case:
if (any(ok)) mean(date[ok]) else date[1]
Just as for your last question, dplyr may work well for you:
Some data:
library(dplyr)
set.seed(42)
n <- 500
dat <- data.frame(date = sample(365, size=n, replace=TRUE),
species = sample(5, size=n, replace=TRUE),
year = 1980 + sample(20, size=n, replace=TRUE))
How it looks without filtering:
dat %>% group_by(year, species) %>% arrange(year, date)
## Source: local data frame [500 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 138 1 1981
## 3 174 1 1981
## 4 179 1 1981
## 5 200 1 1981
## 6 332 1 1981
## 7 31 2 1981
## 8 52 2 1981
## 9 196 2 1981
## 10 226 2 1981
## .. ... ... ...
How it looks with the first 10% by date within each year:
dat %>%
group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
arrange(year, date)
## Source: local data frame [100 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 31 2 1981
## 3 63 3 1981
## 4 112 4 1981
## 5 1 5 1981
## 6 40 1 1982
## 7 103 2 1982
## 8 40 3 1982
## 9 86 4 1982
## 10 48 5 1982
## .. ... ... ...
I think the ntile trick is doing what you want: it breaks the data into roughly equal-sized bins, so it should be giving you the lowest 10% of your dates.
EDIT
Sorry, I missed the mean in there:
dat %>% group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
summarise(date = mean(date)) %>%
arrange(year, date)
## Source: local data frame [99 x 3]
## Groups: year
## year species date
## 1 1981 5 1
## 2 1981 2 31
## 3 1981 1 50
## 4 1981 3 63
## 5 1981 4 112
## 6 1982 1 40
## 7 1982 3 40
## 8 1982 5 48
## 9 1982 4 86
## 10 1982 2 103
## .. ... ... ...