R: isolation of initial 10% - r

My data frame has a 10 columns and 100,000 rows, each row is an observation and the columns are data pertaining to each observation. One of the columns has the date of an observation in the julian day(ie feb 4= day 34). I want to reduce my data set so I'd have the first 10% observations PER year PER species. Ie, for species 1 in the year 1901 I want the average day of appearance based on the first 10% of observations.
Example of what I have: note id= species but as a number. ie blue=1
date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)
habitat=c(forest,plain,mountain...)
ect
What i want:
date=c(3,84,76,86...)
species=c(purple,pink,pink, white...)
id=c(2,5,5,6...)
year=c(1901,2000,2000,1986...)
habitat=c(forest,plain,mountain...)
new=c(3,84,79,86...)

Assuming the data set dd defined below
set.seed(123)
n <- 100000
dd <- data.frame(year = sample(1901:2000, n, replace = TRUE),
date = sample(0:364, n, replace = TRUE),
species = sample(1:5, n, replace = TRUE))
1) base Aggregate dd with the indicated function. No packages are used:
avg10 <- function(date) {
ok <- seq_along(date) <= length(date) / 10
if (any(ok)) mean(date[ok]) else NA
}
aggregate(date ~ species + year, dd, avg10)
2) data.table Here is a data.table solution:
data.table(dd)[,
{ok <- .I <= .10 * .N; if (any(ok)) mean(date[ok]) else NA}, by = "species,year"]
Note: If you don't want NA's then use this instead of either of the if statements above to get the first point in that case:
if (any(ok)) mean(date[ok]) else date[1]

Just as for your last question, dplyr may work well for you:
Some data:
library(dplyr)
set.seed(42)
n <- 500
dat <- data.frame(date = sample(365, size=n, replace=TRUE),
species = sample(5, size=n, replace=TRUE),
year = 1980 + sample(20, size=n, replace=TRUE))
How it looks without filtering:
dat %>% group_by(year, species) %>% arrange(year, date)
## Source: local data frame [500 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 138 1 1981
## 3 174 1 1981
## 4 179 1 1981
## 5 200 1 1981
## 6 332 1 1981
## 7 31 2 1981
## 8 52 2 1981
## 9 196 2 1981
## 10 226 2 1981
## .. ... ... ...
How it looks with the first 10% by date within each year:
dat %>%
group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
arrange(year, date)
## Source: local data frame [100 x 3]
## Groups: year, species
## date species year
## 1 50 1 1981
## 2 31 2 1981
## 3 63 3 1981
## 4 112 4 1981
## 5 1 5 1981
## 6 40 1 1982
## 7 103 2 1982
## 8 40 3 1982
## 9 86 4 1982
## 10 48 5 1982
## .. ... ... ...
I think the ntile trick is doing what you want: it breaks the data into roughly equal-sized bins, so it should be giving you the lowest 10% of your dates.
EDIT
Sorry, I missed the mean in there:
dat %>% group_by(year, species) %>%
filter(ntile(date, 10) == 1) %>%
summarise(date = mean(date)) %>%
arrange(year, date)
## Source: local data frame [99 x 3]
## Groups: year
## year species date
## 1 1981 5 1
## 2 1981 2 31
## 3 1981 1 50
## 4 1981 3 63
## 5 1981 4 112
## 6 1982 1 40
## 7 1982 3 40
## 8 1982 5 48
## 9 1982 4 86
## 10 1982 2 103
## .. ... ... ...

Related

R splitting a column into two seperate columns [duplicate]

This question already has answers here:
R table function: how to sum instead of counting? [duplicate]
(3 answers)
R reshape a data frame to get the total number of appearance of an observation
(2 answers)
Sum by two variables
(2 answers)
Closed 1 year ago.
I have a binary variable in my dataframe (people are older than 65 years or not) and the total amount of people for the two groups for several years.
I would like to have only one row for each year, so one column would show the amount of people older than 65 and one column the amount of people younger than 65.
How can I split my column with the binary variable up to make two columns out of it?
Thank you very much for your answer.
Is this what you're looking for:
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))
dat
# over65 year n
# 1 0 2016 176
# 2 1 2016 109
# 3 0 2017 133
# 4 1 2017 142
# 5 0 2018 150
# 6 1 2018 110
# 7 0 2019 127
# 8 1 2019 138
# 9 0 2020 151
# 10 1 2020 159
dat %>% pivot_wider(names_from="over65", values_from="n", names_prefix="over65_")
# # A tibble: 5 x 3
# year over65_0 over65_1
# <int> <dbl> <dbl>
# 1 2016 176 109
# 2 2017 133 142
# 3 2018 150 110
# 4 2019 127 138
# 5 2020 151 159
Here is a data.table approach:
# H/t to DaveArmstrong for the data!
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))
library(data.table)
setDT(dat)
dcast(dat, year ~ paste0("over65_", over65), fun.aggregate = sum)
#> year over65_0 over65_1
#> 1: 2016 146 159
#> 2: 2017 134 120
#> 3: 2018 164 113
#> 4: 2019 185 163
#> 5: 2020 180 114
Created on 2021-03-18 by the reprex package (v0.3.0)
We can use xtabs in base R
xtabs(n ~ year + over65, dat)
library(dplyr)
dat1 <- dat %>%
group_by(year) %>%
summarize(older_65 = case_when(over65==1 ~ n),
younger_65 = case_when(over65==0 ~ n)) %>%
mutate(older_65=lead(older_65)) %>%
na.omit()
data: borrowed from DaveArmstrong
set.seed(123)
dat <- data.frame(over65 = rep(c(0,1), 5),
year = rep(2016:2020, each=2),
n=round(runif(10, 100, 200)))

Count number of times in a month and year that time series data is above a threshold

I have a large dataframe in R with daily time series data of rainfall for a number of locations (each in their own column). I would like to know the number of times the rainfall is less than, or is greater than a threshold value for each location in each month and also by year.
My dataframe is large so I have provided example data here:
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
How would I find out the number of times the value in A, B, C and D is greater than 4 for each month and then also for each year.
I think I should then be able to summarise this into two new tables.
One like this (example, ignore numbers):
A B C D
2000-01 1 0 5 0
2000-02 2 16 25 0
2000-03 1 5 26 0
And one like this (example, ignore numbers):
A B C D
2000 44 221 67 0
2001 67 231 4 132
2002 99 111 66 4
2003 33 45 45 4
I think I should be using dplyr for this? But I'm not sure how to get the dates to work.
A solution using the dplyr and lubridate package. The key is to create Year and Month columns, group by those columns, and use summarise_all to summarize the data.
# Create the example data frame, set the seed for reproducibility
set.seed(199)
Date_ex <- seq.Date(as.Date('2000-01-01'),as.Date('2005-01-31'),by = 1)
A <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
B <- sample(x = c(1, 2, 10), size = 1858, replace = TRUE)
C <- sample(x = c(1, 3, 5), size = 1858, replace = TRUE)
D <- sample(x = c(1, 3, 4), size = 1858, replace = TRUE)
df <- data.frame(Date_ex, A, B, C, D)
library(dplyr)
library(lubridate)
# Summarise for each month
df2 <- df %>%
mutate(Year = year(Date_ex), Month = month(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year, Month) %>%
summarise_all(funs(sum(. > 4))) %>%
ungroup()
df2
# # A tibble: 61 x 6
# Year Month A B C D
# <dbl> <dbl> <int> <int> <int> <int>
# 1 2000 1 13 8 13 0
# 2 2000 2 12 7 8 0
# 3 2000 3 7 9 9 0
# 4 2000 4 9 12 10 0
# 5 2000 5 11 12 8 0
# 6 2000 6 12 9 16 0
# 7 2000 7 10 11 10 0
# 8 2000 8 8 12 14 0
# 9 2000 9 12 12 12 0
# 10 2000 10 9 9 7 0
# # ... with 51 more rows
# Summarise for each year and month
df3 <- df %>%
mutate(Year = year(Date_ex)) %>%
select(-Date_ex) %>%
group_by(Year) %>%
summarise_all(funs(sum(. > 4)))
df3
# # A tibble: 6 x 5
# Year A B C D
# <dbl> <int> <int> <int> <int>
# 1 2000 120 119 125 0
# 2 2001 119 123 113 0
# 3 2002 135 122 105 0
# 4 2003 114 112 104 0
# 5 2004 115 125 124 0
# 6 2005 9 14 11 0
Here are a few solutions.
1) aggregate This solution uses only base R. The new Date column is the date for the first of the month or first of the year.
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "month"))), sum)
aggregate(df[-1] > 4, list(Date = as.Date(cut(df[[1]], "year"))), sum)
1a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
aggregate(df[-1] > 4, list(Date = as.yearmon(df[[1]])), sum)
aggregate(df[-1] > 4, list(Date = toyear(df[[1]])), sum)
2) rowsum This is another base R solution. The year/month or year is given by the row names.
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y-%m"))
rowsum((df[-1] > 4) + 0, format(df[[1]], "%Y"))
2a) Using yearmon class from zoo and toyear from (3) we can write:
library(zoo)
rowsum((df[-1] > 4) + 0, as.yearmon(df[[1]]))
rowsum((df[-1] > 4) + 0, toyear(df[[1]]))
3) aggregate.zoo Convert to a zoo object and use aggregate.zoo. Note that yearmon class internally represents a year and month as the year plus 0 for Jan, 1/12 for Feb, 2/12 for March, etc. so taking the integer part gives the year.
library(zoo)
z <- read.zoo(df)
aggregate(z > 4, as.yearmon, sum)
toyear <- function(x) as.integer(as.yearmon(x))
aggregate(z > 4, toyear, sum)
The result is a zoo time series with a yearmon index in the first case and an integer index in the second. If you want a data frame use fortify.zoo(ag) where ag is the result of aggregate.
4) dplyr toyear is from (3).
library(dplyr)
library(zoo)
df %>%
group_by(YearMonth = as.yearmon(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
df %>%
group_by(Year = toyear(Date_ex)) %>%
summarize_all(funs(sum)) %>%
ungroup
Data.table is missing so I'm adding this. Comments are in the code. I used set.seed(1) to generate the samples.
library(data.table)
setDT(df)
# add year and month to df
df[, `:=`(month = month(Date_ex),
year = year(Date_ex))]
# monthly returns, remove date_ex
monthly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year, month), .SDcols = -("Date_ex")]
year month A B C D
1: 2000 1 10 10 11 0
2: 2000 2 10 11 8 0
3: 2000 3 11 11 11 0
4: 2000 4 10 11 8 0
5: 2000 5 7 10 8 0
6: 2000 6 9 6 7 0
.....
# yearly returns, remove Date_ex and month
yearly_dt <- df[,lapply(.SD, function(x) sum(x > 4)), by = .(year), .SDcols = -c("Date_ex", "month")]
year A B C D
1: 2000 114 118 113 0
2: 2001 127 129 120 0
3: 2002 122 108 126 0
4: 2003 123 128 125 0
5: 2004 123 132 131 0
6: 2005 14 15 15 0

count and listing all factor levels of all factors

I have a data frame in R like this:
D I S ...
110 2012 1000
111 2012 2000
110 2012 1000
111 2014 2000
110 2013 1000
111 2013 2000
I want to calculate how many factor levels are there for each factor and safe this in an DF like this:
D Count I Count S Count ...
110 3 2012 3 1000 3
111 3 2013 2 2000 3
2014 1
or this:
D Count
110 3
111 3
I Count
2012 3
2013 2
2014 1
S Count
1000 3
2000 3
....
I tried to do it with sapply, levels, the library(dplyr) or aggregate, but it does not produce the desired output. How can I do that?
I think the most efficient way to do it, in terms of length of code and storing final output in a tidy format is this:
library(tidyverse)
# example data
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
data %>%
gather(name,value) %>% # reshape datset
count(name, value) # count combinations
# # A tibble: 7 x 3
# name value n
# <chr> <chr> <int>
# 1 D 110 3
# 2 D 111 3
# 3 I 2012 3
# 4 I 2013 2
# 5 I 2014 1
# 6 S 1000 3
# 7 S 2000 3
1st column represent the name of you factor variable.
2nd column has the unique values of each variable.
3rd column is the counter.
Here is a sulution using data.table
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
str(data)
# you just want
table(data$D)
table(data$I)
table(data$S)
# one option using data.table
require(data.table)
dt <- as.data.table(data)
dt # see dt
dt[, table(D)] # or dt[, .N, by = D], for one variable
paste(names(dt), "Count", sep = "_") # names of new count columns
dt[, paste(names(dt), "Count", sep = "_") := lapply(.SD, table)]
dt # new dt
data2 <- as.data.frame(dt)[, sort(names(dt))]
data2 # final data frame
And a dplyr's one for the second output.
counts <- data %>%
lapply(table) %>%
lapply(as.data.frame)
counts
I think the easy way is by using the "plyr" R-library.
library(plyr)
count(data$D)
count(data$I)
count(data$S)
It will give you
> count(data$D)
x freq
1 110 3
2 111 3
> count(data$I)
x freq
1 2012 3
2 2013 2
3 2014 1
> count(data$S)
x freq
1 1000 3
2 2000 3

expand a data frame from the min to the max value of each column

The reproducible data below contains random values for 2 covariates (cov1 and cov2), 2 animals (Cat and Dog) and 2 seasons (Summer and Winter).
library(dplyr); library(tidyr)
set.seed(123)
dat <- data.frame(Season = rep(c("Summer", "Winter"), each = 100),
Species = rep(c("Cat", "Dog", "Cat", "Dog"), each = 50),
cov1 = sample(1:100, 200, replace = TRUE),
cov2 = sample(1:100, 200, replace = TRUE))
head(dat)
Season Species cov1 cov2
1 Summer Cat 29 24
2 Summer Cat 79 97
3 Summer Cat 41 61
4 Summer Cat 89 52
5 Summer Cat 95 41
6 Summer Cat 5 89
I want to create a new df that contains a sequence from the min to the max value for each Season/Species combination. My initial thought was to first use dplyr to identify the the min and max values.
RangeDat <- dat %>% group_by(Season, Species) %>%
summarise_each(funs(min, max)) %>%
as.data.frame()
> RangeDat
Season Species cov1_min cov2_min cov1_max cov2_max
1 Summer Cat 3 5 100 97
2 Summer Dog 1 1 99 99
3 Winter Cat 2 1 99 100
4 Winter Dog 12 2 99 100
From here I am not sure how to expand the df. Ideally the result df would have 4 columns (Season, Species, cov1, cov2). The values for cov1 and cov2 would range from the min to the max value for each Season/Species combination. Like the initial dat df, the values for Season and Species would repeat down the df for the increasing values of cov1 and cov2.
In reference to the comments, is it possible to include an NA value where the length of a Species/Season combination is less than the 'maximum' range?
Any suggestions are greatly appreciated!
We can summarise in a list
library(dplyr)
dat %>%
group_by(Season, Species) %>%
summarise(cov1 = list(min(cov1):max(cov1)), cov2 = list(min(cov2):max(cov2)))
or with data.table
library(data.table)
setDT(dat)[, .(cov1 = list(min(cov1):max(cov1)),
cov2 = list(min(cov2):max(cov2))), by = .(Season, Species)]
Update
As the OP mentioned about keeping the length same by padding with NA, one option with dplyr would be
f1 <- function(x1, x2){
x1 <- min(x1):max(x1)
x2 <- min(x2):max(x2)
m1 <- max(c(length(x1), length(x2)))
length(x1) <- m1
length(x2) <- m1
list(cov1 = x1, cov2 = x2)
}
dat %>%
group_by(Season, Species) %>%
do(data.frame(Season = .$Season[1], Species = .$Species[1], f1(.$cov1, .$cov2)))
# A tibble: 396 x 4
# Groups: Season, Species [4]
# Season Species cov1 cov2
# <fctr> <fctr> <int> <int>
# 1 Summer Cat 3 5
# 2 Summer Cat 4 6
# 3 Summer Cat 5 7
# 4 Summer Cat 6 8
# 5 Summer Cat 7 9
# 6 Summer Cat 8 10
# 7 Summer Cat 9 11
# 8 Summer Cat 10 12
# 9 Summer Cat 11 13
#10 Summer Cat 12 14
# ... with 386 more rows
and the possible extension with data.table would be
setDT(dat)[, f1(cov1, cov2), .(Season, Species)]
# Season Species cov1 cov2
# 1: Summer Cat 3 5
# 2: Summer Cat 4 6
# 3: Summer Cat 5 7
# 4: Summer Cat 6 8
# 5: Summer Cat 7 9
# ---
#392: Winter Dog NA 96
#393: Winter Dog NA 97
#394: Winter Dog NA 98
#395: Winter Dog NA 99
#396: Winter Dog NA 100

Using index to reference column in summarise() in dplyr - R

I would like to reference a column inside the summarise() in dplyr with its index rather than with its name. For example:
> a
id visit timepoint bedroom den
1 0 0 62 NA
2 1 0 53 6.00
3 2 0 56 2.75
4 0 1 55 NA
5 1 2 61 NA
6 2 0 54 NA
7 0 1 58 2.75
8 1 2 59 NA
9 2 2 60 NA
10 0 1 57 NA
# E.g.
a %>% group_by(visit) %>% summarise(avg.bedroom = mean(bedroom, na.rm =T)
# Returns
visit avg.dedroom
<dbl> <dbl>
1 0 4.375
2 1 2.750
3 2 NaN
How could I use the index of column "bedroom" rather its name in the summarise clause? I tried:
a %>% group_by(visit) %>% summarise("4" = mean(.[[4]], na.rm = T))
but this returned false results:
visit `4`
<dbl> <dbl>
1 0 3.833333
2 1 3.833333
3 2 3.833333
Is my objective achievable and if yes how? Thank you.
Perhaps not exactly what you're looking for, but one option would be to use purrr rather than dplyr. Something like
# Read in data
d <- read.table(textConnection(" id visit timepoint bedroom den
1 12 0 62 NA
2 14 0 53 6.00
3 14 0 56 2.75
4 14 1 55 NA
5 14 2 61 NA
6 15 0 54 NA
7 15 1 58 2.75
8 16 2 59 NA
9 16 2 60 NA
10 17 1 57 NA "),
header = TRUE)
library(purrr)
d %>%
split(.$timepoint) %>%
map_dbl(function(x) mean(x[ ,5], na.rm = TRUE))
# 0 1 2
# 4.375 2.750 NaN
Or, with base
aggregate(d[ ,5] ~ timepoint, data = d, mean)
# timepoint d[, 5]
# 1 0 4.375
# 2 1 2.750
The answer I found is the summarize_at() function of dplyr. Here is how I used summarize_at() to create summary statistics on subsets of my dataframe where the columns were not known in advance (object is my original dataframe which is in a long form and has a column -- room -- that contains the names of the rooms, as well as two other columns, "visit" and "value"):
# Convert object to a wide form
object$row <- 1 : nrow(object)
y <- spread(object, room, value)
# Remove the row column from y
y <- y %>% select(-row)
# Initialize stat1, the dataframe with the summary
# statistics
stat1 <- data.frame(visit = c(0, 1, 2))
# Find the number of columns that stat1 will eventually
# have
y <- y %>% filter(id == id) %>%
select_if(function(col) mean(is.na(col)) != 1)
n <- ncol(y)
# Append columns with summary statistics to stat1
for (i in 3 : n) {
t <- y %>% group_by(visit) %>%
summarise_at(c(i), mean, na.rm = T)
t[, 2] <- round(t[, 2], 2)
stat1 <- cbind(stat1, t[, 2])
}
# Pass the dataframe stat1 to the list "results"
results$stat1 <- stat1

Resources