Calculating rate of change over rows in R in multiple large datasets - r

I currently work with multiple large datasets of the same row number but different column numbers. Now I need to calculate the rate of change between columns and add it to either a new object or to the existing object to go on with my analysis.
In my research on the web I usually only encounterd people trying to figure out rate of change in a column but not between those. Is the easiest way to just flip all my data?
I am very sorry for my vague description of my problem as R and english are not my first languages.
I hope you can still show me the direction to further my understanding of R.
Thank you in advance for any tipps you might have!

I recommend joining all the data together and then convert it into a 3NF normalized long format table:
library(tidyverse)
data1 <- tibble(
country = c("A", "B", "C"),
gdp_2020 = c(1, 8, 10),
gdp_2021 = c(1, 8, 10),
population_2010 = c(5e3, 6e3, 6e3),
population_2020 = c(5.5e3, 6.8e3, 6e3)
)
data1
#> # A tibble: 3 x 5
#> country gdp_2020 gdp_2021 population_2010 population_2020
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 1 5000 5500
#> 2 B 8 8 6000 6800
#> 3 C 10 10 6000 6000
data2 <- tibble(
country = c("A", "B", "C"),
population_2021 = c(7e3, 8e3, 7e3),
population_2022 = c(7e3, 7e3, 10e3)
)
data2
#> # A tibble: 3 x 3
#> country population_2021 population_2022
#> <chr> <dbl> <dbl>
#> 1 A 7000 7000
#> 2 B 8000 7000
#> 3 C 7000 10000
list(
data1,
data2
) %>%
reduce(full_join) %>%
pivot_longer(matches("^(gdp|population)")) %>%
separate(name, into = c("variable", "year"), sep = "_") %>%
type_convert() %>%
arrange(country, variable, year) %>%
group_by(variable, country) %>%
mutate(
# NA for the first value because it does not have a precursor to calculate change
change_rate = (value - lag(value)) / (year - lag(year))
)
#> Joining, by = "country"
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> country = col_character(),
#> variable = col_character(),
#> year = col_double()
#> )
#> # A tibble: 18 x 5
#> # Groups: variable, country [6]
#> country variable year value change_rate
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 A gdp 2020 1 NA
#> 2 A gdp 2021 1 0
#> 3 A population 2010 5000 NA
#> 4 A population 2020 5500 50
#> 5 A population 2021 7000 1500
#> 6 A population 2022 7000 0
#> 7 B gdp 2020 8 NA
#> 8 B gdp 2021 8 0
#> 9 B population 2010 6000 NA
#> 10 B population 2020 6800 80
#> 11 B population 2021 8000 1200
#> 12 B population 2022 7000 -1000
#> 13 C gdp 2020 10 NA
#> 14 C gdp 2021 10 0
#> 15 C population 2010 6000 NA
#> 16 C population 2020 6000 0
#> 17 C population 2021 7000 1000
#> 18 C population 2022 10000 3000
Created on 2021-12-16 by the reprex package (v2.0.1)
Example: rate of change in the second row (gdp of country A) is 0 because it was the same in both 2021 and 2020.

Related

Filter by value counts within groups

I want to filter my grouped dataframe based on the number of occurrences of a specific value within a group.
Some exemplary data:
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
The “lapse” column is my filter variable in this case.
I want to exclude every “ID” group that has more than 15 counts of “lapse” == 2 within!
data %>% group_by(ID) %>% count(lapse == 2)
So, if for example the group “A” has 17 times “lapse” == 2 within it should be filtered entirely from the datafame.
First I created some reproducible data using a set.seed and check the number of values per group. It seems that in this case only group D more values with lapse 2 has. You can use filter and sum the values with lapse 2 per group like this:
set.seed(7)
data <- data.frame(ID = sample(c("A","B","C","D"),100,replace = T),
rt = runif(100,0.2,1),
lapse = sample(1:2,100,replace = T))
library(dplyr)
# Check n values per group
data %>%
group_by(ID, lapse) %>%
summarise(n = n())
#> # A tibble: 8 × 3
#> # Groups: ID [4]
#> ID lapse n
#> <chr> <int> <int>
#> 1 A 1 8
#> 2 A 2 7
#> 3 B 1 13
#> 4 B 2 15
#> 5 C 1 18
#> 6 C 2 6
#> 7 D 1 17
#> 8 D 2 16
data %>%
group_by(ID) %>%
filter(!(sum(lapse == 2) > 15))
#> # A tibble: 67 × 3
#> # Groups: ID [3]
#> ID rt lapse
#> <chr> <dbl> <int>
#> 1 B 0.517 2
#> 2 C 0.589 1
#> 3 C 0.598 2
#> 4 C 0.715 1
#> 5 B 0.475 2
#> 6 C 0.965 1
#> 7 B 0.234 1
#> 8 B 0.812 2
#> 9 C 0.517 1
#> 10 B 0.700 1
#> # … with 57 more rows
Created on 2023-01-08 with reprex v2.0.2

Species list: long to wide based on date and temperature average in R

I have a list of species with details on temperature and other variables that need to be in a wide format to compute analysis. I need to have one row per date and site, and all other variables, like temperature, just the average of the date but respecting the site. Find a diagram of what I need below
I have found ways to do it, but I can not find a way to get the other variables, for example, the temperature or cloud cover averaged by day.
I hope someone can help me
Replace the values in the columns you want to average with the averages
before pivoting. Your example data doesn’t illustrate the problem of having
varying temps within a site/date, so I modified the data a bit:
library(tidyverse)
tbl <- tibble(
site = c("A", "A", "A", "B", "B", "B"),
sp = c("sp1", "sp1", "sp2", "sp1", "sp3", "sp4"),
day = c(1, 1, 2, 2, 2, 3),
temp = c(17, 20, 16, 18, 18, 20)
)
tbl
#> # A tibble: 6 x 4
#> site sp day temp
#> <chr> <chr> <dbl> <dbl>
#> 1 A sp1 1 17
#> 2 A sp1 1 20
#> 3 A sp2 2 16
#> 4 B sp1 2 18
#> 5 B sp3 2 18
#> 6 B sp4 3 20
And here’s the averaging step added:
tbl |>
group_by(site, day) |>
mutate(across(where(is.numeric), mean)) |>
pivot_wider(
names_from = sp,
values_from = sp,
values_fn = length,
values_fill = 0L
)
#> # A tibble: 4 x 7
#> # Groups: site, day [4]
#> site day temp sp1 sp2 sp3 sp4
#> <chr> <dbl> <dbl> <int> <int> <int> <int>
#> 1 A 1 18.5 2 0 0 0
#> 2 A 2 16 0 1 0 0
#> 3 B 2 18 1 0 1 0
#> 4 B 3 20 0 0 0 1

Tally if observations fall in date windows

I have a data frame that represents policies with start and end dates. I'm trying to tally the count of policies that are active each month.
library(tidyverse)
ayear <- 2021
amonth <- 10
months <- 12
df <- tibble(
pol = c(1, 2, 3, 4)
, bdate = c('2021-02-23', '2019-12-03', '2020-08-11', '2020-12-14')
, edate = c('2022-02-23', '2020-12-03', '2021-08-11', '2021-06-14')
)
These four policies have a begin date (bdate) and end date (edate). Beginning in October (amonth) 2021 (ayear) and going back 12 months (months) I'm trying to generate a count of how many of the 4 policies were active at some point in the month to generate a data frame that looks something like this.
Data frame I'm trying to generate would have three columns: month, year, and active_pol_count with 12 rows. Like this.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
df <- tibble(
pol = c(1, 2, 3, 4),
bdate = c("2021-02-23", "2019-12-03", "2020-08-11", "2020-12-14"),
edate = c("2022-02-23", "2020-12-03", "2021-08-11", "2021-06-14")
)
# transform star and end date to interval
df <- mutate(df, interval = interval(bdate, edate))
# for every first date of each month between 2020-10 to 2021-10
seq(as.Date("2020-10-01"), as.Date("2021-09-01"), by = "months") %>%
tibble(date = .) %>%
mutate(
year = year(date),
month = month(date),
active_pol_count = date %>% map_dbl(~ .x %within% df$interval %>% sum()),
)
#> # A tibble: 12 x 4
#> date year month active_pol_count
#> <date> <dbl> <dbl> <dbl>
#> 1 2020-10-01 2020 10 2
#> 2 2020-11-01 2020 11 2
#> 3 2020-12-01 2020 12 2
#> 4 2021-01-01 2021 1 2
#> 5 2021-02-01 2021 2 2
#> 6 2021-03-01 2021 3 3
#> 7 2021-04-01 2021 4 3
#> 8 2021-05-01 2021 5 3
#> 9 2021-06-01 2021 6 3
#> 10 2021-07-01 2021 7 2
#> 11 2021-08-01 2021 8 2
#> 12 2021-09-01 2021 9 1
Created on 2021-12-13 by the reprex package (v2.0.1)

R clasification of a number

I am working in R, but I don't know very well how to extract from any number a series of data, i.e., from the number 20102168056, I want to subdivide it like this
2010 -> year
2 -> semester
168 -> university career
056 -> unique number
I tried to do it with an if, but every time I got more errors, I am new to this and I would like to know if you can help me (By the way, it is for any number, as 20211888070, so I did not use the if I raised).
You can use tidyr::separate.
library(tidyverse)
df <- tibble(original = c(20102168056, 20141152013, 20182008006))
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11))
# A tibble: 3 × 4
year semester university_career unique_number
<chr> <chr> <chr> <chr>
1 2010 2 168 056
2 2014 1 152 013
3 2018 2 008 006
You may want to convert some of the columns to an integer:
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11)) %>%
mutate(across(year:unique_number, as.integer))
# A tibble: 3 × 4
year semester university_career unique_number
<int> <int> <int> <int>
1 2010 2 168 56
2 2014 1 152 13
3 2018 2 8 6
We can use stringr::str_match().
library(tidyverse)
data <- c(20102168056, 20102168356)
str_match(data, '^(\\d{4})(\\d{1})(\\d{3})(\\d{3})') %>%
as.data.frame() %>%
set_names(c('value', 'year', 'semester', 'university_career', 'unique_number'))
#> value year semester university_career unique_number
#> 1 20102168056 2010 2 168 056
#> 2 20102168356 2010 2 168 356
Created on 2021-12-08 by the reprex package (v2.0.1)
You can use the substr() function if you first make the number into a character with as.character().
test <- '20102168056'
data <- list()
data$year <- substr(test, 1, 4)
data$semester <- substr(test, 5, 5)
data$uni_career <- substr(test, 6, 8)
data$unique_num <- substr(test, 9, 11)
print(data)
#> $year
#> [1] "2010"
#>
#> $semester
#> [1] "2"
#>
#> $uni_career
#> [1] "168"
#>
#> $unique_num
#> [1] "056"
Created on 2021-12-08 by the reprex package (v2.0.1)

Create a column for days sampled e.g. 0,10,30 days,starting with 0 days for every study area?

I like to create some sampling effort curves for species data. Where are several study areas with a number of sampling plots, resampled over a certain time period. My data set looks similar to this one:
df1 <- data.frame(PlotID = c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D","E","E","E"),
species = c("x","x","x1","x","x1","x2","x1","x3","x4","x4","x5","x5","x","x3","x","x3","x3","x4","x5","x","x1","x2","x3"),
date = as.Date(c("27-04-1995", "26-05-1995", "02-08-1995", "02-05-1995", "28-09-1995", "02-08-1994", "31-05-1995", "27-07-1995", "06-12-1995", "03-05-1996", "27-04-1995", "31-05-1995", "29-06-1994", "30-08-1995", "26-05-1994", "30-05-1995", "30-06-1995", "30-06-1995", "30-06-1995", "30-08-1995", "31-08-1995", "01-09-1995","02-09-1995"),'%d-%m-%Y'),
area= c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C"))
I really would like an output that gives me an extra column of time of sampling e.g. 0, 10 days, 30days for the whole dataframe, but times should start with 0 for each area. I tried this:
effort<-df1%>% arrange(PlotID, date,species) %>% group_by(area) %>%
mutate(diffDate = difftime(date, lag(date,1))) %>% ungroup()
But somehow my code produces nonsense?
Could please somebody enlighten me?
T the end I would like to achieve something like this example below. A List of matrices for every research area with species as rows but not with sampling plots as columns but time (in days, showing the increasing sampling effort). The example shows a data set from the package iNEXT. But I'm stuck with getting the days of sampling calculated for every area between the sampling dates.For now I just want this extra column showing the days between the sampling events in each area and the species found. I hope now it's a bit clearer?
Edit: This is how the date in my real data set looks like:
output from dput(head(my.data))
date= structure(c(801878400, 798940800, 780710400, 769910400, 775785600, 798940800), class = c("POSIXct", "POSIXt"), tzone = "UTC")
a possible tidyverse solution would be
library(dplyr)
df1 %>% arrange(area, date) %>%
group_by(area) %>%
mutate(diff_date_from_start = date - min(date),
diff_date_from_prev = date - lag(date))
#> # A tibble: 23 x 6
#> # Groups: area [3]
#> PlotID species date area diff_date_from_start diff_date_from_prev
#> <chr> <chr> <date> <chr> <drtn> <drtn>
#> 1 B x2 1994-08-02 A 0 days NA days
#> 2 A x 1995-04-27 A 268 days 268 days
#> 3 A x 1995-05-02 A 273 days 5 days
#> 4 A x 1995-05-26 A 297 days 24 days
#> 5 B x1 1995-05-31 A 302 days 5 days
#> 6 B x3 1995-07-27 A 359 days 57 days
#> 7 A x1 1995-08-02 A 365 days 6 days
#> 8 A x1 1995-09-28 A 422 days 57 days
#> 9 B x4 1995-12-06 A 491 days 69 days
#> 10 B x4 1996-05-03 A 640 days 149 days
#> # … with 13 more rows
The diff_date_from_prev variable might make more sense if you group by other variables as well, such as species and PlotID.
The diff_date_from_prev calculates the difference in days between the current sample and the first sample in for each Area.
Edit to answer comment:
Your date is stored as POSIX and not as Date class. If time zones are not relevant, I find easier to work with Date, so one option is converting to Date as.Date() and then applying the manipulations as stated previously. Alternatively you can use the difftime() function as suggested by #Rui Barradas in the comments and specify the unit accordingly.
df1 <- data.frame(PlotID = c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D","E","E","E"),
species = c("x","x","x1","x","x1","x2","x1","x3","x4","x4","x5","x5","x","x3","x","x3","x3","x4","x5","x","x1","x2","x3"),
# date as posix not as date. they are different data classs.
date = as.POSIXct(c("27-04-1995", "26-05-1995", "02-08-1995", "02-05-1995", "28-09-1995", "02-08-1994", "31-05-1995", "27-07-1995", "06-12-1995", "03-05-1996", "27-04-1995", "31-05-1995", "29-06-1994", "30-08-1995", "26-05-1994", "30-05-1995", "30-06-1995", "30-06-1995", "30-06-1995", "30-08-1995", "31-08-1995", "01-09-1995","02-09-1995"),'%d-%m-%Y'),
area= c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C"))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 %>% arrange(area, date) %>%
group_by(area) %>%
mutate(
date = as.Date(date),
diff_date_from_start = date - min(date)
)
#> # A tibble: 23 x 6
#> # Groups: area [3]
#> PlotID species date area diff_date_from_start diff_date_time
#> <chr> <chr> <date> <chr> <drtn> <drtn>
#> 1 A x 2-05-19 A 0 days 0 days
#> 2 A x1 2-08-19 A 92 days 92 days
#> 3 B x2 2-08-19 A 92 days 92 days
#> 4 B x4 3-05-19 A 365 days 365 days
#> 5 B x4 6-12-19 A 1675 days 1675 days
#> 6 A x 26-05-19 A 8766 days 8766 days
#> 7 A x 27-04-19 A 9101 days 9101 days
#> 8 B x3 27-07-19 A 9192 days 9192 days
#> 9 A x1 28-09-19 A 9620 days 9620 days
#> 10 B x1 31-05-19 A 10592 days 10592 days
#> # … with 13 more rows
# or as suggested by Rui Barradas. you can use difftime function and keep you date as a POSIX class
df1 %>% arrange(area, date) %>%
group_by(area) %>%
mutate(
diff_date_time = difftime(date, min(date), unit = "days")
)
#> # A tibble: 23 x 5
#> # Groups: area [3]
#> PlotID species date area diff_date_time
#> <chr> <chr> <dttm> <chr> <drtn>
#> 1 A x 2-05-19 00:00:00 A 0 days
#> 2 A x1 2-08-19 00:00:00 A 92 days
#> 3 B x2 2-08-19 00:00:00 A 92 days
#> 4 B x4 3-05-19 00:00:00 A 365 days
#> 5 B x4 6-12-19 00:00:00 A 1675 days
#> 6 A x 26-05-19 00:00:00 A 8766 days
#> 7 A x 27-04-19 00:00:00 A 9101 days
#> 8 B x3 27-07-19 00:00:00 A 9192 days
#> 9 A x1 28-09-19 00:00:00 A 9620 days
#> 10 B x1 31-05-19 00:00:00 A 10592 days
#> # … with 13 more rows
Created on 2021-06-13 by the reprex package (v2.0.0)
I solved it with a for loop
areas <- unique(df1$area)
df1$diffdate <- 0
for (i in 1:length(areas)){
df1$diffdate[df1$area == areas[i]] <- df1$date[df1$area == areas[i]] - min(df1$date[df1$area == areas[i]])
}
Do you want a sequence of dates by 10 days for each group of area?
library(dplyr)
library(tidyr)
df1 %>%
arrange(PlotID, date, species) %>%
group_by(area) %>%
complete(date = full_seq(date, 1)) %>%
mutate(species = zoo::na.locf(species),
PlotID = zoo::na.locf(PlotID),
diffDate = 10*as.integer(date - first(date)) %/% 10) %>%
ungroup() %>%
group_by(diffDate) %>%
filter(row_number() == 1)
## A tibble: 65 x 5
## Groups: diffDate [65]
# area date PlotID species diffDate
# <chr> <date> <chr> <chr> <dbl>
# 1 A 1994-08-02 B x2 0
# 2 A 1994-08-12 B x2 10
# 3 A 1994-08-22 B x2 20
# 4 A 1994-09-01 B x2 30
# 5 A 1994-09-11 B x2 40
# 6 A 1994-09-21 B x2 50
# 7 A 1994-10-01 B x2 60
# 8 A 1994-10-11 B x2 70
# 9 A 1994-10-21 B x2 80
#10 A 1994-10-31 B x2 90
## … with 55 more rows

Resources