Arithmetic between two columns, lagged by one row in dplyr - r

Wasn't sure the best way to word this, but I'd like to multiple / divide two columns by each other, lagged by one row (in my dataset this means varx/vary - 1 row).
The end result should be an additional column, with one NA value (for the first year which isn't present)
I'm having trouble indexing it, but I think it would go something along these lines...
e.g.
df <- data_frame(year = c(2010:2020), var_x = c(20:30), var_y = c(2:12))
#not correct
diff <- df[,2, 2:ncol(df)-1] * df[,3, 1:ncol(df)]
dplyr would look something like...
df %>%
mutate(forecast = (var_x * ncol(var_y)-1))
incorrect result:
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 40
2 2011 21 3 63
3 2012 22 4 88
4 2013 23 5 115
5 2014 24 6 144
6 2015 25 7 175
7 2016 26 8 208
8 2017 27 9 243
9 2018 28 10 280
10 2019 29 11 319
11 2020 30 12 360
Error in mutate_impl(.data, dots) :
Column `forecast` must be length 11 (the number of rows) or one, not 0
Thanks, your guidance is appreciated.

From recommended comment above:
df %>%
mutate(forecast = var_y * lag(var_x))
# A tibble: 11 x 4
year var_x var_y forecast
<int> <int> <int> <int>
1 2010 20 2 NA
2 2011 21 3 60
3 2012 22 4 84
4 2013 23 5 110
5 2014 24 6 138
6 2015 25 7 168
7 2016 26 8 200
8 2017 27 9 234
9 2018 28 10 270
10 2019 29 11 308
11 2020 30 12 348

Related

loop to identify a sum and then get the position of that sum

I have this data.frame called EXAMPLE, with 4 variables:
date <- c(2010, 2011, 2012, 2013)
new_york <- c(10,20,22,28)
berlin <- c(0,51,45,12)
tokyo <- c(2,15,20,13)
EXAMPLE <- data.frame(date, new_york, berlin, tokyo)
I want to identify, in each column, in which position was summed at least 50 and also store that sum. For example, in the column new_york, was summed 52 at row 3.
I was thinking in something like this below, but it didn't work:
x <- 1
while(sum(EXAMPLE$berlin[1:x]) <= 50) {
a <-x
}
I appreciate if someone can help.
out <- lapply(EXAMPLE[,-1], cumsum)
names(out) <- paste0(names(out), "_cumulative")
options(width=123, length=99999)
cbind(EXAMPLE, out)
# date new_york berlin tokyo new_york_cumulative berlin_cumulative tokyo_cumulative
# 1 2010 10 0 2 10 0 2
# 2 2011 20 51 15 30 51 17
# 3 2012 22 45 20 52 96 37
# 4 2013 28 12 13 80 108 50
Here's the equivalent tidy version of #r2evans answer...
library(dplyr)
EXAMPLE %>%
mutate(across(new_york:tokyo,
cumsum,
.names = "cumsum_{.col}")
)
#> date new_york berlin tokyo cumsum_new_york cumsum_berlin cumsum_tokyo
#> 1 2010 10 0 2 10 0 2
#> 2 2011 20 51 15 30 51 17
#> 3 2012 22 45 20 52 96 37
#> 4 2013 28 12 13 80 108 50

Two condition filtering and removing Zero Values

I have a simple dataframe of counts in a growing pepper experiment. I want to remove observations where both treatment's[(Control and Covered) Fruit_total are equal to zero. I tried filter but I can only handle one variable at a time. Any advice?
]2
You can accomplish this by grouping on location.id and filtering for the sum of Fruit_total:
library(tidyverse)
df %>%
group_by(location.ID) %>%
filter(sum(Fruit_total) != 0)
Yields:
# A tibble: 22 x 5
# Groups: location.ID [11]
location.ID Year plant treatment Fruit_total
<dbl> <dbl> <chr> <chr> <dbl>
1 7 2019 Anaheim.Peppers.Count Control 23
2 9 2019 Anaheim.Peppers.Count Control 3
3 15 2019 Anaheim.Peppers.Count Control 0
4 23 2019 Anaheim.Peppers.Count Control 1
5 38 2019 Anaheim.Peppers.Count Control 8
6 41 2019 Anaheim.Peppers.Count Control 1
7 42 2019 Anaheim.Peppers.Count Control 12
8 43 2019 Anaheim.Peppers.Count Control 7
9 45 2019 Anaheim.Peppers.Count Control 5
10 49 2019 Anaheim.Peppers.Count Control 13
# ... with 12 more rows

Extract a value of a certain row based on a particular value in a column in a dataframe

I'm new to R and am working on a lingituinal data. What I'd like to do with dplyr, is to extract a value of certain row by matching the value of another column.
I tried using which() within mutute, but it doesn't work. Tried using indexes, but it has its own problems (as will shown below).
For example, suppose I have:
library(dplyr)
df <- tibble(ID = c(1,1,1,2,2,3,3,3,4,4),
year = c(2013,2014,2015,2013,2015,2013,2014,2015,2013,2015),
Income = c(49, 32, 47, 14, 15, 14, 46, 45, 16, 42),
Sales = c(12, 21, 42, 30, 10, 19, 16, 27, 18, 32))
Eventually, I want to deduct values for a certain from prior year. For example, (Income in 2014) - (Income in 2013). What I want to do is to use dplyr in a similar way I call df$Income[df$year=="2014"] in base R.
The reason I don't go with:
dftemp <- df %>%
group_by(ID) %>%
mutate(Income14minus13 = Income[2] - Income[1])
is that indices don't account for the missin 2014s in the data, so I want to make sure I'm extracting exact values.
I've also tried this without success:
dftemp <- df %>%
enter code here`group_by(ID) %>%
mutate(Income13 = Income[which(year==2013)],
Income14 = Income[which(year==2014)],
Income14minus13 = Income14 - Income13)
Finally, I'd like to have this as an out put:
> desired_data
# A tibble: 10 x 7
ID year Income Sales Income13 Income14 Income15
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 49 12 49 32 47
2 1 2014 32 21 49 32 47
3 1 2015 47 42 49 32 47
4 2 2013 14 30 14 NA 15
5 2 2015 15 10 14 NA 15
6 3 2013 14 19 14 46 45
7 3 2014 46 16 14 46 45
8 3 2015 45 27 16 46 45
9 4 2013 16 18 16 NA 42
10 4 2015 42 32 16 NA 42
I've noticed that case-when() only produces the variable in a single row, thus not allowing for rowwise operations, whereas my desired output does so.
Any help is greatly appreciated!
Perhaps a join would help here?
df %>%
left_join(by = "ID",
df %>%
select(ID, year, Income) %>%
mutate(year = paste0("Income", year)) %>%
tidyr::spread(year, Income)
)
# A tibble: 10 x 7
ID year Income Sales Income2013 Income2014 Income2015
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 49 12 49 32 47
2 1 2014 32 21 49 32 47
3 1 2015 47 42 49 32 47
4 2 2013 14 30 14 NA 15
5 2 2015 15 10 14 NA 15
6 3 2013 14 19 14 46 45
7 3 2014 46 16 14 46 45
8 3 2015 45 27 14 46 45
9 4 2013 16 18 16 NA 42
10 4 2015 42 32 16 NA 42
Perhaps an alternative approach could be to reshape data from long to wide; missing values will then automatically become NA (or you can specify a value with fill).
For example
df %>%
select(-Sales) %>%
spread(year, Income) %>%
mutate(Income14minus13 = `2014` - `2013`)
## A tibble: 4 x 5
# ID `2013` `2014` `2015` Income14minus13
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 49 32 47 -17
#2 2 14 NA 15 NA
#3 3 14 46 45 32
#4 4 16 NA 42 NA

add rows to data frame for non-observations

I have a dataframe that summarizes the number of times birds were observed at their breeding site one each day and each hour during daytime (i.e., when the sun was above the horizon). example:
head(df)
ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7
However, this dataframe does not include hours when the bird was not observed. Eg. no line for bird 19 on day 202 at 14 with an nObs value of 0.
I'd like to find a way, preferably with dplyr (tidy verse), to add in those rows for when individuals were not observed.
You can use complete from tidyr, i.e.
library(tidyverse)
df %>%
group_by(ID, site) %>%
complete(hr = seq(min(hr), max(hr)))
which gives,
# A tibble: 9 x 5
# Groups: ID, site [2]
ID site hr day nObs
<int> <fct> <int> <int> <int>
1 8 B 8 188 6
2 8 B 9 188 6
3 8 B 10 NA NA
4 8 B 11 188 7
5 19 A 11 202 60
6 19 A 12 NA NA
7 19 A 13 202 18
8 19 A 14 NA NA
9 19 A 15 202 27
One way to do this would be to first build a "template" of all possible combinations where birds can be observed and then merge ("left join") the actual observations onto that template:
a = read.table(text = " ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7")
tpl <- expand.grid(c(unique(a[, 1:3]), list(hr = 1:24)))
merge(tpl, a, all.x = TRUE)
Edit based on comment by #user3220999: in case we want to do the process per ID, we can just use split to get a list of data.frames per ID, get a list of templates and mapply merge on the two lists:
a <- split(a, a$ID)
tpl <- lapply(a, function(ai) {
expand.grid(c(unique(ai[, 1:3]), list(hr = 1:24)))
})
res <- mapply(merge, tpl, a, SIMPLIFY = FALSE, MoreArgs = list(all.x = TRUE))

Difference in Timestamp

I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.

Resources