Find the smallest value under conditions about the index (NA output possible) - r

Question:
I am using dplyr to do data analysis in R, and I come across the following problem.
My data frame is like this:
item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65
The data frame is already arranged in item, day. Now I want to mutate a new column, with each row being the smallest value of the same group AND having the day to be within the next 2 days.
For the example above, I want the resulting data frame to be:
item day val output
1 A 1 90 100 # the smaller of 100 and 110
2 A 2 100 110 # the only value within 2 days
3 A 3 110 80 # the only value within 2 days
4 A 5 80 NA # there is no data within 2 days
5 A 8 70 NA # there is no data within 2 days
6 B 1 75 65 # the only value within 2 days
7 B 3 65 NA # there is no data within 2 days
I understand that I will probably use group_by and mutate, but how to write the inside function in order to achieve my desired result?
Any help is greatly appreciated. Let me know if you need me to clarify anything. Thank you!

Try this:
df %>%
# arrange(item, day) %>% # if not already arranged
# take note of the next two values & corresponding difference in days
group_by(item) %>%
mutate(val.1 = lead(val),
day.1 = lead(day) - day,
val.2 = lead(val, 2),
day.2 = lead(day, 2) - day) %>%
ungroup() %>%
# if the value is associated with a day more than 2 days away, change it to NA
mutate(val.1 = ifelse(day.1 %in% c(1, 2), val.1, NA),
val.2 = ifelse(day.2 %in% c(1, 2), val.2, NA)) %>%
# calculate output normally
group_by(item, day) %>%
mutate(output = min(val.1, val.2, na.rm = TRUE)) %>%
ungroup() %>%
# arrange results
select(item, day, val, output) %>%
mutate(output = ifelse(output == Inf, NA, output)) %>%
arrange(item, day)
# A tibble: 7 x 4
item day val output
<fctr> <int> <int> <dbl>
1 A 1 90 100
2 A 2 100 110
3 A 3 110 80.0
4 A 5 80 NA
5 A 8 70 NA
6 B 1 75 65.0
7 B 3 65 NA
Data:
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)

We can use complete from the tidyr package to complete the dataset by day, and then use lead from dplyr and rollapply from zoo to find the minimum of the next two days.
library(dplyr)
library(tidyr)
library(zoo)
DF2 <- DF %>%
group_by(item) %>%
complete(day = full_seq(day, period = 1)) %>%
mutate(output = rollapply(lead(val), width = 2, FUN = min, na.rm = TRUE,
fill = NA, align = "left")) %>%
drop_na(val) %>%
ungroup() %>%
mutate(output = ifelse(output == Inf, NA, output))
DF2
# # A tibble: 7 x 4
# item day val output
# <chr> <dbl> <int> <dbl>
# 1 A 1.00 90 100
# 2 A 2.00 100 110
# 3 A 3.00 110 80.0
# 4 A 5.00 80 NA
# 5 A 8.00 70 NA
# 6 B 1.00 75 65.0
# 7 B 3.00 65 NA
DATA
DF <- read.table(text = "item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65",
header = TRUE, stringsAsFactors = FALSE)

We'll create a dataset with modified day, so we can left join it on the original dataset, keeping only minimum value.
df %>%
left_join(
bind_rows(mutate(.,day=day-1),mutate(.,day=day-2)) %>% rename(output=val)) %>%
group_by(item,day,val) %>%
summarize_at("output",min) %>%
ungroup
# # A tibble: 7 x 4
# item day val output
# <fctr> <dbl> <int> <dbl>
# 1 A 1 90 100
# 2 A 2 100 110
# 3 A 3 110 80
# 4 A 5 80 NA
# 5 A 8 70 NA
# 6 B 1 75 65
# 7 B 3 65 NA
data
df <- read.table(text = " item day val
1 A 1 90
2 A 2 100
3 A 3 110
4 A 5 80
5 A 8 70
6 B 1 75
7 B 3 65", header = TRUE)

Related

How to create a percentage column based on the values present in every third row?

I have a data frame containing the values of weight. I have a create a new column, percentage change of weight wherein the denominator takes the value of every third row.
df <- data.frame(weight = c(30,30,109,30,309,10,20,20,14))
# expected output
change_of_weight = c(30/109, 30/109, 109/109, 30/10,309/10,10/10,20/14,20/14,14/14)
Subset weight column where it's position %% 3 is zero and repeat each value three times.
df <- transform(df, change_of_weight=weight / rep(weight[1:nrow(df) %% 3 == 0], each=3))
df
weight change_of_weight
1 30 0.2752294
2 30 0.2752294
3 109 1.0000000
4 30 3.0000000
5 309 30.9000000
6 10 1.0000000
7 20 1.4285714
8 20 1.4285714
9 14 1.0000000
You can create a group of every 3 rows and divide weight column by the last value in the group.
df$change <- with(df, ave(df$weight, ceiling(seq_len(nrow(df))/3),
FUN = function(x) x/x[length(x)]))
Or using dplyr :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number()/3)) %>%
mutate(change = weight/last(weight))
# weight grp change
# <dbl> <dbl> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
We can also use gl to create a grouping column
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(change = weight/last(weight))
# A tibble: 9 x 3
# Groups: grp [3]
# weight grp change
# <dbl> <int> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
Or using data.table
library(data.table)
setDT(df)[, change := weight/last(weight), .(as.integer(gl(nrow(df), 3, nrow(df))))]

Summarising rows and columns from known groupings

I have a symmetrical matrix of flows (in tibble form) similar to the below example:
library(tibble)
set.seed(2019)
df1 <- as_tibble(matrix(sample(1:10,100,replace = T), nrow = 10, ncol = 10, byrow = TRUE,
dimnames = list(as.character(1:10),
as.character(1:10))))
df1
# `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 8 8 4 7 1 1 9 1 2 7
# 2 8 7 3 2 7 7 1 8 4 5
# 3 5 6 10 2 2 1 6 10 7 5
# 4 7 1 9 2 1 1 4 5 1 8
# 5 7 3 9 7 9 5 10 10 3 2
# 6 4 1 1 4 6 4 10 10 1 1
# 7 2 3 8 4 8 10 4 1 9 6
# 8 4 2 4 2 7 10 2 6 4 8
# 9 1 10 10 3 6 2 6 7 8 4
#10 6 8 9 3 6 9 5 10 4 10
I also have a lookup table that shows the broad groups that each flow subgroup fits into:
lookup <- tibble(sector = as.character(1:10),
aggregate_sector = c(rep('A',3), rep('B', 3), rep('C', 4)))
lookup
# sector aggregate_sector
#1 1 A
#2 2 A
#3 3 A
#4 4 B
#5 5 B
#6 6 B
#7 7 C
#8 8 C
#9 9 C
#10 10 C
I want to summarise my original df1 such that it represents the flows between each aggregate_sector (as per the lookup table) rather than each sector. Expected output:
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
My initial attempt has been to convert into a matrix and then use a nested for loop to calculate the sum of flows for each aggregate_sector combination in turn:
mdat <- as.matrix(df1)
# replace row and column names with group names - assumes lookup is in same order as row and col names...
row.names(mdat) <- lookup$aggregate_sector
colnames(mdat) <- lookup$aggregate_sector
# pre-allocate an empty matrix
new_mat <- matrix(nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], LETTERS[1:3]))
# fill in matrix section by section
for(i in row.names(new_mat)){
for(j in colnames(new_mat)){
new_mat[i,j] <- sum(mdat[which(row.names(mdat) ==i), which(colnames(mdat) ==j)])
}
}
new_mat
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
While this is a satisfactory solution, I wonder if there's a solution using dplyr or similar that uses nicer logic and saves me from having to convert my actual data (which is a tibble) into matrix form.
The key steps is to gather - after that is it all straightforward dplyr stuff:
flow_by_sector <-
df1 %>%
mutate(sector_from = rownames(.)) %>%
tidyr::gather(sector_to, flow, -sector_from)
flow_by_sector_with_agg <-
flow_by_sector %>%
left_join(lookup, by = c("sector_from" = "sector")) %>%
rename(agg_from = aggregate_sector) %>%
left_join(lookup, by = c("sector_to" = "sector")) %>%
rename(agg_to = aggregate_sector)
flow_by_agg <-
flow_by_sector_with_agg %>%
group_by(agg_from, agg_to) %>%
summarise(flow = sum(flow))
tidyr::spread(flow_by_agg, agg_to, flow)
Here's a base answer that uses stack and xtabs. It's not super robust - it assumes that the lookup table has the same columns and order as what would be expressed in the data.frame.
colnames(df1) <- lookup$aggregate_sector
xtabs(values ~ sector + ind
, dat = data.frame(sector = rep(lookup$aggregate_sector
, length(df1)), stack(df1))
)
Here's another way to do the data.frame:
xtabs(values ~ Var1 + Var2,
dat = data.frame(expand.grid(lookup$aggregate_sector, lookup$aggregate_sector)
, values = unlist(df1))
)
Var2
Var1 A B C
A 59 30 65
B 42 39 65
C 67 70 94
I actually figured out a matrix algebra alternative to my problem which is much faster despite having to convert my data.frame into a matrix. I won't accept this solution as I did ask specifically for a dplyr answer, but thought it interesting enough to post here anyway.
I first had to form an adjustment matrix, S, from my lookup table where the the locations of ones in row i of S indicate which sectors of the original matrix will be grouped together as sector i in the aggregated matrix:
S <- lookup %>% mutate(sector = as.numeric(sector), value = 1) %>%
spread(sector, value) %>%
column_to_rownames('aggregate_sector') %>%
as.matrix()
S[is.na(S)] <- 0
S
# 1 2 3 4 5 6 7 8 9 10
#A 1 1 1 0 0 0 0 0 0 0
#B 0 0 0 1 1 1 0 0 0 0
#C 0 0 0 0 0 0 1 1 1 1
Then, I convert my original data.frame, df1, into matrix x and simply calculate S.x.S' :
x <- as.matrix(df1)
S %*% x %*% t(S)
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94

R - Tallying up number of new individuals encountered as days go by based on ID

I am monitoring an animal population. I have their individual IDs as numbers, the date they were encountered on, and the number of individuals encountered on that day. I want to sum up the total number of different individuals encountered as the days go by, so I need it to recognize same IDs and only add new individuals to the total encountered.
This is my dataset, the last column being my desired outcome:
Month Day ID N. individuals that day Total encountered
5 13 44 3 3
5 13 58 3 3
5 13 57 3 3
5 14 58 1 3
5 15 44 2 4
5 15 06 2 4
Edit - updated to working, but inelegant, solution. The process here was to use padr to create a row for every ID in every date, with a 1 once it appears. Then we can count how many IDs have appeared as of each date, and add that to the original with a join.
library(tidyverse); library(lubridate)
# First, make a date column for easier sorting etc.
df1 <- df %>%
mutate(date = ymd(paste(2019, Month, Day))) %>%
select(date, ID) %>%
mutate(appearance = 1) # For counting later; if missing = NA in padded version
df2 <- df1 %>%
padr::pad(group = "ID", start_val = min(df1$date), end_val = max(df1$dat)) %>%
fill(appearance) %>%
count(date, Month = month(date), Day = day(date),
wt = appearance, name = "Total_encountered_calc")
df %>%
left_join(df2)
Output
Month Day ID N_individuals_that_day Total_encountered date Total_encountered_calc
1 5 13 44 3 3 2019-05-13 3
2 5 13 58 3 3 2019-05-13 3
3 5 13 57 3 3 2019-05-13 3
4 5 14 58 1 3 2019-05-14 3
5 5 15 44 2 4 2019-05-15 4
6 5 15 6 2 4 2019-05-15 4
An option
library(tidyverse)
df %>%
add_count(Month, Day) %>%
mutate(n1 = duplicated(ID)) %>%
group_by(Month, Day) %>%
mutate(n1 = c(min(n - n1), rep(0, n()-1))) %>%
ungroup %>%
mutate(n1 = cumsum(n1))
# A tibble: 6 x 5
# Month Day ID n n1
# <int> <int> <int> <int> <dbl>
#1 5 13 44 3 3
#2 5 13 58 3 3
#3 5 13 57 3 3
#4 5 14 58 1 3
#5 5 15 44 2 4
#6 5 15 6 2 4

r tidyverse spread() using multiple key value pairs not collapsing rows

I am trying to spread() a couple of key/value pairs but the common value column does not collapse. I think that it may have to do with some previous processing, or more likely I do not know the right way to spread two or more key/value pairs to get the result I expect.
I'm starting with this data set:
library(tidyverse)
df <- tibble(order = 1:7,
line_1 = c(23,8,21,45,68,31,24),
line_2 = c(63,25,25,24,48,24,63),
line_3 = c(62,12,10,56,67,25,35))
There are 2 pre-spread steps to define order of the "count" values created in the following gather() function. This is the first pre-spread step to define the original order of the "count" variable using the row number:
ntrl <- df %>%
gather(line_1,
line_2,
line_3,
key = "sector",
value = "count") %>%
group_by(order) %>%
mutate(sector_ord = row_number()) %>%
arrange(order,
sector)
This is the second pre-spread step to define the numerical order of the "count" variable:
ord <- ntrl %>%
arrange(order,
count) %>%
group_by(order) %>%
mutate(num_ord = paste0("ord_",
row_number(),
sep=""))
And then finally the spread code that I have been using:
wide <- ord %>%
group_by(order) %>%
spread(key = sector,
value = count) %>%
spread(key = num_ord,
value = sector_ord)
What I'm getting is this:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 NA NA 1 NA NA
2 1 NA 63 NA NA NA 2
3 1 NA NA 62 NA 3 NA
4 2 8 NA NA 1 NA NA
5 2 NA 25 NA NA NA 2
6 2 NA NA 12 NA 3 NA
7 3 21 NA NA NA 1 NA
8 3 NA 25 NA NA NA 2
9 3 NA NA 10 3 NA NA
... and so on thru 21 lines accounting for all 7 "order" lines
The behavior that I am expecting is that the "order" column would collapse in all rows that are the same "order" value to give the following:
order line_1 line_2 line_3 ord_1 ord_2 ord_3
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
... and so on, I think that paints the picture
I have reviewed the questions and answers about spreading with duplicate identifiers and the use of the index of row numbers but that does not help.
I figure that it has something to do with the double spreading, but I cannot figure out how to do that.
Thanks for your help.
A solution using tidyverse starting your df. The key is to use summarise_all(funs(.[which(!is.na(.))])) to select the only non-NA value for each column.
library(tidyverse)
df2 <- df %>%
gather(Lines, Value, -order) %>%
group_by(order) %>%
mutate(Rank = dense_rank(Value),
RankOrder = paste0("ord_", row_number())) %>%
spread(Lines, Value) %>%
spread(RankOrder, Rank) %>%
summarise_all(funs(.[which(!is.na(.))]))
df2
# A tibble: 7 x 7
order line_1 line_2 line_3 ord_1 ord_2 ord_3
<int> <dbl> <dbl> <dbl> <int> <int> <int>
1 1 23 63 62 1 3 2
2 2 8 25 12 1 3 2
3 3 21 25 10 2 3 1
4 4 45 24 56 2 1 3
5 5 68 48 67 3 1 2
6 6 31 24 25 3 1 2
7 7 24 63 35 1 3 2
Starting from df:
df %>%
gather(headers, line, -order) %>%
separate(headers, into = c('dummy', 'rn')) %>%
select(-dummy) %>%
group_by(order) %>%
mutate(ord = rank(line, ties.method='first')) %>%
{data.table::dcast(setDT(.), order ~ rn, value.var = c("line", "ord"))}
# order line_1 line_2 line_3 ord_1 ord_2 ord_3
#1: 1 23 63 62 1 3 2
#2: 2 8 25 12 1 3 2
#3: 3 21 25 10 2 3 1
#4: 4 45 24 56 2 1 3
#5: 5 68 48 67 3 1 2
#6: 6 31 24 25 3 1 2
#7: 7 24 63 35 1 3 2

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Resources