R Insert Value within Dataframe - r

I have a very complex problem, i hope someone can help -> i want to copy a row value (i.e. Player 1 or Player 2) into two other rows (for Player 3 and 4) if and only if these players are in the same Treatment, Group and Period AND this player was indeed picked (see column Player.Picked)
I know that with tidyverse I can group_by my columns of interest: Treatment, Group, and Period.
However, I am unsure how to proceed with the condition that Player Picked is fulfilled and then how to extract this value appropriately for the players 3 and 4 in the same treatment, group, period.
The column "extracted.Player 1/2 Value" should be the output. (I have manually provided the first four correct solutions).
Any ideas? Help would be very much appreciated. Thanks a lot in advance!
df
T Player Group Player.Picked Period Player1/2Value extracted.Player1/2Value
1 1 6 1 1 10
1 2 6 1 1 9
1 3 5 2 1 NA -> 4
1 4 6 1 1 NA -> 10
1 5 3 1 1 NA
1 1 5 2 1 8
1 2 1 0 1 7
1 3 6 1 1 NA -> 10
1 4 2 2 1 NA
1 5 2 2 1 NA
1 1 1 0 1 7
1 2 2 2 1 11
1 3 3 1 1 NA
1 4 4 1 1 NA
1 5 4 1 1 NA
1 1 2 2 1 21
1 2 4 1 1 17
1 3 1 0 1 NA
1 4 5 2 1 NA -> 4
1 5 6 1 1 NA
1 1 3 1 1 12
1 2 3 1 1 15
1 3 4 1 1 NA
1 4 1 0 1 NA
1 5 1 0 1 NA
1 1 4 1 1 11
1 2 5 2 1 4
1 3 2 2 1 NA
1 4 3 1 1 NA
1 5 5 2 1 NA

I'm not sure if I understood the required logic; here I'm assuming that Player 5 always picks Player 1 or 2 per Group.
So, here is my go at this using library(data.table):
library(data.table)
DT <- data.table::data.table(
check.names = FALSE,
T = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
Player = c(1L,2L,3L,
4L,5L,1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,
1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,1L,
2L,3L,4L,5L),
Group = c(6L,6L,5L,
6L,3L,5L,1L,6L,2L,2L,1L,2L,3L,4L,4L,
2L,4L,1L,5L,6L,3L,3L,4L,1L,1L,4L,
5L,2L,3L,5L),
Player.Picked = c(1L,1L,2L,
1L,1L,2L,0L,1L,2L,2L,0L,2L,1L,1L,1L,
2L,1L,0L,2L,1L,1L,1L,1L,1L,0L,0L,
1L,2L,2L,2L),
Period = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
`Player1/2Value` = c(10L,9L,NA,
NA,NA,8L,7L,NA,NA,NA,7L,11L,NA,NA,
NA,21L,17L,NA,NA,NA,12L,15L,NA,NA,NA,
11L,4L,NA,NA,NA),
`extracted.Player1/2Value` = c(NA,NA,4L,
10L,NA,NA,NA,10L,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,4L,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA)
)
setorderv(DT, cols = c("T", "Group", "Period", "Player"))
Player5PickedDT <- DT[Player == 5, Player.Picked, by = c("T", "Group", "Period")]
setnames(Player5PickedDT, old = "Player.Picked", new = "Player5Picked")
DT <- DT[Player5PickedDT, on = c("T", "Group", "Period")]
extractedDT <- DT[Player == Player5Picked & Player5Picked > 0, `Player1/2Value`, by = c("T", "Group", "Period")]
setnames(extractedDT, old = "Player1/2Value", new = "extractedValue")
DT[, "Player5Picked" := NULL]
DT <- extractedDT[DT, on = c("T", "Group", "Period")]
DT[, extractedValue := fifelse(Player %in% c(3, 4), yes = extractedValue, no = NA_real_)]
setcolorder(DT, c("T", "Group", "Period", "Player", "Player.Picked", "Player1/2Value", "extracted.Player1/2Value", "extractedValue"))
DT
The resulting table differs from your expected result (extracted.Player1/2Value vs extractedValue, but in my eyes it is following the explained logic):
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue
1: 1 1 1 1 0 7 NA NA
2: 1 1 1 2 0 7 NA NA
3: 1 1 1 3 0 NA NA NA
4: 1 1 1 4 1 NA NA NA
5: 1 1 1 5 0 NA NA NA
6: 1 2 1 1 2 21 NA NA
7: 1 2 1 2 2 11 NA NA
8: 1 2 1 3 2 NA NA 11
9: 1 2 1 4 2 NA NA 11
10: 1 2 1 5 2 NA NA NA
11: 1 3 1 1 1 12 NA NA
12: 1 3 1 2 1 15 NA NA
13: 1 3 1 3 1 NA NA 12
14: 1 3 1 4 2 NA NA 12
15: 1 3 1 5 1 NA NA NA
16: 1 4 1 1 0 11 NA NA
17: 1 4 1 2 1 17 NA NA
18: 1 4 1 3 1 NA NA 11
19: 1 4 1 4 1 NA NA 11
20: 1 4 1 5 1 NA NA NA
21: 1 5 1 1 2 8 NA NA
22: 1 5 1 2 1 4 NA NA
23: 1 5 1 3 2 NA 4 4
24: 1 5 1 4 2 NA 4 4
25: 1 5 1 5 2 NA NA NA
26: 1 6 1 1 1 10 NA NA
27: 1 6 1 2 1 9 NA NA
28: 1 6 1 3 1 NA 10 10
29: 1 6 1 4 1 NA 10 10
30: 1 6 1 5 1 NA NA NA
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue

Related

Filter to remove all rows before a particular value in a specific column, while this particular value occurs several time

I would like to filter to remove all rows before a particular value in a specific column. For example, in the data frame below, I would like to remove all rows before "1" that appears in column x, for as much as "1" occurs. Please note that the value of "1" repeats many times and I want to remove the "NA" rows before the "1" in column x, regarding column a.
Thanks
a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA
the desired output would be like this:
a b x
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 5 1
3 6 0
3 7 NA
Does this solve your problem?
library(tidyverse)
dat <- read.table(text = "a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA", header = TRUE)
dat %>%
group_by(a) %>%
filter(cummax(!is.na(x)) == 1)
#> # A tibble: 13 × 3
#> # Groups: a [3]
#> a b x
#> <int> <int> <int>
#> 1 1 3 1
#> 2 1 4 0
#> 3 1 5 0
#> 4 1 6 NA
#> 5 1 7 NA
#> 6 2 3 1
#> 7 2 4 NA
#> 8 2 5 0
#> 9 2 6 0
#> 10 2 7 NA
#> 11 3 5 1
#> 12 3 6 0
#> 13 3 7 NA
Created on 2021-12-07 by the reprex package (v2.0.1)

How to make the next number in a column a sequence in r

sorry to bother everyone. I have been stuck with coding
Student Number
1 NA
1 NA
1 1
1 1
2 NA
2 1
2 1
2 1
3 NA
3 NA
3 1
3 1
I tried using dplyr to cluster by students try to find a way so that every time it reads that 1, it adds it to the following column so it would read as
Student Number
1 NA
1 NA
1 1
1 2
2 NA
2 1
2 2
2 3
3 NA
3 NA
3 1
3 2
etc
Thank you! It'd help with attendance.
data.table solution;
library(data.table)
setDT(df)
df[!is.na(Number),Number:=cumsum(Number),by=Student]
df
Student Number
<int> <int>
1 1 NA
2 1 NA
3 1 1
4 1 2
5 2 NA
6 2 1
7 2 2
8 2 3
9 3 NA
10 3 NA
11 3 1
12 3 2
Try using cumsum, note that cumsum itself cannot ignore NA
library(dplyr)
df %>%
group_by(Student) %>%
mutate(n = cumsum(ifelse(is.na(Number), 0, Number)) + 0 * Number)
Student Number n
<int> <int> <dbl>
1 1 NA NA
2 1 NA NA
3 1 1 1
4 1 1 2
5 2 NA NA
6 2 1 1
7 2 1 2
8 2 1 3
9 3 NA NA
10 3 NA NA
11 3 1 1
12 3 1 2

Calculate diff price in a unbalanced set

I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3

Update all the records in between from NA to the value

I have a data table in the below format :
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
From this data table I would like to update all the NA in between the two values in c2 as below:
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 10
1 1 10
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 11
1 1 11
1 1 11
2 1 NA
2 1 12
2 1 12
2 1 12
2 1 12
Can do it using a for loop and which():
df=data.frame(id = c(rep(1,12)),c2 = c(NA,NA,10,NA,NA,10, NA,NA,11,NA,11,NA))
Find unique values of c2:
vals=unique(df[which(!is.na(df$c2)),'c2'])
Loop through unique values and replace observations between their first and last appearance:
for(i in vals){
df[min(which(df$c2==i)):max(which(df$c2==i)),'c2']=i
}
Besides David's approach which is working directly with row indices there is another data.table approach which uses a non-equi join:
# coerce to data.table
setDT(DT)[
# append unique row id
, rn := .I][
# non-equi join on row ids
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][
# remove row id column
, rn := NULL][]
id c1 c2
1: 1 1 NA
2: 1 1 NA
3: 1 1 10
4: 1 1 10
5: 1 1 10
6: 1 1 10
7: 1 1 NA
8: 1 1 NA
9: 1 1 11
10: 1 1 11
11: 1 1 11
12: 1 1 11
13: 2 1 NA
14: 2 1 12
15: 2 1 12
16: 2 1 12
17: 2 1 12
Caveat
The expression
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2]
returns the row id ranges for each unique value of c2
c2 rmin rmax
1: 10 3 6
2: 11 9 12
3: 12 14 17
There is an implicit assumption that the row id ranges do not overlap. It requires that each "gap" is associated with a unique c2 value. This affects other solutions 1, 2 as well.
Improved solution using rleid()
The code can be improved to handle cases where the above mentioned assumption is violated.
Using rleid(), we can distinguish different gaps even if the have the same c2 value. For instance, for the second sample data set
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)]
rleid c2 rmin rmax
1: 1 10 3 6
2: 2 11 9 12
3: 3 12 14 17
4: 4 10 20 23
The complete code:
setDT(DT2)[, rn := .I][
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][, rn := NULL][]
id c1 c2
1: 1 1 NA
2: 1 1 NA
3: 1 1 10
4: 1 1 10
5: 1 1 10
6: 1 1 10
7: 1 1 NA
8: 1 1 NA
9: 1 1 11
10: 1 1 11
11: 1 1 11
12: 1 1 11
13: 2 1 NA
14: 2 1 12
15: 2 1 12
16: 2 1 12
17: 2 1 12
18: 2 1 NA
19: 2 1 NA
20: 2 1 10
21: 2 1 10
22: 2 1 10
23: 2 1 10
24: 2 1 NA
25: 2 1 NA
id c1 c2
Data
library(data.table)
DT <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12")
Expanded data set (note the repeated appearance of c2 == 10):
DT2 <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA")
Okay (new/edited answer), we can make use of the fact that the desired property of a solution is that filling up should yield the same result as filling down:
library(tidyverse)
df %>%
mutate(filled_down = c2, filled_up = c2) %>%
fill(filled_down, .direction="down") %>%
fill(filled_up, .direction="up") %>%
mutate(c2 = ifelse(filled_down == filled_up, filled_down, c2)) %>%
select(-filled_down, -filled_up)

Count consecutive strings of zeroes and ones over multiple groups

There have been several discussions about counting consecutive strings of zeroes and ones (or other values) using functions like rle or cumsum. I have played around with these functions, but I can't easily figure out how to get them to apply to my specific problem.
I am working with ecological presence/absence data ("pres.abs" = 1 or 0) organized by time ("year") and location ("id"). For each location id, I would like to separately calculate the length of consecutive ones and zeroes through time. Where these cannot be calculated, I want to return "NA".
Below is a sample of what the data looks like (first 3 columns) and the output I am hoping to achieve (last 2 columns). Ideally, this would be a pretty fast function avoiding for-loops since the real data frame contains ~15,000 rows.
year = rep(1:10, times=3)
id = c(rep(1, times=10), rep(2, times=10), rep(3, times=10))
pres.abs.id.1 = c(0, 0, 0, 1, 1, 1, 0, 0, 1, 1) #Pres/abs data at site 1 across time
pres.abs.id.2 = c(1, 1, 0, 1, 0, 0, 1, 0, 0, 0) #Pres/abs data at site 2 across time
pres.abs.id.3 = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1) #Pres/abs data at site 3 across time
pres.abs = c(pres.abs.id.1, pres.abs.id.2, pres.abs.id.3)
dat = data.frame(id, year, pres.abs)
dat$cumul.zeroes = c(1,2,3,NA,NA,NA,1,2,NA,NA,NA,NA,1,NA,1,2,NA,1,2,3,1,2,3,4,5,NA,NA,NA,NA,NA)
dat$cumul.ones = c(NA,NA,NA,1,2,3,NA,NA,1,2,1,2,NA,1,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,2,3,4,5)
> dat
id year pres.abs cumul.zeroes cumul.ones
1 1 1 0 1 NA
2 1 2 0 2 NA
3 1 3 0 3 NA
4 1 4 1 NA 1
5 1 5 1 NA 2
6 1 6 1 NA 3
7 1 7 0 1 NA
8 1 8 0 2 NA
9 1 9 1 NA 1
10 1 10 1 NA 2
11 2 1 1 NA 1
12 2 2 1 NA 2
13 2 3 0 1 NA
14 2 4 1 NA 1
15 2 5 0 1 NA
16 2 6 0 2 NA
17 2 7 1 NA 1
18 2 8 0 1 NA
19 2 9 0 2 NA
20 2 10 0 3 NA
21 3 1 0 1 NA
22 3 2 0 2 NA
23 3 3 0 3 NA
24 3 4 0 4 NA
25 3 5 0 5 NA
26 3 6 1 NA 1
27 3 7 1 NA 2
28 3 8 1 NA 3
29 3 9 1 NA 4
30 3 10 1 NA 5
Thanks very much for your help.
Here's a base R way using rle and sequence:
dat <- within(dat, {
cumul.counts <- unlist(lapply(split(pres.abs, id), function(x) sequence(rle(x)$lengths)))
cumul.zeroes <- replace(cumul.counts, pres.abs == 1, NA)
cumul.ones <- replace(cumul.counts, pres.abs == 0, NA)
rm(cumul.counts)
})
# id year pres.abs cumul.ones cumul.zeroes
# 1 1 1 0 NA 1
# 2 1 2 0 NA 2
# 3 1 3 0 NA 3
# 4 1 4 1 1 NA
# 5 1 5 1 2 NA
# 6 1 6 1 3 NA
# 7 1 7 0 NA 1
# 8 1 8 0 NA 2
# 9 1 9 1 1 NA
# 10 1 10 1 2 NA
# 11 2 1 1 1 NA
# 12 2 2 1 2 NA
# 13 2 3 0 NA 1
# 14 2 4 1 1 NA
# 15 2 5 0 NA 1
# 16 2 6 0 NA 2
# 17 2 7 1 1 NA
# 18 2 8 0 NA 1
# 19 2 9 0 NA 2
# 20 2 10 0 NA 3
# 21 3 1 0 NA 1
# 22 3 2 0 NA 2
# 23 3 3 0 NA 3
# 24 3 4 0 NA 4
# 25 3 5 0 NA 5
# 26 3 6 1 1 NA
# 27 3 7 1 2 NA
# 28 3 8 1 3 NA
# 29 3 9 1 4 NA
# 30 3 10 1 5 NA
Here's one option with dplyr:
require(dplyr)
dat %>%
group_by(id, x = cumsum(c(0,diff(pres.abs)) != 0)) %>%
mutate(cumul.zeros = ifelse(pres.abs, NA_integer_, row_number()),
cumul.ones = ifelse(!pres.abs, NA_integer_, row_number())) %>%
ungroup() %>% select(-x)
#Source: local data frame [30 x 5]
#
# id year pres.abs cumul.zeros cumul.ones
#1 1 1 0 1 NA
#2 1 2 0 2 NA
#3 1 3 0 3 NA
#4 1 4 1 NA 1
#5 1 5 1 NA 2
#6 1 6 1 NA 3
#7 1 7 0 1 NA
#8 1 8 0 2 NA
#9 1 9 1 NA 1
#10 1 10 1 NA 2
#11 2 1 1 NA 1
#12 2 2 1 NA 2
#13 2 3 0 1 NA
#14 2 4 1 NA 1
#15 2 5 0 1 NA
#16 2 6 0 2 NA
#17 2 7 1 NA 1
#18 2 8 0 1 NA
#19 2 9 0 2 NA
#20 2 10 0 3 NA
#21 3 1 0 1 NA
#22 3 2 0 2 NA
#23 3 3 0 3 NA
#24 3 4 0 4 NA
#25 3 5 0 5 NA
#26 3 6 1 NA 1
#27 3 7 1 NA 2
#28 3 8 1 NA 3
#29 3 9 1 NA 4
#30 3 10 1 NA 5

Resources