I have a data frame with one variable, x. I want to create a new variable y which is equal to 1 when x decreases by 2 from its previous value and equal to 0 otherwise. Then I want to create a variable z which holds the value of x when y was last equal to 1. I want the initial value of z to be 0. I haven't been able to figure out how to make z. Any advice?
Here's what I'm trying to obtain (but for about 1000 rows):
x y z
9 0 0
8 0 0
6 1 6
9 0 6
7 1 7
5 1 5
I've tried lags, cum functions in dplyr to no avail.
library(dplyr)
library(tidyr)
df <- data.frame(x = c(9,8,6,10,9,7,5))
df %>%
mutate(y = +(lag(x, default = x[1]) - x == 2),
z = ifelse(cumsum(y) > 0 & y == 0, NA, x * y)) %>%
fill(z, .direction = "down")
#> x y z
#> 1 9 0 0
#> 2 8 0 0
#> 3 6 1 6
#> 4 10 0 6
#> 5 9 0 6
#> 6 7 1 7
#> 7 5 1 5
Created on 2022-11-07 by the reprex package (v2.0.1)
One option:
df$y = 0L
df$y[-1] = (diff(df$x) == -2L)
df$z = data.table::nafill(ifelse(df$y == 1L, df$x, NA), "locf", fill = 0L)
# x y z
# 1 9 0 0
# 2 8 0 0
# 3 6 1 6
# 4 9 0 6
# 5 7 1 7
# 6 5 1 5
Reproducible data (please provide next time)
df = data.frame(x = c(9L,8L,6L,9L,7L,5L))
Here's a simple way to do it using dplyr.
library(dplyr)
tmp = data.frame(x = c(9,8,6,9,7,5))
tmp %>%
mutate(y = ifelse(lag(x) - x == 2, 1, 0)) %>%
mutate(z = ifelse(y == 1, x, lag(x))) %>%
replace(is.na(.), 0)
# output
# x y z
# 1 9 0 0
# 2 8 0 0
# 3 6 1 6
# 4 9 0 6
# 5 7 1 7
# 6 5 1 5
Related
I have a table of 3 columns:
Start of range
End of range
Number assigned to all values within the range.
I want to create a table with the first column having values 1-x (x being the total of all ranges) and the second column with the assigned number for each value. Any unassigned values need to be set to 0.
E.g. original table:
start
end
value
1
4
-1
6
8
4
So the final table would be:
Number
Value
1
-1
2
-1
3
-1
4
-1
5
0
6
4
7
4
8
4
But I have no idea where to start - any suggestions?
Thanks.
Does this do the trick? starting from your data example
library(dplyr)
a = data.frame(start= c(1,6),end=c(4,8),value=c(-1,4))
c= apply(a, 1,function(i){
b = i[1]:i[2]
return(as.data.frame(cbind(b, rep(i[3], length(b)))))
})
c = bind_rows(c, .id = "column_label")[,-1]
d= (c[1,1]:c[nrow(c),1])[!c[1,1]:c[nrow(c),1]%in%c$b]
d= cbind(d, rep(0, length(d)))
colnames(d)=colnames(c)
res = rbind(c,d)[order(rbind(c,d)[,1]),]
rownames(res)= 1:nrow(res)
colnames(res)=c('Number', 'Value')
res
output:
> res
Number Value
1 1 -1
2 2 -1
3 3 -1
4 4 -1
5 5 0
6 6 4
7 7 4
8 8 4
The obligatory "data.table" solution ;), a general solution can be obtained using "foverlaps"
library(data.table)
data <- data.frame(start = c(1, 6), end= c(4, 8), value = c(-1, 4))
number <- data.frame(start = c(1:8), end = c(1:8))
setDT(data)
setDT(number)
setkey(data, start, end)
df<-foverlaps(number, data)[, c("i.start", "value"),
with = FALSE]
df[is.na(df$value), ]$value <- 0
Here is a tidyverse solution:
library(dplyr)
library(tidyr)
df %>%
group_by(start) %>%
mutate(index = list(start:end)) %>%
unnest(cols = c(index)) %>%
ungroup() %>%
complete(index = 1:max(index), fill = list(value = 0)) %>%
select(Number=index, Value=value)
Number Value
<int> <dbl>
1 1 -1
2 2 -1
3 3 -1
4 4 -1
5 5 0
6 6 4
7 7 4
8 8 4
If you are looking for a generic solution, you can try this function
expand_integers <- function(start, end, value) {
n <- end - start + 1L
rng <- range(c(start, end))
pos <- sequence(n, start - rng[[1L]] + 1L)
val <- rep.int(value, n)
data.frame(
number = seq.int(rng[[1L]], rng[[2L]]),
value = `[<-`(integer(rng[[2L]] - rng[[1L]] + 1L), pos, value = val)
)
}
It works for any start and end values and is very efficient. Here is a simple test:
df <- data.frame(start = c(4L, 10L), end = c(7L, 19L), value = c(-1L, 4L))
df
expand_integers(df$start, df$end, df$value)
Output
> df
start end value
1 4 7 -1
2 10 19 4
> expand_integers(df$start, df$end, df$value)
number value
1 4 -1
2 5 -1
3 6 -1
4 7 -1
5 8 0
6 9 0
7 10 4
8 11 4
9 12 4
10 13 4
11 14 4
12 15 4
13 16 4
14 17 4
15 18 4
16 19 4
In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13
I would like to make a cumulative sum of the variable "nbre_lignes" in order to have the resulting variable named cumulative_sum, i managed to dit but it's not automated.
Can someone help me to automate it?
library(FSA)
library(dplyr)
months.numeric <- lubridate:::months.numeric
strwr <- function(str) gsub(" ", "\n", str)
waterfall <- data.frame(table= strwr(c("Concaténation", "Doublons & NPI","DGC & CAR-REU", "BDD", paste("Répondants",format(as.yearqtr(Sys.Date()-base::months(12)),"T%q"), "&", format(as.yearqtr(Sys.Date()-months(9)),"T%q-%y")), paste("Sollicités au ", format(as.yearqtr(Sys.Date()-months(3)),"T%q-%y")), "Exclusions ", "QD", "Cible Finale")),
nbre_lignes=c(638334, -362769, -17674,41927,-1540, -20149, -300, -10, 19928))
#
waterfall$time <- 1:nrow(waterfall)
waterfall$flow <- factor(sign(waterfall$nbre_lignes))
waterfall$table <- factor(waterfall$table, levels = waterfall[["table"]])
b <- pcumsum(waterfall$nbre_lignes[1:3])
l <- pcumsum(waterfall$nbre_lignes[4:8])
cumulative_sum <- c(b,l, 0)
waterfall <- waterfall %>% cbind(cumulative_sum)
table nbre_lignes time flow cumulative_sum
1 Concaténation 638334 1 1 0
2 Doublons\n&\nNPI -362769 2 -1 638334
3 DGC\n&\nCAR-REU -17674 3 -1 275565
4 BDD 41927 4 1 0
5 Répondants\nT3\n&\nT4-18 -1540 5 -1 41927
6 Sollicités\nau\n\nT2-19 -20149 6 -1 40387
7 Exclusions\n -300 7 -1 20238
8 QD -10 8 -1 19938
9 Cible\nFinale 19928 9 1 0
We can form a grouping variable using cumsum(flow == 1) as shown:
waterfall %>%
group_by(grp = cumsum(flow == 1)) %>%
mutate(cumsum = lag(cumsum(nbre_lignes), default = 0 )) %>%
ungroup %>%
select(- grp)
giving:
# A tibble: 9 x 5
table nbre_lignes time flow cumsum
<fct> <dbl> <int> <fct> <dbl>
1 Concaténation 638334 1 1 0
2 "Doublons\n&\nNPI" -362769 2 -1 638334
3 "DGC\n&\nCAR-REU" -17674 3 -1 275565
4 BDD 41927 4 1 0
5 "Répondants\nT3\n&\nT4-18" -1540 5 -1 41927
6 "Sollicités\nau\n\nT2-19" -20149 6 -1 40387
7 "Exclusions\n" -300 7 -1 20238
8 QD -10 8 -1 19938
9 "Cible\nFinale" 19928 9 1 0
I have the following data frame ordered by name and time.
set.seed(100)
df <- data.frame('name' = c(rep('x', 6), rep('y', 4)),
'time' = c(rep(1, 2), rep(2, 3), 3, 1, 2, 3, 4),
'score' = c(0, sample(1:10, 3), 0, sample(1:10, 2), 0, sample(1:10, 2))
)
> df
name time score
1 x 1 0
2 x 1 4
3 x 2 3
4 x 2 5
5 x 2 0
6 x 3 1
7 y 1 5
8 y 2 0
9 y 3 5
10 y 4 8
In df$score there are zeros followed by an unknown number of actual values, i.e. df[1:4,], and sometimes there are overlapping df$name between two df$score == 0, i.e. df[6:7,].
I want to change df$time where df$score != 0. Specifically, I want to assign the time value of the closest upper row with df$score == 0 if df$name is matching.
The following code gives the good output but my data have millions of rows so this solution is very inefficient.
score_0 <- append(which(df$score == 0), dim(df)[1] + 1)
for(i in 1:(length(score_0) - 1)) {
df$time[score_0[i]:(score_0[i + 1] - 1)] <-
ifelse(df$name[score_0[i]:(score_0[i + 1] - 1)] == df$name[score_0[i]],
df$time[score_0[i]],
df$time[score_0[i]:(score_0[i + 1] - 1)])
}
> df
name time score
1 x 1 0
2 x 1 4
3 x 1 3
4 x 1 5
5 x 2 0
6 x 2 1
7 y 1 5
8 y 2 0
9 y 2 5
10 y 2 8
Where score_0 gives the index where df$score == 0. We see that df$time[2:4] are now all equal to 1, that in df$time[6:7] only the first one changed because the second have df$name == 'y' and the closest upper row with df$score == 0 has df$name == 'x'. The last two rows also have changed correctly.
You can do it like this:
library(dplyr)
df %>% group_by(name) %>% mutate(ID=cumsum(score==0)) %>%
group_by(name,ID) %>% mutate(time = head(time,1)) %>%
ungroup() %>% select(name,time,score) %>% as.data.frame()
# name time score
# 1 x 1 0
# 2 x 1 8
# 3 x 1 10
# 4 x 1 6
# 5 x 2 0
# 6 x 2 5
# 7 y 1 4
# 8 y 2 0
# 9 y 2 5
# 10 y 2 9
Solution using dplyr and data.table:
library(data.table)
library(dplyr)
df %>%
mutate(
chck = score == 0,
chck_rl = ifelse(score == 0, lead(rleid(chck)), rleid(chck))) %>%
group_by(name, chck_rl) %>% mutate(time = first(time)) %>%
ungroup() %>%
select(-chck_rl, -chck)
Output:
# A tibble: 10 x 3
name time score
<chr> <dbl> <int>
1 x 1 0
2 x 1 2
3 x 1 9
4 x 1 7
5 x 2 0
6 x 2 1
7 y 1 8
8 y 2 0
9 y 2 2
10 y 2 3
Solution only using data.table:
library(data.table)
setDT(df)[, chck_rl := ifelse(score == 0, shift(rleid(score == 0), type = "lead"),
rleid(score == 0))][, time := first(time), by = .(name, chck_rl)][, chck_rl := NULL]
Output:
name time score
1: x 1 0
2: x 1 2
3: x 1 9
4: x 1 7
5: x 2 0
6: x 2 1
7: y 1 8
8: y 2 0
9: y 2 2
10: y 2 3
I have a dataset where each row contains the data needed for the combinevar function (package = fishmethods; combinevar combines info from two distributions to come up with the combined variance).
xbar1 = c(2,2,1,4,3)
xbar2 = c(0,0,0,0,0)
var1 = c(0,1,3,2,1)
var2 = c(0,0,0,0,0)
n1 = c(50,10,30,40,50)
n2 = c(3,4,50,32,20)
df <- data.frame(xbar1, xbar2, var1, var2, n1, n2)
xbar1 xbar2 var1 var2 n1 n2
2 0 0 0 50 3
2 0 1 0 10 4
1 0 3 0 30 50
4 0 2 0 40 32
3 0 1 0 50 20
How would I apply the function across the rows. I can do it in a for loop like this:
for (i in 1:nrow(df)) {
combined_var <- combinevar(xbar = c(df$xbar1[i], df$xbar2[i]),
s_squared = c(df$var1[i], df$var2[i]),
n = c(df$n1[i], df$n2[i]))[2]
print(combined_var)
}
[1] 0.2177068
[1] 1.571429
[1] 1.338608
[1] 5.104851
[1] 2.573499
But I'm sure there's a better way. I think I can probably do it with an apply function but I can't figure out how.
You can use apply function to rows and to do specify function properly read rows:
library(fishmethods)
my_function<- function(vec){
combined_var <- combinevar(xbar = c(vec[1], vec[2]), s_squared = c(vec[3], vec[4]), n = c(vec[5], vec[6]))
}
apply(df, 1, my_function) [2, ]
We can nest the data by row and then map the function for each row.
library(tidyverse)
library(fishmethods)
df %>%
rownames_to_column("row") %>%
nest(-row) %>%
mutate(combined_var = map(data, ~combinevar(xbar = c(.x$xbar1, .x$xbar2),
s_squared = c(.x$var1, .x$var2),
n = c(.x$n1, .x$n2))[2])) %>%
unnest()
#> row combined_var xbar1 xbar2 var1 var2 n1 n2
#> 1 1 0.2177068 2 0 0 0 50 3
#> 2 2 1.5714286 2 0 1 0 10 4
#> 3 3 1.3386076 1 0 3 0 30 50
#> 4 4 5.1048513 4 0 2 0 40 32
#> 5 5 2.5734990 3 0 1 0 50 20
Or we can just apply the function rowwise
df %>%
rowwise() %>%
mutate(combined_var = combinevar(xbar = c(xbar1, xbar2),
s_squared = c(var1, var2),
n = c(n1, n2))[2])
#> Source: local data frame [5 x 7]
#> Groups: <by row>
#>
#> # A tibble: 5 x 7
#> xbar1 xbar2 var1 var2 n1 n2 combined_var
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 0 0 0 50 3 0.218
#> 2 2 0 1 0 10 4 1.57
#> 3 1 0 3 0 30 50 1.34
#> 4 4 0 2 0 40 32 5.10
#> 5 3 0 1 0 50 20 2.57
Created on 2018-08-19 by the reprex
package (v0.2.0).