I am tinkering around with this loop (im new to writing loops but trying to learn).
I am aiming when x == 1, on first 1 match, store the value of z, then on each successive z value subtract that z value from the first value. If x == 0 then it will do nothing (not sure if i have to tell the code to do nothing when x ==0?)
This is my dummy data:
x <- c(0,0,1,1,1,0,1,1,1,1,0,0,0)
z <- c(10,34,56,43,56,98,78,98,23,21,45,65,78)
df <- data.frame(x,z)
for (i in 1:nrow(df)) {
if (df$x[i] == 1)
first_price <- df$z[i]
df$output <- first_price - df$z
}
}
I have my if (df$x == 1)
Then I want to save the first price... so first_price <- df$z[i]
The i in here, that means the first in the series right?
Then for my output... i wanted to subtract the first price from each successive price. If I fix the first price with [i] is this the correct way? And if I leave df$z would that then take the next price each time in the loop and subtract from
first_price <- df$z[i]?
Heres a visual:
******Progress****
> for (i in 1:nrow(df)) {
+ if (df$x[i] == 1) {
+ first_price <- df$z[1]
+ df$output <- first_price - df$z
+ }
+ }
> df$output
[1] 0 -24 -46 -33 -46 -88 -68 -88 -13 -17 -35 -55 -68
If i add [1] which is assigning the first element in df$z this actually fixes the first element and then subtracts each successive, now It needs to be rule based and understand that this is only to be the case when df$x == 1
This should work for you
library(dplyr)
library(data.table)
ans <- df %>%
mutate(originalrow = row_number()) %>% # original row position
group_by(rleid(x)) %>%
mutate(ans = first(z) - z) %>%
filter(x==1)
# # A tibble: 7 x 5
# # Groups: rleid(x) [2]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 1 56 3 2 0
# 2 1 43 4 2 13
# 3 1 56 5 2 0
# 4 1 78 7 4 0
# 5 1 98 8 4 -20
# 6 1 23 9 4 55
# 7 1 21 10 4 57
vans <- ans$ans
# [1] 0 13 0 0 -20 55 57
EDIT
To keep all rows, and outputting 0 where x==0
ans <- df %>%
mutate(originalrow = row_number()) %>%
group_by(rleid(x)) %>%
mutate(ans = ifelse(x==0, 0, first(z) - z))
# # A tibble: 13 x 5
# # Groups: rleid(x) [5]
# x z originalrow `rleid(x)` ans
# <dbl> <dbl> <int> <int> <dbl>
# 1 0 10 1 1 0
# 2 0 34 2 1 0
# 3 1 56 3 2 0
# 4 1 43 4 2 13
# 5 1 56 5 2 0
# 6 0 98 6 3 0
# 7 1 78 7 4 0
# 8 1 98 8 4 -20
# 9 1 23 9 4 55
# 10 1 21 10 4 57
# 11 0 45 11 5 0
# 12 0 65 12 5 0
# 13 0 78 13 5 0
Related
I want to remove all the observations except for observations from day 10 or day 20 from data(ChickWeight). But I want to use logical operations in r : and "&" or :|. Below is my code but i get an error
ChickWeight %>% slice(10|20)
We could concatenate (c) the indexes as a vector and use - to remove the observations in slice - slice requires numeric index
library(dplyr)
ChickWeight %>%
slice(-c(10, 20))
With filter, it expects a logical vector
ChickWeight %>%
filter(!row_number() %in% c(10, 20))
If this is based on the 'Time' column, use either of the one below
ChickWeight %>%
slice(-which(Time %in% c(10, 20)))
ChickWeight %>%
filter(! Time %in% c(10, 20))
Here is another option using filter:
ChickWeight %>%
filter(row_number() != 10 &
row_number() != 20)
# A tibble: 576 × 4
weight Time Chick Diet
<dbl> <dbl> <ord> <fct>
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 199 20 1 1
You can use subset,
ChickWeight |> subset(Time == 10 | Time == 20)
or with (same result)
ChickWeight[with(ChickWeight, Time == 10 | Time == 20), ]
# weight Time Chick Diet
# 6 93 10 1 1
# 11 199 20 1 1
# 18 103 10 2 1
# 23 209 20 2 1
# 30 99 10 3 1
# 35 198 20 3 1
# ...
or likewise a sequence if you aim for row numbers.
ChickWeight |> subset({m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20})
ChickWeight[{m <- seq_len(nrow(ChickWeight)); m == 10 | m == 20}, ]
# weight Time Chick Diet
# 10 171 18 1 1
# 20 138 14 2 1
I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)
An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]
You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30
I have a symmetrical matrix of flows (in tibble form) similar to the below example:
library(tibble)
set.seed(2019)
df1 <- as_tibble(matrix(sample(1:10,100,replace = T), nrow = 10, ncol = 10, byrow = TRUE,
dimnames = list(as.character(1:10),
as.character(1:10))))
df1
# `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 8 8 4 7 1 1 9 1 2 7
# 2 8 7 3 2 7 7 1 8 4 5
# 3 5 6 10 2 2 1 6 10 7 5
# 4 7 1 9 2 1 1 4 5 1 8
# 5 7 3 9 7 9 5 10 10 3 2
# 6 4 1 1 4 6 4 10 10 1 1
# 7 2 3 8 4 8 10 4 1 9 6
# 8 4 2 4 2 7 10 2 6 4 8
# 9 1 10 10 3 6 2 6 7 8 4
#10 6 8 9 3 6 9 5 10 4 10
I also have a lookup table that shows the broad groups that each flow subgroup fits into:
lookup <- tibble(sector = as.character(1:10),
aggregate_sector = c(rep('A',3), rep('B', 3), rep('C', 4)))
lookup
# sector aggregate_sector
#1 1 A
#2 2 A
#3 3 A
#4 4 B
#5 5 B
#6 6 B
#7 7 C
#8 8 C
#9 9 C
#10 10 C
I want to summarise my original df1 such that it represents the flows between each aggregate_sector (as per the lookup table) rather than each sector. Expected output:
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
My initial attempt has been to convert into a matrix and then use a nested for loop to calculate the sum of flows for each aggregate_sector combination in turn:
mdat <- as.matrix(df1)
# replace row and column names with group names - assumes lookup is in same order as row and col names...
row.names(mdat) <- lookup$aggregate_sector
colnames(mdat) <- lookup$aggregate_sector
# pre-allocate an empty matrix
new_mat <- matrix(nrow = 3, ncol = 3, dimnames = list(LETTERS[1:3], LETTERS[1:3]))
# fill in matrix section by section
for(i in row.names(new_mat)){
for(j in colnames(new_mat)){
new_mat[i,j] <- sum(mdat[which(row.names(mdat) ==i), which(colnames(mdat) ==j)])
}
}
new_mat
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
While this is a satisfactory solution, I wonder if there's a solution using dplyr or similar that uses nicer logic and saves me from having to convert my actual data (which is a tibble) into matrix form.
The key steps is to gather - after that is it all straightforward dplyr stuff:
flow_by_sector <-
df1 %>%
mutate(sector_from = rownames(.)) %>%
tidyr::gather(sector_to, flow, -sector_from)
flow_by_sector_with_agg <-
flow_by_sector %>%
left_join(lookup, by = c("sector_from" = "sector")) %>%
rename(agg_from = aggregate_sector) %>%
left_join(lookup, by = c("sector_to" = "sector")) %>%
rename(agg_to = aggregate_sector)
flow_by_agg <-
flow_by_sector_with_agg %>%
group_by(agg_from, agg_to) %>%
summarise(flow = sum(flow))
tidyr::spread(flow_by_agg, agg_to, flow)
Here's a base answer that uses stack and xtabs. It's not super robust - it assumes that the lookup table has the same columns and order as what would be expressed in the data.frame.
colnames(df1) <- lookup$aggregate_sector
xtabs(values ~ sector + ind
, dat = data.frame(sector = rep(lookup$aggregate_sector
, length(df1)), stack(df1))
)
Here's another way to do the data.frame:
xtabs(values ~ Var1 + Var2,
dat = data.frame(expand.grid(lookup$aggregate_sector, lookup$aggregate_sector)
, values = unlist(df1))
)
Var2
Var1 A B C
A 59 30 65
B 42 39 65
C 67 70 94
I actually figured out a matrix algebra alternative to my problem which is much faster despite having to convert my data.frame into a matrix. I won't accept this solution as I did ask specifically for a dplyr answer, but thought it interesting enough to post here anyway.
I first had to form an adjustment matrix, S, from my lookup table where the the locations of ones in row i of S indicate which sectors of the original matrix will be grouped together as sector i in the aggregated matrix:
S <- lookup %>% mutate(sector = as.numeric(sector), value = 1) %>%
spread(sector, value) %>%
column_to_rownames('aggregate_sector') %>%
as.matrix()
S[is.na(S)] <- 0
S
# 1 2 3 4 5 6 7 8 9 10
#A 1 1 1 0 0 0 0 0 0 0
#B 0 0 0 1 1 1 0 0 0 0
#C 0 0 0 0 0 0 1 1 1 1
Then, I convert my original data.frame, df1, into matrix x and simply calculate S.x.S' :
x <- as.matrix(df1)
S %*% x %*% t(S)
# A B C
#A 59 30 65
#B 42 39 65
#C 67 70 94
I am giving a data set called ChickWeight. This has the weights of chicks over a time period. I need to introduce a new variable that measures the current weight difference compared to day 0. The data set is in library(datasets) so you should have it.
library(dplyr)
weightgain <- ChickWeight %>%
group_by(Chick) %>%
filter(any(Time == 21)) %>%
mutate(weightgain = weight - first(weight))
I have this code, but this code just subtracts each weight by 42 which is the weight at time 0 for chick 1. I need each chick to be subtracted by its own weight at time 0 so that the weightgain column is correct.
We could do
library(dplyr)
ChickWeight %>%
group_by(Chick) %>%
mutate(weightgain = weight - weight[Time == 0])
#Or mutate(weightgain = weight - first(weight))
# A tibble: 578 x 5
# Groups: Chick [50]
# weight Time Chick Diet weightgain
# <dbl> <dbl> <ord> <fct> <dbl>
# 1 42 0 1 1 0
# 2 51 2 1 1 9
# 3 59 4 1 1 17
# 4 64 6 1 1 22
# 5 76 8 1 1 34
# 6 93 10 1 1 51
# 7 106 12 1 1 64
# 8 125 14 1 1 83
# 9 149 16 1 1 107
#10 171 18 1 1 129
# … with 568 more rows
Or using base R ave
with(ChickWeight, ave(weight, Chick, FUN = function(x) x - x[1]))
I have a function returning a list. I'm using mutate to add columns in a data frame that correspond to the output. The calculation is rather involved so I would prefer to only call the function once. I'm rather new to R and dplry and cannot figure out a more efficient way of doing this.
Here is a very simple example of what I am doing now.
library(dplyr)
testFun <- function(x,z)
{
list(x2=x*x + z, x3=x*x*x + z)
}
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2=testFun(x,z)$x2,
x3=testFun(x,z)$x3)
How can I do this more efficiently?
With the purrr-package you can solve this problem, like that:
library(purrr)
library(dplyr)
testFun <- function(x,z) {
tibble(x2=x*x + z, x3=x*x*x + z)
}
have %>%
mutate(new_x = map2(x, z, testFun)) %>%
unnest(new_x)
# x y z x2 x3
# 1 1 1 0 1 1
# 2 2 1 0 4 8
# 3 3 1 0 9 27
# 4 4 1 0 16 64
# 5 5 1 0 25 125
# 6 6 1 0 36 216
# 7 7 1 0 49 343
# 8 8 1 0 64 512
# 9 9 1 0 81 729
# 10 10 1 0 100 1000
Note that I changed the output of your function from a list to a tibble.
We can use pmap
library(purrr)
library(dplyr)
pmap_dfr(have %>%
select(x, z), testFun) %>%
bind_cols(have, .)
# x y z x2 x3
#1 1 1 0 1 1
#2 2 1 0 4 8
#3 3 1 0 9 27
#4 4 1 0 16 64
#5 5 1 0 25 125
#6 6 1 0 36 216
#7 7 1 0 49 343
#8 8 1 0 64 512
#9 9 1 0 81 729
#10 10 1 0 100 1000
Or if we can change the function by quoting (quote or quo) it, this becomes more easier
testFun <- function(x,z){
list(x2= quo(x*x + z), x3= quo(x*x*x + z))
}
have %>%
mutate(!!! testFun(x, z))
# x y z x2 x3
#1 1 1 0 1 1
#2 2 1 0 4 8
#3 3 1 0 9 27
#4 4 1 0 16 64
#5 5 1 0 25 125
#6 6 1 0 36 216
#7 7 1 0 49 343
#8 8 1 0 64 512
#9 9 1 0 81 729
#10 10 1 0 100 1000
I might have missed something really obvious here, but you seem to be running the function twice to produce two answers. To keep things really simple to begin with, try:
library(dplyr)
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2 = (x * 2 + z),
x3 = (x * 3 + z))
Does that help? Or has your example simplified out what you were trying to achieve?
Using a different function for the mutate you should be able to do:
library(dplyr)
createMultiX <- function(inputX, inputZ, multiplier) {
inputX * multiplier + inputZ
}
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2 = createMultiX(x, z, 2),
x3 = createMultiX(x, z, 3))
(Apologies in advance as I've written this blindly without access to an R terminal so hope it works first time without typos!)