Obtaining data from a different row of data frame - r

I have a large data frame. Each row has data for a specific date. The next group of columns has stock prices. Each column represents one stock. I then have offset columns, one for each stock column. I have to offset the current row by the offset amount. Then I put the prices found in the first group of columns (but now using the offset row) in the last group of columns, which start out as NA.
For example, the value in row 1, col 3 is 1, so I need to offset the first row by 1. That gives me row 2. I need to get the price, p1, that is in row 2, col 1. That value is 2. The value 2 is then placed into row 1, col 5.
I theoretically solved the problem with a double loop, but the code was hopelessly slow. I was able to eliminate one loop. Can someone please help me eliminate the remaining loop?
Below is my code as well as the data frame before and after the code runs. Note that in the sample, I omitted the dates as they are not needed.
p1 = 1:1000000
p2 = 11:1000010
of1 = c(rep(1, 100000), rep(2, 800000), rep(0, 100000) )
of2 = c(rep(2,100000),rep(1,800000), rep(0, 100000) )
DF1 = data.frame(p1 = p1, p2 = p2, of1 = of1, of2 = of2)
DF1$newPrice1 = rep(NA, 1000000)
DF1$newPrice2 = rep(NA, 1000000)
head(DF1)
p1 p2 of1 of2 newPrice1 newPrice2
1 1 11 1 2 NA NA
2 2 12 1 2 NA NA
3 3 13 1 2 NA NA
4 4 14 1 2 NA NA
5 5 15 1 2 NA NA
6 6 16 1 2 NA NA
for(j in 1:2) {
DF1[j+4] = DF1[DF1[,j+2] + row(DF1)[,j], j]
}
head(DF1)
p1 p2 of1 of2 newPrice1 newPrice2
1 1 11 1 2 2 13
2 2 12 1 2 3 14
3 3 13 1 2 4 15
4 4 14 1 2 5 16
5 5 15 1 2 6 17
6 6 16 1 2 7 18

DF1$np1 <- DF1$p1[seq_along(DF1$p1) + DF1$of1]
DF1$np2 <- DF1$p2[seq_along(DF1$p2) + DF1$of2]
identical(DF1$np1, DF1$newPrice1)
#[1] TRUE
identical(DF1$np2, DF1$newPrice2)
#[1] TRUE

Related

R Fill backwards with flexible window based on number of rows in a separate column

I am trying to carry a value in one column backwards by a number of rows given in a second column and fill everything in between.
So column y mainly has 1s in it but might have individual numbers up to about 20 (in my real data, up to 3 in my example below). If the number in y is 20, I need the 19 rows before that row and that row itself to equal the value of x for the row where y is 20. If the value in y is 1 the output will just equal x.
y also has many NAs, these NAs are either legitimate NAs where I want an NA output or are placeholders where the filling should occur if a y value afterwards is > 1.
I thought I could use dplyr::lead but I cannot have a variable n value to look forwards a different number of steps, and it wouldn't fill inbetween, and I wondered about making a new, always increasing column and using RcppRoll::roll_max but have similar problems with the flexible window size.
Typically y-values in the lead up to a y > 1 will be 0 or NA, but if there were conflicts I would want to adopt the later value still eg in row 8 of my data frame y is 1 followed by y = 2 in row 9 so I want the value associated with row 9 in both cases. If y in NA and there is not covered by filling backwards, I want it to remain NA (or 0 would be fine)
Thanks for any thoughts
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
#> x y out
#> 1 9 NA NA
#> 2 4 NA NA
#> 3 7 NA NA
#> 4 1 1 1
#> 5 2 NA 11
#> 6 13 NA 11
#> 7 11 3 11
#> 8 3 1 8
#> 9 8 2 8
#> 10 12 1 12
#> 11 5 1 5
#> 12 6 NA NA
#> 13 15 NA NA
#> 14 10 NA 14
#> 15 14 2 14
#try adopting #sirius answer before I specified about the extra NAs
test$y <- ifelse(is.na(test$y),0,test$y)
test$out <- with( test, rep( x, y ) )
#> Error in `$<-.data.frame`(`*tmp*`, out, value = c(1L, 11L, 11L, 11L, 3L, : replacement has 11 rows, data has 15
Created on 2021-04-08 by the reprex package (v0.3.0)
Things got a bit complex, but essentially calculate all the repeated x's for each y > 0, and then let subsequent x'es overwrite earlier ones
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
test %<>% mutate( id = seq(n()) ) %>%
filter( !is.na(y) & y != 0 ) %>%
group_by(id) %>%
slice( rep(1,y) ) %>%
mutate( id = rev( max(id)+1-1:n() ) ) %>%
group_by(id) %>%
summarize( out = as.numeric(last(x)) ) %>%
right_join( test %>% mutate( id=seq(n()) ) ) %>%
arrange( id ) %>% select( -id ) %>% relocate( x, y, out )
identical( as.data.frame(test), desired_out ) ## TRUE
test
Output:
> test
# A tibble: 15 x 3
x y out
<int> <dbl> <dbl>
1 9 NA NA
2 4 NA NA
3 7 NA NA
4 1 1 1
5 2 NA 11
6 13 NA 11
7 11 3 11
8 3 1 8
9 8 2 8
10 12 1 12
11 5 1 5
12 6 NA NA
13 15 NA NA
14 10 NA 14
15 14 2 14
What the algorithm does, which after a few piped lines is no longer very clear, is the following:
temporarily add id as original row number
take away 0 and NA rows for y
repeat each row y times
within each such repeated row, create a new id that counts backwards (these will be the new row numbers for the x-values to
go)
group by id again this time to let later values overwrite earlier values (so keep only the highest row number for any collision)
join these data back on the original data, using the newly calculated row numbers, repeated x's will now be inserted
sort and clean up
Sequencing and indexing to the rescue:
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
# x y out
#1 9 NA NA
#2 4 NA NA
#3 7 NA NA
#4 1 1 1
#5 2 NA 11
#6 13 NA 11
#7 11 3 11
#8 3 1 8
#9 8 2 8
#10 12 1 12
#11 5 1 5
#12 6 NA NA
#13 15 NA NA
#14 10 NA 14
#15 14 2 14
I'm generating a row number, getting the row numbers prior to the key rows, and then overwriting those rows with repeats of the selected rows. Sometimes they specify the same location, but the later value will be taken as you can see in the output.
Should be pretty efficient as everything is vectorised and there's only one major assignment operation back to the original dataset for updating all the rows at once. Here's 4.5M rows processed in a fraction of a second:
test <- test[rep(1:15, 3e5),]
system.time({
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
})
# user system elapsed
# 0.28 0.00 0.28

how to subset every 6 rows in R?

I have to subset the data of 6 rows every time. How to do that in R?
data:
col1 : 1,2,3,4,5,6,7,8,9,10
col2 : a1,a2,a3,a4,a5,a6,a7,a8,a9,a10
I want to do subset of 6 rows every time. First subset of the rows will have 1:6 ,next subset of the rows will have 7:nrow(data). I have tried using seq function.
seqData <- seq(1,nrow(data),6)
output: It is giving 1 and 7th row but I want 1 to 6 rows first, next onwards 7 to nrow(data).
How to get output like that.
Will this work:
set.seed(1)
dat <- data.frame(c1 = sample(1:5,12,T),
c2 = sample(1:5,12,T))
dat
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
split(dat, rep(1:ceiling(nrow(dat)/6), each = 6))
$`1`
c1 c2
1 1 2
2 4 2
3 1 1
4 2 5
5 5 5
6 3 1
$`2`
c1 c2
7 2 1
8 3 5
9 3 5
10 1 2
11 5 2
12 5 1
The function below creates a numeric vector with integers increasing by 1 unit every n rows. And uses this vector to split the data as needed.
data <- data.frame(col1 = 1:10, col2 = paste0("a", 1:10))
split_nrows <- function(x, n){
f <- c(1, rep(0, n - 1))
f <- rep(f, length.out = NROW(x))
f <- cumsum(f)
split(x, f)
}
split_nrows(data, 6)
Here's a simple example with mtcars that yields a list of 6 subset dfs.
nrows <- nrow(mtcars)
breaks <- seq(1, nrows, 6)
listdfs <- lapply(breaks, function(x) mtcars[x:(x+5), ]) # increment by 5 not 6
listdfs[[6]] <- listdfs[[6]][1:2, ] #last df: remove 4 NA rows (36 - 32)

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

Sorting and numbering can be easier (in R) [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 6 years ago.
First: I want to sort a dataframe and then add a rank to the dataframe.
df <- data.frame(a = 3:1, b = 6:4, Rank = NA) # create dataframe
df <- df[order(df[, 1], df[, 2]), ] # sort dataframe
for ( i in 1:nrow(dataframe) ) dataframe[i, 3] <- i # add the ranking
Second: I want to sort within a group g
df <- data.frame(g = sample(1:4, 4), num = 1:20, Rank = NA)
df <- df[order(df[, 1], df[, 2]), ]
row <- 1
for (x in 1:4) {
rank <- 1
df[row, 3] <- rank # adding the number one to list
row <- row + 1 # move to the next row!
while (df[row - 1, 1] == df[row, 1] & row < length(df[,1]) + 1){
# Check if state is the last row still same same, otherwise stop next loop!
rank <- rank + 1 # adding next to rank!
df[row, 3] <- rank # Put rank in dataframe!
row <- row + 1 # move to next row
}
}
it works but I would like to accomplish the same tasks with more parsimonious or efficient coding.
Try:
set.seed(123)
df = data.frame(g=sample(1:4, 4), num = 1:20, Rank = NA)
library(dplyr)
df %>% group_by(g) %>% arrange(num) %>% mutate(rank = seq_along(g))
Source: local data frame [20 x 4]
Groups: g
g num Rank rank
1 1 3 NA 1
2 1 7 NA 2
3 1 11 NA 3
4 1 15 NA 4
5 1 19 NA 5
6 2 1 NA 1
7 2 5 NA 2
8 2 9 NA 3
9 2 13 NA 4
10 2 17 NA 5
11 3 2 NA 1
12 3 6 NA 2
13 3 10 NA 3
14 3 14 NA 4
15 3 18 NA 5
16 4 4 NA 1
17 4 8 NA 2
18 4 12 NA 3
19 4 16 NA 4
20 4 20 NA 5
Is this what you need?
df = data.frame(g=sample(1:4, 4), num = 1:20, Rank = NA)
df <- df[order(df[,1],df[,2]),]
df$Rank <- rep(1:5,4)

Conditional calculation of means of different columns in data.table with R

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.
x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86
Multiple aggregation in R with 4 parameters
But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.
Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):
data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y =mean(y, na.rm = T),
mean.z=mean(z, na.rm = T),
mean.t=mean(t,na.rm = T)),
by=x]
In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.
Update:
Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:
dt[y != 0 & z != 0 & t != 0,
list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]
FWIW, here's how I'd do it in data.table:
dt[(y | NA) & (z | NA) & (t | NA),
list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
# x ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278
Let's break it down with the general syntax: dt[i, j, by]:
In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).
Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.
Then for each group in by, we aggregate according to your function, in j, to get the result.
HTH
Here's one solution:
# create your sample data frame
df <- read.table(text = " x y z t
1 1 1 10
1 0 1 15
2 NA 1 14
2 3 0 15
2 2 1 17
2 1 NA 19
3 4 2 18
3 0 2 NA
3 2 2 45
4 3 2 NA
4 1 3 59
5 0 3 0
5 4 3 45
5 4 4 74
5 1 4 86", header = TRUE)
library('dplyr')
dfmeans <- df %>%
filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
group_by(x) %>%
summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))
I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:
# x xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278
And if you just want the values from xmeans use dfmeans$xmeans.

Resources