I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects). Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:
subid
a_score.baseline
a_score.followup
b_score.baseline
b_score.followup
c_score.baseline
c_score.followup
1
100
150
5
2
80
70
2
120
142
10
9
79
42
3
111
146
60
49
89
46
4
152
148
4
4
69
48
5
110
123
20
18
60
23
6
112
120
5
3
12
20
7
111
145
6
4
11
45
I'd like to calculate the difference between followup and baseline for each score in a new column like this:
df$a_score_difference = df$a_score.followup - df$a_score.baseleine
Any ideas on how to do this efficiently? I really appreciate your help.
code to generate sample data:
subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)
df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)
base R
scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13 8 34
# $b_score.followup
# [1] -3 -1 -11 0 -2 -2 -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37 8 34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1 1 100 150 5 2 80 70 50
# 2 2 120 142 10 9 79 42 22
# 3 3 111 146 60 49 89 46 35
# 4 4 152 148 4 4 69 48 -4
# 5 5 110 123 20 18 60 23 13
# 6 6 112 120 5 3 12 20 8
# 7 7 111 145 6 4 11 45 34
# b_score.difference c_score.difference
# 1 -3 -10
# 2 -1 -37
# 3 -11 -43
# 4 0 -21
# 5 -2 -37
# 6 -2 8
# 7 -2 34
There exists (in an unsupervised mode) the possibility that not all followups will have comparable baselines, which could cause a problem. You might include a test to validate the presence and order:
all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE
dplyr
You might consider pivoting the data into a more long format. This can be done in base R as well, but looks a lot simpler when done here:
library(dplyr)
# library(tidyr) # pivot_*
df %>%
tidyr::pivot_longer(
-subid,
names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline)
# # A tibble: 21 x 5
# subid ltr baseline followup difference
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 a 100 150 50
# 2 1 b 5 2 -3
# 3 1 c 80 70 -10
# 4 2 a 120 142 22
# 5 2 b 10 9 -1
# 6 2 c 79 42 -37
# 7 3 a 111 146 35
# 8 3 b 60 49 -11
# 9 3 c 89 46 -43
# 10 4 a 152 148 -4
# # ... with 11 more rows
Honestly, I tend to prefer a long format most of the time for many reasons. If, however, you want to make it wide again, then
df %>%
tidyr::pivot_longer(
-subid, names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline) %>%
tidyr::pivot_wider(
names_from = "ltr",
values_from = c("baseline", "followup", "difference"),
names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
# subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 100 5 80 150 2 70 50 -3 -10
# 2 2 120 10 79 142 9 42 22 -1 -37
# 3 3 111 60 89 146 49 46 35 -11 -43
# 4 4 152 4 69 148 4 48 -4 0 -21
# 5 5 110 20 60 123 18 23 13 -2 -37
# 6 6 112 5 12 120 3 20 8 -2 8
# 7 7 111 6 11 145 4 45 34 -2 34
dplyr #2
This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.
df %>%
mutate(across(
ends_with("score.followup"),
~ . - cur_data()[[sub("followup", "baseline", cur_column())]],
.names = "{sub('followup', 'difference', col)}")
)
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1 1 100 150 5 2 80 70 50 -3 -10
# 2 2 120 142 10 9 79 42 22 -1 -37
# 3 3 111 146 60 49 89 46 35 -11 -43
# 4 4 152 148 4 4 69 48 -4 0 -21
# 5 5 110 123 20 18 60 23 13 -2 -37
# 6 6 112 120 5 3 12 20 8 -2 8
# 7 7 111 145 6 4 11 45 34 -2 34
Related
How can I transpose specific columns in a data.frame as:
id<- c(1,2,3)
t0<- c(0,0,0)
bp0<- c(88,95,79)
t1<- c(15,12,12)
bp1<- c(92,110,82)
t2<- c(25,30,20)
bp2<- c(75,99,88)
df1<- data.frame(id, t0, bp0, t1, bp1, t2, bp2)
df1
> df1
id t0 bp0 t1 bp1 t2 bp2
1 1 0 88 15 92 25 75
2 2 0 95 12 110 30 99
3 3 0 79 12 82 20 88
In order to obtain:
> df2
id t bp
1 1 0 88
2 2 0 95
3 3 0 79
4 1 15 92
5 2 12 110
6 3 12 82
7 1 25 75
8 2 30 99
9 3 20 88
In order to obtain df2, with represent t(t0,t1,t2) and bp(bp0,bp1,bp2) for the corresponding "id"
Using Base R, you can do:
Reprex
Code
df2 <- cbind(df1[1], stack(df1, startsWith(names(df1), "t"))[1], stack(df1,startsWith(names(df1), "bp"))[1])
names(df2)[2:3] <- c("t", "bp")
Output
df2
#> id t bp
#> 1 1 0 88
#> 2 2 0 95
#> 3 3 0 79
#> 4 1 15 92
#> 5 2 12 110
#> 6 3 12 82
#> 7 1 25 75
#> 8 2 30 99
#> 9 3 20 88
Created on 2022-02-14 by the reprex package (v2.0.1)
Here is solution with pivot_longer using name_pattern:
\\w+ = one or more alphabetic charachters
\\d+ = one or more digits
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer (
-id,
names_to = c(".value", "name"),
names_pattern = "(\\w+)(\\d+)"
) %>%
select(-name)
id t bp
<dbl> <dbl> <dbl>
1 1 0 88
2 1 15 92
3 1 25 75
4 2 0 95
5 2 12 110
6 2 30 99
7 3 0 79
8 3 12 82
9 3 20 88
A base R option using reshape
reshape(
setNames(df1, sub("(\\d+)", ".\\1", names(df1))),
direction = "long",
idvar = "id",
varying = -1
)
gives
id time t bp
1.0 1 0 0 88
2.0 2 0 0 95
3.0 3 0 0 79
1.1 1 1 15 92
2.1 2 1 12 110
3.1 3 1 12 82
1.2 1 2 25 75
2.2 2 2 30 99
3.2 3 2 20 88
I have the following dataset and looking to write a code that can help pull out which stocks have been positive or negative consecutively. The data would have first 3 column. last 2 columns are manually calculated in excel to depict expected results.
This is only sample, i would have data for 200+ stocks and few years of data with all stocks not trading every day.
In the end, i want to extract which stocks have say 3 or 4 or 5 consecutive positive or negative change for the day.
` Stocks Date Close Price Change for day Positive/Negative Count
A 11/11/2020 11
B 11/11/2020 50
C 11/11/2020 164
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2
The difficulty is to have a function that makes the cumulative sum, both positive and negative, resetting the count when the sign changes, and starting the count with the first value. I managed to make one, but it is not terribly efficient and will probably get slow on a bigger dataset. I suspect there is a way to do better, if only with a simple for loop in C or C++.
library(tidyverse)
df <- read.table(text="Stocks Date Close_Price Change_for_day Positive/Negative_Count
A 11/11/2020 11 NA 0
B 11/11/2020 50 NA 0
C 11/11/2020 164 NA 0
A 11/12/2020 19 8 1
B 11/12/2020 62 12 1
C 11/12/2020 125 -39 -1
A 11/13/2020 7 -12 -1
B 11/13/2020 63 1 2
C 11/13/2020 165 40 1
A 11/16/2020 17 10 1
B 11/16/2020 70 7 3
C 11/16/2020 170 5 2
A 11/17/2020 24 7 2
B 11/17/2020 52 -18 -1
C 11/17/2020 165 -5 -1
A 11/18/2020 31 7 3
B 11/18/2020 61 9 1
C 11/18/2020 157 -8 -2",
header = TRUE) %>%
select(1:3) %>%
as_tibble()
# this formulation could be faster on data with longer stretches
nb_days_cons2 <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
y <- rle(x)
cur_pos <- 1
for(i in seq_len(length(y$lengths))){
out[(cur_pos):(cur_pos+y$lengths[i]-1)] <- cumsum(rep(y$values[i], y$lengths[i]))
cur_pos <- cur_pos + y$lengths[i]
}
out
}
# this formulation was faster on some tests, and would be easier to rewrite in C
nb_days_cons <- function(x){
n <- length(x)
if(n < 2) x
out <- integer(n)
out[1] <- x[1]
for(i in 2:n){
if(x[i] == x[i-1]){
out[i] <- out[i-1] + x[i]
} else{
out[i] <- x[i]
}
}
out
}
Once we have that function, the dplyr part is quite classic.
df %>%
group_by(Stocks) %>%
arrange(Date) %>% # make sure of order
mutate(change = c(0, diff(Close_Price)),
stretch_duration = nb_days_cons(sign(change))) %>%
arrange(Stocks)
#> # A tibble: 18 x 5
#> # Groups: Stocks [3]
#> Stocks Date Close_Price change stretch_duration
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 A 11/11/2020 11 0 0
#> 2 A 11/12/2020 19 8 1
#> 3 A 11/13/2020 7 -12 -1
#> 4 A 11/16/2020 17 10 1
#> 5 A 11/17/2020 24 7 2
#> 6 A 11/18/2020 31 7 3
#> 7 B 11/11/2020 50 0 0
#> 8 B 11/12/2020 62 12 1
#> 9 B 11/13/2020 63 1 2
#> 10 B 11/16/2020 70 7 3
#> 11 B 11/17/2020 52 -18 -1
#> 12 B 11/18/2020 61 9 1
#> 13 C 11/11/2020 164 0 0
#> 14 C 11/12/2020 125 -39 -1
#> 15 C 11/13/2020 165 40 1
#> 16 C 11/16/2020 170 5 2
#> 17 C 11/17/2020 165 -5 -1
#> 18 C 11/18/2020 157 -8 -2
Created on 2020-11-19 by the reprex package (v0.3.0)
Of course, the final arrange() is just for easy visualization, and you can remove the columns you don't need anymore with select().
I need to expand a dataset and merge with series of time from 1 hr to 600 hr
ID Age Weight
1 25 150
2 30 140
3 28 170
to the following format [dots represent continuous count from 4 to 599; with the same ID, Age and Weight values]
ID Age Weight Time
1 25 150 1
1 25 150 2
1 25 150 3
1 25 150 4
.. .. ... ..
1 25 150 599
1 25 150 600
2 30 140 1
2 30 140 2
2 30 140 3
2 30 140 4
.. .. ... ..
2 30 140 599
2 30 140 600
3 28 170 1
3 28 170 2
3 28 170 3
3 28 170 4
.. .. ... ..
3 28 170 599
3 28 170 600
We can use complete from tidyr
library(dplyr)
df %>%
mutate(Time = 1) %>%
group_by(ID, Age, Weight) %>%
tidyr::complete(Time = 1:600)
# ID Age Weight Time
# <int> <int> <int> <dbl>
# 1 1 25 150 1
# 2 1 25 150 2
# 3 1 25 150 3
# 4 1 25 150 4
# 5 1 25 150 5
# 6 1 25 150 6
# 7 1 25 150 7
# 8 1 25 150 8
# 9 1 25 150 9
#10 1 25 150 10
# … with 1,790 more rows
Or in base R using split and transform
do.call(rbind, lapply(split(df, df$ID), function(x) transform(x, Time = 1:600)))
Hi I have data frame with 100 columns , i am trying to calculate current_value -lag(current_value)
I can able to calculate for single column , but unable run code on all available columns dynamically
Sample df
Class <- c("A","A","A","A","B","B","B","C","C","C","C","C","C")
A<-c(23,33,45,56,22,34,34,45,65,5,57,75,57)
D<-c(2,133,5,60,23,312,341,25,75,50,3,9,21)
M<-c(34,35,67,325,46,56,547,47,67,67,68,3,12)
df <- data.frame(Class,A,D,M)
I have tried with below code
df <- df %>% group_by(Class) %>%
mutate(A_lag =(A-lag(A)))
Help me to calculate all lad columns dynamically
Thanks in advance
Class <- c("A","A","A","A","B","B","B","C","C","C","C","C","C")
A<-c(23,33,45,56,22,34,34,45,65,5,57,75,57)
D<-c(2,133,5,60,23,312,341,25,75,50,3,9,21)
M<-c(34,35,67,325,46,56,547,47,67,67,68,3,12)
df <- data.frame(Class,A,D,M)
library(dplyr)
df %>%
group_by(Class) %>%
mutate_all(~.-lag(.)) %>%
ungroup()
# # A tibble: 13 x 4
# Class A D M
# <fct> <dbl> <dbl> <dbl>
# 1 A NA NA NA
# 2 A 10 131 1
# 3 A 12 -128 32
# 4 A 11 55 258
# 5 B NA NA NA
# 6 B 12 289 10
# 7 B 0 29 491
# 8 C NA NA NA
# 9 C 20 50 20
#10 C -60 -25 0
#11 C 52 -47 1
#12 C 18 6 -65
#13 C -18 12 9
or if you want to add new columns to the existing ones
df %>%
group_by(Class) %>%
mutate_all(funs(new = .-lag(.))) %>%
ungroup()
# # A tibble: 13 x 7
# Class A D M A_new D_new M_new
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 23 2 34 NA NA NA
# 2 A 33 133 35 10 131 1
# 3 A 45 5 67 12 -128 32
# 4 A 56 60 325 11 55 258
# 5 B 22 23 46 NA NA NA
# 6 B 34 312 56 12 289 10
# 7 B 34 341 547 0 29 491
# 8 C 45 25 47 NA NA NA
# 9 C 65 75 67 20 50 20
#10 C 5 50 67 -60 -25 0
#11 C 57 3 68 52 -47 1
#12 C 75 9 3 18 6 -65
#13 C 57 21 12 -18 12 9
Using data.table and calculating lag for only numeric variables:
library(data.table)
setDT(df)
df[,
lapply(.SD, function(x) x - shift(x)),
by = Class,
.SDcols = (sapply(df, is.numeric))
]
Class A D M
1: A NA NA NA
2: A 10 131 1
3: A 12 -128 32
4: A 11 55 258
5: B NA NA NA
6: B 12 289 10
7: B 0 29 491
8: C NA NA NA
9: C 20 50 20
10: C -60 -25 0
11: C 52 -47 1
12: C 18 6 -65
13: C -18 12 9
I have a list of 15 data frames with each 13 columns (time + 6 stations with each 3 layers) and 172 rows. I want to collapse those columns (observations at stations) in basically two columns (one for station and one for observation) by applying a function over the whole list. Here I use gather from tidyr. In addition, I want to find a pattern (upper, middle or lower layer) in one of the columns and assign a new value (depth) in a new column. For this I use ddply from plyr and grep. My problem is that it is veryyyy slow. I guess I created a bottleneck with my limited R knowledge. So where is the bottleneck and how to improve it?
an example:
data <- list(a = data.frame(time = (1:180), alpha.upper = sample(1:180),
beta.middle = sample(1:180), gamma.lower = sample(1:180)),
b = data.frame(time(1:180), alpha.upper = sample(1:180),
beta.middle = sample(1:180), gamma.lower = sample(1:180)))
> data
$a
time alpha.upper beta.middle gamma.lower
1 1 133 179 99
2 2 175 147 56
3 3 169 9 24
4 4 116 129 75
5 5 92 65 65
6 6 141 73 49
$b
time alpha.upper beta.middle gamma.lower
1 1 111 2 89
2 2 84 81 159
3 3 93 82 84
4 4 44 58 125
5 5 31 33 131
6 6 1 120 63
my code is:
> data2<-lapply(data, function(x) {
x<-gather(x,stn,value,-time)
x<-arrange(x,time)
x<-ddply(x,c("time","stn","value"), function(x) {
if (grepl(".upper",x$stn) == TRUE)
{
x$depth<-1
return(x)
}
if (grepl(".lower",x$stn) == TRUE)
{
x$depth<-3
return(x)
}
if (grepl(".middle",x$stn) == TRUE)
{
x$depth<-2
return(x)
}
})
return(x)
})
the result should be:
> data2
$a
time stn value depth
1 1 alpha.upper 111 1
2 1 beta.middle 2 2
3 1 gamma.lower 89 3
4 2 alpha.upper 84 1
5 2 beta.middle 81 2
6 2 gamma.lower 159 3
$b
1 1 alpha.upper 38 1
2 1 beta.middle 151 2
3 1 gamma.lower 93 3
4 2 alpha.upper 61 1
5 2 beta.middle 56 2
6 2 gamma.lower 66 3
First of all let's reproduce your data.
dataa <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 133 179 99
2 175 147 56
3 169 9 24
4 116 129 75
5 92 65 65
6 141 73 49", header = T, sep = " ")
datab <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 1 111 2 89
2 2 84 81 159
3 3 93 82 84
4 4 44 58 125
5 5 31 33 131
6 6 1 120 63", header = T, sep = " ")
mydata <- list(a = dataa, b = datab)
# $a
# time alpha.upper beta.middle gamma.lower
# 1 1 133 179 99
# 2 2 175 147 56
# 3 3 169 9 24
# 4 4 116 129 75
# 5 5 92 65 65
# 6 6 141 73 49
# $b
# time alpha.upper beta.middle gamma.lower
# 1 1 111 2 89
# 2 2 84 81 159
# 3 3 93 82 84
# 4 4 44 58 125
# 5 5 31 33 131
# 6 6 1 120 63
Here I named the variable mydata because there is a fuction data in standart package utils and it's better not to use this name for a variable.
As far as I've got it, you need to make every data.frame of the list from "wide" to "long" form. You can use gather from tidyr package and in my opinion it's a clever choise, but for this situation I show how we can get the same result with basic R tools.
rebuilddf <- function(df)
{ # first of all see the difference between rep(1:3, each = 3) and rep(1:3, times = 3)
res_df <- data.frame(
time = rep(df$time, each = 3),# first column of new data.frame -
# we repeat each time mark 3 times
# as we know that there are exactly 3
# observations for single time: upper, middle, lower
stn = rep(colnames(df)[-1], times = nrow(df)), # second column
# fill it with words "alpha.upper",
# "beta.middle", "gamma.lower" which are colnames(df)[-1]
# repeated nrow(df) times
value = as.vector(t(as.matrix(df[,-1]))) ) #
# numbers of 2:4 columns of our data.frame are
# transposed and then arranged in a vector
# the result is like reading it row by row
# to understand what's happening with the matrix you can try this code
# m <- matrix(1:20, nrow = 4)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 5 9 13 17
# [2,] 2 6 10 14 18
# [3,] 3 7 11 15 19
# [4,] 4 8 12 16 20
# as.vector(t(m))
# 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 4 8 12 16 20
# after that we add column "depth"
# as I got it, we need 1 for "upper", 2 for "middle" and 3 for "lower"
# we make it with the help of two nested ifelse functions
res_df <- transform(res_df, depth = ifelse(stn == "alpha.upper", 1,
ifelse(stn == "beta.middle", 2, 3)) )
return(res_df)
}
If the names of columns are not always the same, and only the end of the name is invariant we can modify condition for depth as follows:
res_df <-
transform(res_df,
depth = ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "upper",
1,
ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "middle", 2, 3)
) )
# we work with
# rev(strsplit(stn, "[.]")[[1]])[1]
# it may be "upper", "middle" or "lower"
# here we split character string of form "some.name1.upper" or
# "some.other.colname.lower" by every dot in text, then take
# the first from end part of the string (rev does reversing order)
You may also modify the condition and use grepl, but I believe it will be faster with strsplit.
When we've finished with our rebuilddf function let's watch what it does.
lapply(mydata, rebuilddf)
# $a
# time stn value depth
# 1 1 alpha.upper 133 1
# 2 1 beta.middle 179 2
# 3 1 gamma.lower 99 3
# 4 2 alpha.upper 175 1
# 5 2 beta.middle 147 2
# 6 2 gamma.lower 56 3
# 7 3 alpha.upper 169 1
# 8 3 beta.middle 9 2
# 9 3 gamma.lower 24 3
# 10 4 alpha.upper 116 1
# 11 4 beta.middle 129 2
# 12 4 gamma.lower 75 3
# 13 5 alpha.upper 92 1
# 14 5 beta.middle 65 2
# 15 5 gamma.lower 65 3
# 16 6 alpha.upper 141 1
# 17 6 beta.middle 73 2
# 18 6 gamma.lower 49 3
#
# $b
# time stn value depth
# 1 1 alpha.upper 111 1
# 2 1 beta.middle 2 2
# 3 1 gamma.lower 89 3
# 4 2 alpha.upper 84 1
# 5 2 beta.middle 81 2
# 6 2 gamma.lower 159 3
# 7 3 alpha.upper 93 1
# 8 3 beta.middle 82 2
# 9 3 gamma.lower 84 3
# 10 4 alpha.upper 44 1
# 11 4 beta.middle 58 2
# 12 4 gamma.lower 125 3
# 13 5 alpha.upper 31 1
# 14 5 beta.middle 33 2
# 15 5 gamma.lower 131 3
# 16 6 alpha.upper 1 1
# 17 6 beta.middle 120 2
# 18 6 gamma.lower 63 3
I want to believe it's your desired output, though in the question you show us in a data.frame numbers from b and vice versa.