I've got a data frame like below with vector coordinates:
df <- structure(list(x0 = c(22.6, 38.5, 73.7), y0 = c(62.9, 56.6, 27.7
), x1 = c(45.8, 49.3, 80.8), y1 = c(69.9, 21.9, 14)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
# A tibble: 3 x 4
x0 y0 x1 y1
<dbl> <dbl> <dbl> <dbl>
1 22.6 62.9 45.8 69.9
2 38.5 56.6 49.3 21.9
3 73.7 27.7 80.8 14
For visualisation purposes I need to manually interpolate points, i.e. add an intermediate row between each two rows of df, where the starting coordinates x0, y0 are the ending coordinates of original, previous row, while ending coordinates x1, y1 are the starting coordinates of original, next row. I also need to preserve information if an observation is from original dataset or it is manually added. So the expected output would be:
# A tibble: 5 x 5
x y pass_end_x pass_end_y source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 45.8 69.9 38.5 56.6 added
3 38.5 56.6 49.3 21.9 original
4 49.3 21.9 73.7 27.7 added
5 73.7 27.7 80.8 14 original
How can I do that in efficient and elegant way (preferably in tidyverse)?
To do this, all I'm going to do is swap the column names of the start and end points, and then use lead to get the next value of x1 and y1. Then we just add the source tag, and bind_rows
library(tidyverse)
df2 <- df
names(df2) <- names(df2)[c(3,4,1,2)] # swap names
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added")
df <- df %>% mutate(source = "original") %>% bind_rows(., df2)
Resulting in:
# A tibble: 6 x 5
x0 y0 x1 y1 source
<dbl> <dbl> <dbl> <dbl> <chr>
1 22.6 62.9 45.8 69.9 original
2 38.5 56.6 49.3 21.9 original
3 73.7 27.7 80.8 14 original
4 45.8 69.9 38.5 56.6 added
5 49.3 21.9 73.7 27.7 added
6 80.8 14 NA NA added
If you need the rows in order:
df2 <- df2 %>% mutate(x1 = lead(x1), y1 = lead(y1),source = "added", ID = seq(1,n()*2, by =2)+1)
df <- df %>% mutate(source = "original", ID = seq(1,n()*2, by =2)) %>% bind_rows(., df2) %>% arrange(ID)
# A tibble: 6 x 6
x0 y0 x1 y1 source ID
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 22.6 62.9 45.8 69.9 original 1
2 45.8 69.9 38.5 56.6 added 2
3 38.5 56.6 49.3 21.9 original 3
4 49.3 21.9 73.7 27.7 added 4
5 73.7 27.7 80.8 14 original 5
6 80.8 14 NA NA added 6
Related
the randomly generated data frame contains ID, Dates, and Earnings. I changed up the data frame format so that each column represents a date and its values corresponds to the earnings.
I want to create a new variable named "Date_over100 " that would determine the date when one's cumulative earnings have exceeded 100. I have put below a reproducible code that would generate the data frame. I assume conditional statements or loops would be used to achieve this. I would appreciate all the help there is. Thanks in advance!
ID <- c(1:10)
Date <- sample(seq(as.Date('2021/01/01'), as.Date('2021/01/11'), by="day", replace=T), 10)
Earning <- round(runif(10,30,50),digits = 2)
df <- data.frame(ID,Date,Earning,check.names = F)
df1 <- df%>%
arrange(Date)%>%
pivot_wider(names_from = Date, values_from = Earning)
df1 <- as.data.frame(df1)
df1[is.na(df1)] <- round(runif(sum(is.na(df1)),min=30,max=50),digits = 2)
I go back to long format for the calculation, then join to the wide data:
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -ID, names_to = "date") %>%
group_by(ID) %>%
summarize(Date_over_100 = Date[which.max(cumsum(value) > 100)]) %>%
right_join(df1, by = "ID")
# # A tibble: 10 × 12
# ID Date_over_100 `2021-01-04` `2021-01-01` `2021-01-08` `2021-01-11` `2021-01-02` `2021-01-09`
# <int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2021-01-08 45.0 46.2 40.1 47.4 47.5 48.8
# 2 2 2021-01-08 36.7 30.3 36.2 47.5 41.4 41.7
# 3 3 2021-01-08 49.5 46.0 45.0 43.9 45.4 37.1
# 4 4 2021-01-08 31.0 48.7 47.3 40.4 40.8 35.5
# 5 5 2021-01-08 48.2 35.2 32.1 44.2 35.4 49.7
# 6 6 2021-01-08 40.8 37.6 31.8 40.3 38.3 42.5
# 7 7 2021-01-08 37.9 42.9 36.8 46.0 39.8 33.6
# 8 8 2021-01-08 47.7 47.8 39.7 46.4 43.8 46.5
# 9 9 2021-01-08 32.9 42.0 41.8 32.8 33.9 35.5
# 10 10 2021-01-08 34.5 40.1 42.7 35.9 44.8 31.8
# # … with 4 more variables: 2021-01-10 <dbl>, 2021-01-03 <dbl>, 2021-01-07 <dbl>, 2021-01-05 <dbl>
I wish to summarize a set of data in a dataframe using dplyer.
Concerning the "vars" argument, the documentation reads:
A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL.
I have the following behavior depending on the type of "vars" argument:
summarize_at(vars(D8,D9,D10), mean, na.rm=TRUE) # works
summarize_at(c("D8","D9","D10"), mean, na.rm=TRUE) # works
summarize_at(c(12,13,14), mean, na.rm=TRUE) # Using column indexes for D8, D9 and D10, respectively
# ! Can't convert a character `NA` to a symbol.
summarize_at(c(12:14), mean, na.rm=TRUE) # Same error as c(12,13,14)
Why I'm getting that error?
POST EDIT: Adding data and actual code
Data:
# A tibble: 12 x 5
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.3 41.1 41.5
2 Group_1 0 37.4 36.9 37.1
3 Group_1 0 44.8 44.1 44.4
4 Group_2 450 39.6 39.6 39.4
5 Group_2 450 40.6 41.2 40.8
6 Group_2 450 41.1 42.1 41.2
7 Group_3 500 38.5 39.2 39.9
8 Group_3 500 41.6 41.6 41.5
9 Group_3 500 41.8 41.8 42.4
10 Group_4 700 43.6 42 42.4
11 Group_4 700 43.1 42.7 42.7
12 Group_4 700 41.6 40.8 41.9
Error triggering code:
group_by(TTMENT, DOSE) %>%
#summarize_at(c("D8","D9","D10"), mean, na.rm=TRUE)
#summarize_at(vars(D8,D9,D10), mean, na.rm=TRUE)
summarize_at(c(3,4,5), mean, na.rm=TRUE)
Full error:
Error in FUN(): ! Can't convert a character NA to a symbol. Backtrace:
stack %>% group_by(TTMENT, DOSE) %>% ...
dplyr::summarize_at(., c(3, 4, 5), mean, na.rm = TRUE)
dplyr:::manip_at(...)
dplyr:::tbl_at_syms(.tbl, .vars, .include_group_vars = .include_group_vars)
rlang::syms(vars)
rlang:::map(x, sym)
base::lapply(.x, .f, ...)
rlang FUN(X[[i]], ...) Error in FUN(X[[i]], ...) :
I actually want an output showing mean, SD and SE presented in 3 rows per group (rather than in columns); and if possible an asterisk next to the mean in case of significant t-test between each group and the reference group (Group 1). Something like that:
Group Statistic D8 D9 D10
Group_1 Mean XX XX XX
Group_1 SD XX XX XX
Group_1 SE XX XX XX
Group_2 Mean XX* XX XX*
Group_2 SD XX XX XX
Group_2 SE XX XX XX
Group_3 etc.
Any ideas on how to achieve this?
Just posting an answer as I found an explanation (newbie topic though...)
Apparently, by using group_by the columns used to group the data are extracted from the column indexes. Therefore, given the dataframe:
# A tibble: 12 x 5
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.3 41.1 41.5
2 Group_1 0 37.4 36.9 37.1
3 Group_1 0 44.8 44.1 44.4
4 Group_2 450 39.6 39.6 39.4
5 Group_2 450 40.6 41.2 40.8
6 Group_2 450 41.1 42.1 41.2
7 Group_3 500 38.5 39.2 39.9
8 Group_3 500 41.6 41.6 41.5
9 Group_3 500 41.8 41.8 42.4
10 Group_4 700 43.6 42 42.4
11 Group_4 700 43.1 42.7 42.7
12 Group_4 700 41.6 40.8 41.9
The following code fails as it assumes column indexes 3, 4 and 5 for columns D8, D9 and D10 respectively:
results <- stack %>%
group_by(TTMENT, DOSE) %>%
summarize_at(c(3:5), mean, na.rm=TRUE)
Results in error:
Error in `FUN()`:
! Can't convert a character `NA` to a symbol.
In contrast, the following code provides the expected result, as it assumes column indexes 1, 2 and 3 for columns D8, D9 and D10 respectively. This is ignores TTMENT and DOSE for the index counting:
results <- stack %>%
group_by(TTMENT, DOSE) %>%
summarize_at(c(1:3), mean, na.rm=TRUE)
Result:
# A tibble: 4 x 5
# Groups: TTMENT [4]
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.8 40.7 41
2 Group_2 450 40.4 41.0 40.5
3 Group_3 500 40.6 40.9 41.3
4 Group_4 700 42.8 41.8 42.3
Thanks to #jpiversen, as his/her comment helped to understand what was going on.
I have data that is in the following format:
(data <- tribble(
~Date, ~ENRSxOPEN, ~ENRSxCLOSE, ~INFTxOPEN, ~INFTxCLOSE,
"1989-09-11",82.97,82.10,72.88,72.56,
"1989-09-12",83.84,83.96,73.52,72.51,
"1989-09-13",83.16,83.88,72.91,72.12))
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
For analysis, I want to pivot this tibble longer to the following format:
tribble(
~Ticker, ~Date, ~OPEN, ~CLOSE,
"ENRS","1989-09-11",82.97,82.10,
"ENRS","1989-09-12",83.84,83.96,
"ENRS","1989-09-13",83.16,83.88,
"INFT","1989-09-11",72.88,72.56,
"INFT","1989-09-12",73.52,72.51,
"INFT","1989-09-13",72.91,72.12)
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
I.e., I want to separate the Open/Close prices from the ticker, and put the latter as an entirely new column in the beginning.
I've tried to use the function pivot_longer:
pivot_longer(data, cols = ENRSxOPEN:INFTxCLOSE)
While this goes into the direction of what I wanna achieve, it does not separate the prices and keep them in one row for each Ticker.
Is there a way to add additional arguments to pivot_longer()to achieve that?
pivot_longer(data, -Date, names_to = c('Ticker', '.value'), names_sep = 'x')
# A tibble: 6 x 4
Date Ticker OPEN CLOSE
<dbl> <chr> <dbl> <dbl>
1 1969 ENRS 83.0 82.1
2 1969 INFT 72.9 72.6
3 1968 ENRS 83.8 84.0
4 1968 INFT 73.5 72.5
5 1967 ENRS 83.2 83.9
6 1967 INFT 72.9 72.1
I'm trying to get the area under the curve of some data for each run of a set of simulation runs. My data is of the form:
run year data1 data2 data3
--- ---- ----- ----- -----
1 2001 2.3 45.6 30.2
1 2002 2.4 35.4 23.4
1 2003 2.6 45.6 23.6
2 2001 2.3 45.6 30.2
2 2002 2.4 35.4 23.4
2 2003 2.6 45.6 23.6
3 2001 ... and so on
So, I'd like to get the area under the curve for each data trace for run 1, run 2, ... where the x axis is always the year column and the y axis is each data column. So, as output I want something like:
run Data1_auc Data2_auc Data3_auc
--- --------- --------- ---------
1 4.5 6.7 27.5
2 3.4 6.8 35.4
3 4.5 7.8 45.6
(Theses are not actual areas for the data above)
I want to use the pracma package 'trapz' function to compute the area which takes x and y values: trapz(x, y) where x=year in my case and y=Data column.
I've tried
dataCols <- colnames(myData %>% select(-c("run","year"))
myData <- group_by(run) %>% summarize_at(vars(dataCols), list(auc = trapz(year,.)))
but I can't get it to work without error. I've tried different variations on this, but can't seem it get it right.
Is this possible? If so, how do I do it?
library(dplyr)
library(pracma)
set.seed(1)
df <- tibble(
run = rep(1:3, each = 3),
year = rep(2001:2003, 3),
data1 = runif(9, 2, 3),
data2 = runif(9, 30, 50),
data3 = runif(9, 20, 40)
)
df
#> # A tibble: 9 x 5
#> run year data1 data2 data3
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2001 2.27 31.2 27.6
#> 2 1 2002 2.37 34.1 35.5
#> 3 1 2003 2.57 33.5 38.7
#> 4 2 2001 2.91 43.7 24.2
#> 5 2 2002 2.20 37.7 33.0
#> 6 2 2003 2.90 45.4 22.5
#> 7 3 2001 2.94 40.0 25.3
#> 8 3 2002 2.66 44.4 27.7
#> 9 3 2003 2.63 49.8 20.3
df %>%
group_by(run) %>%
summarise_at(vars(starts_with("data")), list(auc = ~trapz(year, .)))
#> # A tibble: 3 x 4
#> run data1_auc data2_auc data3_auc
#> <int> <dbl> <dbl> <dbl>
#> 1 1 4.79 66.5 68.7
#> 2 2 5.10 82.3 56.4
#> 3 3 5.45 89.2 50.5
Let's use the following example:
set.seed(2409)
N=5
T=10
id<- rep(LETTERS[1:N],each=T)
time<-rep(1:T, times=N)
var1<-runif(N*T,0,100)
var2<-runif(N*T,0,100)
var3<-runif(N*T,0,100)
var4<-runif(N*T,0,100)
var5<-runif(N*T,0,100)
df<-data.frame(id,time,var1,var2,var3,var4,var5); rm(N,T,id,time,var1,var2,var3,var4,var5)
I now try to execute a function for several of these variables (not the whole series of variables!) and create new variables accordingly.
I already have a suitable code for creating log variables. For this I would use the following code:
cols <- c("var1",
"var3",
"var5")
log <- log(df[cols])
colnames(log) <- paste(colnames(log), "log", sep = "_")
df <- cbind(df,log); rm(log, cols)
This would give me my additional log variables. But now I also want to create lagged and z-transformed variables. These functions refer to the individual IDs. So I wrote the following code that of course works, but is extremely long and inefficient in my real dataset where I apply the function to 38 variables each:
library(Hmisc)
library(dplyr)
df<-df %>%
group_by(id) %>%
mutate(var1_1=Lag(var1, shift=1),
var3_1=Lag(var3, shift=1),
var5_1=Lag(var5, shift=1),
var1_2=Lag(var1, shift=2),
var3_2=Lag(var3, shift=2),
var5_2=Lag(var5, shift=2),
var1_z=scale(var1),
var3_z=scale(var3),
var5_z=scale(var5)
)
I am very sure that there is also a way to make this more efficient. It would be desirable if I could define the original variable once and execute different functions and create new variables as a result.
Thank you very much!
You can use mutate_at with funs. This will apply the three functions in funs to each of the three variables in vars, creating 9 new columns.
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(vars(var1, var3, var5),
funs(lag1 = lag(.), lag2 = lag(., 2), scale))
# # A tibble: 50 x 16
# # Groups: id [5]
# id time var1 var2 var3 var4 var5 var1_lag1 var3_lag1 var5_lag1
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 38.8 25.7 29.2 91.1 35.3 NA NA NA
# 2 A 2 87.1 22.3 8.27 31.5 93.7 38.8 29.2 35.3
# 3 A 3 61.7 38.8 0.887 63.0 50.4 87.1 8.27 93.7
# 4 A 4 0.692 60.1 71.5 74.0 41.6 61.7 0.887 50.4
# 5 A 5 60.1 13.3 90.4 80.6 47.5 0.692 71.5 41.6
# 6 A 6 46.4 3.67 36.7 86.9 67.5 60.1 90.4 47.5
# 7 A 7 80.4 72.1 82.2 25.5 70.3 46.4 36.7 67.5
# 8 A 8 48.8 25.7 93.4 19.8 81.2 80.4 82.2 70.3
# 9 A 9 48.2 31.5 82.1 47.2 49.2 48.8 93.4 81.2
# 10 A 10 21.8 32.6 76.5 19.7 41.1 48.2 82.1 49.2
# # ... with 40 more rows, and 6 more variables: var1_lag2 <dbl>, var3_lag2 <dbl>,
# # var5_lag2 <dbl>, var1_scale <dbl>, var3_scale <dbl>, var5_scale <dbl>
Here is an option with data.table
library(data.table)
nm1 <- c('var1', 'var3', 'var5')
nm2 <- paste0(nm1, rep(c('_lag1', '_lag2'), each = 3))
nm3 <- paste0(nm1, '_scale')
setDT(df)[, c(nm2, nm3) := c(shift(.SD, n = 1:2), lapply(.SD,
function(x) as.vector(scale(x)))), by = id, .SDcols = nm1]'