Reshape horizontal to to long format using pivot_longer

Reshape horizontal to to long format using pivot_longer - r

I am trying to reshape my data to long instead of wide format using the same code provided earlier link , however it doesn't work even after several trials to modify names_pattern = "(.*)_(pre|post.*)",
My data sample is
data1<-read.table(text="
Serial_ID pre_EDV pre_ESV pre_LVEF post_EDV post_ESV post_LVEF
1 76.2 32.9 56.8 86.3 36.6 57.6
2 65.4 35.9 45.1 60.1 26.1 56.7
3 64.4 35.1 45.5 72.5 41.1 43.3
4 50 13.9 72.1 46.4 18.4 60.4
5 89.6 32 64.3 70.9 19.3 72.8
6 62 20.6 66.7 55.9 17.8 68.2
7 91.2 37.7 58.6 61.9 23.8 61.6
8 62 24 61.3 69.3 34.9 49.6
9 104.1 22.7 78.8 38.6 11.5 70.1
10 90.6 31.2 65.6 48 16.1 66.4", sep="", header=T)
I want to reshape my data to
put identical column headings below each other eg post_EDV below
pre_EDV
Create new column Pre vs. post
Fix column heading (remove "pre_" and "post_" to be "EDV" only (as shown in the screenshot below)).
This is the used code:
library(dplyr);library(tidyr);library(stringr)
out <- data %>% pivot_longer(cols = -Serial_ID,
names_to = c(".value", "prevspost"),
names_pattern = "(.*)_(pre|post.*)",
names_sep="_") #%>% as.data.frame
Also I tried names_prefix = c("pre_","post_") instead of names_pattern = "(.*)_(pre|post.*)", but it doesn't work.
Any advice will be greatly appreciated.

Edit I recommend using #Dave2e's superior approach.
The reason your attempt didn't work is because the pattern has to match in order. You could try this:
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("prevspost",".value"),
names_pattern = "(pre|post)_(\\w+)") %>%
dplyr::arrange(desc(prevspost),Serial_ID)
# A tibble: 20 x 5
Serial_ID prevspost EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 2 pre 65.4 35.9 45.1
3 3 pre 64.4 35.1 45.5
4 4 pre 50 13.9 72.1
5 5 pre 89.6 32 64.3
6 6 pre 62 20.6 66.7
7 7 pre 91.2 37.7 58.6
8 8 pre 62 24 61.3
9 9 pre 104. 22.7 78.8
10 10 pre 90.6 31.2 65.6
11 1 post 86.3 36.6 57.6
12 2 post 60.1 26.1 56.7
13 3 post 72.5 41.1 43.3
14 4 post 46.4 18.4 60.4
15 5 post 70.9 19.3 72.8
16 6 post 55.9 17.8 68.2
17 7 post 61.9 23.8 61.6
18 8 post 69.3 34.9 49.6
19 9 post 38.6 11.5 70.1
20 10 post 48 16.1 66.4

Your initial approach very close, it needed some simplification. Use only "names_sep" or "names_pattern"
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("Pre vs. post", '.value'),
names_sep="_")
# A tibble: 20 x 5
Serial_ID `Pre vs. post` EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 1 post 86.3 36.6 57.6
3 2 pre 65.4 35.9 45.1
4 2 post 60.1 26.1 56.7
5 3 pre 64.4 35.1 45.5
6 3 post 72.5 41.1 43.3
7 4 pre 50 13.9 72.1
8 4 post 46.4 18.4 60.4
9 5 pre 89.6 32 64.3
10 5 post 70.9 19.3 72.8
11 6 pre 62 20.6 66.7
12 6 post 55.9 17.8 68.2
13 7 pre 91.2 37.7 58.6
14 7 post 61.9 23.8 61.6
15 8 pre 62 24 61.3
16 8 post 69.3 34.9 49.6
17 9 pre 104. 22.7 78.8
18 9 post 38.6 11.5 70.1
19 10 pre 90.6 31.2 65.6
20 10 post 48 16.1 66.4

try this:
library(dplyr);library(tidyr);library(stringr)
out <- data1 %>% pivot_longer(-Serial_ID,
names_to = c("measurement", "names"),
values_to = "values",
names_sep = "_")
out
# # A tibble: 60 x 4
# Serial_ID measurement names values
# <int> <chr> <chr> <dbl>
# 1 1 pre EDV 76.2
# 2 1 pre ESV 32.9
# 3 1 pre LVEF 56.8
# 4 1 post EDV 86.3
# 5 1 post ESV 36.6
# 6 1 post LVEF 57.6
# 7 2 pre EDV 65.4
# 8 2 pre ESV 35.9
# 9 2 pre LVEF 45.1
# 10 2 post EDV 60.1
# # ... with 50 more rows
Your code snipped passed the object "data" instead of "data1" into the pipe which produced an error:
"Error: No tidyselect variables were registered".

Related

Apply a function to a set of columns in a dataset

using this function I calculate the variance of some 3d points.
centroid_3d_sq_dist <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
mean_point <- apply(point_matrix, 2, mean)
point_sq_distances <- apply(
point_matrix,
1,
function(row_point) {
sum((row_point - mean_point) ** 2)
}
)
sum_sq_distances <- sum(point_sq_distances)
return(sum_sq_distances)
}
point_3d_variance <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
dist_var <- centroid_3d_sq_dist(point_matrix) /
(nrow(point_matrix) - 1)
return(dist_var)
}
The argument of this function is a matrix (x,y,z).
Now I have a dataset with two 3D points.
ID Trial Size PP PA FkA ciccioX ciccioY ciccioZ pinoX pinoY pinoZ
1 Gigi 1 40 39.6 1050. 31.5 521. 293. 10.6 516. 323. 6.41
2 Gigi 2 20.0 30.7 944. 9.35 525. 300. 12.6 520. 305. 7.09
3 Gigi 3 30 29.5 1056. 24.1 521. 298. 12.3 519. 321. 5.89
4 Gigi 5 60 53.0 1190. 53.0 680. 287. 64.4 699. 336. 68.6
5 Bibi 1 40 38.3 1038. 31.4 524. 289. 10.9 519. 319. 6.17
6 Bibi 2 60 64.7 1293. 47.8 516. 282. 10.4 519. 330. 6.32
7 Bibi 3 20.0 33.8 1092. 17.5 523. 300. 12.8 518. 315. 6.22
8 Bibi 4 30 35.0 1108. 26.4 525. 295. 11.7 517. 320. 5.78
9 Bibi 5 50 46.5 1199. 34.2 515. 289. 11.2 517. 323. 6.27
10 Bibi 6 30 28.7 1016. 17.1 528. 298. 12.7 524. 314. 6.36
The 3D points are:
ciccio: ciccioX ciccioY ciccioZ
pino: pinoX pinoY pinoZ
I want to calculate the variance of ciccio and the variance of pino grouped by ID and SIZE.
I tried to do:
data %>%
group_by(SubjectID, Size) %>%
summarize(as.data.frame(matrix(f4(dd[7:9],dd[10:12]), nr = 1)))
But it doesn't work.
Do you have any advice?

Your shown dataset is too small to calculate (meaningful) variations. But you could use:
library(dplyr)
df %>%
group_by(ID, Size) %>%
summarise(var_ciccio = point_3d_variance(as.matrix(across(ciccioX:ciccioZ))),
var_pino = point_3d_variance(as.matrix(across(pinoX:pinoZ))),
.groups = "drop")
This returns
# A tibble: 9 x 4
ID Size var_ciccio var_pinoo
<chr> <dbl> <dbl> <dbl>
1 Bibi 20 0 0
2 Bibi 30 9.5 42.7
3 Bibi 40 0 0
4 Bibi 50 0 0
5 Bibi 60 0 0
6 Gigi 20 0 0
7 Gigi 30 0 0
8 Gigi 40 0 0
9 Gigi 60 0 0

Pivot Longer with Modification of Columns

I have data that is in the following format:
(data <- tribble(
~Date, ~ENRSxOPEN, ~ENRSxCLOSE, ~INFTxOPEN, ~INFTxCLOSE,
"1989-09-11",82.97,82.10,72.88,72.56,
"1989-09-12",83.84,83.96,73.52,72.51,
"1989-09-13",83.16,83.88,72.91,72.12))
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
For analysis, I want to pivot this tibble longer to the following format:
tribble(
~Ticker, ~Date, ~OPEN, ~CLOSE,
"ENRS","1989-09-11",82.97,82.10,
"ENRS","1989-09-12",83.84,83.96,
"ENRS","1989-09-13",83.16,83.88,
"INFT","1989-09-11",72.88,72.56,
"INFT","1989-09-12",73.52,72.51,
"INFT","1989-09-13",72.91,72.12)
# A tibble: 3 x 5
Date ENRSxOPEN ENRSxCLOSE INFTxOPEN INFTxCLOSE
<chr> <dbl> <dbl> <dbl> <dbl>
1 1989-09-11 83.0 82.1 72.9 72.6
2 1989-09-12 83.8 84.0 73.5 72.5
3 1989-09-13 83.2 83.9 72.9 72.1
I.e., I want to separate the Open/Close prices from the ticker, and put the latter as an entirely new column in the beginning.
I've tried to use the function pivot_longer:
pivot_longer(data, cols = ENRSxOPEN:INFTxCLOSE)
While this goes into the direction of what I wanna achieve, it does not separate the prices and keep them in one row for each Ticker.
Is there a way to add additional arguments to pivot_longer()to achieve that?

pivot_longer(data, -Date, names_to = c('Ticker', '.value'), names_sep = 'x')
# A tibble: 6 x 4
Date Ticker OPEN CLOSE
<dbl> <chr> <dbl> <dbl>
1 1969 ENRS 83.0 82.1
2 1969 INFT 72.9 72.6
3 1968 ENRS 83.8 84.0
4 1968 INFT 73.5 72.5
5 1967 ENRS 83.2 83.9
6 1967 INFT 72.9 72.1

How to get returns from a DF

So I'm using the quantmod library to calculate historical returns, but while I can get the past prices, how can I calculate the returns and add it on to the dataframe???
My code looks like this
tickers <- c('KO', 'AAPL')
getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
history <- cbind(KO$KO.Close,AAPL$AAPL.Close)

First I did a way to better import and structure data
Import
library(quantmod)
library(tidyverse)
tickers <- c('KO', 'AAPL')
df <-
map_df(
.x = tickers,
.f = function(x){
getSymbols(x, from = '2020-07-01', to = '2021-07-01',auto.assign = FALSE) %>%
as_tibble() %>%
set_names(c("open","high","low","close","volume","adjusted")) %>%
mutate(symbol = x)
}
)
# A tibble: 504 x 7
open high low close volume adjusted symbol
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO
2 45.3 45.4 44.8 44.9 15111900 43.2 KO
3 45.1 45.3 44.6 45.2 15146000 43.5 KO
4 45 45.5 44.8 45.2 13043600 43.5 KO
5 45.1 45.2 44.5 45.1 13851200 43.3 KO
6 45.0 45.0 43.8 43.9 16087100 42.2 KO
7 43.9 45.2 43.9 45.2 15627800 43.4 KO
8 45.5 45.7 45.0 45.2 16705300 43.5 KO
9 44.9 45.9 44.7 45.9 17080100 44.1 KO
10 46.3 47.2 46.2 46.4 23738000 44.6 KO
Return
I do not know if this is the right formula for return, but you can change later inside mutate
df %>%
group_by(symbol) %>%
mutate(return = 100*((open/lag(open))-1))
# A tibble: 504 x 8
# Groups: symbol [2]
open high low close volume adjusted symbol return
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO NA
2 45.3 45.4 44.8 44.9 15111900 43.2 KO 0.801
3 45.1 45.3 44.6 45.2 15146000 43.5 KO -0.331
4 45 45.5 44.8 45.2 13043600 43.5 KO -0.310
5 45.1 45.2 44.5 45.1 13851200 43.3 KO 0.311
6 45.0 45.0 43.8 43.9 16087100 42.2 KO -0.199
7 43.9 45.2 43.9 45.2 15627800 43.4 KO -2.60
8 45.5 45.7 45.0 45.2 16705300 43.5 KO 3.76
9 44.9 45.9 44.7 45.9 17080100 44.1 KO -1.36
10 46.3 47.2 46.2 46.4 23738000 44.6 KO 3.10
# ... with 494 more rows

Assuming the return you're looking for as today's value/yesterday's value, and using the tidyverse:
library(tidyverse)
library(timetk)
tickers <- c('KO', 'AAPL')
quantmod::getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
# Convert to a tibble to keep the dates
equity1 <- tk_tbl(KO) %>%
select(date = index, 5)
equity2 <- tk_tbl(AAPL) %>%
select(date = index, 5)
# Combine the series using a join, in case dates don't line up exactly.
history <- full_join(equity1, equity2, by = "date")
# Make data long, group by equity, do the calculation, turn back into wide data:
return <- history %>%
pivot_longer(-date) %>%
group_by(name) %>%
mutate(return = value/lag(value)-1) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = c(value, return))
# A tibble: 252 x 5
date value_KO.Close value_AAPL.Close return_KO.Close return_AAPL.Close
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-07-01 44.8 91.0 NA NA
2 2020-07-02 44.9 91.0 0.00134 0
3 2020-07-06 45.2 93.5 0.00780 0.0268
4 2020-07-07 45.2 93.2 -0.000442 -0.00310
5 2020-07-08 45.1 95.3 -0.00310 0.0233
6 2020-07-09 43.9 95.8 -0.0257 0.00430
7 2020-07-10 45.2 95.9 0.0282 0.00175
8 2020-07-13 45.2 95.5 0.00221 -0.00461
9 2020-07-14 45.9 97.1 0.0137 0.0165
10 2020-07-15 46.4 97.7 0.0116 0.00688
# ... with 242 more rows

avoid scatterplot between the same variable in a loop

I have a data frame (tab3) looking like this:
CWRES ID AGE BMI WGT
3 0.59034000 1 37.5 20.7 64.6
4 1.81300000 1 37.5 20.7 64.6
5 1.42920000 1 37.5 20.7 64.6
6 0.59194000 1 37.5 20.7 64.6
7 0.30886000 1 37.5 20.7 64.6
8 -0.14601000 1 37.5 20.7 64.6
9 -0.19776000 1 37.5 20.7 64.6
10 0.74208000 1 37.5 20.7 64.6
11 -0.69280000 1 37.5 20.7 64.6
38 -2.42900000 1 37.5 20.7 64.6
39 -0.25732000 1 37.5 20.7 64.6
40 -0.49689000 1 37.5 20.7 64.6
41 -0.11556000 1 37.5 20.7 64.6
42 0.91036000 1 37.5 20.7 64.6
43 -0.24766000 1 37.5 20.7 64.6
44 -0.14962000 1 37.5 20.7 64.6
45 -0.45651000 1 37.5 20.7 64.6
48 0.53237000 2 58.5 23.0 53.4
49 -0.53284000 2 58.5 23.0 53.4
50 -0.33086000 2 58.5 23.0 53.4
51 -0.56355000 2 58.5 23.0 53.4
52 0.00883120 2 58.5 23.0 53.4
53 -1.00650000 2 58.5 23.0 53.4
80 0.85810000 2 58.5 23.0 53.4
81 -0.71715000 2 58.5 23.0 53.4
82 0.44346000 2 58.5 23.0 53.4
83 1.09890000 2 58.5 23.0 53.4
84 0.98726000 2 58.5 23.0 53.4
85 0.19667000 2 58.5 23.0 53.4
86 -1.32570000 2 58.5 23.0 53.4
89 -4.56920000 3 43.5 26.7 66.2
90 0.75174000 3 43.5 26.7 66.2
91 0.40935000 3 43.5 26.7 66.2
92 0.18340000 3 43.5 26.7 66.2
93 0.27399000 3 43.5 26.7 66.2
94 -0.23596000 3 43.5 26.7 66.2
95 -1.59460000 3 43.5 26.7 66.2
96 -0.03708900 3 43.5 26.7 66.2
97 0.68750000 3 43.5 26.7 66.2
98 -0.47979000 3 43.5 26.7 66.2
125 2.23200000 3 43.5 26.7 66.2
126 0.90470000 3 43.5 26.7 66.2
127 -0.34493000 3 43.5 26.7 66.2
128 -0.02114400 3 43.5 26.7 66.2
129 -1.08830000 3 43.5 26.7 66.2
130 -0.33937000 3 43.5 26.7 66.2
131 1.19820000 3 43.5 26.7 66.2
132 0.81653000 3 43.5 26.7 66.2
133 1.61810000 3 43.5 26.7 66.2
134 0.42914000 3 43.5 26.7 66.2
135 -1.03150000 3 43.5 26.7 66.2
...
I want to plot the variable CWRES versus ID, AGE, BMI and WGT. To do this I use this code:
library(ggplot2)
plotloop <- function(x, na.rm = TRUE, ...) {
nm <- names(x)
for (i in seq_along(nm)) {
print(ggplot(x,aes_string(x = nm[i], y = nm[1])) +
geom_point()) }
}
plotloop(tab3)
However, it also plots CWRES vs CWRES and I do not want to plot CWRES vs CWRES.
What should I do?
Thanks in advance,
Mario

The loop may not be the best way to go about plotting multiple plots with ggplot. Hence I will not try to fix your code but suggest an alternative route.
You should first melt your data.frame to transform it to long format with only retaining the CWRES variable:
require(reshape2)
mDf <- melt(x, c("CWRES"))
Now you can create your plots as follows:
g <- ggplot(mDF,aes(x=CWRES,y=value))
g <- g + geom_point()
g <- g + facet_grid(.~variable)
g
This creates a faceted plot with the four scatter plots next to each other.
If you really want to plot multiple i would proceed as follows (based on the formatting above):
variables <- unique(mDF$variable)
for (v in variables)
{
print(ggplot(mDF[mDF$variable==v,],aes(x=CWRES,y=value)) + geom_point() )
}

Finally I found the solution:
This code works great!
for(i in names(tab3)[2:5]) {
df2 <- tab3[, c(i, "CWRES")]
print(ggplot(tab3) + geom_point(aes_string(x = i, y = "CWRES")) + theme_bw())
}

changing variable values into new variables in r

I have a data set with about 600 animals with this structure:
anim <- c(1,1,1,1,1,2,2,2,2)
point <- c(1,2,3,4,6,3,4,5,6)
temp <- c(38.8,38.5,33.2,33.5,37.2,36.2,36.5,36.0,37.8)
mydf <- data.frame(anim,point,temp)
anim point temp
1 1 38.8
1 2 38.5
1 3 33.2
1 4 33.5
1 6 37.2
2 3 36.2
2 4 36.5
2 5 36.0
2 6 37.8
The variable "point" contains different measuring points (temperature) on an animal and I need to have these measuring points (1,2,3,4,5,6) as new variables such that 1=bel,2=ber,3=le,4=re,5=ey,6=cr. Note, that some points may not appear or measured and therefore NA should be put in.
mynewdf should look like this:
anim bel ber le re ey cr
1 38.8 38.5 33.2 33.5 NA 37.2
2 NA NA 36.2 36.5 36.0 37.8
I hope that my question is clear enough and any help would be very much appreciated.
EDITED:
This is an extract from my actual data set:
head(irpig,n=25)
head(irpig,n=25)
dam anim point temp
1 1A0331 20584 1 37.9
2 1A0331 20584 2 37.7
3 1A0331 20584 3 34.3
4 1A0331 20584 4 35.8
5 1A0331 20584 6 37.6
6 1A0331 20585 2 38.7
7 1A0331 20585 4 36.4
8 1A0331 20585 6 38.0
9 1A0331 20586 1 39.0
10 1A0331 20586 2 39.8
11 1A0331 20586 3 37.9
12 1A0331 20586 4 38.0
13 1A0331 20586 6 38.5
14 1A0331 20587 1 39.3
15 1A0331 20587 2 38.9
16 1A0331 20587 3 39.4
17 1A0331 20587 4 38.6
18 1A0331 20587 6 39.4
19 1A0331 20588 1 39.6
20 1A0331 20588 2 39.2
21 1A0331 20588 3 38.9
22 1A0331 20588 4 38.0
23 1A0331 20588 6 39.6
24 1A0331 20589 1 38.1
25 1A0331 20589 2 38.7
Baz

Here is one solution. It uses the dcast function from reshape2 library by Hadley Wickham
mydf$point = as.factor(mydf$point)
levels(mydf$point) = c("bel", "ber", "le", "re", "ey", "cr")
library(reshape2)
dcast(mydf, anim ~ point)
anim bel ber le re ey cr
1 1 38.8 38.5 33.2 33.5 NA 37.2
2 2 NA NA 36.2 36.5 36 37.8

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reshape horizontal to to long format using pivot_longer - r

Related

Apply a function to a set of columns in a dataset

Pivot Longer with Modification of Columns

How to get returns from a DF

avoid scatterplot between the same variable in a loop

changing variable values into new variables in r

Categories

Resources