Reshaping database using reshape package - r

I would like to reshaping some rows of my database. In particular I have some row that it replicate for the Id column. I would like to convert this row in column. I provide a code that it represent a example of my database.
I'm trying with t() and reshape but it doesn't do that I would. Can anyone give me any suggestions?
test<-data.frame(Id=c(1,1,2,3),
St=c(20,80,80,20),
gap=seq(0.02,0.08,by=0.02),
gip=c(0.23,0.60,0.86,2.09),
gat=c(0.0107,0.989,0.337,0.663))

setNames(data.frame(t(test))[2:nrow(data.frame(t(test))),], test$Id)
1 1 2 3
St 20.0000 80.000 80.000 20.000
gap 0.0200 0.040 0.060 0.080
gip 0.2300 0.600 0.860 2.090
gat 0.0107 0.989 0.337 0.663
It helps to provide an expected output. It this what you expected?

Related

How to choose best imputed data for further analysis in r

I have a multivariate time series dataset (almost 30 years) with random missing values.
T
S
po4
si
din
9.00000
NA
0.290
5.310
18.51
8.45000
NA
0.130
6.180
14.74
13.60000
36.46000
0.010
0.500
1.86
23.20000
32.12000
0.010
6.580
0.81
26.00000
32.13000
0.070
0.500
0.23
NA
35.41400
0.010
1.670
0.72
24.80000
36.42000
0.000
3.540
24.20000
33.16000
0.110
2.020
22.50000
37.60000
0.040
0.400
16.32000
36.01000
0.020
2.900
17.60000
38.04000
0.010
0.970
9.70000
36.36000
0.120
7.950
13.80000
38.33000
0.010
5.760
7.90000
35.51000
0.060
2.350
11.90000
38.33000
0.030
3.410
24.10000
36.30000
0.020
0.730
25.20000
35.77000
0.020
1.370
24.70000
37.54000
0.330
0.700
5.75000
33.26000
0.120
0.860
13.30000
33.14000
0.000
0.000
13.60000
38.21265
0.000
0.190
15.70000
28.33000
0.040
11.500
41.64
I would like to fill the gaps in order to have a constant frequency (I have a monthly frequency with missing values) to try different techniques in the content of a time series analysis.
I have tried to use the mice package in r
and to decide which imputed dataset to use with with() and pool(),but I don't want to use all of them in a model, but obtain the most correct one and use that one for further analysis.
How can I do that? How can I find the best one?
Otherwise, can you suggest me another method as an alternative to mice?
Thank you very much in advance
If you have a strong time correlation you can use the imputets package for time series imputation.
library(imputeTS)
na_kalman(your_dataframe)
There are also several other methods included in the package. As for mice, the whole point of multiple imputation is to have several imputed datasets. You would perform your analysis separately on each of them. Then you can compare the results. Since with imputation there always comes some uncertainty along (since your missing data replacements are only estimations). This technique enables you to model / get a feeling for the uncertainty.
If you don't want to do multiple analysis and do single imputation you can use any of these datasets, they are equally valid/there is no best one.
Or you could also use a single imputation package like misssForest.

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

Error Adding p-value to parallel coordinates plot (ggplot)

I have a data set that is paired data from multiple samples that I want to do a parallel coordinates graph with and include a p-value above (i.e. plot each data point in each group and link the pairs with a line and have the comparison statistic above the plotted data).
I can get the graph to (largely) look the way I want it to, but when I try and add a p-value using stat_compare_means(paired=TRUE), I get 3 errors:
2 x:
"Don't know how to automatically pick scale for object of type quosure/formula. Defaulting to continuous."
1 x:
"Error in validDetails.text(x) : 'pairlist' object cannot be coerced to type 'double'".
My data is a data.frame with three variables: a sample variable so I know which pair is which, a group variable so I know which category the value is, and the value variable. I've pasted the code below and am more than happy to take other suggestions on any other ways to make the code look better as well.
ggplot(test_OCI, aes(x=test_OCI$variable, y=test_OCI$value, group =test_OCI$Pt)) +
geom_point(aes(x=test_OCI$variable),size=3)+
geom_line(aes(x=test_OCI$variable),group=test_OCI$Pt)+
theme_bw()+
theme(panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.line=element_line(color="black"))+
scale_x_discrete(labels=c("OCI_pre_ART"="Pre-ART OCI", "OCI_on_ART"="On-ART OCI"))+
stat_compare_means(paired=TRUE)
edit 1: adding sample data
There isn't too much data, but I've added it below per request.
Pt variable value
1 Pt1 OCI_pre_ART 0.024
2 Pt2 OCI_pre_ART 0.027
3 Pt3 OCI_pre_ART 0.027
4 Pt4 OCI_pre_ART 0.010
5 Pt5 OCI_pre_ART 0.075
6 Pt6 OCI_pre_ART 0.040
7 Pt7 OCI_pre_ART 0.070
8 Pt8 OCI_pre_ART 0.011
9 Pt9 OCI_pre_ART 0.022
10 Pt10 OCI_pre_ART 0.006
11 Pt11 OCI_pre_ART 0.019
12 Pt1 OCI_on_ART 0.223
13 Pt2 OCI_on_ART 0.166
14 Pt3 OCI_on_ART 0.163
15 Pt4 OCI_on_ART 0.126
16 Pt5 OCI_on_ART 0.090
17 Pt6 OCI_on_ART 0.139
18 Pt7 OCI_on_ART 0.403
19 Pt8 OCI_on_ART 0.342
20 Pt9 OCI_on_ART 0.092
edit 2: packages
all lines in the figure code are from ggplot2 except stat_compare_means(paired=TRUE) which is from ggpubr.
I'm not sure if this is the reason, but it appears that the stat_compare_means() line was not interpreting the x~y aesthestic. Changing the line to
stat_compare_means(comparisons = list(c("OCI_pre_ART","OCI_on_ART")), paired=TRUE) resulted in a functional graph.

Building a time series of r-squared values in R

I have multiple xts objects that look something like this:
t x0 x1 x2 x3 x4
2000-01-01 0.397 0.262 -0.039 0.440 0.186
2000-01-02 0.020 -0.219 0.197 0.301 -0.300
2000-01-03 0.010 0.064 -0.034 0.214 -0.451
2000-01-04 -0.002 -0.483 -0.251 0.023 0.153
2000-01-05 0.451 0.375 0.566 0.258 -0.092
2000-01-06 0.411 0.219 0.108 0.137 -0.087
2000-01-07 0.111 -0.039 0.187 -0.271 0.346
2000-01-08 0.131 -0.078 0.163 -0.057 -0.313
2000-01-09 -0.409 -0.022 0.211 -0.297 0.217
.
.
.
Now, I am interested in finding out how well the average of the other variables explains each single x variable during each period 𝜏 (e.g. monthly).
Or in other words, I'm interested in building a time series of the r-squared in the following regression:
xi,t = 𝛽0 + 𝛽1 avgi,t + 𝜀i,t
where avgi,t is be the average of all other variables at time t. And running that for each i 𝜖 n variables and observations t 𝜖 𝜏 (e.g. 𝜏 is a month). I would like to be able to run this with any period 𝜏, where I believe xts might be able to help, since there are functions such as endpoints or apply.monthly?
I have years and years of data, and multiple of those xts objects, so the question is, what would be a sensible way to run those regressions, collecting the time series of r-squared values? A for-loop should be able to take care of doing that for each separate xts object, but I really don't think a for-loop would be very wise for each single object?

Finance - Random Portfolios using plyr and rportfolio - need to split columns

I am trying to compare an actual portfolio's performance to the performances of hypothetical random portfolios.
Here is a sample of the data set I am working with. It shows two months worth of data, the names of the managers in the portfolios, and the returns, allocations, and attributions of those managers.
"date" "manager" "return" "allocation" "attribution"
2005-01-31 "manager01" -0.00763241754291056 0.146 6.94549996404861e-05
2005-01-31 "manager02" 0.0292205518315147 0.048 4.09087725641205e-05
2005-01-31 "manager03" -0.0354047394153526 0.049 -8.85118485383814e-05
2005-01-31 "manager04" 0.0424244772606645 0.124 -0.000148485670412326
2005-01-31 "manager05" -0.0574606103881735 0.134 0.000206858197397425
2005-01-31 "manager06" 0.0465278163188542 0.098 -0.000265208553017469
2005-01-31 "manager07" 0.157063203979822 0.142 -0.000219888485571751
2005-01-31 "manager08" -0.0594342759491509 0.071 2.97171379745754e-05
2005-01-31 "manager09" -0.0199466865109495 0.093 6.18347281839434e-05
2005-01-31 "manager10" 0.118839410130508 0.095 0.000190143056208813
2005-02-28 "manager01" 0.0403671815817711 0.119 -0.000460185870032191
2005-02-28 "manager02" 0.0246109773791459 0.064 -3.93775638066334e-05
2005-02-28 "manager03" 0.00868489880733732 0.065 -4.08190243944854e-05
2005-02-28 "manager04" -0.082332291530606 0.105 2.46996874591818e-05
2005-02-28 "manager05" -0.0903959999837099 0.114 -0.000117514799978823
2005-02-28 "manager06" 0.0514735666329574 0.081 -6.17682799595489e-05
2005-02-28 "manager07" -0.00914374153663751 0.164 -8.41224221370651e-05
2005-02-28 "manager08" -0.0367283709786134 0.083 -4.77468822721974e-05
2005-02-28 "manager09" -0.04752320926613 0.079 -3.8018567412904e-05
2005-02-28 "manager10" -0.0657464361573664 0.126 -0.000309008249939622
In order to get the data into R, copy the data to the clipboard and then
mydata<-read.table("clipboard",header=TRUE)
In order to create the random portfolios I then use ddply, mutate, and rlongonly functions from plyr and rportfolio.
library(plyr)
library(rporfolio)
mydata.new<-ddply(mydata,.(date),mutate,new.attr=t(rlongonly(m=1,n=length(date),k=10,x.u=.15))*return)
In the rlongonly function:
The value for m is the number of random portfolios I want to create.
The value for n is the number of allocations for the time period.
The value for k is the number of non zero allocations.
The value for x.u is the upper limit for an allocation.
Attribution is merely return * allocation.
If I have m=1, everything is fine.
If I have m>1, the dimensions of the output are not correct.
mydata.new2<-ddply(mydata,.(date),mutate,new.attr=t(rlongonly(m=2,n=length(date),k=10,x.u=.15))*return)
dim(mydata.new)
mydata.new2 has only 6 columns when it should have 7.
The last column "new.attr" is basically 2 columns in one.
When I try to melt mydata.new2, I get the following error.
library(reshape2)
drop<-names(mydata.new2) %in% c("manager","return","allocation")
melt(mydata.new2[!drop],id="date")
> Error in rbind(deparse.level, ...) :
> numbers of columns of arguments do not match
How do I split up the "new.attr" column so that I can melt and graph the data?
First I regenrate your data, you have to use dput(mydata) and post the result next time.
Then I generate your mydata.new2 vector.
library(plyr)
library(rportfolios)
mydata.new2<-ddply(mydata,
.(date),
mutate,
new.attr=t(rlongonly(m=2,n=length(date),k=10,x.u=.15))*return)
I round the numeric values , I and I show the data
mydata.new2[,-c(1,2)] <- numcolwise(round_any)(mydata.new2,0.0001)
head(mydata.new2)
date manager return allocation attribution new.attr.1 new.attr.2
1 2005-01-31 manager01 -0.0076 0.146 1e-04 -0.0009 -0.0007
2 2005-01-31 manager02 0.0292 0.048 0e+00 0.0032 0.0040
3 2005-01-31 manager03 -0.0354 0.049 -1e-04 -0.0024 -0.0049
4 2005-01-31 manager04 0.0424 0.124 -1e-04 0.0029 0.0025
5 2005-01-31 manager05 -0.0575 0.134 2e-04 -0.0047 -0.0042
6 2005-01-31 manager06 0.0465 0.098 -3e-04 0.0051 0.0039
Here I have 7 columns and not 6 columns as you said.
I try to melt the data:
library(reshape2)
drop<-names(mydata.new2) %in% c("manager","return","allocation")
melt(mydata.new2[!drop],id="date")
But here you get the error:
numbers of columns of arguments do not match
beacuse of the nested data.frame new.attr in mydata.new2 data.frame. This is due to the use of mutate. Here it is better to use transform because you don't need to do transformation iteratively.
So :
mydata.new2<-ddply(mydata,
.(date),
transform,
new.attr=t(rlongonly(m=2,n=length(date),k=10,x.u=.15))*return)
and you get your result
head(melt(mydata.new2[!drop],id="date"))
date variable value
1 2005-01-31 attribution 6.945500e-05
2 2005-01-31 attribution 4.090877e-05
3 2005-01-31 attribution -8.851185e-05
4 2005-01-31 attribution -1.484857e-04
5 2005-01-31 attribution 2.068582e-04
6 2005-01-31 attribution -2.652086e-04

Resources