Suppose I have many data frames, that have varying row numbers (of data) but Date as common among them. e.g. :
DF1:
Date Index Change
05-04-17 29911.55 0
03-04-17 29910.22 0.0098
31-03-17 29620.5 -0.0009
30-03-17 29647.42 0.0039
29-03-17 29531.43 0.0041
28-03-17 29409.52 0.0059
27-03-17 29237.15 -0.0063
24-03-17 29421.4 0.003
And
DF2:
Date NG NG_Change
05-04-17 213.8 0.0047
04-04-17 212.8 0.0421
03-04-17 204.2 -0.0078
31-03-17 205.8 -0.0068
30-03-17 207.2 -0.0166
29-03-17 210.7 0.0483
28-03-17 201 0.005
27-03-17 200 -0.0015
24-03-17 200.3 0.0137
And another one:
DF3:
Date TI_Price TI_Change
05-04-17 51.39 0.0071
04-04-17 51.03 0.0157
03-04-17 50.24 -0.0071
31-03-17 50.6 0.005
30-03-17 50.35 0.017
29-03-17 49.51 0.0236
28-03-17 48.37 0.0134
I wanted to combine them, using Dates column "as common variable", in a way that there are only those rows in the final for which Dates are common. Such as:
Date TI_Price TI_Change NG NG_Change TI_Price TI_Change
05-04-17 51.39 0.0071 213.8 0.0047 51.39 0.0071
04-04-17 51.03 0.0157 212.8 0.0421 51.03 0.0157
03-04-17 50.24 -0.0071 204.2 -0.0078 50.24 -0.0071
31-03-17 50.6 0.005 205.8 -0.0068 50.6 0.005
30-03-17 50.35 0.017 207.2 -0.0166 50.35 0.017
29-03-17 49.51 0.0236 210.7 0.0483 49.51 0.0236
28-03-17 48.37 0.0134 201 0.005 48.37 0.0134
I am just wondering if there is any method so that I could merge them in one go and not like the merge() function which takes DF2 and DF2 at a time, merge and then the result is merged with DF3.
What I used and tweaked around (but waste):
myfulldata = merge(DF1, DF2, all.x=T)
Related
This is my data. Daily return data for different sectors.
I would like to compute the 3 month rolling correlation between sectors but keep the date field and have it line up.
> head(data)
Date Communication Services Consumer Discretionary Consumer Staples Energy Financials - AREITs Financials - ex AREITs Health Care
1 2003-01-02 -0.0004 0.0016 0.0033 0.0007 0.0073 0.0006 0.0370
2 2003-01-03 -0.0126 -0.0008 0.0057 -0.0019 0.0016 0.0062 0.0166
3 2003-01-06 0.0076 0.0058 -0.0051 0.0044 0.0063 0.0037 -0.0082
4 2003-01-07 -0.0152 0.0052 -0.0024 -0.0042 -0.0037 -0.0014 0.0027
5 2003-01-08 0.0107 0.0017 0.0047 -0.0057 0.0013 -0.0008 -0.0003
6 2003-01-09 -0.0157 0.0019 -0.0020 0.0009 -0.0016 -0.0012 0.0055
`
My data type is this
$ Date : Date[1:5241], format: "2003-01-02" "2003-01-03" "2003-01-06" "2003-01-07" ...
$ Communication Services : num [1:5241] -0.0004 -0.0126 0.0076 -0.0152 0.0107 -0.0157 0.0057 -0.0131 0.0044 0.0103 ...
$ Consumer Discretionary : num [1:5241] 0.0016 -0.0008 0.0058 0.0052 0.0017 0.0019 -0.0022 0.0057 -0.0028 0.0039 ...
$ Consumer Staples : num [1:5241] 0.0033 0.0057 -0.0051 -0.0024 0.0047 -0.002 0.0043 -0.0005 0.0163 0.004 ...
$ Energy : num [1:5241] 0.0007 -0.0019 0.0044 -0.0042 -0.0057 0.0009 0.0058 0.0167 -0.0026 -0.0043 ...
$ Financials - AREITs : num [1:5241] 0.0073 0.0016 0.0063 -0.0037 0.0013 -0.0016 0 0.0025 -0.0051 0.0026 ...`
Currently what I am doing is this:
rollingcor <- rollapply(data, width=60, function(x) cor(x[,2],x[,3]),by=60, by.column=FALSE)
This works fine and works out the rolling 60 day correlation and shifts the window by 60 days. However it doesnt keep the date column and I find it hard to match the dates.
The end goal here is to produce a df in which the the date is every 3 months and the other columns are the correlations between all the sectors in my data.
Please read the information at the top of the r tag and, in particular provide the input in an easily reproducible manner using dput. In the absence of that we will use data shown below based on the 6x2 BOD data frame that comes with R and use a width of 4. The names on the correlation columns are the row:column numbers in the correlation matrix. For example, compare the 4th row of the output below with cor(data[1:4, -1]) .
fill=NA causes it to output the same number of rows as the input by filling with NA's.
library(zoo)
# test data
data <- cbind(Date = as.Date("2023-02-01") + 0:5, BOD, X = 1:6)
# given data frame x return lower triangular part of cor matrix
# Last 2 lines add row:column names.
Cor <- function(x) {
k <- cor(x)
lo <- lower.tri(k)
k.lo <- k[lo]
m <- which(lo, arr.ind = TRUE) # rows & cols of lower tri
setNames(k.lo, paste(m[, 1], m[, 2], sep = ":"))
}
cbind(data, rollapplyr(data[-1], 4, Cor, by.column = FALSE, fill = NA))
giving:
Date Time demand X 2:1 3:1 3:2
1 2023-02-01 1 8.3 1 NA NA NA
2 2023-02-02 2 10.3 2 NA NA NA
3 2023-02-03 3 19.0 3 NA NA NA
4 2023-02-04 4 16.0 4 0.8280576 1.0000000 0.8280576
5 2023-02-05 5 15.6 5 0.4604354 1.0000000 0.4604354
6 2023-02-06 7 19.8 6 0.2959666 0.9827076 0.1223522
I have the below data set:
Profit
MRO 15x5
D30
$150.00
-9.189
-0.24
$12.50
-6.076
-0.248
-$125.00
-7.699
-0.282
-$162.50
-8.008
-0.281
-$175.00
-0.183
-0.056
-$175.00
-0.235
-0.061
$275.00
0.141
-0.027
-$175.00
-4.062
-0.103
-$162.50
-5.654
-0.258
-$162.50
-1.578
-0.051
-$175.00
-3.336
-0.205
-$162.50
-1.523
-0.022
$412.50
-1.524
-0.194
$337.50
-1.049
-0.055
$100.00
-1.043
-0.059
I want to first arrange column D30 in ascending order and then look into the Profit column. If the top n row and bottom n row values (a range of cells) are less than -50 in the Profit column then delete the entire row in the data set.
The result would be like this:
Profit
MRO 15x5
D30
$275.00
0.141
-0.027
-$162.50
-1.578
-0.051
$337.50
-1.049
-0.055
-$175.00
-0.183
-0.056
$100.00
-1.043
-0.059
-$175.00
-0.235
-0.061
-$175.00
-4.062
-0.103
$412.50
-1.524
-0.194
-$175.00
-3.336
-0.205
$150.00
-9.189
-0.24
$12.50
-6.076
-0.248
This output is the result of the deletion of the top 1st row and bottom 3 rows from the entire data set as these rows (range of values) were having Profit values less than -50.
Can anyone please help me to do this in the R program using dplyr or by using some other filtering packages?
I would be thankful for your kind support.
Regards,
Farhan
Use cumany. Combined with filter, it removes rows until a criterion is met (here Profit <= -50).
The first command is a way to parse your Profit column into a numeric column.
library(dplyr)
data %>% mutate(Profit = parse_number(str_replace(Profit,"^-\\$(.*)$", "$-\\1"))) %>%
arrange(D30) %>%
filter(cumany(Profit > -50)) %>%
arrange(desc(D30)) %>%
filter(cumany(Profit > -50))
Profit MRO_15x5 D30
1 275.0 0.141 -0.027
2 -162.5 -1.578 -0.051
3 337.5 -1.049 -0.055
4 -175.0 -0.183 -0.056
5 100.0 -1.043 -0.059
6 -175.0 -0.235 -0.061
7 -175.0 -4.062 -0.103
8 412.5 -1.524 -0.194
9 -175.0 -3.336 -0.205
10 150.0 -9.189 -0.240
11 12.5 -6.076 -0.248
I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.
I have the following dataset:
Col1 Col2 Col3 Col4 Col5 Col6
4439.5 6.5211 50.0182 29.4709 -0.0207 0.0888
4453 25.1186 46.5586 34.1279 -0.0529 0.082
4453.5 24.2974 46.6291 30.6281 -0.057 0.0809
4457.5 25.3257 49.6885 26.2664 -0.0357 0.0837
4465 7.1077 53.516 32.5077 -0.0398 0.1099
4465.5 7.5892 53.0884 33.1582 -0.0395 0.1128
4898.5 8.8296 55.0611 40.3813 -0.0123 0.1389
4899 9.2469 54.4799 37.1927 -0.0061 0.1354
4900 13.4119 50.8334 28.9441 -0.0272 0.1071
4900.5 21.8415 50.1127 24.2351 -0.0375 0.0882
4905 11.3824 52.4024 37.2646 -0.0324 0.1215
4918.5 6.2601 49.9454 27.715 0.0101 0.1444
4919 7.4157 49.7412 25.6159 -0.0164 0.1038
4932 25.737 46.2825 38.6334 -0.0425 0.0717
5008.5 13.641 49.7868 18.0337 -0.0213 0.111
5010.5 13.5935 49.5352 23.9319 -0.0518 0.0979
5012 16.6945 48.0672 25.2408 -0.0446 0.0985
5014.5 14.1303 49.6361 23.1816 -0.0455 0.1056
5040 7.6895 49.8688 31.562 -0.0138 0.126
5044 12.594 60.822 52.4569 0.0481 0.1877
5045.5 10.3719 56.443 43.3782 0.0076 0.1403
5046 8.1382 54.5388 46.2675 0.01 0.1443
5051.5 29.0142 46.8052 43.3224 -0.0465 0.0917
5052 32.3053 46.4278 32.9387 -0.0509 0.0868
5052.5 38.4807 45.3555 24.4187 -0.0619 0.0774
5053 38.8954 43.8459 21.8487 -0.0688 0.0681
5055 19.69 50.9335 46.9419 -0.0527 0.0897
5055.5 11.7398 51.8329 59.5443 -0.0307 0.1083
5056 13.3196 51.8329 55.4419 -0.0276 0.1262
5056.5 18.3702 51.7003 39.232 -0.0408 0.1105
5057.5 14.0531 50.1129 24.4546 -0.0444 0.0921
5058 15.292 49.8805 23.0938 -0.0347 0.0925
5059 20.5135 49.52 21.6173 -0.0333 0.1006
5060 14.5151 47.5836 27.0685 -0.0156 0.1062
5060.5 14.5188 48.2506 27.9704 -0.0363 0.1018
5228 1.2168 54.2009 17.4351 0.0583 0.1794
5229 3.5896 51.7649 26.1107 -0.0033 0.1362
5232.5 2.7404 53.5941 38.6852 0.0646 0.194
5233 3.6694 53.9483 36.674 0.0633 0.204
5234 1.3789 53.8741 18.5804 0.0693 0.1958
5234.5 0.8592 53.6052 18.1654 0.0742 0.1982
5237 2.6951 52.3763 24.8098 0.0549 0.1923
I am trying to create an R visual that will break out each Column into facets, using Col1 as the identity column.
To do this I am using this (faulty) code:
library(reshape2)
library(plotly)
plot.data <- dataset
melted <- melt(dataset, id.vars="Col1")
sp <- ggplot(melted, aes(x=Col1, y=value)) + geom_line()
# Divide by variable in the vertical direction
sp + facet_grid(variable~.)
ggplotly()
However, I am receiving an error saying:
Faceting variables must have at least one value
I know this is an unlikely solution, but did you make sure all your filters are correct / not filtering out values somehow? I find that filter are often a source of mistakes for me so if it works in R, that could be the problem.
I had the same error and it was my filtering:
Example:
I did this data <- data[data$symbol == geneId,] instead of data <- data[data$symbol %in% geneId,]
I'm trying to plot boxplots of multiple variables (columns in a table), grouping the subjects for the levels in cl.med.
This is what I tried:
boxplot(scfa[,c("acetate","propionate")]~as.factor(cl.med),outline=FALSE)
this is my table:
aceticacid.methylester.1 acetate butyrate fumarate caprate propionate X3phenylpropionate valerate formate
01.BA.V 4.509 0.1430 0.0168 4e-04 0.0080 0.0174 0.0008 0.0030 5e-04
01.BA.VG 2.750 0.2736 0.0228 4e-04 0.0047 0.0261 0.0012 0.0014 4e-04
01.BO.VG 15.281 0.1667 0.0159 6e-04 0.0049 0.0191 0.0008 0.0011 4e-04
01.PR.O 0.317 0.2470 0.0327 4e-04 0.0078 0.0293 0.0006 0.0016 4e-04
01.TO.VG 0.210 0.1406 0.0186 4e-04 0.0034 0.0161 0.0006 0.0026 6e-04
and this is my class vector
01.BA.VG 01.BO.VG 01.PR.O 01.TO.VG 02.BA.VG
1 2 3 1 3 2
This produces 3 boxes (for the 3 classes as expected), but the two variables are merged. How could I modify it obtaining 3 boxes for each variable?
Thanks
Your code really shouldn't work at all instead of producing 3 boxes. Usually you have to add them one plot at a time using the add=TRUE argument:
boxplot(scfa[,c("acetate")]~as.factor(cl.med), outline=FALSE, ylim=c(0, 0.3))
boxplot(scfa[,c("propionate")]~as.factor(cl.med), outline=FALSE, add=TRUE)
You either need to 'melt' or 'stack' your data so that the 2 response columns are stacked. Or split the data yourself. Here is an example of the second option:
tmp <- split(iris[,c('Sepal.Width','Petal.Width')], iris$Species)
tmp2 <- unlist(tmp, recursive=FALSE)
boxplot(tmp2)