I am working with a large dataset in long format (after using melt) that I would now like to switch to a wide format before running a calculation on each row of the df
This long-form dataset ("dflong") has 7 columns: name, returnmonth, startmonth, descriptive1, descriptive2, descriptive3, and return.
I use the long format because I have multiple returnmonths for each name/startmonth pair. Specifically, I have the next 12 months of returns for each name and startmonth, and the startmonths span 10 years. This makes the df large: it has 1.2 million rows and those 7 columns.
What I would like next is to standardize the next 12 months for each name/startmonth and have a wide format where the right-most 12 columns are the returns in month startmonth+1, startmonth+2, etc regardless of what that actual month is. But I cannot just use the formula below because the returnmonths themselves differ, which I imagine would create a very, very wide df, which I do not want. (at this point, it's just giving me an error when I try to run it).
dfwide=reshape(dflong,idvar=c("name","startmonth","descriptive1","descriptive2","descriptive3"),timevar="returnmonth",direction="wide")
Is there a combination of reshape and some other tool that would allow me to both convert from long to wide but also not take into account the particular month that the return is coming from. This would form a long but manageable n rows*17 column dataframe to work with.
Appreciate your help very much.
This old question hasn't got an answer so far.
The OP has requested to reshape from long to wide format where the returns values should be arranged in 12 columns representing the 12 months after each startmonth.
This can be accomplished using the dcast() and rowid() functions from the data.table package.
library(data.table)
dcast(setDT(DT), name + startmonth + d1 + d2 + d3 ~ sprintf("M_%02i", rowid(name, startmonth)),
value.var = "return")
name startmonth d1 d2 d3 M_01 M_02 M_03 M_04 M_05 M_06 M_07 M_08 M_09 M_10 M_11 M_12
1: 0001 2000-01-01 A B C -56.0 -23.0 155.9 7.1 12.9 171.5 46.1 -126.5 -68.7 -44.6 122.4 36.0
2: 0001 2000-02-01 A B C 40.1 11.1 -55.6 178.7 49.8 -196.7 70.1 -47.3 -106.8 -21.8 -102.6 -72.9
3: 0001 2000-03-01 A B C -62.5 -168.7 83.8 15.3 -113.8 125.4 42.6 -29.5 89.5 87.8 82.2 68.9
4: 0001 2000-04-01 A B C 55.4 -6.2 -30.6 -38.0 -69.5 -20.8 -126.5 216.9 120.8 -112.3 -40.3 -46.7
5: 0001 2000-05-01 A B C 78.0 -8.3 25.3 -2.9 -4.3 136.9 -22.6 151.6 -154.9 58.5 12.4 21.6
---
95996: 0800 2009-08-01 A B C 136.5 79.6 -140.4 -216.5 35.3 31.4 -57.0 78.1 -53.3 -18.0 60.9 100.7
95997: 0800 2009-09-01 A B C 33.8 -12.5 -17.2 90.8 -24.7 -19.3 101.4 -39.1 139.4 15.5 -33.5 17.7
95998: 0800 2009-10-01 A B C 56.8 -121.2 242.4 -1.0 -9.9 78.2 34.6 31.8 -51.7 -11.1 45.0 89.1
95999: 0800 2009-11-01 A B C 158.8 -101.9 -271.8 -21.1 -24.6 108.0 185.3 82.3 -54.9 116.4 149.2 -30.4
96000: 0800 2009-12-01 A B C 34.9 -34.0 186.4 -3.7 -26.8 97.0 8.7 84.5 -35.5 -106.2 -165.0 188.0
Data
The OP hasn't provided a reproducible example, so we make up our own dummy data:
library(data.table)
library(lubridate)
# 10 years of startmonths times 12 returnmonths times 800 names
nn <- 800L
ns <- 10L * 12L
nr <- 12L
# cross join to create all combinations
DT <- CJ(name = sprintf("%04i", seq_len(nn)),
startmonth = seq(as.Date("2000-01-01"), length.out = ns, by = "month"),
returnmonth = seq_len(nr)
)
# compute returnmonth as sequence of 12 months after each startmonth
DT[, returnmonth := startmonth + months(returnmonth)][]
# append other data columns
DT[, paste0("d", 1:3) := .("A", "B", "C")][]
set.seed(123L)
DT[, return := round(100 * rnorm(nrow(DT)), 1L)][]
str(DT)
Classes ‘data.table’ and 'data.frame': 1152000 obs. of 7 variables:
$ name : chr "0001" "0001" "0001" "0001" ...
$ startmonth : Date, format: "2000-01-01" "2000-01-01" "2000-01-01" "2000-01-01" ...
$ returnmonth: Date, format: "2000-02-01" "2000-03-01" "2000-04-01" "2000-05-01" ...
$ d1 : chr "A" "A" "A" "A" ...
$ d2 : chr "B" "B" "B" "B" ...
$ d3 : chr "C" "C" "C" "C" ...
$ return : num -56 -23 155.9 7.1 12.9 ...
- attr(*, ".internal.selfref")=<externalptr>
Related
I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5
I have two multivariate time series x and y, both covering approximately the same range in time (one starts two years before the other, but they end on the same date). Both series have missing observations in the form of empty columns next to the date column, and also in the sense that one of the series has several dates that are not found in the other, and vice versa.
I would like to create a data frame (or similar) with a column that lists all the dates found in x OR y, without duplicate dates. For each date (row), I would like to horizontally stack the observations from x next to the observations from y, with NA's filling the missing cells. Example:
>x
"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9
>y
"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34
# result I would like
"1987-01-01" 7.1 NA 3 55.3 66 45
"1987-01-02" 5.2 5 2 NA NA NA
"1987-01-03" NA NA NA 77.3 87 34
"1987-01-06" 2.3 NA 9 NA NA NA
What I have tried: with the zoo package, I've tried the merge.zoo method, but this seems to just stack the two series next to each other, with the dates (as numbers, e.g. "1987-01-02" shown as 6210) from each series appearing in two separate columns.
I've sat for hours getting almost nowhere, so all help is appreciated.
EDIT: some code included below as per suggestion from Soumendra
atcoa <- read.csv(file = "ATCOA_full_adj.csv", header = TRUE)
atcob <- read.csv(file = "ATCOB_full_adj.csv", header = TRUE)
atcoa$date <- as.Date(atcoa$date)
atcob$date <- as.Date(atcob$date)
# only number of observations and the observations themselves differ
>str(atcoa)
'data.frame': 6151 obs. of 8 variables:
$ date :Class 'Date' num [1:6151] 6210 6213 6215 6216 6217 ...
$ max : num 4.31 4.33 4.38 4.18 4.13 4.05 4.08 4.05 4.08 4.1 ...
$ min : num 4.28 4.31 4.28 4.13 4.05 3.95 3.97 3.95 4 4.02 ...
$ close : num 4.31 4.33 4.31 4.15 4.1 3.97 4 3.97 4.08 4.02 ...
$ avg : num NA NA NA NA NA NA NA NA NA NA ...
$ tot.vol : int 877733 89724 889437 1927113 3050611 846525 1782774 1497998 2504466 5636999 ...
$ turnover : num 3762300 388900 3835900 8015900 12468100 ...
$ transactions: int 12 9 24 17 31 26 34 35 37 33 ...
>atcoa[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1 1987-01-02 4.31 4.28 4.31 NA 877733 3762300 12
# using timeSeries package
ts.atcoa <- timeSeries::as.timeSeries(atcoa, format = "%Y-%m-%d")
ts.atcob <- timeSeries::as.timeSeries(atcob, format = "%Y-%m-%d")
>str(ts.atcoa)
Time Series:
Name: object
Data Matrix:
Dimension: 6151 7
Column Names: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Row Names: 1970-01-01 01:43:30 ... 1970-01-01 04:12:35
Positions:
Start: 1970-01-01 01:43:30
End: 1970-01-01 04:12:35
With:
Format: %Y-%m-%d %H:%M:%S
FinCenter: GMT
Units: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Title: Time Series Object
Documentation: Wed Aug 17 13:00:50 2011
>ts.atcoa[1:1, ]
GMT
a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1970-01-01 01:43:30 4.31 4.28 4.31 NA 877733 3762300 12
# The following will create an object of class "data frame" and mode "list", which contains observations for the days mutual for the two series
>ts.atco <- timeSeries::merge(atcoa, atcob) # produces same result as base::merge, apparently
>ts.atco[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions b.max b.min b.close b.avg b.tot.vol b.turnover b.transactions
1 1989-08-25 7.92 7.77 7.79 NA 269172 2119400 19 7.69 7.56 7.64 NA 81176693 593858000 12
EDIT: problem solved by (using zoo package)
atcoa <- read.zoo(read.csv(file = "ATCOA_full_adj.csv", header = TRUE))
atcob <- read.zoo(read.csv(file = "ATCOB_full_adj.csv", header = TRUE))
names(atcoa) <- c("a.max", "a.min", "a.close",
"a.avg", "a.tot.vol", "a.turnover", "a.transactions")
names(atcob) <- c("b.max", "b.min", "b.close",
"b.avg", "b.tot.vol", "b.turnover", "b.transactions")
atco <- merge.zoo(atcoa, atcob)
Thank you all for your help.
Try this:
Lines.x <- '"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9'
Lines.y <- '"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34'
library(zoo)
# in reality x might be in a file and might be read via: x <- read.zoo("x.dat")
# ditto for y. See ?read.zoo and the zoo-read vignette if you need other args too
x <- read.zoo(text = Lines.x)
y <- read.zoo(text = Lines.y)
merge(x, y)
giving:
V2.x V3.x V4.x V2.y V3.y V4.y
1987-01-01 7.1 NA 3 55.3 66 45
1987-01-02 5.2 5 2 NA NA NA
1987-01-03 NA NA NA 77.3 87 34
1987-01-06 2.3 NA 9 NA NA NA
You can create a timeSeries (timeSeries library) object from your dates, merge them (timeSeries default merge behaviour is different from zoo and xts and does exactly what you are asking for) and then make zoo/xts objects out of the result in case you don't want to stay with timeSeries.
One quick way to test is the following, assuming you have two zoo objects zz1 and zz2 -
library(timeSeries)
as.zoo(merge(as.timeSeries(zz1), as.timeSeries(zz2)))
Compare the output of the above command with
merge(zz1, zz2)
You can also cbind -
cbind(zz1, zz2)
provided there are no shared columns with same names. Even if such column are there, you can choose the columns by which you cbind, and you will get a zoo object.
cbind(zz1[, 1:2], zz2[, 2:3]) #Assuming other columns are common
here, i found a more generic aproach from stat.ethz.ch
a <- ts(1:10, start=c(2014,6), frequency=12)
b <- ts(1:12, start=c(2015,1), frequency=12)
library(zoo)
m <- merge(a = as.zoo(a), b = as.zoo(b))
to get a ts object back:
as.ts(m)
How about this:
## Generate unique sorted time values.
i <- sort(unique(c(index(x), index(y))))
## Empty data matrix.
v <- matrix(nrow=length(i), ncol=6, NA)
## Pull in data items.
v[match(index(x), i), 1:3] <- coredata(x)
v[match(index(y), i), 4:6] <- coredata(y)
## Build new zoo object.
d <- zoo(v, order.by=i)
I'm trying to reshape my data from a long format into a wide format based on multiple groupings, without success. with this data:
id <- 1:20
month <- rep(4:7, 50)
name <- rep(c("sam", "mike", "tim", "jill", "max"), 40)
cost <- sample(1:100, 200, replace=TRUE)
df <- data.frame(id, month, name, cost)
df.mo.mean <- aggregate(df$cost ~ df$name + df$month, FUN="mean")
df.mo.sd <- aggregate(df$cost ~ df$name + df$month, FUN="sd")
df.mo <- data.frame(df.mo.mean, df.mo.sd)
df.mo <- df.mo[,-c(4,5)]
df.mo[3:4] <- round(df.mo[3:4],2)
head(df)
id month name cost
1 1 4 sam 29
2 2 5 mike 93
3 3 6 tim 27
4 4 7 jill 67
5 5 4 max 28
6 6 5 sam 69
I'm trying to get my data to look like something below, and try to generalize it for an unknown number of names (but <15 max)
month name1.cost.mean name1.cost.sd name2.cost.mean name2.cost.sd
1 45 4 40 6
2 ...
I've tried reshape and do.call with rbind without success. The only other way I can think of doing it is with a loop, which means I'm doing something wrong. I dont have any experience with plyr and would prefer to solve this problem with base packages (for learning purposes), but if its not possible any other suggestions would be very helpful
set.seed(1)
library(plyr)
kk<-ddply(df,.(month,name),summarize,mean=mean(cost),sd=sd(cost))
reshape(kk,timevar="name",idvar="month",direction="wide")
month mean.jill sd.jill mean.max sd.max mean.mike sd.mike mean.sam sd.sam mean.tim sd.tim
1 4 55.3 34.62834 63.3 23.35261 57.6 22.91627 63.4 28.89906 43.3 25.42112
6 5 49.3 25.00689 51.1 27.85059 48.4 23.16223 43.0 24.33562 47.6 32.13928
11 6 60.4 23.61826 52.1 29.74503 38.6 34.39703 53.0 23.28567 52.4 20.88700
16 7 50.0 30.76073 62.7 23.98634 51.7 32.10763 52.8 32.27589 49.5 23.00845
> means <- with( df, tapply(cost, list(month, name), FUN=mean) )
> sds <- with( df, tapply(cost, list(month, name), FUN=sd) )
> colnames(means) <- paste0(colnames(means), ".mean")
> colnames(sds) <- paste0(colnames(sds), ".sd")
> comb.df <- as.data.frame( cbind(means, sds) )
> comb.df <- comb.df[order(names(comb.df))]
> comb.df
jill.mean jill.mean.sd max.mean max.mean.sd mike.mean mike.mean.sd
4 62.1 22.29823 39.7 25.53016 39.6 30.11164
5 40.7 30.72838 44.4 29.12502 54.2 23.91095
6 47.3 31.54556 46.9 32.30910 65.3 30.05569
7 55.5 33.16038 45.9 28.13637 59.7 31.79815
sam.mean sam.mean.sd tim.mean tim.mean.sd
4 40.9 23.54877 58.5 21.69613
5 51.5 30.76163 34.2 32.16900
6 69.1 18.26016 55.2 32.99764
7 46.9 29.90150 55.8 27.17352
I'm not sure what you are asking for, but maybe something like this could be useful
> set.seed(1)
> df <- data.frame(id=1:20, month=rep(4:7, 50),
+ name=rep(c("sam", "mike", "tim", "jill", "max"), 40),
+ cost= sample(1:100, 200, replace=TRUE))
>
> DF.mean <- aggregate(cost ~ name + month, FUN=mean, data=df) ## mean
> DF.sd <- aggregate(cost ~ name + month, FUN=sd, data=df) ## sd
>
> x1 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.mean)) # reshaping mean
> colnames(x1) <- paste0(colnames(x1), ".mean")
> x2 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.sd)) # reshaping sd
> colnames(x2) <- paste0(colnames(x2), ".sd")
>
> cbind(x1, x2)
jill.mean max.mean mike.mean sam.mean tim.mean jill.sd max.sd mike.sd sam.sd tim.sd
4 55.3 63.3 57.6 63.4 43.3 34.62834 23.35261 22.91627 28.89906 25.42112
5 49.3 51.1 48.4 43.0 47.6 25.00689 27.85059 23.16223 24.33562 32.13928
6 60.4 52.1 38.6 53.0 52.4 23.61826 29.74503 34.39703 23.28567 20.88700
7 50.0 62.7 51.7 52.8 49.5 30.76073 23.98634 32.10763 32.27589 23.00845
Also, note that #Metrics approach can be done using R base functions without any extra packages:
> kk <- aggregate(cost ~ name + month, FUN=function(x) c(mean=mean(x), sd=sd(x)), data=df)
> reshape(kk,timevar="name",idvar="month",direction="wide")
month cost.jill.mean cost.jill.sd cost.max.mean cost.max.sd cost.mike.mean cost.mike.sd cost.sam.mean cost.sam.sd cost.tim.mean cost.tim.sd
1 4 55.30000 34.62834 63.30000 23.35261 57.60000 22.91627 63.40000 28.89906 43.30000 25.42112
6 5 49.30000 25.00689 51.10000 27.85059 48.40000 23.16223 43.00000 24.33562 47.60000 32.13928
11 6 60.40000 23.61826 52.10000 29.74503 38.60000 34.39703 53.00000 23.28567 52.40000 20.88700
16 7 50.00000 30.76073 62.70000 23.98634 51.70000 32.10763 52.80000 32.27589 49.50000 23.00845
You can use two reshape and then merge the results
library(reshape2)
> dcast(df, month ~ name, mean, value.var="cost")
month jill max mike sam tim
1 4 39.5 54.6 45.6 48.4 57.4
2 5 45.1 61.7 45.4 54.5 50.8
3 6 41.9 45.7 56.4 43.1 52.1
4 7 51.6 38.6 43.6 65.1 51.5
> dcast(df, month ~ name, sd, value.var="cost")
month jill max mike sam tim
1 4 29.31154 25.25954 28.96051 31.32695 29.82989
2 5 31.02848 27.96049 34.32589 30.08599 23.95273
3 6 32.09517 32.50316 37.16988 27.03681 30.42094
4 7 19.56300 31.50026 28.65969 36.53750 26.73429
I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE
I have two multivariate time series x and y, both covering approximately the same range in time (one starts two years before the other, but they end on the same date). Both series have missing observations in the form of empty columns next to the date column, and also in the sense that one of the series has several dates that are not found in the other, and vice versa.
I would like to create a data frame (or similar) with a column that lists all the dates found in x OR y, without duplicate dates. For each date (row), I would like to horizontally stack the observations from x next to the observations from y, with NA's filling the missing cells. Example:
>x
"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9
>y
"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34
# result I would like
"1987-01-01" 7.1 NA 3 55.3 66 45
"1987-01-02" 5.2 5 2 NA NA NA
"1987-01-03" NA NA NA 77.3 87 34
"1987-01-06" 2.3 NA 9 NA NA NA
What I have tried: with the zoo package, I've tried the merge.zoo method, but this seems to just stack the two series next to each other, with the dates (as numbers, e.g. "1987-01-02" shown as 6210) from each series appearing in two separate columns.
I've sat for hours getting almost nowhere, so all help is appreciated.
EDIT: some code included below as per suggestion from Soumendra
atcoa <- read.csv(file = "ATCOA_full_adj.csv", header = TRUE)
atcob <- read.csv(file = "ATCOB_full_adj.csv", header = TRUE)
atcoa$date <- as.Date(atcoa$date)
atcob$date <- as.Date(atcob$date)
# only number of observations and the observations themselves differ
>str(atcoa)
'data.frame': 6151 obs. of 8 variables:
$ date :Class 'Date' num [1:6151] 6210 6213 6215 6216 6217 ...
$ max : num 4.31 4.33 4.38 4.18 4.13 4.05 4.08 4.05 4.08 4.1 ...
$ min : num 4.28 4.31 4.28 4.13 4.05 3.95 3.97 3.95 4 4.02 ...
$ close : num 4.31 4.33 4.31 4.15 4.1 3.97 4 3.97 4.08 4.02 ...
$ avg : num NA NA NA NA NA NA NA NA NA NA ...
$ tot.vol : int 877733 89724 889437 1927113 3050611 846525 1782774 1497998 2504466 5636999 ...
$ turnover : num 3762300 388900 3835900 8015900 12468100 ...
$ transactions: int 12 9 24 17 31 26 34 35 37 33 ...
>atcoa[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1 1987-01-02 4.31 4.28 4.31 NA 877733 3762300 12
# using timeSeries package
ts.atcoa <- timeSeries::as.timeSeries(atcoa, format = "%Y-%m-%d")
ts.atcob <- timeSeries::as.timeSeries(atcob, format = "%Y-%m-%d")
>str(ts.atcoa)
Time Series:
Name: object
Data Matrix:
Dimension: 6151 7
Column Names: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Row Names: 1970-01-01 01:43:30 ... 1970-01-01 04:12:35
Positions:
Start: 1970-01-01 01:43:30
End: 1970-01-01 04:12:35
With:
Format: %Y-%m-%d %H:%M:%S
FinCenter: GMT
Units: a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
Title: Time Series Object
Documentation: Wed Aug 17 13:00:50 2011
>ts.atcoa[1:1, ]
GMT
a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions
1970-01-01 01:43:30 4.31 4.28 4.31 NA 877733 3762300 12
# The following will create an object of class "data frame" and mode "list", which contains observations for the days mutual for the two series
>ts.atco <- timeSeries::merge(atcoa, atcob) # produces same result as base::merge, apparently
>ts.atco[1:1, ]
date a.max a.min a.close a.avg a.tot.vol a.turnover a.transactions b.max b.min b.close b.avg b.tot.vol b.turnover b.transactions
1 1989-08-25 7.92 7.77 7.79 NA 269172 2119400 19 7.69 7.56 7.64 NA 81176693 593858000 12
EDIT: problem solved by (using zoo package)
atcoa <- read.zoo(read.csv(file = "ATCOA_full_adj.csv", header = TRUE))
atcob <- read.zoo(read.csv(file = "ATCOB_full_adj.csv", header = TRUE))
names(atcoa) <- c("a.max", "a.min", "a.close",
"a.avg", "a.tot.vol", "a.turnover", "a.transactions")
names(atcob) <- c("b.max", "b.min", "b.close",
"b.avg", "b.tot.vol", "b.turnover", "b.transactions")
atco <- merge.zoo(atcoa, atcob)
Thank you all for your help.
Try this:
Lines.x <- '"1987-01-01" 7.1 NA 3
"1987-01-02" 5.2 5 2
"1987-01-06" 2.3 NA 9'
Lines.y <- '"1987-01-01" 55.3 66 45
"1987-01-03" 77.3 87 34'
library(zoo)
# in reality x might be in a file and might be read via: x <- read.zoo("x.dat")
# ditto for y. See ?read.zoo and the zoo-read vignette if you need other args too
x <- read.zoo(text = Lines.x)
y <- read.zoo(text = Lines.y)
merge(x, y)
giving:
V2.x V3.x V4.x V2.y V3.y V4.y
1987-01-01 7.1 NA 3 55.3 66 45
1987-01-02 5.2 5 2 NA NA NA
1987-01-03 NA NA NA 77.3 87 34
1987-01-06 2.3 NA 9 NA NA NA
You can create a timeSeries (timeSeries library) object from your dates, merge them (timeSeries default merge behaviour is different from zoo and xts and does exactly what you are asking for) and then make zoo/xts objects out of the result in case you don't want to stay with timeSeries.
One quick way to test is the following, assuming you have two zoo objects zz1 and zz2 -
library(timeSeries)
as.zoo(merge(as.timeSeries(zz1), as.timeSeries(zz2)))
Compare the output of the above command with
merge(zz1, zz2)
You can also cbind -
cbind(zz1, zz2)
provided there are no shared columns with same names. Even if such column are there, you can choose the columns by which you cbind, and you will get a zoo object.
cbind(zz1[, 1:2], zz2[, 2:3]) #Assuming other columns are common
here, i found a more generic aproach from stat.ethz.ch
a <- ts(1:10, start=c(2014,6), frequency=12)
b <- ts(1:12, start=c(2015,1), frequency=12)
library(zoo)
m <- merge(a = as.zoo(a), b = as.zoo(b))
to get a ts object back:
as.ts(m)
How about this:
## Generate unique sorted time values.
i <- sort(unique(c(index(x), index(y))))
## Empty data matrix.
v <- matrix(nrow=length(i), ncol=6, NA)
## Pull in data items.
v[match(index(x), i), 1:3] <- coredata(x)
v[match(index(y), i), 4:6] <- coredata(y)
## Build new zoo object.
d <- zoo(v, order.by=i)