Restructuring Market Data in R - r

I am very new to R and to stackoverflow. My data is being read into R as a csv file. I have figured out how to restructure currency 1 by itself within R, however, I am working with 900+ columns of data and need a way of looping the R script to apply what I have done to columns 1 through 7 to the other 900 columns.
Currently my data looks like this:
Currency 1 Blank Currency 2
Date Contract Last Open High Low Volume Column Date Contract Last Open High Low Volume
10/10/2012 Dec 100 101 105 99 20000
10/11/2012 Dec 101 102 106 98 20100
10/12/2012 Jan 102 103 107 97 20120
As you can see the data is sent to me horizontally. With a blank column in between each currency and I need the data stacked on top of each other.
I would like the data to look like this:
Date Contract Last Open High Low Volume Market
10/10/2012 Dec 100 101 105 99 20000 Currency 1
10/11/2012 Dec 101 102 106 98 20100 Currency 1
10/12/2012 Jan 102 103 107 97 20120 Currency 1
10/10/2012 Dec 50 52 49 99 20530 Currency 2
10/11/2012 Dec 53 56 43 98 24300 Currency 2
10/12/2012 Jan 56 52 48 97 22320 Currency 2

If I understand correctly, and if your source data really are nicely formatted you might be able to do something like the following. Here, I'm linking to a csv with three sets of currencies which replicates what I think your source data look like.
First, read the file in using read.csv but skip the first row. Use check.names = FALSE so that duplicate column names are allowed.
temp <- read.csv("http://ideone.com/plain/t3cGcA",
header = TRUE, skip = 1,
check.names = FALSE)
temp
# Date Contract Last Open High Low Volume Date
# 1 10/10/2012 Dec 100 101 105 99 20000 NA 10/10/2012
# 2 10/11/2012 Dec 101 102 106 98 20100 NA 10/11/2012
# 3 10/12/2012 Jan 102 103 107 97 20120 NA 10/12/2012
# Contract Last Open High Low Volume
# 1 Dec 50 52 49 99 20530
# 2 Dec 53 56 43 98 24300
# 3 Jan 56 52 48 97 22320
# structure(c("NA", "NA", "NA"), class = "AsIs") Date Contract
# 1 NA 10/10/2012 Dec
# 2 NA 10/11/2012 Dec
# 3 NA 10/12/2012 Jan
# Last Open High Low Volume
# 1 500 501 605 99 20000
# 2 600 502 606 98 20100
# 3 700 503 607 97 20120
Second---and here is one assumption of the tidiness of your dataset---use seq to create a vector of where your blank columns are. From this, if our assumption of tidiness is correct, you can use simple math to determine the start (vector value minus 7) and end indexes (vector value minus 1) of each currency.
myblankcols <- seq(1, ncol(temp), by=8) + 7
myblankcols
# [1] 8 16 24
Using the simple math mentioned above, create a list of the subsets of each currency, and add names to the list. You can get the names by re-reading just the first line of the file as a csv and dropping all the NA values.
tempL <- lapply(seq_along(myblankcols),
function(x) temp[(myblankcols[x] - 7):(myblankcols[x] - 1)])
NamesTempL <- read.csv("http://ideone.com/plain/t3cGcA",
header = FALSE, nrows = 1)
names(tempL) <- NamesTempL[!is.na(NamesTempL)]
tempL
# $`Currency 1`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 100 101 105 99 20000
# 2 10/11/2012 Dec 101 102 106 98 20100
# 3 10/12/2012 Jan 102 103 107 97 20120
#
# $`Currency 2`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 50 52 49 99 20530
# 2 10/11/2012 Dec 53 56 43 98 24300
# 3 10/12/2012 Jan 56 52 48 97 22320
#
# $`Currency 3`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 500 501 605 99 20000
# 2 10/11/2012 Dec 600 502 606 98 20100
# 3 10/12/2012 Jan 700 503 607 97 20120
I am usually tempted to stop at this point, because I find lists convenient for many purposes. But, it's equally easy to convert it to a single data.frame. This is also one of the reasons to make sure you use check.names = FALSE in the first step: if all the columns have the same name, then there will be no problem rbinding them together.
do.call(rbind, tempL)
# Date Contract Last Open High Low Volume
# Currency 1.1 10/10/2012 Dec 100 101 105 99 20000
# Currency 1.2 10/11/2012 Dec 101 102 106 98 20100
# Currency 1.3 10/12/2012 Jan 102 103 107 97 20120
# Currency 2.1 10/10/2012 Dec 50 52 49 99 20530
# Currency 2.2 10/11/2012 Dec 53 56 43 98 24300
# Currency 2.3 10/12/2012 Jan 56 52 48 97 22320
# Currency 3.1 10/10/2012 Dec 500 501 605 99 20000
# Currency 3.2 10/11/2012 Dec 600 502 606 98 20100
# Currency 3.3 10/12/2012 Jan 700 503 607 97 20120
I'll definitely stop here, but from here, you probably want to convert your "Date" column to actual columns, and perhaps convert the row names ("Currency 1.1", "Currency 1.2", and so on) to a column in your data.frame.

Related

What is the best way to assign detection history using the following values?

I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.

Create a new column based on the conditions of other columns

The raw data is presented as below,
Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800
... ... ... ... ... ... ... ... ...
I try to create a new column named as "Portfolio". If Price and Volume are smaller than P1 and V1, respectively, Portfolio is equal to 11. Then, if else Price is smaller than P1 but Volume is smaller than V2, Portfolio is equal to 12, and so on.
There are 3 breakpoints for Price and also Volume. Therefore, 16 Portfolios are created, which are named 11, 12, 13, 14, 21, 22, 23, 24,...,44.
The result would be as the table below,
Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
2009 46 125 25 50 75 200 400 600 21
2009 65 800 25 50 75 200 400 600 34
2010 20 560 30 55 90 250 500 800 13
2010 15 990 30 55 90 250 500 800 14
2011 89 350 35 70 120 250 500 800 32
2012 23 100 35 70 120 250 500 800 11
... ... ... ... ... ... ... ... ... ...
Could you please help me to solve this issue. I tried if(){} and else if(){} functions. However, I did not get the result as the second table. That is why I post raw data here. Thank you so much.
The code I tried was as the following,
if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
else if((Price<P1)&&(Volume<V3)){data$Portfolio=13}
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
The output was,
> if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
> else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
Error: unexpected 'else' in "else"
...
When I tried "&" instead of &&", the result showed,
> if ((mkvalt<MV20)&(BM<BM20)){Fin_Ret$port=11}
Warning message:
In if ((mkvalt < MV20) & (BM < BM20)) { :
the condition has length > 1 and only the first element will be used
I am confused maybe I don't understand fundamental things in R.
You can use:
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V1)] <- 11
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V2) & is.na(df$Portfolio)] <- 12
or using dplyr::mutate
library(dplyr)
df <- df %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V1),11,NA)) %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V2)& is.na(Portfolio),12,Portfolio))
In the code you have given,
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
Remove if after else in the last line. You should be able to get the expected result.
Here is a different and concise approach using findInterval and data.table. It is based on the observation that the Portfolio id consists of two digits where the first digit is determined solely by the price category and the second digit solely by the volume category.
library(data.table)
dt[, Portfolio := paste0(findInterval(Price, c(-Inf, P1, P2, P3)),
findInterval(Volume, c(-Inf, V1, V2, V3))),
by = .(P1, P2, P3, V1, V2, V3)]
print(dt)
# Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
#1: 2009 46 125 25 50 75 200 400 600 21
#2: 2009 65 800 25 50 75 200 400 600 34
#3: 2010 20 560 30 55 90 250 500 800 13
#4: 2010 15 990 30 55 90 250 500 800 14
#5: 2011 89 350 35 70 120 250 500 800 32
#6: 2012 23 100 35 70 120 250 500 800 11
findInterval uses right open intervals by default which is in line with the conditions (Price<P1), etc in the code of the OP.
Data
To make it a reproducible example
dt <- fread("Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800")

How can I make the loop to count the gene against query id

I have the data frame in R with 14 columns and 4.4 million rows.
column 1 has the query id and column 4 has the gene name.
I want to make the data frame that can show which and how many genes corresponding to each query id.
I have 44K different query ids and each query have maximum ~100 genes hit
CSAI_contig04661_6 sp O65396 GCST ARATH 86.03 408 56 1 72 478 1 408 0.0e+00 738.0
CSAI_contig04661_6 sp Q681Y3 Y1099 ARATH 22.55 337 244 10 140 474 103 424 8.0e-09 56.6
CSAI_contig04661_6 sp Q9FLR5 SMC6A ARATH 24.27 103 66 3 04. Jun 249 342 441 4.6e+00 28. Sep
CSAI_contig04661_6 sp Q9LQI7 GCST ARATH 24.28 74 47 2 17. Aug 300 31 100 8.1e+00 27. Jul
CSAI_contig04661_6 sp P56795 RK22 ARATH 28.95 76 49 4 11. Mrz 509 15 87 8.4e+00 27. Mrz
CSAI_isotig00001_4 sp Q8VZE4 PP299 ARATH 29.63 108 55 5 31. Jul 307 10 109 1.6e+00 30. Apr
I am interested in this type of output.
CSAI_contig04661_6
GCST 2
Y1099 1
SMC6A 1
RK22 1
How can I make a loop that check column 1 until they have same query (for example in this example it has 6 ) and then go to the column 4 and find how many genes are present and count their number if more than one (in this example against first query GCST is present 2 times)
You can accomplish this easily with dplyr:
group_by(df, V1, V4) %>%
summarise(n=n()) %>%
group_by(V1) %>%
summarise(hits=paste(paste(V4, n), collapse=" "))

diff operation within a group, after a dplyr::group_by()

Let's say I have this data.frame (with 3 variables)
ID Period Score
123 2013 146
123 2014 133
23 2013 150
456 2013 205
456 2014 219
456 2015 140
78 2012 192
78 2013 199
78 2014 133
78 2015 170
Using dplyr I can group them by ID and filter these ID that appear more than once
data <- data %>% group_by(ID) %>% filter(n() > 1)
Now, what I like to achieve is to add a column that is:
Difference = Score of Period P - Score of Period P-1
to get something like this:
ID Period Score Difference
123 2013 146
123 2014 133 -13
456 2013 205
456 2014 219 14
456 2015 140 -79
78 2012 192
78 2013 199 7
78 2014 133 -66
78 2015 170 37
It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.
Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.
data %>% group_by(ID) %>% filter(n() > 1) %>%
mutate(
Difference = Score - lag(Score)
)
# ID Period Score Difference
# 1 123 2013 146 NA
# 2 123 2014 133 -13
# 3 456 2013 205 NA
# 4 456 2014 219 14
# 5 456 2015 140 -79
# 6 78 2012 192 NA
# 7 78 2013 199 7
# 8 78 2014 133 -66
# 9 78 2015 170 37

Time Calculation Between Specific Events

Here is a subset of my data:
Fr Sig Code NumDet Date.Time Aerial
62 150102 102 15 195 2012-09-14 18:28:00 1
63 150102 102 15 189 2012-09-14 18:32:00 1
64 150102 106 15 213 2012-09-14 18:36:00 1
65 150102 102 15 152 2012-09-14 18:40:00 1
66 150102 105 15 190 2012-09-14 18:46:00 1
67 150102 97 15 4 2012-09-14 18:51:00 2
I am trying to calculate time between first detection on Aerial 1 and first detection on Aerial 2. Hence in this data set it would be 23mins
I have tried variations of difftime but can't seem to select specific times based on the Aerial number.
I have tried:
a <- difftime(table$Date.Time[2:length(table$Aerial == "1")],
table$Date.Time[2:length(table$Aerial == "2")])
but it's not even close.
This command using difftime
difftime(table$Date.Time[table$Aerial == "2"][1],
table$Date.Time[table$Aerial == "1"][1])
will return
Time difference of 23 mins

Resources