I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.
The raw data is presented as below,
Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800
... ... ... ... ... ... ... ... ...
I try to create a new column named as "Portfolio". If Price and Volume are smaller than P1 and V1, respectively, Portfolio is equal to 11. Then, if else Price is smaller than P1 but Volume is smaller than V2, Portfolio is equal to 12, and so on.
There are 3 breakpoints for Price and also Volume. Therefore, 16 Portfolios are created, which are named 11, 12, 13, 14, 21, 22, 23, 24,...,44.
The result would be as the table below,
Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
2009 46 125 25 50 75 200 400 600 21
2009 65 800 25 50 75 200 400 600 34
2010 20 560 30 55 90 250 500 800 13
2010 15 990 30 55 90 250 500 800 14
2011 89 350 35 70 120 250 500 800 32
2012 23 100 35 70 120 250 500 800 11
... ... ... ... ... ... ... ... ... ...
Could you please help me to solve this issue. I tried if(){} and else if(){} functions. However, I did not get the result as the second table. That is why I post raw data here. Thank you so much.
The code I tried was as the following,
if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
else if((Price<P1)&&(Volume<V3)){data$Portfolio=13}
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
The output was,
> if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
> else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
Error: unexpected 'else' in "else"
...
When I tried "&" instead of &&", the result showed,
> if ((mkvalt<MV20)&(BM<BM20)){Fin_Ret$port=11}
Warning message:
In if ((mkvalt < MV20) & (BM < BM20)) { :
the condition has length > 1 and only the first element will be used
I am confused maybe I don't understand fundamental things in R.
You can use:
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V1)] <- 11
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V2) & is.na(df$Portfolio)] <- 12
or using dplyr::mutate
library(dplyr)
df <- df %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V1),11,NA)) %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V2)& is.na(Portfolio),12,Portfolio))
In the code you have given,
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
Remove if after else in the last line. You should be able to get the expected result.
Here is a different and concise approach using findInterval and data.table. It is based on the observation that the Portfolio id consists of two digits where the first digit is determined solely by the price category and the second digit solely by the volume category.
library(data.table)
dt[, Portfolio := paste0(findInterval(Price, c(-Inf, P1, P2, P3)),
findInterval(Volume, c(-Inf, V1, V2, V3))),
by = .(P1, P2, P3, V1, V2, V3)]
print(dt)
# Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
#1: 2009 46 125 25 50 75 200 400 600 21
#2: 2009 65 800 25 50 75 200 400 600 34
#3: 2010 20 560 30 55 90 250 500 800 13
#4: 2010 15 990 30 55 90 250 500 800 14
#5: 2011 89 350 35 70 120 250 500 800 32
#6: 2012 23 100 35 70 120 250 500 800 11
findInterval uses right open intervals by default which is in line with the conditions (Price<P1), etc in the code of the OP.
Data
To make it a reproducible example
dt <- fread("Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800")
I have the data frame in R with 14 columns and 4.4 million rows.
column 1 has the query id and column 4 has the gene name.
I want to make the data frame that can show which and how many genes corresponding to each query id.
I have 44K different query ids and each query have maximum ~100 genes hit
CSAI_contig04661_6 sp O65396 GCST ARATH 86.03 408 56 1 72 478 1 408 0.0e+00 738.0
CSAI_contig04661_6 sp Q681Y3 Y1099 ARATH 22.55 337 244 10 140 474 103 424 8.0e-09 56.6
CSAI_contig04661_6 sp Q9FLR5 SMC6A ARATH 24.27 103 66 3 04. Jun 249 342 441 4.6e+00 28. Sep
CSAI_contig04661_6 sp Q9LQI7 GCST ARATH 24.28 74 47 2 17. Aug 300 31 100 8.1e+00 27. Jul
CSAI_contig04661_6 sp P56795 RK22 ARATH 28.95 76 49 4 11. Mrz 509 15 87 8.4e+00 27. Mrz
CSAI_isotig00001_4 sp Q8VZE4 PP299 ARATH 29.63 108 55 5 31. Jul 307 10 109 1.6e+00 30. Apr
I am interested in this type of output.
CSAI_contig04661_6
GCST 2
Y1099 1
SMC6A 1
RK22 1
How can I make a loop that check column 1 until they have same query (for example in this example it has 6 ) and then go to the column 4 and find how many genes are present and count their number if more than one (in this example against first query GCST is present 2 times)
You can accomplish this easily with dplyr:
group_by(df, V1, V4) %>%
summarise(n=n()) %>%
group_by(V1) %>%
summarise(hits=paste(paste(V4, n), collapse=" "))
Let's say I have this data.frame (with 3 variables)
ID Period Score
123 2013 146
123 2014 133
23 2013 150
456 2013 205
456 2014 219
456 2015 140
78 2012 192
78 2013 199
78 2014 133
78 2015 170
Using dplyr I can group them by ID and filter these ID that appear more than once
data <- data %>% group_by(ID) %>% filter(n() > 1)
Now, what I like to achieve is to add a column that is:
Difference = Score of Period P - Score of Period P-1
to get something like this:
ID Period Score Difference
123 2013 146
123 2014 133 -13
456 2013 205
456 2014 219 14
456 2015 140 -79
78 2012 192
78 2013 199 7
78 2014 133 -66
78 2015 170 37
It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.
Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.
data %>% group_by(ID) %>% filter(n() > 1) %>%
mutate(
Difference = Score - lag(Score)
)
# ID Period Score Difference
# 1 123 2013 146 NA
# 2 123 2014 133 -13
# 3 456 2013 205 NA
# 4 456 2014 219 14
# 5 456 2015 140 -79
# 6 78 2012 192 NA
# 7 78 2013 199 7
# 8 78 2014 133 -66
# 9 78 2015 170 37
Here is a subset of my data:
Fr Sig Code NumDet Date.Time Aerial
62 150102 102 15 195 2012-09-14 18:28:00 1
63 150102 102 15 189 2012-09-14 18:32:00 1
64 150102 106 15 213 2012-09-14 18:36:00 1
65 150102 102 15 152 2012-09-14 18:40:00 1
66 150102 105 15 190 2012-09-14 18:46:00 1
67 150102 97 15 4 2012-09-14 18:51:00 2
I am trying to calculate time between first detection on Aerial 1 and first detection on Aerial 2. Hence in this data set it would be 23mins
I have tried variations of difftime but can't seem to select specific times based on the Aerial number.
I have tried:
a <- difftime(table$Date.Time[2:length(table$Aerial == "1")],
table$Date.Time[2:length(table$Aerial == "2")])
but it's not even close.
This command using difftime
difftime(table$Date.Time[table$Aerial == "2"][1],
table$Date.Time[table$Aerial == "1"][1])
will return
Time difference of 23 mins