Create a new column based on the conditions of other columns - r

The raw data is presented as below,
Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800
... ... ... ... ... ... ... ... ...
I try to create a new column named as "Portfolio". If Price and Volume are smaller than P1 and V1, respectively, Portfolio is equal to 11. Then, if else Price is smaller than P1 but Volume is smaller than V2, Portfolio is equal to 12, and so on.
There are 3 breakpoints for Price and also Volume. Therefore, 16 Portfolios are created, which are named 11, 12, 13, 14, 21, 22, 23, 24,...,44.
The result would be as the table below,
Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
2009 46 125 25 50 75 200 400 600 21
2009 65 800 25 50 75 200 400 600 34
2010 20 560 30 55 90 250 500 800 13
2010 15 990 30 55 90 250 500 800 14
2011 89 350 35 70 120 250 500 800 32
2012 23 100 35 70 120 250 500 800 11
... ... ... ... ... ... ... ... ... ...
Could you please help me to solve this issue. I tried if(){} and else if(){} functions. However, I did not get the result as the second table. That is why I post raw data here. Thank you so much.
The code I tried was as the following,
if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
else if((Price<P1)&&(Volume<V3)){data$Portfolio=13}
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
The output was,
> if ((Price<P1)&&(Volume<V1)){data$Portfolio=11}
> else if ((Price<P1)&&(Volume<V2)){data$Portfolio=12}
Error: unexpected 'else' in "else"
...
When I tried "&" instead of &&", the result showed,
> if ((mkvalt<MV20)&(BM<BM20)){Fin_Ret$port=11}
Warning message:
In if ((mkvalt < MV20) & (BM < BM20)) { :
the condition has length > 1 and only the first element will be used
I am confused maybe I don't understand fundamental things in R.

You can use:
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V1)] <- 11
df$Portfolio[(df$Price<df$P1)&(df$Volume<df$V2) & is.na(df$Portfolio)] <- 12
or using dplyr::mutate
library(dplyr)
df <- df %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V1),11,NA)) %>%
mutate(Portfolio=ifelse((Price<P1)&(Volume<V2)& is.na(Portfolio),12,Portfolio))

In the code you have given,
else if(Price<P1){data$Portfolio=14}
else if((Price<P2)&&(Volume<V1)){Fin_Ret$port=21}
...
else if(Price>P3){data$Portfolio=44}
Remove if after else in the last line. You should be able to get the expected result.

Here is a different and concise approach using findInterval and data.table. It is based on the observation that the Portfolio id consists of two digits where the first digit is determined solely by the price category and the second digit solely by the volume category.
library(data.table)
dt[, Portfolio := paste0(findInterval(Price, c(-Inf, P1, P2, P3)),
findInterval(Volume, c(-Inf, V1, V2, V3))),
by = .(P1, P2, P3, V1, V2, V3)]
print(dt)
# Year Price Volume P1 P2 P3 V1 V2 V3 Portfolio
#1: 2009 46 125 25 50 75 200 400 600 21
#2: 2009 65 800 25 50 75 200 400 600 34
#3: 2010 20 560 30 55 90 250 500 800 13
#4: 2010 15 990 30 55 90 250 500 800 14
#5: 2011 89 350 35 70 120 250 500 800 32
#6: 2012 23 100 35 70 120 250 500 800 11
findInterval uses right open intervals by default which is in line with the conditions (Price<P1), etc in the code of the OP.
Data
To make it a reproducible example
dt <- fread("Year Price Volume P1 P2 P3 V1 V2 V3
2009 46 125 25 50 75 200 400 600
2009 65 800 25 50 75 200 400 600
2010 20 560 30 55 90 250 500 800
2010 15 990 30 55 90 250 500 800
2011 89 350 35 70 120 250 500 800
2012 23 100 35 70 120 250 500 800")

Related

How to remove outliers by columns in R

I have this data frame.
IQ
sleep
GRE
happiness
105
70
200
15
40
50
150
15
70
20
70
10
150
150
80
6
148
60
900
7
115
10
1200
40
110
90
15
5
120
40
60
12
99
30
70
15
1000
15
30
68
70
60
12
70
I would like to remove the outliers for each variable. I do not want to delete a whole row if one value is identified an outlier. For example, let's say the outlier for IQ is 40, I just want to delete 40, I don't want a whole row deleted.
If I define any values that are > mean * 3sd and < mean - 3sd as outliers, what are the codes I can use to run it?
If I can achieve this using Dplyr and subset, that would be great
I would expect something like this
IQ
sleep
GRE
happiness
105
70
200
15
50
150
15
70
20
70
10
150
80
6
148
60
900
7
115
40
110
90
5
120
40
60
12
99
30
70
15
15
30
68
70
60
12
70
I have tried the remove_sd_outlier code (from dataPreperation package) and it deleted an entire row of data. I do not want this.
You can use scale() to compute z-scores and across() to apply across all numeric variables. Note none of your example values are > 3 SD from the mean, so I used 2 SD as the threshold for demonstration.
library(dplyr)
df1 %>%
mutate(across(
where(is.numeric),
~ ifelse(
abs(as.numeric(scale(.x))) > 2,
NA,
.x
)
))
# A tibble: 11 × 4
IQ sleep GRE happiness
<dbl> <dbl> <dbl> <dbl>
1 105 70 200 15
2 40 50 150 15
3 70 20 70 10
4 150 NA 80 6
5 148 60 900 7
6 115 10 NA 40
7 110 90 15 5
8 120 40 60 12
9 99 30 70 15
10 NA 15 30 68
11 70 60 12 70
I think you could rephrase the nested ifelse() as case_when() for something easier to read, but hopefully this works for you.
df %>%
mutate(across(everything(),
~ ifelse(. > (mean(.) + 3*sd(.)),
"",
ifelse(. < (mean(.) - 3*sd(.)),
"", 1*(.)))))

What is the best way to assign detection history using the following values?

I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.

duplicate low frequency values in a factor

I need to duplicate those levels whose frequency in my factor variable called groups is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001, 0000110, 0001010, 0001100... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.
We can use rep on the levels of 'group' for the desired 'n'
factor(rep(levels(group), each = n))
If we need to use the table results as well
factor(rep(levels(group), table(group) + n-table(group)) )
Or with pmax
factor(rep(levels(group), pmax(n, table(levels(group)))))
data
set.seed(24)
group <- factor(sample(letters[1:6], 3000, replace = TRUE))
n <- 500

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Restructuring Market Data in R

I am very new to R and to stackoverflow. My data is being read into R as a csv file. I have figured out how to restructure currency 1 by itself within R, however, I am working with 900+ columns of data and need a way of looping the R script to apply what I have done to columns 1 through 7 to the other 900 columns.
Currently my data looks like this:
Currency 1 Blank Currency 2
Date Contract Last Open High Low Volume Column Date Contract Last Open High Low Volume
10/10/2012 Dec 100 101 105 99 20000
10/11/2012 Dec 101 102 106 98 20100
10/12/2012 Jan 102 103 107 97 20120
As you can see the data is sent to me horizontally. With a blank column in between each currency and I need the data stacked on top of each other.
I would like the data to look like this:
Date Contract Last Open High Low Volume Market
10/10/2012 Dec 100 101 105 99 20000 Currency 1
10/11/2012 Dec 101 102 106 98 20100 Currency 1
10/12/2012 Jan 102 103 107 97 20120 Currency 1
10/10/2012 Dec 50 52 49 99 20530 Currency 2
10/11/2012 Dec 53 56 43 98 24300 Currency 2
10/12/2012 Jan 56 52 48 97 22320 Currency 2
If I understand correctly, and if your source data really are nicely formatted you might be able to do something like the following. Here, I'm linking to a csv with three sets of currencies which replicates what I think your source data look like.
First, read the file in using read.csv but skip the first row. Use check.names = FALSE so that duplicate column names are allowed.
temp <- read.csv("http://ideone.com/plain/t3cGcA",
header = TRUE, skip = 1,
check.names = FALSE)
temp
# Date Contract Last Open High Low Volume Date
# 1 10/10/2012 Dec 100 101 105 99 20000 NA 10/10/2012
# 2 10/11/2012 Dec 101 102 106 98 20100 NA 10/11/2012
# 3 10/12/2012 Jan 102 103 107 97 20120 NA 10/12/2012
# Contract Last Open High Low Volume
# 1 Dec 50 52 49 99 20530
# 2 Dec 53 56 43 98 24300
# 3 Jan 56 52 48 97 22320
# structure(c("NA", "NA", "NA"), class = "AsIs") Date Contract
# 1 NA 10/10/2012 Dec
# 2 NA 10/11/2012 Dec
# 3 NA 10/12/2012 Jan
# Last Open High Low Volume
# 1 500 501 605 99 20000
# 2 600 502 606 98 20100
# 3 700 503 607 97 20120
Second---and here is one assumption of the tidiness of your dataset---use seq to create a vector of where your blank columns are. From this, if our assumption of tidiness is correct, you can use simple math to determine the start (vector value minus 7) and end indexes (vector value minus 1) of each currency.
myblankcols <- seq(1, ncol(temp), by=8) + 7
myblankcols
# [1] 8 16 24
Using the simple math mentioned above, create a list of the subsets of each currency, and add names to the list. You can get the names by re-reading just the first line of the file as a csv and dropping all the NA values.
tempL <- lapply(seq_along(myblankcols),
function(x) temp[(myblankcols[x] - 7):(myblankcols[x] - 1)])
NamesTempL <- read.csv("http://ideone.com/plain/t3cGcA",
header = FALSE, nrows = 1)
names(tempL) <- NamesTempL[!is.na(NamesTempL)]
tempL
# $`Currency 1`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 100 101 105 99 20000
# 2 10/11/2012 Dec 101 102 106 98 20100
# 3 10/12/2012 Jan 102 103 107 97 20120
#
# $`Currency 2`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 50 52 49 99 20530
# 2 10/11/2012 Dec 53 56 43 98 24300
# 3 10/12/2012 Jan 56 52 48 97 22320
#
# $`Currency 3`
# Date Contract Last Open High Low Volume
# 1 10/10/2012 Dec 500 501 605 99 20000
# 2 10/11/2012 Dec 600 502 606 98 20100
# 3 10/12/2012 Jan 700 503 607 97 20120
I am usually tempted to stop at this point, because I find lists convenient for many purposes. But, it's equally easy to convert it to a single data.frame. This is also one of the reasons to make sure you use check.names = FALSE in the first step: if all the columns have the same name, then there will be no problem rbinding them together.
do.call(rbind, tempL)
# Date Contract Last Open High Low Volume
# Currency 1.1 10/10/2012 Dec 100 101 105 99 20000
# Currency 1.2 10/11/2012 Dec 101 102 106 98 20100
# Currency 1.3 10/12/2012 Jan 102 103 107 97 20120
# Currency 2.1 10/10/2012 Dec 50 52 49 99 20530
# Currency 2.2 10/11/2012 Dec 53 56 43 98 24300
# Currency 2.3 10/12/2012 Jan 56 52 48 97 22320
# Currency 3.1 10/10/2012 Dec 500 501 605 99 20000
# Currency 3.2 10/11/2012 Dec 600 502 606 98 20100
# Currency 3.3 10/12/2012 Jan 700 503 607 97 20120
I'll definitely stop here, but from here, you probably want to convert your "Date" column to actual columns, and perhaps convert the row names ("Currency 1.1", "Currency 1.2", and so on) to a column in your data.frame.

Resources