Why does the frequency reduce if I use ifelse function in R?Is there a way to create categories from the combination of 2 variables/columns? - r

when I do
table(df$strategy.x)
0 1 2 3
70 514 223 209
table(df$strategy.y)
0 1 2 3
729 24 7 4
I want to create a variable with both of these combined. I tried this
df <- df %>%
mutate(nstrategy1 = ifelse(strategy.x==1| strategy.y==1 , 1, 0))
table(df$nstrategy1)
0 1
399 519
I am supposed to get 514 + 24 = 538 but I got 519 instead
df <- df %>% mutate(nstrategy2 = ifelse(strategy.x==2| strategy.y==2 , 1, 0))
table(df$nstrategy2)
0 1
578 228
Similarly, I am supposed to get 223 + 7 = 230, but I got 228 instead
Is there a good way to merge both strategy.x and strategy.y and end up with a table like the following with 4 categories?
0 1 2 3
799 538 230 213

table(mtcars$am) # 13 1's
table(mtcars$vs) # 14 1's
mtcars$ones = ifelse(mtcars$am == 1 | mtcars$vs == 1, 1, 0)
table(mtcars$ones) # 20 1's < 13 + 14 = 27
Why is it showing only 20 1's instead of 27? It's because there are 7 + 6 + 7 = 20 cars with either one or two 1's in am and vs. There are 13 with am==1 (6+7), and 14 with vs==1 (7+7). Seven cars are in the bottom left because they have 1's in both dimensions, which you are expecting/seeking to count twice.
table(mtcars$am, mtcars$vs)
# 0 1
# 0 12 7
# 1 6 7
The simplest way to get the sum of the two results would be by adding the two table objects:
table(mtcars$am) + table(mtcars$vs)
# 0 1
# 37 27

Related

Conditional filling NA rows with comparing non-NA labeled rows

I want to fill NA rows based on checking the differences between the closest non-NA labeled rows.
For instance
data <- data.frame(sd_value=c(34,33,34,37,36,45),
value=c(383,428,437,455,508,509),
label=c(c("bad",rep(NA,4),"unable")))
> data
sd_value value label
1 34 383 bad
2 33 428 <NA>
3 34 437 <NA>
4 37 455 <NA>
5 36 508 <NA>
6 45 509 unable
I want to evaluate how to change NA rows with checking the difference between sd_value and value those close to bad and unablerows.
if we want to get differences between the rows we can do;
library(dplyr)
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 <NA> 45 -1
3 34 437 <NA> 9 1
4 37 455 <NA> 18 3
5 36 508 <NA> 53 -1
6 45 509 unable 1 9
The condition how I want to label the NA rows is
if the diff_val<50 and diff_sd_val<9 label them with the last non-NA label else use the first non-NA label after the last NA row.
So that the expected output would be
sd_value value label diff_val diff_sd_val
1 34 383 bad 0 0
2 33 428 bad 45 -1
3 34 437 bad 9 1
4 37 455 bad 18 3
5 36 508 unable 53 -1
6 45 509 unable 1 9
The possible solution I cooked up so far:
custom_labelling <- function(x,y,label){
diff_sd_val<-c(NA,diff(x))
diff_val<-c(NA,diff(y))
label <- NA
for (i in 1:length(label)){
if(is.na(label[i])&diff_sd_val<9&diff_val<50){
label[i] <- label
}
else {
label <- label[i]
}
}
return(label)
}
which gives
data%>%
mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))%>%
mutate(custom_label=custom_labelling(sd_value,value,label))
Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed.
In addition: Warning message:
In if (is.na(label[i]) & diff_sd_val < 9 & diff_val < 50) { :
the condition has length > 1 and only the first element will be used
One option is to find NA and non-NA index and based on the condition select the closest label to it.
library(dplyr)
#Create a new dataframe with diff_val and diff_sd_val
data1 <- data%>% mutate(diff_val=c(0,diff(value)), diff_sd_val=c(0,diff(sd_value)))
#Get the NA indices
NA_inds <- which(is.na(data1$label))
#Get the non-NA indices
non_NA_inds <- setdiff(1:nrow(data1), NA_inds)
#For every NA index
for (i in NA_inds) {
#Check the condition
if(data1$diff_sd_val[i] < 9 & data1$diff_val[i] < 50)
#Get the last non-NA label
data1$label[i] <- data1$label[non_NA_inds[which.max(i > non_NA_inds)]]
else
#Get the first non-NA label after last NA value
data1$label[i] <- data1$label[non_NA_inds[i < non_NA_inds]]
}
data1
# sd_value value label diff_val diff_sd_val
#1 34 383 bad 0 0
#2 33 428 bad 45 -1
#3 34 437 bad 9 1
#4 37 455 bad 18 3
#5 36 508 unable 53 -1
#6 45 509 unable 1 9
You can remove diff_val and diff_sd_val columns later if not needed.
We can also create a function
custom_label <- function(label, diff_val, diff_sd_val) {
NA_inds <- which(is.na(label))
non_NA_inds <- setdiff(1:length(label), NA_inds)
new_label = label
for (i in NA_inds) {
if(diff_sd_val[i] < 9 & diff_val[i] < 50)
new_label[i] <- label[non_NA_inds[which.max(i > non_NA_inds)]]
else
new_label[i] <- label[non_NA_inds[i < non_NA_inds]]
}
return(new_label)
}
and then apply it
data%>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))
# sd_value value label diff_val diff_sd_val new_label
#1 34 383 bad 0 0 bad
#2 33 428 <NA> 45 -1 bad
#3 34 437 <NA> 9 1 bad
#4 37 455 <NA> 18 3 bad
#5 36 508 <NA> 53 -1 unable
#6 45 509 unable 1 9 unable
If we want to apply it by group we can add a group_by statement and it should work.
data%>%
group_by(group) %>%
mutate(diff_val = c(0, diff(value)),
diff_sd_val = c(0, diff(sd_value)),
new_label = custom_label(label, diff_val, diff_sd_val))

Sorting elements by column in R

I have a simple code for matrix
ind1=which(macierz==1,arr.ind = TRUE)
fragment of theresult is
> ind1
row col
TCGA.CH.5737.01 53 1
TCGA.CH.5791.01 66 1
P03.1334.Tumor 322 1
P04.1790.Tumor 327 1
CPCG0340.F1 425 1
TCGA.CH.5737.01 53 2
TCGA.CH.5791.01 66 2
P03.1334.Tumor 322 2
P04.1790.Tumor 327 2
CPCG0340.F1 425 2
I would like to sort it by first column alphabetical. How can I do this in R?
It looks as if ind1 is a matrix and the first column is the rownames, so you probably need something like ind1 <- ind1[order(rownames(ind1)),]
You need (assuming your first column is called "label" and those are not rownames)
ind1[order(ind1$label),]
order() return a list of row indexes after sorting alphabetically the data frame. Just to make the example reproducible I created your data frame so
ind1 <- data.frame ( label = c("TCGA.CH.5737.01", "TCGA.CH.5791.01",
"P03.1334.Tumor","P04.1790.Tumor", "CPCG0340.F1" , "TCGA.CH.5737.01",
"TCGA.CH.5791.01","P03.1334.Tumor", "P04.1790.Tumor", "CPCG0340.F1"),
row = c(53,66,322,327,425,53,66,322,327,425), col =
c(1,1,1,1,1,2,2,2,2,2),
stringsAsFactors = FALSE)
and the output is
> ind1[order(ind1$label),]
label row col
5 CPCG0340.F1 425 1
10 CPCG0340.F1 425 2
3 P03.1334.Tumor 322 1
8 P03.1334.Tumor 322 2
4 P04.1790.Tumor 327 1
9 P04.1790.Tumor 327 2
1 TCGA.CH.5737.01 53 1
6 TCGA.CH.5737.01 53 2
2 TCGA.CH.5791.01 66 1
7 TCGA.CH.5791.01 66 2
Hope that helps.
Regards, Umberto

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Calculating R combinations from CSV file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a CSV file which contains about 400 values ranging from 10 000 to 50 000.
I want to calculate what combinations of selected values, for example 100, 150, 200,250 correspond to values in CSV file.
Is it possible to do it in R?
So this is part of the data:
1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094
Second data set is (146.058, 203.193, 162.053, 291.095)
I need possible combinations of second data set that corresponds to values in the first on. For example 291*2+162*5+203*4 = 2204.
There will be alternative ways to do that, like a loop that checks a specific combination at iteration i and decides to keep it or ignore it, but I prefer not to use loops when possible.
library(dplyr)
dt = read.table(text = "1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094")
# change column name and round values
names(dt) = "value"
dt$value = round(dt$value)
# give the manual values (assuming they are 4 values)
manual_values = c(146.058, 203.193, 162.053, 291.095)
# round values
manual_values = round(manual_values)
# get the maximum coefficient to investigate
coeff = ceiling(max(dt$value) / min(manual_values))
expand.grid(v1 = manual_values[1], ## create all combinations of coefficients and keep your values
v2 = manual_values[2],
v3 = manual_values[3],
v4 = manual_values[4],
coeff1 = 0:coeff,
coeff2 = 0:coeff,
coeff3 = 0:coeff,
coeff4 = 0:coeff) %>%
mutate(value = v1*coeff1+v2*coeff2+v3*coeff3+v4*coeff4) %>% ## calculate the value from each combination
inner_join(dt, by="value") ## join info from your initial values
## sample of the first 10 rows of the result :
# v1 v2 v3 v4 coeff1 coeff2 coeff3 coeff4 value
# 1 146 203 162 291 3 10 0 0 2468
# 2 146 203 162 291 7 12 0 0 3458
# 3 146 203 162 291 9 13 0 0 3953
# 4 146 203 162 291 7 3 1 0 1793
# 5 146 203 162 291 22 3 1 0 3983
# 6 146 203 162 291 15 4 1 0 3164
# 7 146 203 162 291 4 5 1 0 1761
# 8 146 203 162 291 0 11 1 0 2395
# 9 146 203 162 291 4 11 1 0 2979
# 10 146 203 162 291 2 14 2 0 3458
So, the first line of the output tells you that the combination 3*146 + 10*203 equals 2468, which is a value that exists in your initial dataset (CSV).
If you spot any bugs, or you need any clarifications let me know and I can update my answer.
A small improvement could be to replace the last inner_join with filter(value %in% dt$value). I don't think there's any reason to join when you can get the same output by using a filtering command.
For your other objective (specified in the comments) try this:
library(dplyr)
dt = read.table(text = "1359.214844
1604.558594
1701.759766
1761.083984
1792.990234
1926.248047
1958.144531
2086.373047
2114.501953
2142.542969
2204.325621
2216.468750
2229.136719
2286.894531
2302.847656
2379.826172
2395.039063
2467.578125
2610.802734
2797.929688
2812.916016
2838.947266
2979.498047
3122.171875
3163.671875
3457.794922
3809.228516
3826.058594
3952.609375
3983.210938
4102.996094")
# change column name and round values
names(dt) = "value"
dt$value = round(dt$value)
# give the manual values (assuming they are 4 values)
manual_values = c(146.058, 203.193, 162.053, 291.095)
# get the maximum coefficient to investigate
coeff = ceiling(max(dt$value) / min(manual_values))
expand.grid(v1 = manual_values[1], ## create all combinations of coefficients and keep your values
v2 = manual_values[2],
v3 = manual_values[3],
v4 = manual_values[4],
coeff1 = 0:3,
coeff2 = 5:coeff,
coeff3 = 5:coeff,
coeff4 = 0:3) %>%
mutate(SUM = v1*coeff1+v2*coeff2+v3*coeff3+v4*coeff4) %>% ## calculate the value of each combination
tbl_df() ## only for printing top 10 rows
# v1 v2 v3 v4 coeff1 coeff2 coeff3 coeff4 SUM
# (dbl) (dbl) (dbl) (dbl) (int) (int) (int) (int) (dbl)
# 1 146.058 203.193 162.053 291.095 0 5 5 0 1826.230
# 2 146.058 203.193 162.053 291.095 1 5 5 0 1972.288
# 3 146.058 203.193 162.053 291.095 2 5 5 0 2118.346
# 4 146.058 203.193 162.053 291.095 3 5 5 0 2264.404
# 5 146.058 203.193 162.053 291.095 0 6 5 0 2029.423
# 6 146.058 203.193 162.053 291.095 1 6 5 0 2175.481
# 7 146.058 203.193 162.053 291.095 2 6 5 0 2321.539
# 8 146.058 203.193 162.053 291.095 3 6 5 0 2467.597
# 9 146.058 203.193 162.053 291.095 0 7 5 0 2232.616
# 10 146.058 203.193 162.053 291.095 1 7 5 0 2378.674
# .. ... ... ... ... ... ... ... ... ...
You can save this result table as a data frame and continue your process as you like.

Loop for subsetting data.frame

I work with neuralnet package to predict values of stocks (diploma thesis). The example data are below
predict<-runif(23,min=0,max=1)
day<-c(369:391)
ChoosedN<-c(2,5,5,5,5,5,4,3,5,5,5,2,1,1,5,5,4,3,2,3,4,3,2)
Profit<-runif(23,min=-2,max=5)
df<-data.frame(predict,day,ChoosedN,Profit)
colnames(df)<-c('predict','day','ChoosedN','Profit')
But I haven't always same period for investments (ChoodedN). For backtest the neural site I have to skip the days when I am still in position even if the neural site says 'buy it' (i.e.predict > 0.5). The frame looks like this
predict day ChoosedN Profit
1 0.6762981061 369 2 -1.6288823350
2 0.0195611224 370 5 1.5682195597
3 0.2442795106 371 5 0.6195915225
4 0.9587601107 372 5 -1.9701975542
5 0.7415729680 373 5 3.7826137026
6 0.4814927997 374 5 4.1228808255
7 0.1340754859 375 4 3.7818792837
8 0.6316874851 376 3 0.7670884461
9 0.1107241728 377 5 -1.3367400097
10 0.5850426450 378 5 2.2848396166
11 0.2809308425 379 5 2.5234691438
12 0.2835292015 380 2 -0.3291319925
13 0.3328713216 381 1 4.7425349397
14 0.4766904986 382 1 -0.4062103292
15 0.5005860797 383 5 4.8612083721
16 0.2734292494 384 5 -0.2320077328
17 0.1488479455 385 4 2.6195679584
18 0.9446908936 386 3 0.4889716264
19 0.8222738281 387 2 0.7362413658
20 0.7570014759 388 3 4.6661250258
21 0.9988698252 389 4 2.6340743946
22 0.8384663551 390 3 1.0428046484
23 0.1938821415 391 2 0.8855748393
And I need to create new data.frame this way.For example:If predict (in first row) > 0.5,delete second and third row (because ChoosedN in first row is 2 so next two after first row has to be delete, because there we were still in position). And continue on fourth the same way (if predict (fourth row) > 0.5, delete next five rows and so. And of course, if predict <=0.5 delete this row too.
Any straightforward way how to do it with some loop?
Thanks
I would create a new dataframe, then bind the rows you want using rbind inside of a for loop
newDF <- data.frame() # New, Empty Dataframe
i = 1 # Loop index Variable
while (i < nrow(df)) {
if (df$predict[i] > 0.5) { # If predict > 0.5,
newDF <- rbind(newDF, df[i,]) # Bind the row
i = i + df$ChoosedN[i] # Adjust for ChoosedN rows
}
i = i + 1 # Move to the next row
}

Resources