How to reduce the if statement for multiple arguments? - r

I would like to reduce this code:
carro$custo_tprivate = with(carro, ifelse(decile_renda == 1,
renda_fa*0.116,
ifelse(decile_renda == 2, renda_fa*0.106,
ifelse(decile_renda == 3, renda_fa*0.102,
ifelse(decile_renda == 4, renda_fa*0.115,
ifelse(decile_renda == 5, renda_fa*0.124,
ifelse(decile_renda == 6, renda_fa*0.125,
ifelse(decile_renda == 7, renda_fa*0.137,
ifelse(decile_renda == 8, renda_fa*0.141,
ifelse(decile_renda == 9, renda_fa*0.156,
ifelse(decile_renda == 10, renda_fa*0.131, 0)))))))))))
Someone could teach me how to do that?
Thank you very much!

You can technically use match statement to reduce your ifelse statements as below
# put your case data into a data frame
data = data.frame(x= seq(1, 10, by = 1)
, y = runif(10))
# creating your actual data
carro = data.frame(decile_renda = sample(1:10, 10, replace =T)
,renda_fa = runif(10)
)
#Match it to get positions of case statement
pos = with(carro, match(decile_renda, data$x, 0))
# multiply to get results
data$y[pos]*carro$renda_fa

Alternatively, this can be solved by left-joining with a lookup table.
If I understand correctly the nested ifelse() construct, the OP wants to multiply renda with a factor which depends on the value of decile_renda. The factors are given for 10 distinct values of decile_renda. In all other cases, the result must be zero.
Using left join will find matching values of decile_renda in the lookup table. Non-matching rows in carro will get an NA value. These need to be replaced by zero, subsequently.
The lookup table treats the single use cases as data instead of hard-coding. This gives the flexibility to add or change the use cases without changing the code.
Create lookup table and test dataset
# create lookup table
lut <- data.frame(
decile_renda = 1:10,
fa = c(0.116, 0.106, 0.102, 0.115, 0.124, 0.125, 0.137, 0.141, 0.156, 0.131)
)
lut
decile_renda fa
1 1 0.116
2 2 0.106
3 3 0.102
4 4 0.115
5 5 0.124
6 6 0.125
7 7 0.137
8 8 0.141
9 9 0.156
10 10 0.131
# create test dataset
carro <- data.frame(decile_renda = 0:11, renda_fa = 100)
# randomize row order
set.seed(1L) # required for reproducible data
carro <- carro[sample(nrow(carro)), ]
carro
decile_renda renda_fa
9 8 100
4 3 100
7 6 100
1 0 100
2 1 100
5 4 100
3 2 100
8 7 100
6 5 100
11 10 100
12 11 100
10 9 100
Note that the test dataset has been choosen to allow for easy verification of the results.
Base R: merge()
carro <- merge(carro, lut, all.x = TRUE, by = "decile_renda")
carro$custo_tprivate <- with(carro, ifelse(is.na(fa), 0, renda_fa * fa))
carro
decile_renda renda_fa fa custo_tprivate
1 0 100 NA 0.0
2 1 100 0.116 11.6
3 2 100 0.106 10.6
4 3 100 0.102 10.2
5 4 100 0.115 11.5
6 5 100 0.124 12.4
7 6 100 0.125 12.5
8 7 100 0.137 13.7
9 8 100 0.141 14.1
10 9 100 0.156 15.6
11 10 100 0.131 13.1
12 11 100 NA 0.0
Note that the result column custo_tprivate shows 0.0 for the rows with non-matching decile_renda values of 0 and 11 as requested.
However, the drawback here is that merge() does not maintain the original row order (this is why the test dataset uses a random row order for demonstration). Also, the result contains the fa column which is no longer needed.
dplyr
library(dplyr)
carro %>%
left_join(lut, by = "decile_renda") %>%
mutate(custo_tprivate = if_else(is.na(fa), 0, renda_fa * fa)) %>%
select(-fa)
decile_renda renda_fa custo_tprivate
1 8 100 14.1
2 3 100 10.2
3 6 100 12.5
4 0 100 0.0
5 1 100 11.6
6 4 100 11.5
7 2 100 10.6
8 7 100 13.7
9 5 100 12.4
10 10 100 13.1
11 11 100 0.0
12 9 100 15.6
Here, the original row order is kept and the fa column has been removed.
data.table
With data.table we can do an update join where the matching rows of carro are being updated by reference, i.e., without copying the whole object. Only the result column custo_tprivate is appended to carro but not fa which would have to be removed afterwards. Also, for replacing the NA values only the affected rows are updated in place.
This might be an advantage in terms of speed and memory consumption in case of large datasets.
library(data.table)
setDT(carro)[lut, on = .(decile_renda), custo_tprivate := renda_fa * fa]
carro[is.na(custo_tprivate), custo_tprivate := 0]
carro
decile_renda renda_fa custo_tprivate
1: 8 100 14.1
2: 3 100 10.2
3: 6 100 12.5
4: 0 100 0.0
5: 1 100 11.6
6: 4 100 11.5
7: 2 100 10.6
8: 7 100 13.7
9: 5 100 12.4
10: 10 100 13.1
11: 11 100 0.0
12: 9 100 15.6

Related

Dataframe calculation, anchor cell value to formula

I would like to do some calculations with the following dataframe. There are some values in specific cells of a column, and I would like to have them replicated based on a second column value, and store these in a new, third column:
x <- c ("1", "2","3", "4")
z <- (rep(x,5))
batch <- sort(z)
NDF <- rnorm(20, 10, 1); NDF <- signif (NDF, digits =3)
Fibre_analysis <- data.frame(batch, NDF)
Fibre_analysis$NDF[[1]] <- 10
Fibre_analysis$NDF[[6]] <- 100
Fibre_analysis$NDF[[11]] <- 1000
Fibre_analysis$NDF[[16]] <- 10000
This is the table that I would like:
batch NDF NEW_column
1 1 10.00 10
2 1 10.80 10
3 1 9.44 10
4 1 10.30 10
5 1 11.60 10
6 2 100.00 100
7 2 8.26 100
8 2 9.15 100
9 2 9.40 100
10 2 8.53 100
11 3 1000.00 1000
12 3 9.41 1000
13 3 9.20 1000
14 3 10.30 1000
15 3 9.32 1000
16 4 10000.00 10000
17 4 11.20 10000
18 4 7.33 10000
19 4 9.34 10000
20 4 11.00 10000
I would like this to create a new column in the dataframe, with absolute cell values from $NDFthat have to change for each value of $batch.
Because I need to use this process more than once I created the following function:
batch_Function <- function (x,y){
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
print (y)
}
when I call the function:
Fibre_analysis$NEW_column <- batch_Function ( Fibre_analysis$batch , Fibre_analysis$NDF )
I expect $NEW_column to look like this:
x <- c(10,100,1000,10000)
NEW_column <- rep(x, each=5)
whereas instead it is the exact same copy of the $NDF.
The only necessary change is to drop print(y) as it is not allowing to return the actual result:
batch_Function <- function (x, y) {
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
}
batch_Function (Fibre_analysis$batch , Fibre_analysis$NDF )
# [1] 10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000
# [15] 1000 10000 10000 10000 10000 10000
In case you still want print(y), you may put it at the beginning of batch_Function.

Get column mean every block on n rows based on condition

I have this dataframe
r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816
I need to calculate the mean of column r2 for every block of rows where the distance between two values in the column distance is equal or less than 100000.
For this example the desired output would be:
mean_r2 diff_of_distance
1 17.86 75527 ## mean of rows 1 to 5; distance 5 - distance 1
2 13.91 66164 ## mean of rows 2 to 5; distance 5 - distance 2
3 13.84 68551 ## mean of rows 3 to 6; distance 6 - distance 3
4 13.14 74161 ## mean of rows 4 to 7; distance 7 - distance 4
5 9.40 73047 ## mean of rows 5 to 7; distance 7 - distance 5
6 11.04 76081 ## mean of rows 6 to 8; distance 8 - distance 6
and so on.
Edit 1: I have more than 100,000 rows.
Thanks.
Loop through each value of distance, minus this from the values in the distance vector and test if the result is less than 100000. This creates a boolean vector which you sum to identify the index at which the distance is greater than 100000 (i.e. bool becomes FALSE). Use this index to identify your block then take the mean of r2 in each block.
To speed up the code define your vector type and length (to avoid "growing vectors" on each iteration.
means <- vector("numeric", length = nrow(df))
rows <- vector("numeric", length = nrow(df))
distance_diff <- vector("numeric", length = nrow(df))
for (i in seq_along(df$distance)) {
dis_val <- df$distance[i] # the ith distance value
bools <- (df$distance - dis_val) < 100000 # bool indicating if difference between i and every value in vector is less than 100000
block_range <- sum(bools)# taking sum of bools identifies the value at which the distance becomes > 100000
rows[i] <- paste(as.character(i), "-", as.character(block_range))
means[i] <- mean(df$r2[i:block_range]) # take the mean of r2 in the range i to all rows where distance is < 100000
distance_diff[i] <- df$distance[block_range] - dis_val # minus the distance from the value before distance is > 100000 from i
}
data.frame(mean_r2 = means, rows= rows, diff_of_distance=distance_diff)
mean_r2 rows diff_of_distance
1 17.860000 1 - 5 75527
2 13.915000 2 - 5 66164
3 13.842500 3 - 6 68551
4 13.140000 4 - 7 74161
5 9.403333 5 - 7 73047
6 11.043333 6 - 8 76081
7 17.076667 7 - 9 66116
8 26.016667 8 - 10 67065
9 32.695000 9 - 10 38260
10 39.080000 10 - 10 0
You can try:
# your data
d <- read.table(text="r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816", header=T)
library(tidyverse) #dplyr_0.7.2
d %>%
mutate(index=1:n()) %>% add row index
group_by(index) %>% # group by this index
# calculate difference and find max row where diff < 100000
mutate(max_row=max(which(.$distance - distance < 100000, arr.ind=T))) %>%
# calculate mean
mutate(mean_r2=mean(.$r2[index:max_row])) %>%
# calculate the difference
mutate(diff_of_distance=.$distance[max_row] - .$distance[index]) %>%
# unite the columns
unite(rows, index, max_row, sep = "-")
# A tibble: 10 x 5
r2 distance rows mean_r2 diff_of_distance
* <dbl> <int> <chr> <dbl> <int>
1 33.64 67866 1-5 17.860000 75527
2 8.50 77229 2-5 13.915000 66164
3 15.07 109119 3-6 13.842500 68551
4 24.35 142279 4-7 13.140000 74161
5 7.74 143393 5-7 9.403333 73047
6 8.21 177670 6-8 11.043333 76081
7 12.26 216440 7-9 17.076667 66116
8 12.66 253751 8-10 26.016667 67065
9 26.31 282556 9-10 32.695000 38260
10 39.08 320816 10-10 39.080000 0
This works because group_by subsets the dataframe, thus you can access within mutate the respective distance value per group and calculate the difference with the complete vector using .$distance as this access the complete column regardless the group_by() function.

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

change code because of unwanted factors

So basically my code above simply takes every 5th number and calculates the standard deviation of the values for every 5th number....So if I have a sample data like this
Number STD
1 11.15
2 11.18
3 11.21
4 11.24
5 11.3
10 11.36
11 11.42
12 11.48
13 11.54
14 11.6
15 11.66
16 11.72
17 11.78
18 11.84
19 11.9
20 11.96
When I run my code, I'll get this output
Number STD
1 1 0.05770615
2 2 NA
3 3 0.09486833
4 4 0.09486833
So what I want to do is simple replace the NA with 0. Also instead of getting factors like 1,2,3,4 etc...I want to get 5,10,15,20,25 etc....
Another way of doing it:
# Generate data
number <- c(1:5, 10:20)
val <- c(11.15, 11.18, 11.21, 11.24, 11.30, 11.36, 11.42,
11.48, 11.54, 11.60, 11.66, 11.72, 11.78, 11.84, 11.90, 11.96)
data <- data.frame(number, val)
# Calculate SD
breaks <- seq(0, 20, 5)
splitted.data <- split(data$val, f=cut(data$number, breaks, labels=F))
err <- sapply(splitted.data, sd)
err[is.na(err)] <- 0
res <- cbind(Number = breaks[-1], STD = err)
Resulting in:
> res
Number STD
1 5 0.05770615
2 10 0.00000000
3 15 0.09486833
4 20 0.09486833
I haven't tried to rewrite what you try to do , but just for the sake of continuity you can
You can use argument labels of cut to set labels resulting category.
Change NA to 0 using spread[is.na(spread)] <- 0
The all code is :
hunter <- lapply(hunt, function(i) {
random <- cut(value[,i],seq(0,max(value[i]),5),
labels=seq(5,max(value[i]),5))
spread<-tapply(value[,i+1],random, sd,na.rm=TRUE)
spread[is.na(spread)] <- 0
Number<-levels(as.factor(random))
d <- data.frame(Number=Number,STD=spread)
})
Number STD
5 5 0.05770615
10 10 0.00000000
15 15 0.09486833
20 20 0.09486833
Using the data.table package, you can accomplish this in one call:
library(data.table)
DT <- data.table(value)
As a sigle call:
DT[, list(SD = ifelse(is.na(sd(STD)), 0, sd(STD)))
, by=list("Group" = factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5))]
Group SD
1: 5 0.05770615
2: 10 0.00000000
3: 15 0.09486833
4: 20 0.09486833
Breaking it down:
# you can create your groupings by
(Number-1) %/% 5 # (ie, the remainder when divided by 5)
# you can create your factor levels by
5 * ((Number-1) %/% 5 + 1)
# calculate the Group:
DT[, grp := factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5)]
# calculate the SD by Group, replacing NA's with 0:
DT[, SD := ifelse(is.na(sd(STD)), 0, sd(STD)), by=grp]
unique(DT[, list(grp, SD)])

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources