R fast way to perturb through data frame - r

I have a data frame that I'm trying to do some scenario analysis with. It looks like this:
Revenue Item_1 Item_2 Item_3
552 200 220 45
1500 400 300 200
2300 600 400 300
I'd like to generate something where 1 item is increased or decreased by some fixed amount (ie 1 unite) like this:
Revenue Item_1 Item_2 Item_3
552 201 220 45
1500 401 300 200
2300 601 400 300
552 200 221 45
1500 400 301 200
2300 600 401 300
552 200 220 46
1500 400 300 201
2300 600 400 301
I'm currently doing it in loop like this but am wondering if there's a faster way:
l1 <- list()
increment_amt <- 1
for(i in c('Item_1','Item_2','Item_3')){
newDf <- df1
newDf[,i] <- newDf[,i] + increment_amt
l1[[i]] <- newDf
}
df2 <- do.call(rbind, l1)
Any suggestions?

With lapply,
do.call(rbind, lapply(names(dat)[2:4], function(x) {dat[,x] <- dat[,x] + 1; dat}))
Revenue Item_1 Item_2 Item_3
1 552 201 220 45
2 1500 401 300 200
3 2300 601 400 300
4 552 200 221 45
5 1500 400 301 200
6 2300 600 401 300
7 552 200 220 46
8 1500 400 300 201
9 2300 600 400 301
Of course, do.call / rbind can be replaced with the data.table's speedier rbindlist, which returns a data.table.
library(data.table)
rbindlist(lapply(names(dat)[2:4], function(x) {dat[,x] <- dat[,x] + 1; dat}))

# Data frame
df <- data.frame(Item_1= c(200, 400, 600),
Item_2= c(220, 300, 400),
Item_3= c(45, 200, 300))
# Perturbation
p <- 1
# Add to all columns
df.new <- apply(diag(ncol(df)) * p, MAR = 1, function(x)data.frame(t(t(df) + x)))
[[1]]
Item_1 Item_2 Item_3
1 201 220 45
2 401 300 200
3 601 400 300
[[2]]
Item_1 Item_2 Item_3
1 200 221 45
2 400 301 200
3 600 401 300
[[3]]
Item_1 Item_2 Item_3
1 200 220 46
2 400 300 201
3 600 400 301

We can write a function and use lapply to achieve this task. df is your original data frame. df_list is a list with all final outputs. You can later use df2 <- do.call(rbind, df_list), or bind_rows from dplyr.
# A function to add 1 to all numbers in a column
add_one <- function(Col, dt){
dt[, Col] <- dt[, Col] + 1
return(dt)
}
# Get the column names
Col_vec <- colnames(df)[2:ncol(df)]
# Apply the add_one function
df_list <- lapply(Col_vec, add_one, dt = df)
# Combine all results
df2 <- dplyr::bind_rows(df_list)

You can use perturb function in R using library(perturb). The code is as follows:
# using the most important features, we create a model
m1 <- lm(revenue ~ item1 + item2 + item3)
#summary(m1)
#anova(m1)
#install.packages("perturb")
library(perturb)
set.seed(1234)
p1_new <- perturb(m1, pvars=c("item1","item2") , prange = c(1,1),niter=20)
p1_new
summary(p1_new)

Related

Sequential calculations fail in R

I tried to do some calculations with a constant and several variables in a dataframe.
For example we can use the following dummy data
constant <- 100
df <- as.data.frame(cbind(c(1,2,3,4,5),
c(4,3,6,1,4),
c(2,5,6,6,2),
c(5,5,5,1,2),
c(3,6,4,3,1)))
colnames(df) <- c("aa", "bb", "cc", "dd", "ee")
Now say that for every row in my dataframe I want to multiply my constant with variable bb, then cc, and then dd sequentially. I tried
answers <- sapply(df, function(x) constant * (1 + x[,2:4])
and similar attempts with lapply.
How would I go about it so that I get my: constant * bb * cc * dd? They are percentages, that is why I have the (1+... there
Try this approach with apply():
#Data
constant <- 100
df <- as.data.frame(cbind(c(1,2,3,4,5),
c(4,3,6,1,4),
c(2,5,6,6,2),
c(5,5,5,1,2),
c(3,6,4,3,1)))
colnames(df) <- c("aa", "bb", "cc", "dd", "ee")
#Apply
answers <- as.data.frame(t(apply(df,1, function(x) constant * (1 + x))))
Output:
answers
aa bb cc dd ee
1 200 500 300 600 400
2 300 400 600 600 700
3 400 700 700 600 500
4 500 200 700 200 400
5 600 500 300 300 200
Or using dplyr with across():
library(dplyr)
#Code
answer <- df %>% mutate(across(everything(),~constant * (1 + .)))
Output:
aa bb cc dd ee
1 200 500 300 600 400
2 300 400 600 600 700
3 400 700 700 600 500
4 500 200 700 200 400
5 600 500 300 300 200
Or with the same sapply():
#Code 3
answers <- sapply(df,function(x) constant * (1 + x))
answers <- as.data.frame(answers)
Output:
aa bb cc dd ee
1 200 500 300 600 400
2 300 400 600 600 700
3 400 700 700 600 500
4 500 200 700 200 400
5 600 500 300 300 200
Or any of these options will produce same output:
#Code 4
answers <- as.data.frame(do.call(cbind,lapply(df,function(x) constant * (1 + x))))
#Code 5
answers <- as.data.frame(mapply(function(x) constant * (1 + x),x=df))

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

Finding overlap in dataframe ranges in R

I have two bedfiles as dataframes in R, for which I want to map all overlapping regions to each other (similar to what bedtools closest would be able to do).
BedA:
chr start end
2 100 500
2 200 250
3 275 300
BedB:
chr start end
2 210 265
2 99 106
8 275 290
BedOut:
chr start.A end.A start.B end.B
2 100 500 210 265
2 100 500 99 106
2 200 250 210 265
Now, I found this very similar question, which suggest to use iRanges. Using the proposed way seems works, but I have no idea how to turn the output into a data frame like "BedOut".
Another data.table option using foverlaps:
setkeyv(BedA, names(BedA))
setkeyv(BedB, names(BedB))
ans <- foverlaps(BedB, BedA, nomatch=0L)
setnames(ans, c("start","end","i.start","i.end"), c("start.A","end.A","start.B","end.B"))
output:
chr start.A end.A start.B end.B
1: 2 100 500 99 106
2: 2 100 500 210 265
3: 2 200 250 210 265
data:
library(data.table)
BedA <- fread("chr start end
2 100 500
2 200 250
3 275 300")
BedB <- fread("chr start end
2 210 265
2 99 106
8 275 290")
Here is a solution using the data.table package.
library(data.table)
chr = c(2,2,3)
start.A = c(100, 200, 275)
end.A = c(500, 250, 300)
df_A = data.table(chr, start.A, end.A)
chr = c(2,2,8)
start.B = c(210, 99, 275)
end.B = c(265, 106, 290)
df_B = data.table(chr, start.B, end.B)
First, inner join the data-tables on the key chr:
df_out = df_B[df_A, on="chr", nomatch=0]
Then filter the overlapping interval:
df_out = df_out[(start.A>=start.B & start.A<=end.B) | (start.B>=start.A & start.B<=end.A)]
setcolorder(df_out, c("chr", "start.A", "end.A", "start.B", "end.B"))
chr start.A end.A start.B end.B
1: 2 100 500 210 265
2: 2 100 500 99 106
3: 2 200 250 210 265

opposite of mutate_at dplyr

We know mutate_at function from dplyr allows us to mutate selected multiple columns and apply a function to each of them. I need opposite of it. I mean to say, apply multiple functions to same column or apply same function multiple times to the same column. Take the following reproducible example.
> main <- structure(list(PolygonId = c(0L, 1L, 1612L, 3L, 2L, 1698L), Area = c(3.018892,
1.995702, 0.582808, 1.176975, 2.277057, 0.014854), Perimeter = c(10.6415,
8.6314, 4.8478, 6.1484, 9.2226, 0.6503), h0 = c(1000,500,700,1000,200,1200)), .Names = c("PolygonId",
"Area", "Perimeter", "h0"), row.names = c(NA, 6L), class = "data.frame")
> main
PolygonId Area Perimeter h0
1 0 3.018892 10.6415 1000
2 1 1.995702 8.6314 500
3 1612 0.582808 4.8478 700
4 3 1.176975 6.1484 1000
5 2 2.277057 9.2226 200
6 1698 0.014854 0.6503 1200
I am only concerned about h0 column in the df main.
Expected outcome:
The h10 field is h0 + 10% of h0 and h_10 is h0 - 10% of h0
PolygonId Area Perimeter h0 h10 h20 h_10 h_20
1 0 3.018892 10.6415 1000 1100 1200 900 800
2 1 1.995702 8.6314 500 550 600 450 400
3 1612 0.582808 4.8478 700 770 840 630 560
4 3 1.176975 6.1484 1000 1100 1200 900 800
5 2 2.277057 9.2226 200 220 240 180 160
6 1698 0.014854 0.6503 1200 1320 1440 1080 960
I'd usually do this::
calcH <- function(h, pc){
h + pc / 100 * h
}
new_main <- mutate ( main,
h10 = calcH(h0, 10),
h20 = calcH(h0, 20),
h_10 = calcH(h0, -10),
h_20 = calcH(h0, -20)
)
But this is going to be hectic and long code since I have to do this calculation for 1%, 2.5%, 5%, 7.5%, 10%, 12.5%, 15%... 30% in both positive and negative ways.
mutate_at can use multiple functions, but they need to exist in the environment as named functions (can't be anonymous functions) So something like
pcts<-rep(c(1,2.5*1:12),2)*c(-1,1)
for(i in pcts){
assign(gsub("-","_",paste0("h",i)),eval(parse(text=sprintf("function(x) x*(100+%f)/100",i)))) }
main %>% mutate_at(vars(h0),gsub("-","_",paste0("h",pcts)))
would work
I like to solve these kind of problems using long data representation:
library(dplyr)
library(tidyr)
# create data frame with join helper and multiplier-values:
bla <- data.frame(mult = seq(-.1, .1, .01),
join = TRUE)
# join, calculate values, create names, transform to wide:
main %>%
mutate(join = TRUE) %>%
left_join(bla) %>%
mutate(h0 = h0*(1+mult),
mult = sub(x = paste0("h", mult*100), pattern = "-", replacement = "_")) %>%
select(-join) %>%
spread(mult, h0)
This is easy in base R. The idea is to create a vector with the required percentages, loop over that vector and calculate your metric, i.e.
v1 <- c(1, seq(2.5, 30, by = 2.5), seq(-30, -2.5, by = 2.5), -1)
sapply(v1, function(i) calcH(main$h0, i))
Here's another approach similar to #andyyy's, but uses rlang instead:
library(dplyr)
library(rlang)
percent <- c(1, 2.5*1:12)
calc_expr <- function(percent_vec){
parse_exprs(paste(paste0("h0+(",percent_vec,"/100*h0)"), collapse = ";"))
}
main %>%
mutate(!!!calc_expr (percent), !!!calc_expr (percent*-1)) %>%
setNames(c(colnames(main), paste0("h", percent), paste0("h_", percent)))
Result:
PolygonId Area Perimeter h0 h1 h2.5 h5 h7.5 h10 h12.5 h15 h17.5 h20 h22.5 h25 h27.5
1 0 3.018892 10.6415 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
2 1 1.995702 8.6314 500 505 512.5 525 537.5 550 562.5 575 587.5 600 612.5 625 637.5
3 1612 0.582808 4.8478 700 707 717.5 735 752.5 770 787.5 805 822.5 840 857.5 875 892.5
4 3 1.176975 6.1484 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
5 2 2.277057 9.2226 200 202 205.0 210 215.0 220 225.0 230 235.0 240 245.0 250 255.0
6 1698 0.014854 0.6503 1200 1212 1230.0 1260 1290.0 1320 1350.0 1380 1410.0 1440 1470.0 1500 1530.0
h30 h_1 h_2.5 h_5 h_7.5 h_10 h_12.5 h_15 h_17.5 h_20 h_22.5 h_25 h_27.5 h_30
1 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
2 650 495 487.5 475 462.5 450 437.5 425 412.5 400 387.5 375 362.5 350
3 910 693 682.5 665 647.5 630 612.5 595 577.5 560 542.5 525 507.5 490
4 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
5 260 198 195.0 190 185.0 180 175.0 170 165.0 160 155.0 150 145.0 140
6 1560 1188 1170.0 1140 1110.0 1080 1050.0 1020 990.0 960 930.0 900 870.0 840
Notes:
Using the vector of percentages, I construct multiple expressions using paste0 and parse_exprs then unquote and splice them in mutate using !!!. Finally, rename the columns using setNames.

How to make new variable across conditions

I need to calculate new variable from data using conditions. New Pheno.
Data set is huge.
I have data set: Animal, Record, Days, Pheno
A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450
Conditions are:
Constant pheno per day is 2.
If record days is more than 305 old pheno should be keept.
If record is less than 305 but has next records Pheno should be keept.
If record is less than 305 and have no next records it should be calculated as : 305-days*constant+pheno = (305 - 260)*2+300
Example for animal 1 having less than 305 for both records. So First record will be same in new pheno, but secon record is las and has less than 305, so we need to re-calculate... (305-230)*2+290=440
Finaly data will be like:
A R D P N_P
1 1 240 300 300
1 2 230 290 440
2 1 305 350 350
2 2 260 290 380
3 1 350 450 450
How to do it in R or linux ...
Here is a solution with base R
df <- read.table(header=TRUE, text=
"A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450")
newP <- function(d) {
np <- numeric(nrow(d))
for (i in 1:nrow(d)) {
if (d$D[i] > 305) { np[i] <- d$P[i]; next }
if (d$D[i] <= 305 && i<nrow(d)) { np[i] <- d$P[i]; next }
np[i] <- (305-d$D[i])*2 + d$P[i]
}
d$N_P <- np
return(d)
}
D <- split(df, df$A)
D2 <- lapply(D, newP)
do.call(rbind, D2)
Check this out (I assume R is the number of records sorted, so if you have 10 records the last will have R=10)
library(dplyr)
df <- data.frame(A=c(1,1,2,2,3),
R=c(1,2,1,2,1),
D=c(240,230,305,260,350),
P=c(300,290,350,290,450))
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*2)+P # calculate new P
,P)) # Else : use old P
Source: local data frame [5 x 5]
Groups: A [3]
A R D P N_P
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 240 300 300
2 1 2 230 290 440
3 2 1 305 350 350
4 2 2 260 290 380
5 3 1 350 450 450
If you have predefined constants that depend on R value in the df, for example :
const <- c(1,2,1.5,2.5,3)
You can replace R in the code by const[R]
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*const[R])+P # calculate new P
,P)) # Else : use old P

Resources