opposite of mutate_at dplyr - r

We know mutate_at function from dplyr allows us to mutate selected multiple columns and apply a function to each of them. I need opposite of it. I mean to say, apply multiple functions to same column or apply same function multiple times to the same column. Take the following reproducible example.
> main <- structure(list(PolygonId = c(0L, 1L, 1612L, 3L, 2L, 1698L), Area = c(3.018892,
1.995702, 0.582808, 1.176975, 2.277057, 0.014854), Perimeter = c(10.6415,
8.6314, 4.8478, 6.1484, 9.2226, 0.6503), h0 = c(1000,500,700,1000,200,1200)), .Names = c("PolygonId",
"Area", "Perimeter", "h0"), row.names = c(NA, 6L), class = "data.frame")
> main
PolygonId Area Perimeter h0
1 0 3.018892 10.6415 1000
2 1 1.995702 8.6314 500
3 1612 0.582808 4.8478 700
4 3 1.176975 6.1484 1000
5 2 2.277057 9.2226 200
6 1698 0.014854 0.6503 1200
I am only concerned about h0 column in the df main.
Expected outcome:
The h10 field is h0 + 10% of h0 and h_10 is h0 - 10% of h0
PolygonId Area Perimeter h0 h10 h20 h_10 h_20
1 0 3.018892 10.6415 1000 1100 1200 900 800
2 1 1.995702 8.6314 500 550 600 450 400
3 1612 0.582808 4.8478 700 770 840 630 560
4 3 1.176975 6.1484 1000 1100 1200 900 800
5 2 2.277057 9.2226 200 220 240 180 160
6 1698 0.014854 0.6503 1200 1320 1440 1080 960
I'd usually do this::
calcH <- function(h, pc){
h + pc / 100 * h
}
new_main <- mutate ( main,
h10 = calcH(h0, 10),
h20 = calcH(h0, 20),
h_10 = calcH(h0, -10),
h_20 = calcH(h0, -20)
)
But this is going to be hectic and long code since I have to do this calculation for 1%, 2.5%, 5%, 7.5%, 10%, 12.5%, 15%... 30% in both positive and negative ways.

mutate_at can use multiple functions, but they need to exist in the environment as named functions (can't be anonymous functions) So something like
pcts<-rep(c(1,2.5*1:12),2)*c(-1,1)
for(i in pcts){
assign(gsub("-","_",paste0("h",i)),eval(parse(text=sprintf("function(x) x*(100+%f)/100",i)))) }
main %>% mutate_at(vars(h0),gsub("-","_",paste0("h",pcts)))
would work

I like to solve these kind of problems using long data representation:
library(dplyr)
library(tidyr)
# create data frame with join helper and multiplier-values:
bla <- data.frame(mult = seq(-.1, .1, .01),
join = TRUE)
# join, calculate values, create names, transform to wide:
main %>%
mutate(join = TRUE) %>%
left_join(bla) %>%
mutate(h0 = h0*(1+mult),
mult = sub(x = paste0("h", mult*100), pattern = "-", replacement = "_")) %>%
select(-join) %>%
spread(mult, h0)

This is easy in base R. The idea is to create a vector with the required percentages, loop over that vector and calculate your metric, i.e.
v1 <- c(1, seq(2.5, 30, by = 2.5), seq(-30, -2.5, by = 2.5), -1)
sapply(v1, function(i) calcH(main$h0, i))

Here's another approach similar to #andyyy's, but uses rlang instead:
library(dplyr)
library(rlang)
percent <- c(1, 2.5*1:12)
calc_expr <- function(percent_vec){
parse_exprs(paste(paste0("h0+(",percent_vec,"/100*h0)"), collapse = ";"))
}
main %>%
mutate(!!!calc_expr (percent), !!!calc_expr (percent*-1)) %>%
setNames(c(colnames(main), paste0("h", percent), paste0("h_", percent)))
Result:
PolygonId Area Perimeter h0 h1 h2.5 h5 h7.5 h10 h12.5 h15 h17.5 h20 h22.5 h25 h27.5
1 0 3.018892 10.6415 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
2 1 1.995702 8.6314 500 505 512.5 525 537.5 550 562.5 575 587.5 600 612.5 625 637.5
3 1612 0.582808 4.8478 700 707 717.5 735 752.5 770 787.5 805 822.5 840 857.5 875 892.5
4 3 1.176975 6.1484 1000 1010 1025.0 1050 1075.0 1100 1125.0 1150 1175.0 1200 1225.0 1250 1275.0
5 2 2.277057 9.2226 200 202 205.0 210 215.0 220 225.0 230 235.0 240 245.0 250 255.0
6 1698 0.014854 0.6503 1200 1212 1230.0 1260 1290.0 1320 1350.0 1380 1410.0 1440 1470.0 1500 1530.0
h30 h_1 h_2.5 h_5 h_7.5 h_10 h_12.5 h_15 h_17.5 h_20 h_22.5 h_25 h_27.5 h_30
1 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
2 650 495 487.5 475 462.5 450 437.5 425 412.5 400 387.5 375 362.5 350
3 910 693 682.5 665 647.5 630 612.5 595 577.5 560 542.5 525 507.5 490
4 1300 990 975.0 950 925.0 900 875.0 850 825.0 800 775.0 750 725.0 700
5 260 198 195.0 190 185.0 180 175.0 170 165.0 160 155.0 150 145.0 140
6 1560 1188 1170.0 1140 1110.0 1080 1050.0 1020 990.0 960 930.0 900 870.0 840
Notes:
Using the vector of percentages, I construct multiple expressions using paste0 and parse_exprs then unquote and splice them in mutate using !!!. Finally, rename the columns using setNames.

Related

Divide colums by other columns and itself depending on index in dplyr

library(dplyr)
set.seed(1)
df <- data.frame(dddt_a = sample(1:1000, 1000, replace=T),
dddt_b = sample(1:1000, 1000, replace=T),
dddt_c = sample(1:1000, 1000, replace=T),
dddt_d = sample(1:1000, 1000, replace=T),
index = as.character(sample(c("a", "b"), 1000, replace=T)))
I want to divide each colum by either dddt_a or dddt_b depending on what the index is. If the index is a then divide all columns except the index by dddt_a and if index==b divide all columns except the index by dddt_b. The way it is set up now, this only divides dddt_a by a but not the other columns (likewise if index==b).
df1 <- df %>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = list(~ifelse(index=="a", ./dddt_a, ./dddt_b)))
head(df1)
dddt_a dddt_b dddt_c dddt_d index
1 1.0000000 686 474 756 a
2 0.7388466 1 681 726 b
3 1.0000000 218 570 448 a
4 2.0086393 1 830 958 b
5 1.0000000 989 590 128 a
6 1.0000000 128 978 144 a
A work around is storing the denominator variable outside, split the data for each index, divide everything and put it back together (I ran it only for index==a here). However, this should be possible in dplyr, I'm sure...?
ind_a <- df$dddt_a[df$index=="a"]
dfa <- df %>%
filter(index=="a")%>%
mutate_at(.vars = vars(starts_with("dddt")),
.funs = ~ ./!!ind_a)
Related to what seems to be the same issue. In a nex step I want to sum the values up, again depending on the index variable:
df2 <- df1 %>%
mutate(SUMS = ifelse(index=="a",
1+dddt_b+dddt_c+dddt_d,
1+dddt_a+dddt_c+dddt_d))
However, this sums all variables up...
head(df2)
dddt_a dddt_b dddt_c dddt_d index SUMS
1 1.0000000 686 474 756 a 1917.000
2 0.7388466 1 681 726 b 1408.739
3 1.0000000 218 570 448 a 1237.000
4 2.0086393 1 830 958 b 1791.009
5 1.0000000 989 590 128 a 1708.000
6 1.0000000 128 978 144 a 1251.000
But for the first row, for example, SUMS should be equal to 1916:
rowSums(df2[1,2:4]) #the result should be 1916 not 1917
1916
Thanks for the help.
Create a new column after dividing
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~ifelse(index=="a", ./dddt_a, ./dddt_b))) %>%
head
# dddt_a dddt_b dddt_c dddt_d index dddt_a_new dddt_b_new dddt_c_new dddt_d_new
#1 836 686 474 756 a 1.000 0.821 0.567 0.904
#2 679 919 681 726 b 0.739 1.000 0.741 0.790
#3 129 218 570 448 a 1.000 1.690 4.419 3.473
#4 930 463 830 958 b 2.009 1.000 1.793 2.069
#5 509 989 590 128 a 1.000 1.943 1.159 0.251
#6 471 128 978 144 a 1.000 0.272 2.076 0.306
If you want you can then select only "_new" columns or rename the "_new" column to names of your choice.
We can also use case_when
library(dplyr)
df %>%
mutate_at(vars(starts_with("dddt")),
list(new = ~case_when(index=="a" ~ ./dddt_a, TRUE ~ ./dddt_b)))

Capture the column index in R or excel for a series of data for a condition

I would like to capture the index value for any value less than 500 for a series of data.
Below is how my data looks like
Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788
For example, in the first row, Price6 is the first element for the series between Price1 to Price6, where value is less than 500, hence in the output "First" has 6.
Similarly, for second row, Price4 has less than 500, and next Price5 has less than 500, hence the value for First and Second are 4,5 respectively for the series of data between price1 and Price6.
When nothing is capture in the logic, i want to place a "-" for the same.
Below is the output i am looking for.
Category,Price1,Price2,Price3,Price4,Price5,Price6,First,Second,Third,Fourth,Fifth,Sixth
Product1,967,855,929,811,501,387,6,-,-,-,-,-
Product2,526,809,723,304,315,671,4,5,-,-,-,-
Product3,412,133,369,930,400,337,1,2,3,5,6,-
Product4,709,241,625,822,967,952,2,-,-,-,-,-
Product5,395,506,110,280,829,817,1,3,4,-,-,-
Product6,803,618,794,214,605,788,4,-,-,-,-,-
Not sure how to do the same in R or excel.
Any leads would be highly appreciated.
Thanks,
Using data.table
dt[, when := melt(dt, id.vars = "Category")[, toString(which(value < 500)), Category][, V1]]
cbind(dt, dt[, tstrsplit(when, ", ", fill = "-")])
Gives
Category Price1 Price2 Price3 Price4 Price5 Price6 when V1 V2 V3 V4 V5
1: Product1 967 855 929 811 501 387 6 6 - - - -
2: Product2 526 809 723 304 315 671 4, 5 4 5 - - -
3: Product3 412 133 369 930 400 337 1, 2, 3, 5, 6 1 2 3 5 6
4: Product4 709 241 625 822 967 952 2 2 - - - -
5: Product5 395 506 110 280 829 817 1, 3, 4 1 3 4 - -
6: Product6 803 618 794 214 605 788 4 4 - - - -
Now you just need to replace the names V1-V5 and drop column when.
Data:
dt <- fread("Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788")
One can try apply and tidyr::separate based solution as:
# First merge the data after moving values < 500 in left.
# The empty places should be filled with `-`
df$DesiredData <- apply(df[2:7],1,function(x){
value <- x[x<500]
paste0(c(value,rep("-",length(x)-length(value))),collapse = ",")
})
library(tidyverse)
# Now use `separate` function to split column in 6 desired columns
df %>% separate("DesiredData",
c("First","Second","Third","Fourth","Fifth","Sixth"), sep = ",")
# Category Price1 Price2 Price3 Price4 Price5 Price6 First Second Third Fourth Fifth Sixth
# 1 Product1 967 855 929 811 501 387 387 - - - - -
# 2 Product2 526 809 723 304 315 671 304 315 - - - -
# 3 Product3 412 133 369 930 400 337 412 133 369 400 337 -
# 4 Product4 709 241 625 822 967 952 241 - - - - -
# 5 Product5 395 506 110 280 829 817 395 110 280 - - -
# 6 Product6 803 618 794 214 605 788 214 - - - - -
Data:
df <- read.table(text="
Category,Price1,Price2,Price3,Price4,Price5,Price6
Product1,967,855,929,811,501,387
Product2,526,809,723,304,315,671
Product3,412,133,369,930,400,337
Product4,709,241,625,822,967,952
Product5,395,506,110,280,829,817
Product6,803,618,794,214,605,788",
header = TRUE, stringsAsFactors = FALSE, sep=",")

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

Binning a dataframe with equal frequency of samples

I have binned my data using the cut function
breaks<-seq(0, 250, by=5)
data<-split(df2, cut(df2$val, breaks))
My split dataframe looks like
... ...
$`(15,20]`
val ks_Result c
15 60 237
18 70 247
... ...
$`(20,25]`
val ks_Result c
21 20 317
24 10 140
... ...
My bins looks like
> table(data)
data
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
0 0 0 7 128 2748 2307
(35,40] (40,45] (45,50] (50,55] (55,60] (60,65] (65,70]
1404 11472 1064 536 7389 1008 1714
(70,75] (75,80] (80,85] (85,90] (90,95] (95,100] (100,105]
2047 700 329 1107 399 376 323
(105,110] (110,115] (115,120] (120,125] (125,130] (130,135] (135,140]
314 79 1008 77 474 158 381
(140,145] (145,150] (150,155] (155,160] (160,165] (165,170] (170,175]
89 660 15 1090 109 824 247
(175,180] (180,185] (185,190] (190,195] (195,200] (200,205] (205,210]
1226 139 531 174 1041 107 257
(210,215] (215,220] (220,225] (225,230] (230,235] (235,240] (240,245]
72 671 98 212 70 95 25
(245,250]
494
When I mean the bins, I get on an average of ~900 samples
> mean(table(data))
[1] 915.9
I want to tell R to make irregular bins in such a way that each bin will contain on an average 900 samples (e.g. (0, 27] = 900, (27,28.5] = 900, and so on). I found something similar here, which deals with only one variable, not the whole dataframe.
I also tried Hmisc package, unfortunately the bins don't contain equal frequency!!
library(Hmisc)
data<-split(df2, cut2(df2$val, g=30, oneval=TRUE))
data<-split(df2, cut2(df2$val, m=1000, oneval=TRUE))
Assuming you want 50 equal sized buckets (based on your seq) statement, you can use something like:
df <- data.frame(var=runif(500, 0, 100)) # make data
cut.vec <- cut(
df$var,
breaks=quantile(df$var, 0:50/50), # breaks along 1/50 quantiles
include.lowest=T
)
df.split <- split(df, cut.vec)
Hmisc::cut2 has this option built in as well.
Can be done by the function provided here by Joris Meys
EqualFreq2 <- function(x,n){
nx <- length(x)
nrepl <- floor(nx/n)
nplus <- sample(1:n,nx - nrepl*n)
nrep <- rep(nrepl,n)
nrep[nplus] <- nrepl+1
x[order(x)] <- rep(seq.int(n),nrep)
x
}
data<-split(df2, EqualFreq2(df2$val, 25))

How to divide a set of overlapping ranges into non-overlapping ranges? but in R

Let's say we have two datasets:
assays:
BHID<-c(127,127,127,127,128)
FROM<-c(950,959,960,961,955)
TO<-c(958,960,961,966,969)
Cu<-c(0.3,0.9,2.5,1.2,0.5)
assays<-data.frame(BHID,FROM,TO,Cu)
and litho:
BHID<-c(125,127,127,127)
FROM<-c(940,949,960,962)
TO<-c(949,960,961,969)
ROCK<-c(1,1,2,3)
litho<-data.frame(BHID,FROM,TO,ROCK)
and I want to join the two sets and the results after running the algorithm would be:
BHID FROM TO CU ROCK
125 940 970 - 1
127 949 950 - 1
127 950 958 0.3 1
127 958 959 - 1
127 959 960 0.9 1
127 960 961 2.5 2
127 961 962 1.2 -
127 962 966 1.2 3
127 966 969 - 3
128 955 962 0.5 -
Use merge
merge(assays, litho, all=T)
In essence, all=T is the SQL equivalent for FULL OUTER JOIN. I haven't specified any columns, because in this case merge function will perform the join across the column with same names.
Tough one but the code seems to work. The idea is to first expand each row into many, each representing a one-increment from FROM to TO. After merging, identify contiguous rows and un-expand them... Obviously it is not a very efficient approach so it may or may not work if your real data has very large FROM and TO ranges.
library(plyr)
ASSAYS <- adply(assays, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
Cu)
})
LITHO <- adply(litho, 1, with, {
SEQ <- seq(FROM, TO)
data.frame(BHID,
FROM = head(seq(FROM, TO), -1),
TO = tail(seq(FROM, TO), -1),
ROCK)
})
not.as.previous <- function(x) {
x1 <- head(x, -1)
x2 <- tail(x, -1)
c(TRUE, !is.na(x1) & !is.na(x2) & x1 != x2 |
is.na(x1) & !is.na(x2) |
!is.na(x1) & is.na(x2))
}
MERGED <- merge(ASSAYS, LITHO, all = TRUE)
MERGED <- transform(MERGED,
gp.id = cumsum(not.as.previous(BHID) |
not.as.previous(Cu) |
not.as.previous(ROCK)))
merged <- ddply(MERGED, "gp.id", function(x) {
out <- head(x, 1)
out$TO <- tail(x$TO, 1)
out
})
merged
# BHID FROM TO Cu ROCK gp.id
# 1 125 940 949 NA 1 1
# 2 127 949 950 NA 1 2
# 3 127 950 958 0.3 1 3
# 4 127 958 959 NA 1 4
# 5 127 959 960 0.9 1 5
# 6 127 960 961 2.5 2 6
# 7 127 961 962 1.2 NA 7
# 8 127 962 966 1.2 3 8
# 9 127 966 969 NA 3 9
# 10 128 955 969 0.5 NA 10
Note that the first row is not exactly the same as in your expected output, but I think mine makes more sense.

Resources