sumif in ifelse condition R - r

I have a DT with multiple columns and I need to give a condition in ifelse and do the calculations accordingly. I want it to do count/sum(count) grouped by segment. Here is the DT
Segment Count Flag
A 23 Y
B 45 N
A 56 N
B 212 Y
I want the fourth column as count per total count of the segment based on the flag so the out put should look something like this. For flag N it is the share of the count per segment. For flag Y, it is the revenue percentage calculation if the No(N) becomes Yes(Y) and in that case the revenue that could be earned. I am sorry as it is clumsy but kindly ask me in comments if you have any doubts.
Segment Count Flag Rev Value
A 23 Y 34 ((56/23)*34)/(34+69)
B 45 N 48 45/(45+212)
A 56 N 23 56/(56+23)
B 212 Y 67 ((45/212)*67)/(67+12)
A 65 Y 69 ...
B 10 Y 12 ...
Any help is appreciated. Thanks!

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(DT)), grouped by 'Segment', create the 'Value' column by diviing the 'Count' by the sum of 'Count', then we update the 'Value' where the Flag' is 'N'
library(data.table)
setDT(DT)[, Value := Count/sum(Count), Segment
][Flag == "N", Value := Count/sum(Count), Segment]
DT
# Segment Count Flag Value
#1: A 23 Y 0.18852459
#2: B 45 N 1.00000000
#3: A 56 N 1.00000000
#4: B 212 Y 0.78810409
#5: A 43 Y 0.35245902
#6: B 12 Y 0.04460967
Just checking with the OPs expected output 'Value'
> 23/122
#[1] 0.1885246
> 212/269
#[1] 0.7881041
> 43/122
#[1] 0.352459
> 12/269
#[1] 0.04460967
Update3
Based on the update No:3 in Op's post
s1 <- setDT(DT1)[, .(rn = .I[Flag == "Y"], Value = (Rev[Flag=="Y"] *
(Count[Flag == "N"]/Count[Flag=="Y"]))/sum(Rev[Flag == "Y"])), Segment]
s2 <- DT1[, .(rn = .I[Flag == "N"], Value = Count[Flag == "N"]/(Count[Flag == "N"] +
Count[Flag=="Y"][1])), Segment]
DT1[, Value := rbind(s1, s2)[order(rn)]$Value]
DT1
# Segment Count Flag Rev Value
#1: A 23 Y 34 0.8037146
#2: B 45 N 48 0.1750973
#3: A 56 N 23 0.7088608
#4: B 212 Y 67 0.1800215
#5: A 65 Y 69 0.5771471
#6: B 10 Y 12 0.6835443
>((56/23)*34)/(34+69)
#[1] 0.8037146
> 45/(45+212)
#[1] 0.1750973
> 56/(56+23)
#[1] 0.7088608
> ((45/212)*67)/(67+12)
#[1] 0.1800215
data
DT <- structure(list(Segment = c("A", "B", "A", "B", "A", "B"), Count = c(23L,
45L, 56L, 212L, 43L, 12L), Flag = c("Y", "N", "N", "Y", "Y",
"Y")), .Names = c("Segment", "Count", "Flag"), row.names = c(NA,
-6L), class = "data.frame")
DT1 <- structure(list(Segment = c("A", "B", "A", "B", "A", "B"), Count = c(23L,
45L, 56L, 212L, 65L, 10L), Flag = c("Y", "N", "N", "Y", "Y",
"Y"), Rev = c(34L, 48L, 23L, 67L, 69L, 12L)), .Names = c("Segment",
"Count", "Flag", "Rev"), class = "data.frame", row.names = c(NA,
-6L))

Alternatively we could have also used dplyr pkg for that...
Updating based on the suggestions provided by #Aramis7d - thanks!
library(data.table)
df <- fread("Segment Count Flag
A 23 Y
B 45 N
A 56 N
B 212 Y
A 43 Y
B 12 Y")
library(dplyr)
df %>%
group_by(Segment) %>%
mutate(Value = Count/sum(Count)) %>%
group_by(Segment, Flag) %>%
mutate(Value = if_else( Flag == "N", Count/sum(Count), Value))

Related

Group values from a column based on another column's values in R

I have a dataframe and I would like all values in the second column to be stored together when they have the same value in the first column.
One of the difficulties is to put these values in quotation marks separated by semicolons only when there are several of them.
A 12
A 56
A 23
B 16
C 04
C 73
The result would be this :
A "12;56;23"
B 16
C "04;73"
I saw that the function fill() of tydir allows to do more or less the opposite of what I want, but I don't know any function able to do that. If you can give me some clues, Thanks !
We can use paste with collapse after grouping
aggregate(col2 ~ col1, df1, \(x)
if(length(x) > 1) dQuote(paste(x, collapse =";"), FALSE) else x)
-output
col1 col2
1 A "12;56;23"
2 B 16
3 C "4;73"
data
df1 <- structure(list(col1 = c("A", "A", "A", "B", "C", "C"), col2 = c(12L,
56L, 23L, 16L, 4L, 73L)), class = "data.frame", row.names = c(NA,
-6L))

Calculating sum of certain values across two columns in R

I currently have a dataframe like the one below of a bunch of pairwise correlations:
Data
structure(list(ID1 = c("A", "A", "A", "B", "B", "C"), ID2 = c("B",
"C", "D", "C", "D", "D"), cor = c(0.6, 0.6, 0.2, 0.1, 0.9, 0.2
), value1 = c(50L, 50L, 50L, 20L, 20L, 30L), value2 = c(20L,
30L, 100L, 30L, 100L, 100L)), class = "data.frame", row.names = c(NA,
-6L))
ID1 ID2 cor value1 value2
1 A B 0.6 50 20
2 A C 0.6 50 30
3 A D 0.2 50 100
4 B C 0.1 20 30
5 B D 0.9 20 100
6 C D 0.2 30 100
I'm trying to get the sum of all IDs (i.e. B) of the product between cor and either value1 or value2 depending on whether it is from ID1 or ID2.
For instance, the sum of B would be (cor x value)
(0.6 x 50) + (0.1 x 30) + (0.9 x 100)
I essentially would need to do this for around 20000 unique IDs. I hope this makes sense. I'm not that great in R (yet)!
Does this achieve what you need?
library(tidyverse)
df2 <- df %>%
pivot_longer(names_to = "names", values_to = "values", -c(cor:value2)) %>%
mutate(value = if_else(names == "ID1", value2, value1),
sum = cor * value) %>%
group_by(values) %>%
summarise(sum = sum(sum))
Unless you're looking for dplyr way of answering it, here's a quick but a little inelegant way of doing it:
cond1 <- df$ID1[df$ID1 == "B"]
sum1 <- sum(df$cor[cond1] * df$value1[cond1])
cond2 <- df$ID2[df$ID2 == "B"]
sum2 <- sum(df$cor[cond2] * df$value2[cond2])
finalsum = sum1 + sum2
Basically you want to first look at which row B is in ID1, and then do the product-sum, and then look at which row B is in ID2 and do the same.
Update:
What if you have thousands of ID? Again, I like it quick so create a function out of it:
prodsum <- function (df, ID) {
cond1 <- df$ID1[df$ID1 == ID]
sum1 <- sum(df$cor[cond1] * df$value1[cond1])
cond2 <- df$ID2[df$ID2 == ID]
sum2 <- sum(df$cor[cond2] * df$value2[cond2])
return(sum1 + sum2)
}
Then prodsum(df, "B") will give you the answer for original question. And you can use sapply() to do the job of cycling through thousands of IDs:
IDs <- unique(c(df$ID1, df$ID2))
sapply(IDs, function (x) prodsum(df, x)
There may or may not be a problem if an ID exists exclusively in ID1 or ID2. I'm sure you can write a conditional to deal with the problem.
Another way of looking the thing as below.
Assuming your data frame name is a
a1 <- subset(a,select=c(ID1,cor,value1))
a2 <- subset(a,select=c(ID2,cor,value1))
colnames(a2)[colnames(a2) == "ID2"] <- "ID1"
a3 <- rbind(a1,a2)
a3$MULTIPLY1 <- a3$cor * a3$value1
a4 <- a3 %>% group_by(ID1) %>% summarise(FINALVALUE = sum(MULTIPLY1))
# A tibble: 4 x 2
ID1 FINALVALUE
<chr> <dbl>
1 A 70
2 B 50
3 C 38
4 D 34
Hope this will help to some extent...!

Replace certain columns in dataframe with corresponding names from another dataframe

I have a dataframe with SRR names as column headers, and I would like to replace those with their corresponding PI names from another dataframe, using dplyr.
SRR dataframe:
CHR POS ALLELE SRR6 SRR8 SRR9 SRR10
01 10 A,T A T T A
01 20 C,G G C C C
02 15 T T T T T
PI dataframe:
PI_NAME SRR_NAME
PI1 SRR6
PI2 SRR7
PI3 SRR8
PI4 SRR9
PI5 SRR10
Desired Output:
CHR POS ALLELE PI1 PI3 PI4 PI5
01 10 A,T A T T A
01 20 C,G G C C C
02 15 T T T T T
So far, I've tried something like this:
SRR %>%
rename_at(vars(matches("SRR")), funs(str_replace(., ., PI$PI_NAME[PI$SRR == .])))
but have not been successful.
Thanks in advance for any help.
We can use a named key/value vector to match the column names and replace the names
library(dplyr)
SRR %>%
rename_at(vars(matches("SRR")), ~ setNames(PI$PI_NAME, PI$SRR_NAME)[.])
# CHR POS ALLELE PI1 PI3 PI4 PI5
#1 1 10 A,T A T T A
#2 1 20 C,G G C C C
#3 2 15 T T T T T
It can be translated in base R as well
i1 <- grep("SRR", names(SRR))
names(SRR)[i1] <- setNames(PI$PI_NAME, PI$SRR_NAME)[names(SRR)[i1]]
data
SRR <- structure(list(CHR = c(1L, 1L, 2L), POS = c(10L, 20L, 15L), ALLELE = c("A,T",
"C,G", "T"), SRR6 = c("A", "G", "T"), SRR8 = c("T", "C", "T"),
SRR9 = c("T", "C", "T"), SRR10 = c("A", "C", "T")), class = "data.frame",
row.names = c(NA,
-3L))
PI <- structure(list(PI_NAME = c("PI1", "PI2", "PI3", "PI4", "PI5"),
SRR_NAME = c("SRR6", "SRR7", "SRR8", "SRR9", "SRR10")),
class = "data.frame", row.names = c(NA,
-5L))

R dplyr: mutate specific values with a reference value found in the same column

How can I mutate specific values in a column with a reference value found in the same column? The data frame looks like this:
A Ref 20
A S1 12
A S2 76
A S3 12
A S4 12
A XY 89
B Ref 02
B S1 12
B S2 42
B S3 21
B S4 12
B XY 56
I would like to mutate by having all S values divided by the Ref value, however not the values for XY.
Basically S1/Ref, S2/Ref, ... S4/Ref for A and B excluding values for XY.
Thanks in advance.
Here is one way to do this with dplyr. After grouping by first column, say 'v1', dvide the 'v3' divided by 'v3' where 'v2' is 'Ref' (assuming that there is only one 'Ref' per each unique 'v1' and then replace the 'newcol' where the 'v2' column have values that are not "S\d+"i.e. "S" followed by numbers with 'v3' values.
library(dplyr)
df1 %>%
group_by(v1) %>%
mutate(newcol = v3/v3[v2 == "Ref"],
newcol = ifelse(!grepl("^S\\d+", v2), v3, newcol))
# A tibble: 12 x 4
# Groups: v1 [2]
# v1 v2 v3 newcol
# <chr> <chr> <int> <dbl>
# 1 A Ref 20 20.0
# 2 A S1 12 0.6
# 3 A S2 76 3.8
# 4 A S3 12 0.6
# 5 A S4 12 0.6
# 6 A XY 89 89.0
# 7 B Ref 2 2.0
# 8 B S1 12 6.0
# 9 B S2 42 21.0
#10 B S3 21 10.5
#11 B S4 12 6.0
#12 B XY 56 56.0
Suppose, if we need to only replace the 'XY' values with the 'v3', replace the last line with newcol = ifelse(v2 == "XY", v3, newcol))
data
df1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("Ref", "S1", "S2", "S3", "S4", "XY",
"Ref", "S1", "S2", "S3", "S4", "XY"), v3 = c(20L, 12L, 76L, 12L,
12L, 89L, 2L, 12L, 42L, 21L, 12L, 56L)), .Names = c("v1", "v2",
"v3"), class = "data.frame", row.names = c(NA, -12L))

Round conditionally numbers in R

How can I round conditionally the values of a column in a dataframe in R? I need to round to the lower 10 from 0-89 and not from 90-100. For example:
ID value
A 15
B 47
C 91
D 92
has to be changed to
ID value
A 10
B 40
C 91
D 92
so, no changes for C/D and A/B rounded down
Any ideas?
Thanks
You can do it like this:
df$value[df$value < 90] <- floor(df$value[df$value < 90] / 10) * 10
# ID value
# 1 A 10
# 2 B 40
# 3 C 91
# 4 D 92
As a reminder, here is your data:
df <- structure(list(ID = c("A", "B", "C", "D"), value = c(15L, 47L,
91L, 92L)), .Names = c("ID", "value"), class = "data.frame", row.names = c(NA,
-4L))
Other solution using data.table:
library(data.table)
setDT(df)[, value:= as.numeric(value)][value<90, value:= floor(value/10) * 10]
# ID value
# 1: A 10
# 2: B 40
# 3: C 91
# 4: D 92
You could do:
df$value <- with(df, ifelse(value %in% c(0:89), value-(value%%10), value))

Resources