Combine if else statement with two conditions - r

I have one dataset which contain two columns "Code" and "Gross_i". You can see data below:
# Data
TABLE<-data.frame(Code=as.integer(c("1","2","3","4","5")),
Gross_i=as.integer(c("10","20","30","40","50")
))
TAX_RATE1<-0.20
TAX_RATE2<-0.25
My intention here is to multiply second column "Gross_i" with two different tax rates.So I need to multiply first three code "1","2" and "3" with TAX_RATE1 (20%) and observation "4" and "5" with TAX_RATE2 (25%). In order to do this I try this line of code (If else statment) but results are not good:
pit_1=if_else(filter(Code %in% c("1","2","3")),Gross_i*TAX_RATE1,Gross_i*TAX_RATE2)
So can anybody help how to fix this line of code?

This approach can be useful with dplyr in the field:
library(dplyr)
#Code
TABLE %>% mutate(Value=if_else(Code %in% c("1","2","3"),Gross_i*TAX_RATE1,Gross_i*TAX_RATE2))
Output:
Code Gross_i Value
1 1 10 2.0
2 2 20 4.0
3 3 30 6.0
4 4 40 10.0
5 5 50 12.5

If you have only two tax rates, you can do :
library(dplyr)
TABLE %>% mutate(pit_1 = Gross_i * c(TAX_RATE2, TAX_RATE1)[(Code %in% 1:3) + 1])
# Code Gross_i pit_1
#1 1 10 2.0
#2 2 20 4.0
#3 3 30 6.0
#4 4 40 10.0
#5 5 50 12.5
If you lot of rates like this it would be easy to specify conditions within case_when :
TABLE %>%
mutate(pit_1 = Gross_i * case_when(Code %in% 1:3 ~ TAX_RATE1,
TRUE ~ TAX_RATE2))

Your attempt at if_else translates easily to the following using ifelse
pit_1 <- ifelse(TABLE$Code %in% c("1", "2", "3"),
yes = TABLE$Gross_i * TAX_RATE1,
no = TABLE$Gross_i * TAX_RATE2)

Related

Dividing all possible rows within a given sub-data in R

My data looks like this:
set <- c(1,1,1,2,2,3,3,3,3,3,4,4)
density <- c(1,3,3,1,3,1,1,1,3,3,1,3)
counts <- c(100,2,4,76,33,12,44,13,54,36,65,1)
data <- data.frame(set,density,counts)
data$set <- as.factor(data$set)
data$density <- as.factor(data$density)
Within a given set there are two levels of densities "1" or "3". For a given set, I want to divide all possible combinations of counts of density "1" and density "3". I then want to print the original density associated with density "1", the ratio, and the set
For example, the result for the first few rows should look like:
set counts ratio
1 100 50 #100/2
1 100 25 #100/4
2 76 2.3 #76/33
3 12 0.22 #12/54
3 12 0.33 #12/36
3 44 0.8148 #44/54
...
I thought I could achieve it by dplyr..but it seems a little too complicated for dplyr.
It looks like the comments get you most of the way there. Here's a dplyr solution. With left_join each of the density1's get matched up with all density3's in the same set, providing output in line with your specification.
# Edited below to use dplyr syntax; my base syntax had a typo
library(dplyr)
data_combined <- data %>% filter(density == 1) %>%
# Match each 1 w/ each 3 in the set
left_join(data %>% filter(density == 3), by = "set") %>%
mutate(ratio = counts.x / counts.y) %>%
select(set, counts.x, counts.y, ratio)
data_combined
# set counts.x counts.y ratio
#1 1 100 2 50.0000000
#2 1 100 4 25.0000000
#3 2 76 33 2.3030303
#4 3 12 54 0.2222222
#5 3 12 36 0.3333333
#6 3 44 54 0.8148148
#7 3 44 36 1.2222222
#8 3 13 54 0.2407407
#9 3 13 36 0.3611111
#10 4 65 1 65.0000000

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

R loop through multiple sub groups with using functions

Hi I am trying to learn how to loop through multiple groups within a data frame and apply certain arithmetic operations. I do not have a programming background and am struggling to loop through the multiple conditions.
My data looks like the following:
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
What I would like to do is to look through each event and each category and take the average value of Money in cases where Play_together == 1. When Play_together==0, then I would like to apply Money/100.
I understand that the loop would look something like the following:
for i in 1:nrow(z){
#loop for event{
#loop for Category{
#Define avg or division function
}
}
}
However, I cannot seem to implement this using a nested loop. I saw another post (link: apply function for each subgroup) which uses dplyr package. I was wondering if someone could help me to implement this without using any packages (I know this might take longer as compared to using R packages). I am trying to learn R and this is the first time I am working with nested loops.
The final output will look like this:
where for event 1, the following holds:
a) For cateory 1:
Play_together ==1 in row 1; we take the avg of Money value and hence final output = 23/1= 23
Play_together==0 in row 2; we take Money/100= 0.11
b) For category 2:
Play_together == 1 for all observations. We take avg Money for all three observations.
This holds similarly for Event 2. In my actual dataset, I have event = 600 and number of category ranging from 1 - 10. Some events may have only 1 category and a maximum of 10 categories. So any function needs to be extremely flexible. The total number of observations in my dataset is around 1.5 million so any changes in the looping process to reduce the time taken to carry out the operation is going to be extremely helpful (Although at this stage my priority is the looping process itself).
Once again it would be a great help if you can show me how to use nested looping and explain the steps in brief. Much appreciated.
will something like this do?
I know it's using dplyr, but that package is made for this kind of jobs ;-)
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
library(dplyr)
df_temp <- z %>%
group_by( Event, Category, Play_together ) %>%
summarise( money_mean = mean( Money ) ) %>%
mutate( final_output = ifelse( Play_together == 0, money_mean / 100, money_mean )) %>%
select( -money_mean )
df <- z %>%
left_join(df_temp, by = c("Event", "Category", "Play_together" )) %>%
arrange(Event, Category)
Consider base R's by, the object-oriented wrapper to tapply designed to subset dataframes by factor(s) but unlike split can pass subsets into a defined function. Then, run conditional logic with ifelse for Final_Output field. Finally, stack all subsetted dataframes for final object.
# LIST OF DATAFRAMES
by_list <- by(z, z[c("Event", "Category")], function(sub) {
tmp <- subset(sub, Play_together==1)
sub$Final_Output <- ifelse(sub$Play_together == 1, mean(tmp$Money), sub$Money/100)
return(sub)
})
# APPEND ALL DATAFRAMES
final_df <- do.call(rbind, by_list)
row.names(final_df) <- NULL
final_df
# Event Indiv1 Indiv2 Category Play_together Money Final_Output
# 1 1 4 7 1 1 23 23.00
# 2 1 5 81 1 0 11 0.11
# 3 2 66 34 1 1 345 217.50
# 4 2 32 90 1 0 21 0.21
# 5 2 45 12 1 1 90 217.50
# 6 1 6 91 2 1 78 19.00
# 7 1 11 67 2 1 -9 19.00
# 8 1 45 12 2 1 -12 19.00
# 9 2 8 56 2 1 9 26.00
# 10 2 9 78 2 1 43 26.00

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Resources