How can I conditionally calculate a new variable from pre-existing variables? - r

I'm a programming novice trying to calculate some ideal body weight numbers from a large dataset of height, sex, and actual body weight. I would like to create a new column in the data frame (df$ibw) based on the ideal body weight calculation for each individual.
Ideal body weight (IBW) is calculated differently for men and women.
For males... IBW = 50 + 0.91((Height in cm)-152.4)
For females... IBW = 45.5 + 0.91((Height in cm)-152.4)
set.seed(1000)
weight <- rnorm(10, 100, 20) # weight in kilograms
sex <- (0:1) # 0 for Male, 1 for Female
height <- rnorm(10, 150, 10) # height in centimeters
df <- data.frame(weight, sex, height)
df
I've been reading other posts using if else statements and other conditional formats, but I keep getting errors. This is something I will be frequently doing for datasets and I'm trying to figure out the best way to accomplish this task.

You could use a one-liner:
df$IBW <- 0.91 * (df$height - 152.4) + 50 - 4.5 * df$sex
df
# weight sex height IBW
# 1 91.08443 0 140.1757 38.87591
# 2 75.88287 1 144.4551 38.27015
# 3 100.82253 0 151.2138 48.92057
# 4 112.78777 1 148.7913 42.21606
# 5 84.26891 0 136.6396 35.65803
# 6 92.29021 1 151.7006 44.86352
# 7 90.48264 0 151.5508 49.22722
# 8 114.39501 1 150.2493 43.54288
# 9 99.62989 0 129.5341 29.19207
# 10 72.53764 1 152.1315 45.25570
If sex = 1 (female), then we just substract 50 - 45.5 = 4.5

This should do it
df$ibw <- ifelse(df$sex == 0, 50 + 0.91 * (df$height - 152.4),
45.5 + 0.91 * (df$height - 152.4))

Something like this should work.
df$ibw <- 0
df[df$sex == 0,]$ibw <- 50 + 0.91*df[df$sex == 0,]$height - 152.4
df[df$sex == 1,]$ibw <- 45.5 + 0.91*df[df$sex == 1,]$height - 152.4

Related

How to remove loop from the following model function?

I'm rewriting some code, and I am currently creating a small population model. I have re-created the current model function below from a book, it's a simple population model based on a few parameters. I've left them at default and returned the data frame. Everything works well. However, I was wondering whether I could somehow exclude the loop from the function.
I know R is great because of vectorized calculation, but I'm not sure in this case whether it would be possible. I thought of using something like lead/lag to do it, but would this work? Perhaps not as things need to be calculated sequentially?
# Nt numbers at start of time t
# Ct = removed at the end of time t
# Nt0 = numbers at time 0
# r = intrinsic rate of population growth
# K = carrying capacity
mod_fun = function (r = 0.5, K = 1000, N0 = 50, Ct = 0, Yrs = 10, p = 1)
{
# sets years to year value plus 1
yr1 <- Yrs + 1
# creates sequence of length years from year 1 to Yrs value +!
years <- seq(1, yr1, 1)
# uses years length to create a vector of length Yrs + 1
pop <- numeric(yr1)
# sets population at time 0
pop[1] <- N0
# creates a loop that calculates model for each year after first year
for (i in 2:yr1) {
# sets starting value of population for step to one calculated previous step
# thus Nt is always the previous step pop size
Nt <- pop[i - 1]
pop[i] <- max((Nt + (r * Nt/p) * (1 - (Nt/K)^p) -
Ct), 0)
}
# sets pop2 to original pop length
pop2 <- pop[2:yr1]
# binds together years (sequence from 1 to length Yrs),
# pop which is created in loop and is the population at the start of step t
# pop2 which is the population at the end of step t
out <- cbind(year = years, nt = pop, nt1 = c(pop2, NA))
# sets row names to
rownames(out) <- years
out <- out[-yr1, ]
#returns data.frame
return(out)
}
result = mod_fun()
This is what the output looks like. Basically rowwise starting from row 1 given the starting population of 50 the loop calculates nt1 then sets next nt row to lag(nt1) and then things continue in a similar fashion.
result
#> year nt nt1
#> 1 1 50.0000 73.7500
#> 2 2 73.7500 107.9055
#> 3 3 107.9055 156.0364
#> 4 4 156.0364 221.8809
#> 5 5 221.8809 308.2058
#> 6 6 308.2058 414.8133
#> 7 7 414.8133 536.1849
#> 8 8 536.1849 660.5303
#> 9 9 660.5303 772.6453
#> 10 10 772.6453 860.4776
Created on 2022-04-24 by the reprex package (v2.0.1)
mod_fun = function (r = 0.5, K = 1000, N0 = 50, Ct = 0, Yrs = 10, p = 1)
{
years <- seq_len(Yrs)
pop <- Reduce(function(Nt, y)max((Nt + (r * Nt/p) * (1 - (Nt/K)^p) - Ct), 0),
years, init = N0, accumulate = TRUE)
data.frame(year = years, nt = head(pop,-1), nt1 = pop[-1])
}
year nt nt1
1 1 50.0000 73.7500
2 2 73.7500 107.9055
3 3 107.9055 156.0364
4 4 156.0364 221.8809
5 5 221.8809 308.2058
6 6 308.2058 414.8133
7 7 414.8133 536.1849
8 8 536.1849 660.5303
9 9 660.5303 772.6453
10 10 772.6453 860.4776

using lag for creating an x+1 column

I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?
You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936

Rolling window with dplyr to find value of factor

I have a matrix like this
head(a)
# A tibble: 6 x 4
date ROE ROFE ROTFE
<date> <dbl> <dbl> <dbl>
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522
I want to pick the value of a factor which has highest cumulative return last 2 months over time.
I have done something like this and it works.
However, my friend told me that it can be done in one or two lines of dplyr and I'm wondering if you could please show me how to do that.
index = as.Date(unique(a$date))
nmonth = 2;
mean.ROE = numeric()
for (i in 1:(length(index) - nmonth)) { # i = 2
index1 = index[i]
index2 = index[nmonth + i]
index3 = index[nmonth + i+1]
# Take a 2-month window of ROE returns:
b = a[a$date >= index1 & a$date < index2,] %>% mutate(cum.ROE = cumprod(1 + ROE)) %>% mutate(cum.ROFE = cumprod(1 + ROFE)) %>% mutate(cum.ROTFE = cumprod(1 + ROTFE))
# Use the cumulative return over the 2-month window to determine which factor is best.
mean.ROE1 = ifelse(b$cum.ROE[nmonth] > b$cum.ROFE[nmonth] & b$cum.ROE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROE, ifelse(b$cum.ROFE[nmonth] > b$cum.ROE[nmonth] & b$cum.ROFE[nmonth] > b$cum.ROTFE[nmonth], a[a$date == index3,]$ROFE, a[a$date == index3,]$ROTFE))
# Bind the answer to the answer vector
mean.ROE = rbind(mean.ROE, mean.ROE1)
}
Create a function maxret which takes 2 + nmonth rows, x, and calculates the cumulative returns, r, for each column of the first two rows. For the largest of those return the value in the last row of x.
Now use rollapplyr to apply it to a rolling window of width 2 + month:
library(zoo)
maxret <- function(x) {
r <- apply(1 + x[1:2, ], 2, prod)
x[2 + nmonth, which.max(r)]
}
z <- read.zoo(as.data.frame(a))
res <- rollapplyr(z, 2 + nmonth, maxret, by.column = FALSE)
giving the zoo series:
> res
2000-04-28 2000-05-31 2000-06-30
0.06002222 -0.03620956 0.03075652
If you want a data frame use fortify.zoo(res) .
Note: 1 The input was not provided in reproducible form in the question so I have assumed this data.frame:
Lines <-
"date ROE ROFE ROTFE
1 2000-01-31 0.033968932 0.0324214815 0.010205926
2 2000-02-29 0.006891111 -0.0003352941 -0.005230147
3 2000-03-31 0.006158519 0.0213992647 0.040399265
4 2000-04-28 0.060022222 0.0151191176 0.047586029
5 2000-05-31 -0.016960000 -0.0287617647 -0.036209559
6 2000-06-30 0.034133577 0.0144456522 0.030756522"
a <- read.table(text = Lines, header = TRUE)
Note 2: With the input in Note 1 or with zoo 1.8.1 (the development version of zoo) this line:
z <- read.zoo(as.data.frame(a))
could be simplified to just:
z <- read.zoo(a)
but we have added the as.data.frame part in the main code so it works with tibbles as well as straight data frames even with the current version of zoo on CRAN.

How to maintain the order of elements of a row when using by and rbind function in r?

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources