Writing custom function in R which takes input vector with units - r

I am trying to write a function my_heights() that takes a vector-like heights and returns a numeric vector with the heights in a consistent unit.
The unit should be specified as a function argument, with the default being cm. For the sake of simplicity, assume the unit can only be m or cm.
I have the following vector:
heights <- c("188 cm", "1.73 m", "192 cm", "175 cm", "165 cm", "191 cm", "1.85 m")
Desired output (sanity check):
my_heights(heights)
[1] 188 173 192 175 165 191 185
my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
I understand that when the default function is called it would return the heights with cm as it is but the heights with m are multiplied by 100. And for the latter heights with m are returned as it is but the heights with cm are divided by 100. Not sure how should I separate the heights by cm and m to make it work correctly.

Create a function with the input 'x' for the vector of values, and the unit specifying the default case as 'cm', then either use parse_number from readr or sub from base R to extract the numeric part, create a logical vector (i1) with grepl based on the occurrence of 'm' at the end ($) of the string. Specify a condition to multiply or divide the subset of values based on the 'unit', assign and return the numeric converted vector
clean_heights <- function(x, unit = 'cm') {
x1 <- readr::parse_number(x)
# // base R with sub
#x1 <- as.numeric(sub("\\s+\\D+", "", x))
i1 <- grepl('\\sm$', x)
if(unit == 'cm') {
x1[i1] <- x1[i1] * 100
} else {
x1[!i1] <- x1[!i1]/100
}
x1
}
-testing
clean_heights(heights)
#[1] 188 173 192 175 165 191 185
clean_heights(heights, "m")
#[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85

This would be done as follows:
my_heights <- function(x, unit = "cm"){
units <- c(mm = 0.1, cm = 1, dm=10, m = 100, km = 100000)
unname(with(read.table(text=x, h=F), V1*units[V2]))/units[unit]
}
my_heights(heights)
[1] 188 173 192 175 165 191 185
my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
my_heights(heights, "mm")
[1] 1880 1730 1920 1750 1650 1910 1850

You can try the following user function my_heights
my_heights <- function(h, unit = "cm") {
ifelse(gsub(".*\\s+", "", h) == unit,
1,
100^ifelse(unit == "cm", 1, -1)
) * as.numeric(gsub("\\s.*$", "", h))
}
such that
> my_heights(heights)
[1] 188 173 192 175 165 191 185
> my_heights(heights, "m")
[1] 1.88 1.73 1.92 1.75 1.65 1.91 1.85
> my_heights(heights, "cm")
[1] 188 173 192 175 165 191 185

Related

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

Sorting pairs by first coordinate

I have the following vectors:
X<-c(140,140,130,109,124,114,65,162,150,0)
Y<-c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
What I would like to do is assign each entry in X to the corresponding entry in Y, and then order them by X. For example, if I had
J<-c(10,40,20)
K<-c(9,9,2)
I would like it to give me
Jo = (10,20,40)
Ko = (9,2,9)
How do I do this in R? Thanks for the help.
Use the order() function:
X <- c(140,140,130,109,124,114,65,162,150,0)
Y <- c(30.65,6.45,17.74,11.29,3.23,3.23,3.23,8.06,14.52,1.61)
ord <- order(X)
(X2 <- X[ord])
## [1] 0 65 109 114 124 130 140 140 150 162
(Y2 <- Y[ord])
## [1] 1.61 3.23 11.29 3.23 3.23 17.74 30.65 6.45 14.52 8.06
(Don't really need to save ord if you re-order Y first; could use Y2 <- Y[order(X)]; X2 <- sort(X) instead.)

r test for a criteria in a column and if not met, move to next number and test again

This is quite complicated for me and I would be really grateful if someone could tell me how to go about this problem. My dataframe has two columns:
dat <- structure(list(day = 172:208,
x = c(0.14, 0.02, 0.09, 3.06, 3.21,
4.15, 6.24, 6.27, 3.31, 6.28,
16.9, 20.1, 20.29, 20.45, 17.52,
6.22, 1.14, 0.84, 0.68, 0.49,
0.22, 0.01, 0.01, 0.6, 0.64, 0.64,
0.66, 0.69, 0.15, 0.15, 3.16,
3.44, 3.42, 3.37, 3.51, 2.77, 3.51
)),
.Names = c("day", "x"),
class = "data.frame", row.names = c(NA,-37L))
dat
day x
172 0.14
173 0.02
174 0.09
175 3.06
176 3.21
177 4.15
178 6.24
179 6.27
180 3.31
181 6.28
182 16.90
183 20.10
184 20.29
185 20.45
186 17.52
187 6.22
188 1.14
189 0.84
190 0.68
191 0.49
192 0.22
193 0.01
194 0.01
195 0.60
196 0.64
197 0.64
198 0.66
199 0.69
200 0.15
201 0.15
202 3.16
203 3.44
204 3.42
205 3.37
206 3.51
207 2.77
208 3.51
What I want to do is this:
1) In column x, look for values greater than 2.3
which(x>2.3)
2) For the day where x is greater than 2.3, calculate the percentage change in x for next 3 days. For example, for 175 day, x is 3.06 (>2.3), therefore for next 3 consecutive values of x (3.21 - day 176, 4.15 - day 177, 6.24 - day 178), do this:
(3.21 - 3.06)*100/3.06 = 4.9
(4.15 - 3.21)*100/3.21 = 0.29
(6.24 - 4.15)*100/4.15 = 50.36
and if the all the above three values are greater than -30, then store the middle day from 176, 177 and 178 in a separate vector (in this case, store 177).
3) If the three values are less than -30, then start again from 179 (>2.3 mm) and repeat step 2 for day 180, 181 and 182.
(3.31 - 6.27)*100/6.27 = -47.2
(6. 28 - 3.31)*100/3.31 = 89.72
(16.9 - 6.28) * 100/6.28 = 169.1
If all the values are greater than -30, then store the middle day (181). In this case, one of the values is less than -30, therefore do not store anything and start again fromfrom 183 (>2.3 mm) and repeat again for 184, 185 and 186. If a value out of 3 values above is again less than -30, start from day 187 (x > 2.3) and repeat step 2 for day 188,189 and 190. If again a single value out of three is less than -30, then start from 202 (since for 202, x > 2.3)
I am really sorry I do not have much programming experience here in r therefore posting this question which has bogged me down for quite a time.
Thanks a lot
We can solve this without any loops, by using the new(ish) dplyr package:
library(dplyr)
library(magrittr)
Let's calculate percent change first, as a new column:
dat <- dat %>%
mutate(change = (x-lag(x))/lag(x)*100)
Next, make a vector of subscripts that indicates which groups of 3 should be included in your answer. In this chunk of code, we fine the first (min) value that matches your condition, count from there in intervals of 4, and then make those the "starting points" for our subscripts. The last line simply makes this into a dataframe:
grps <- which(dat$x > 2.3) %>%
min %>%
seq(from = ., to = nrow(dat), by = 4) %>%
lapply('+',1:3) %>%
do.call(c,.) %>%
function(l) data.frame(group = gl(length(l)/3,3), ss = l)
Then you use those group subscripts (ss) to pull out the necessary rows of dat. Let's look at the top of the new dataframe before proceeding:
grps %>%
do(data.frame(.,dat[.$ss,])) %>%
head
group ss day x change
5 1 5 176 3.21 4.901961
6 1 6 177 4.15 29.283489
7 1 7 178 6.24 50.361446
9 2 9 180 3.31 -47.208931
10 2 10 181 6.28 89.728097
11 2 11 182 16.90 169.108280
As you can see, the values for days 176, 177 and 178 exactly match those in your example. This is a group for which we will want the middle number, since all of change is greater than -30. We won't use 181, however, because one value of change in that group ("group 2") is less than -30. Again, this matches the original question.
Then group by group (which keeps data in sets of 3). Finally you filter the data by change > -30 and select only the middle row:
grps %>%
do(data.frame(.,dat[.$ss,])) %>%
group_by(group) %>%
filter(all(change > -30)) %>%
do(.[2,])
group ss day x change
1 1 6 177 4.15 29.2834891
2 3 14 185 20.45 0.7885658
3 6 26 197 0.64 0.0000000
4 8 34 205 3.37 -1.4619883

Relative and cumulative percentages for levels of an ordered factor variable in R

The title says it all. I searched over the internet a lot but I was not able to find the answer.
This topic "Make Frequency Histogram for Factor Variables" does exactly what I need but for a plot not for a table. I have an ordered factor variable and I need to calculate the relative percentages and the cumulative percentages for each level as if it was a numerical value. I would like to calculate the percentagies and save them in a separate table.
Any suggestions?
Thank you in advance.
Is this what you mean:
X <- sample(LETTERS[1:5],1000,replace=T)
X <- factor(X, ordered=T)
prop.table(table(X))
# X
# A B C D E
# 0.210 0.187 0.180 0.222 0.201
cumsum(prop.table(table(X)))
# A B C D E
# 0.210 0.397 0.577 0.799 1.000
This is basically just #Roland's answer from the question you referenced??
EDIT (Response to OP's comment)
Y <- table(X)
str(Y)
# 'table' int [1:5(1d)] 205 191 200 183 221
# - attr(*, "dimnames")=List of 1
# ..$ X: chr [1:5] "A" "B" "C" "D" ...
Z <- c(table(X))
str(Z)
# Named int [1:5] 205 191 200 183 221
# - attr(*, "names")= chr [1:5] "A" "B" "C" "D" ...
So Y is of class "table", whereas Z is a named integer vector. The main difference is the way various R functions treat the different classes. Try plot(Y) and plot(Z) or data.frame(Y) and data.frame(Z). Note, however, that (e.g.) sum(Y) and sum(Z) return the same thing.
Try fdth package:
library(fdth)
set.seed(2019)
X <- sample(LETTERS[1:5],
1000,
replace=T)
X <- factor(X,
ordered=T)
tb <- fdt_cat(X)
summary(tb)
# Category f rf rf(%) cf cf(%)
# B 223 0.22 22.3 223 22.3
# A 210 0.21 21.0 433 43.3
# C 191 0.19 19.1 624 62.4
# D 188 0.19 18.8 812 81.2
# E 188 0.19 18.8 1000 100.0

K-fold cross-validation using cv.lm()

I am new to R and trying to do K-fold cross validation using cv.lm()
Refer: http://www.statmethods.net/stats/regression.html
I am getting error indicating the length of my variable are different. Infact during my verification using length(), I found the size in fact the same.
The below are the minimal datasets to replicate the problem,
X Y
277 5.20
285 5.17
297 4.96
308 5.26
308 5.11
263 5.27
278 5.20
283 5.16
268 5.17
250 5.20
275 5.18
274 5.09
312 5.03
294 5.21
279 5.29
300 5.14
293 5.09
298 5.16
290 4.99
273 5.23
289 5.32
279 5.21
326 5.14
293 5.22
256 5.15
291 5.09
283 5.09
284 5.07
298 5.27
269 5.19
Used the below code to do the cross-validation
# K-fold cross-validation, with K=10
sampledata <- read.table("H:/sample.txt", header=TRUE)
y.1 <- sampledata$Y
x.1 <- sampledata$X
fit=lm(y.1 ~ x.1)
library(DAAG)
cv.lm(df=sampledata, fit, m=10)
The error on the terminal,
Error in model.frame.default(formula = form, data = df[rows.in, ], drop.unused.levels = TRUE) :
variable lengths differ (found for 'x.1')
Verification,
> length(x.1)
[1] 30
> length(y.1)
[1] 30
The above confirms the length are the same.
> str(x.1)
int [1:30] 277 285 297 308 308 263 278 283 268 250 ...
> str(y.1)
num [1:30] 5.2 5.17 4.96 5.26 5.11 5.27 5.2 5.16 5.17 5.2 ...
> is(y.1)
[1] "numeric" "vector"
> is(x.1)
[1] "integer" "numeric" "vector" "data.frameRowLabels"
Further check on the data set as above indicates one dataset is integer and another is numeric. But even when the data sets are converted the numeric to integer or integer to numeric, the same error pops up in the screen indicating issues with data length.
Can you guide me what should I do to correct the error?
I am unsuccessful in handling this since 2 days ago. Did not get any good lead from my research using internet.
Addional Related Query:
I see the fit works if we use the headers of the data set in the attributes,
fit=lm(Y ~ X, data=sampledata)
a) what is the difference of the above syntax with,
fit1=lm(sampledata$Y ~ sampledata$X)
Thought it is the same. In the below,
#fit 1 works
fit1=lm(Y ~ X, data=sampledata)
cv.lm(df=sampledata, fit1, m=10)
#fit 2 does not work
fit2=lm(sampledata$Y ~ sampledata$X)
cv.lm(df=sampledata, fit2, m=10)
The problem is at df=sampledata as the header "sampledata$Y" does not exist but only $Y exist. Tried to manupulate cv.lm to below it does not work too,
cv.lm(fit2, m=10)
b) How if we like to manipulate the variables, how to use it in cv.lm() for e.g
y.1 <- (sampledata$Y/sampledata$X)
x.1 <- (1/sampledata$X)
#fit 4 problem
fit4=lm(y.1 ~ x.1)
cv.lm(df=sampledata, fit4, m=10)
Is there a way I could reference y.1 and x.1 instead of the header Y ~ X in the function?
Thanks.
I'm not sure about why exactly this happens, but I've spotted that you do not specify data argument for lm(), so this was my first guess.
fit=lm(Y ~ X, data=sampledata)
Since the error is gone, this may be a sufficient answer.
UPD: The reason for the error is that y.1 and x.1 do not exist in sampledata, which is provided as df argument for cv.lm, so that formula y.1 ~ x.1 makes no sense in the cv.lm environment.

Resources