After imputation, how to round to nearest level of a factor

After imputation, how to round to nearest level of a factor - r

I have imputed missing values in the following data frame in column q1. q1 is a factor with the levels 0,20,40,60,80,100. I need to round the column q1 to nearest level of factors, number 49 and 91 need to be rounded. Does anyone have a solution for this problem?, thanks in advance!
id <- rep(c(300,450), each=6)
> visit <- rep(1:6,2)
> trt <- rep(c(0,"A",0,"B",0,"C"),2)
> q1 <- c(0,100,0,89,0, 60,0,85,0,40,0, 20)
> df <- data.frame(id,visit,trt,q1)
> df
id visit trt q1
1 300 1 0 0
2 300 2 A 100
3 300 3 0 0
4 300 4 B 49
5 300 5 0 0
6 300 6 C 60
7 450 1 0 0
8 450 2 A 91
9 450 3 0 0
10 450 4 B 40
11 450 5 0 0
12 450 6 C 20
>

Maybe you could use plyr's round_any function here which would round to nearest multiple of 20 in this case.
plyr::round_any(df$q1, 20)
#[1] 0 100 0 80 0 60 0 80 0 40 0 20

Related

The value in one column depends in the value of another column

I want to make all rows with number 2 in column q1 to zero in column q2. Anyone have a smart solution?
a <- rep(c(300,450), each=c(3,3))
q1 <- rep(c(1,1,2,1,1,2),2)
q2 <- c(100,40,"",80,30,"" , 45,78,"",20,58,"")
df <- cbind(a,q1,q2)
df <- as.data.frame(df)
Original input data :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2
4 450 1 80
5 450 1 30
6 450 2
7 300 1 45
8 300 1 78
9 300 2
10 450 1 20
11 450 1 58
12 450 2
Desired output :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2 0
4 450 1 80
5 450 1 30
6 450 2 0
7 300 1 45
8 300 1 78
9 300 2 0
10 450 1 20
11 450 1 58
12 450 2 0

An option would be to create a logical vector based on the column 'q1' and assign the value of 'q2' to 0
df$q2[df$q1 == 2] <- 0
df
# a q1 q2
#1 300 1 100
#2 300 1 40
#3 300 2 0
#4 450 1 80
#5 450 1 30
#6 450 2 0
#7 300 1 45
#8 300 1 78
#9 300 2 0
#10 450 1 20
#11 450 1 58
#12 450 2 0
Another option is replace
transform(df, q2 = replace(q2, q1 == 2, 0))
With cbind, it converts to a matrix first, so any character element anywhere results in the whole matrix to be character. Better, would be use data.frame directly
Or in data.table
library(data.table)
setDT(df)[q1== 2, q2 := '0']
data
df <- data.frame(a, q1, q2, stringsAsFactors = FALSE)

Linear interpolation by multiple groupings in R

I have the following data set:
District Type DaysBtwn Start_Day End_Day Start_Vol End_Vol
1 A 0 3 0 31 28 23
2 A 1 3 0 31 24 0
3 B 0 3 0 31 17700 10526
4 B 1 3 0 31 44000 35800
5 C 0 3 0 31 5700 0
6 C 1 3 0 31 35000 500
For each of the group combinations District & Type, I want to do a simple linear interpolation: for a x=Days (Start_Day and End_Day) and y=Volumes (Start_Vol and End_Vol), I want the estimated volume returned for xout=DaysBtwn.
I have tried so many things. I think I am having issues because of the way my data is set up. Can someone point me in the right direction for how to use the approx function in R to get the desired output? I don't mind moving my data set around to get the correct format for approx.`
Example of desired output:
District Type EstimatedVol
1 0 25
2 1 15
3 0 13000
4 1 39000
5 0 2500
6 1 25000
dt <- data.table(input) interpolation <- dt[, approx(x,y,xout=z), by=list(input$District,input$Type)]

Why not simply calculate it directly?
dt$EstimatedVol <- (End_Vol - Start_Vol) / (End_Day - Start_Day) * (DaysBtwn - Start_Day) + Start_Vol

Calculating table in R with uneven length

I have to table of data in R
a = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 2 0 0 0 2
2 3 0 0 10 3
3 4 0 51 25 0
4 5 19 129 14 0
5 6 60 137 1 0
6 7 31 62 15 5
7 8 7 11 7 0
and
b = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 1 0 0 1 266
2 2 1 0 47 335
3 3 1 26 415 142
4 4 3 965 508 5
5 5 145 2535 103 0
6 6 939 2239 15 6
7 7 420 613 86 34
8 8 46 84 36 16
I wouold like to calculate b/a by matching the duration. I though of some thing like ifelse() but it does not work. Can someone please help me?
Thanks a lot

Match the order and selection of b with a (in my example y with x). Then do the math.
x <- data.frame(duration = 2:8, v = rnorm(7))
y <- data.frame(duration = 8:1, v = rnorm(8))
m <- match(y$duration, x$duration)
ym <- y[m[!is.na(m)],]
x$v/ym$v
It does not work when x contains items that are not in y, btw.

Do you want something like the following:
a <- a[-1]
b <- b[-1]
a <- a[order(a$Duration),]
b <- b[order(b$Duration),]
durations <- intersect(a$Duration, b$Duration)
b[b$Duration %in% durations,] / a[a$Duration %in% durations,]
Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
2 1 Inf NaN Inf 167.50000
3 1 Inf Inf 41.500000 47.33333
4 1 Inf 18.921569 20.320000 Inf
5 1 7.631579 19.651163 7.357143 NaN
6 1 15.650000 16.343066 15.000000 Inf
7 1 13.548387 9.887097 5.733333 6.80000
8 1 6.571429 7.636364 5.142857 Inf
you may like to replace NaN and Inf values by something else.

How to remove rows based on distance from an average of column and max of another column

Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)

subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1

Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)

Calculating Confidence Intervals for two datasets

Quite the number of questions I've made today.
I'd like to calculate the Confidence Interval (99% level, not 95) for the mean value of variable age of two dataframes, infert_control and infert_patient where:
infert_control = subset(infert$age, infert$case == 0)
infert_patient = subset(infert$age, infert$case == 1)
infert is a built-in R dataset, for those not familiar with it, here it is: case 0 denominates the control-group patients, case 1 the actual ones.
> infert
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6 1 1 2 1 3
2 0-5yrs 42 1 1 1 0 2 1
3 0-5yrs 39 6 2 1 0 3 4
4 0-5yrs 34 4 2 1 0 4 2
5 6-11yrs 35 3 1 1 1 5 32
6 6-11yrs 36 4 2 1 1 6 36
7 6-11yrs 23 1 0 1 0 7 6
8 6-11yrs 32 2 0 1 0 8 22
9 6-11yrs 21 1 0 1 1 9 5
10 6-11yrs 28 2 0 1 0 10 19
11 6-11yrs 29 2 1 1 0 11 20
...
239 12+ yrs 38 6 0 0 2 74 63
240 12+ yrs 26 2 1 0 1 75 49
241 12+ yrs 31 1 1 0 0 76 45
242 12+ yrs 31 2 0 0 1 77 53
243 12+ yrs 25 1 0 0 1 78 41
244 12+ yrs 31 1 0 0 1 79 45
245 12+ yrs 34 1 0 0 0 80 47
246 12+ yrs 35 2 2 0 0 81 54
247 12+ yrs 29 1 0 0 1 82 43
248 12+ yrs 23 1 0 0 1 83 40
What would be the correct way to solve this?
I've already calculated the Mean value of column age for both infert_control and infert_patient, plus the standard deviation of each subset.

You could use bootstrap for this:
library(boot)
set.seed(42)
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4)
quantile(boot_mean$t, probs=c(0.005, 0.995))
# 0.5% 99.5%
# 30.47273 32.58182
Or if you don't want to use a library:
set.seed(42)
R <- 1e4
boot_mean <- colMeans(
matrix(
sample(infert_control, R * length(infert_control), TRUE),
ncol=R))
quantile(boot_mean, probs=c(0.005, 0.995))
# 0.5% 99.5%
#30.42424 32.55152

So many answers...
The mean value of a random sample has a t-distribution, not normal, although t -> N as df -> Inf.
cl <- function(data,p) {
n <- length(data)
cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n)
m <- mean(data)
return(c(lower=m-cl,upper=m+cl))
}
cl.control <- cl(infert_control,0.01)
cl.control
# lower upper
# 30.42493 32.55689
cl.patient <- cl(infert_patient,0.01)
cl.patient
# lower upper
# 30.00221 33.05803
aggregate(age~case,data=infert,cl,p=0.01) # much better way...
# case age.lower age.upper
# 1 0 30.42493 32.55689
# 2 1 30.00221 33.05803
Also, the quantile functions (e.q. qt(...) and qnorm(...)) return the lower tail by default, so your limits would be reversed unless you set lower.tail=F

You could easily calculate the confidence interval manually:
infert_control <- subset(infert$age, infert$case == 0)
# calculate needed values
m <- mean(infert_control)
s <- sd(infert_control)
n <- length(infert_control)
# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution)
a <- 0.995 # 99% CI => 0.5% on both sides
error <- qnorm(a)*s/sqrt(n)
# calculate CI
ci_lower <- m-error
ci_upper <- m+error
See also http://en.wikipedia.org/wiki/Confidence_interval (sorry for a wikipedia link, but it has a good explanation and shows you the formula)

... or as small function:
cifun <- function(data, ALPHA){
c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)))
}
cifun(infert_control, 0.01)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

After imputation, how to round to nearest level of a factor - r

Maybe you could use plyr's round_any function here which would round to nearest multiple of 20 in this case. plyr::round_any(df$q1, 20) #[1] 0 100 0 80 0 60 0 80 0 40 0 20

Related

The value in one column depends in the value of another column

Linear interpolation by multiple groupings in R

Calculating table in R with uneven length

How to remove rows based on distance from an average of column and max of another column

Calculating Confidence Intervals for two datasets

Categories

Resources