Generate random and unique probabilities based on the data in a column rstudio - r

I have a database with different ages (16-85) I'm trying to assign a random probability to each age but following the next conditions:
The random probability should be between 20-100
The probability should be higher in the lowest and highest values
For example:
I have this data frame
Age
Probability
16
85
18
80
25
70
30
50
35
40
65
60
75
65
85
70
I'd really appreciate if you can help me with this.
Tank you so much!

Related

Randomly Select 10 percent of data from the whole data set in R

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)
Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)
In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

How to calculate 95% confidence interval for a proportion in R?

Assume I own a factory that produces 150 screws a day and there is a 22% error rate. Now I am going to estimate how many screws are faulty each day for a year (365 days) with
rbinom(n = 365, size = 150, prob = 0.22)
which generates 365 values in this way
45 31 35 31 34 37 33 41 37 37 26 32 37 38 39 35 44 36 25 27 32 25 30 33 25 37 36 31 32 32 43 42 32 33 33 38 26 24 ...................
Now for each of the value generated, I am supposed to calculate a 95% confidence interval for the proportion of faulty screws in each day.
I am not sure how I can do this. Is there any built in functions for this (I am not supposed to use any packages) or should I create a new function?
If the number of trials per day is large enough and the probability of failure not too extreme, then you can use the normal approximation https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval.
# number of failures, for each of the 365 days
f <- rbinom(365, size = 150, prob = 0.22)
# failure rates
p <- f/150
# confidence interval for the failur rate, for each day
p + 1.96*sqrt((p*(1-p)/150))
p - 1.96*sqrt((p*(1-p)/150))

What can do to find and remove semi-duplicate rows in a matrix?

Assume I have this matrix
set.seed(123)
x <- matrix(rnorm(410),205,2)
x[8,] <- c(0.13152348, -0.05235148) #similar to x[5,]
x[16,] <- c(1.21846582, 1.695452178) #similar to x[11,]
The values are very similar to the rows specified above, and in the context of the whole data, they are semi-duplicates. What could I do to find and remove them? My original data is an array that contains many such matrices, but the position of the semi duplicates is the same across all matrices.
I know of agrep but the function operates on vectors as far as I understand.
You will need to set a threshold, but you can just compute the distance between each row using dist and find the points that are sufficiently close together. Of course, Each point is near itself, so you need to ignore the diagonal of the distance matrix.
DM = as.matrix(dist(x))
diag(DM) = 1 ## ignore diagonal
which(DM < 0.025, arr.ind=TRUE)
row col
8 8 5
5 5 8
16 16 11
11 11 16
48 48 20
20 20 48
168 168 71
91 91 73
73 73 91
71 71 168
This finds the "close" points that you created and a few others that got generated at random.

R - Read Matrix Data from file with free spaces

I have an map of efficiency data as matrix stored in a text file. With increasing row number data points are missing at the end of the line since there are no data available. When I try to read the data as matrix in R:
mymatrix = as.matrix(read.table(file="~/Desktop/Map_.txt"))
I get the error message that the row 3 (in this case the first row with missing values at the rows end) has not all values expected.
Is there a way to read the map file as matrix even with free spaces (no values) in some lines?
Thank you!
example:
1000 2000 3000 4000 5000
-1 85 75 65 60 58
-2 86 74 64 58 52
-3 83 78 68 59
-4 86 80 72
-5 86 81 71

How to return back the imputed values in R

Is there any function in R that can help to return imputed values, for example:
x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,
23)
by using single linear imputation method,
na.approx(x)
I get the imputed data as;
[1] 23 23 25 43 34 22 78 35 98 23 30 24 21 78 22 76 22 77 33 98 22 14 52 87 59
[26] 23 23
How can I get the imputed value from the program back without looking at the completed dataset one by one? For example, if the data I imputed contain $n=200$ observations, can I get 20 estimates of the missing value?
I am not 100 percent sure if I got you right, but does this help?
You first save the places, at which the original NA values are, so.e.g the first NA value is at the 8th place. Save this into the dummy variable
dummy<-NA
for (i in 1:length(x)){
if(is.na(x[i])) dummy[i]<-i
}
Now get the corresponding values in the imputed data
imputeddata<-na.approx(x)
for (i in 1:length(imputeddata)){
if(!is.na(imputeddata[dummy[i]])) print(imputeddata[dummy[i]])
}
You could use is.na to select only those values that were previously NA.
> x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,23)
> na.approx(x)[is.na(x)]
[1] 88.0 25.5 76.5 37.0 55.0
Hope that helps.

Resources