writing an R function - r

I want to write a R function that give me the value on 10th percentile of observations. I want to use this function with sapply. Like for mean, sapply is
sapply(1 : n, function(i) mean(a1))
Say e.g. I have 100 values and 10th percentile of 100 is 10 i want that the value that is on 10th line get printed.
X
45
80
70
56
78
78
56
90
35
190
.......... up till 100 values
Desired output: print value on 10th line i.e. 190 in the above column
I want the function to calculate first 10th percentile of my 100 observations and later just print the value that come on that position.

The desired function will look like:
give_quant_value <- function(vec, quantile) {
return(vec[quantile(1:length(vec),p=quantile,type=1)])
}
where
- vec is vector of interest
- quantile is quantile of interest
Proof:
set.seed(42)
a1 <- rnorm(100)
give_quant_value(a1, 0.1)
-0.062714099052421
You may read more about quantile function while typing ?quantile in your terminal

Here is a simple function I wrote: It takes on data as a data.frame object.
select_percentile<-function(df,n){
df<-df
nth<-(n/nrow(df))*100
return(df[nth,])
}
Using data from another answer:
set.seed(42)
a1 <- rnorm(100)
df1<-as.data.frame(a1)
select_percentile(df1,10)
Result:
#[1] -0.0627141

Related

Simple If statement. Comparing non-numeric values

My data:
pirmas antras trecias
17 44 55
788 890 1409
968 218 344
333 355 Na
I want to check which correlation is bigger:
the correlation between pirmas and antras columns
or the correlation between antras and trecias columns
Next, I want to write the If statement.
If the correlation between antras and trecias columns is bigger, I fill this N/A value in the last column with the value of the column antras.
BUT I get an error, because the function cor.test is test and does not give me a numeric answer, so I cannot compare them in If statement.
How can I do this?
My source code:
data<- X12_5_3
data
a<-cor.test(data$pirmas, data$trecias)
b<-cor.test(data$antras, data$trecias)
if (a<b) {
data$trecias[4]<-data$antras[4]
}
data
You can extract the correlation value from the test objects with $estimate.
set.seed(7)
a <- cor.test(rnorm(5), rnorm(5))
b <- cor.test(rnorm(5), rnorm(5))
if (a$estimate < b$estimate) {
print('correlation of a smaller than b')
}
If you don't need to do a hypothesis test, just use cor() to get their correlation coefficient. Besides, because of the presence of missing values, you need to control the argument use to deal with it.
a <- cor(df$pirmas, df$trecias, use = "complete.obs")
b <- cor(df$antras, df$trecias, use = "complete.obs")

R loop to execute the same code

I'm trying to figure out how to repeat the same code 30 times without typing each one at a time... any help will be much appreciated.
SRS_1 <- sample(1:nrow(MyData_points), size=.10*nrow(MyData_points))
data_sample_1 <- MyData_points[SRS_1,]
fpc.srs <- rep(6399875, 639987)
design_SRS_1 <- svydesign(id=~1, strata=NULL, data=data_sample_1, fpc=fpc.srs)
ONStotal_SRS1 <- svytotal(~data_sample_1$V4, design=design_SRS_1)
ONSmean_SRS1 <- svymean(~data_sample_1$V4, design=design_SRS_1)
CI_SRS_1 <- confint(svytotal(~data_sample_1$V4, design=design_SRS_1))
The first code calculates a Simple Random Sampling with a probability of .10 from the data. The second gets the sample from the data. Third, calculates the fpc, which is the 10% of the total data points. Now, in order to estimate the population I need to do a design of the sample without replacement including the fpc. Then, for the last three codes, I calculate a population estimate, mean and confidence interval based on that sample.
What changes is that I must repeat 30 different Simple Random Samplings from the data. Therefore, the resulting estimation, mean and confidence intervals will be obtained from 30 different samples. They might be close but not equal
How can I make this code better so I can run it 30 times each and be able to print a table with (ONStotal_SRS1, ONSmean_SRS1,CI_SRS_1)?
Usually I would use either rbindlist from the data.table package or bind_rows from dplyr in combination with an lapply to build the table a row at a time and then bind the rows together. Here is an example using bind_rows with the mtcars data set:
library(dplyr)
combined_data <- bind_rows(lapply(1:30, function(...) {
# Take a sample
SRS_1 <- sample(1:nrow(mtcars), size = .10 * nrow(mtcars))
data_sample_1 <- mtcars[SRS_1, ]
# Compute some things from the sample
m_disp <- mean(data_sample_1$disp)
m_hp <- mean(data_sample_1$hp)
# Make a one row data.frame that will be returned by the function
data.frame(m_disp, m_hp)
}))
Which gives this data.frame:
> str(combined_data)
'data.frame': 30 obs. of 2 variables:
$ m_disp: num 235 272 410 115 249 ...
$ m_hp : num 147 159 195 113 154 ...

Apply a function that requires seq() in R

I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30

Repeated subsetting of the same matrix using apply in R

Motivation: I am currently trying to rethink my coding such as to exclude for-loops where possible. The below problem can easily be solved with conventional for-loops, but I was wondering if R offers a possibility to utilize the apply-family to make the problem easier.
Problem: I have a matrix, say X (n x k matrix) and two matrices of start and stop indices, called index.starts and index.stops, respectively. They are of size n x B and it holds that index.stops = index.starts + m for some integer m. Each pair index.starts[i,j] and index.stops[i,j] are needed to subset X as X[ (index.starts[i,j]:index.stops[i,j]),]. I.e., they should select all the rows of X in their index range.
Can I solve this problem using one of the apply functions?
Application: (Not necessarily important for understanding my problem.) In case you are interested, this is needed for a bootstrapping application with blocks in a time series application. The X represents the original sample. index.starts is sampled as replicate(repetitionNumber, sample.int((n-r), ceiling(n/r), replace=TRUE)) and index.stopsis obtained as index.stop = index.starts + m. What I want in the end is a collection of rows of X. In particular, I want to resample repetitionNumber times m blocks of length r from X.
Example:
#generate data
n<-100 #the size of your sample
B<-5 #the number of columns for index.starts and index.stops
#and equivalently the number of block bootstraps to sample
k<-2 #the number of variables in X
X<-matrix(rnorm(n*k), nrow=n, ncol = k)
#take a random sample of the indices 1:100 to get index.starts
r<-10 #this is the block length
#get a sample of the indices 1:(n-r), and get ceiling(n/r) of these
#(for n=100 and r=10, ceiling(n/r) = n/r = 10). Replicate this B times
index.starts<-replicate(B, sample.int((n-r), ceiling(n/r), replace=TRUE))
index.stops<-index.starts + r
#Now can I use apply-functions to extract the r subsequent rows that are
#paired in index.starts[i,j] and index.stops[i,j] for i = 1,2,...,10 = ceiling(n/r) and
#j=1,2,3,4,5=B ?
It's probably way more complicated than what you want/need, but here is a first approach. Just comment if that helps you in any way and I am happy to help.
My approach uses (multiple) *apply-functions. The first lapply "loops" over 1:B cases, where it first calculates the start and end points, which are combined into the take.rows (with subsetting numbers). Next, the inital matrix is subsetted by take.rows (and returned in a list). As a last step, the standard deviation is taken for each column of the subsetted matrizes (as a dummy function).
The code (with heavy commenting) looks like this:
# you can use lapply in parallel mode if you want to speed up code...
lapply(1:B, function(i){
starts <- sample.int((n-r), ceiling(n/r), replace=TRUE)
# [1] 64 22 84 26 40 7 66 12 25 15
ends <- starts + r
take.rows <- Map(":", starts, ends)
# [[1]]
# [1] 72 73 74 75 76 77 78 79 80 81 82
# ...
res <- lapply(take.rows, function(subs) X[subs, ])
# res is now a list of 10 with the ten subsets
# [[1]]
# [,1] [,2]
# [1,] 0.2658915 -0.18265235
# [2,] 1.7397478 0.66315385
# ...
# say you want to compute something (sd in this case) you can do the following
# but better you do the computing directly in the former "lapply(take.rows...)"
res2 <- t(sapply(res, function(tmp){
apply(tmp, 2, sd)
})) # simplify into a vector/data.frame
# [,1] [,2]
# [1,] 1.2345833 1.0927203
# [2,] 1.1838110 1.0767433
# [3,] 0.9808146 1.0522117
# ...
return(res2)
})
Does that point you in the right direction/gives you the answer?

How to sum data.frame column values?

I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?
You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).
To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)
you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))
When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)
to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.

Resources