I have a dataset column which contains values. When a new input is given, I want to check this column and finding the percentile of that input value in that column.
I tried with quantile function. But the quantile function gives the values of 25th,50th percentile and so on. But I want the reverse of it. I want the percentile of a given value.
The following is my reproducible example,
data <- seq(90,100,length.out=1000)
input <- 97
My output should be the percentile of 97 in the data column. Is this possible to do?
Thanks
You may also use a somewhat more statistical version with an empirical cumulative distribution function:
ecdf(data)(input)
or
F <- ecdf(data)
F(input)
This approach also allows for vectorization over input.
I think you want to count the fraction of the data that are (is?) less than the input value:
mean(input>data)
## [1] 0.7
Related
I have an 11 x 8 data frame of numeric values in R that I want to find the standard deviation of. However, I cannot take the standard deviation of a matrix (use the sd() function), only the columns. But I need every data value used. How do I make this data frame into one column so that all values are used when finding the standard deviation? Hope this makes sense.
#generate data
df <- data.frame(matrix(rbinom(8*11, 1, .5), ncol=8))
#get sd
sd(unlist(df))
edit: just saw the comment where user fra got there first
I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662
Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.
I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?
You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.
I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.
The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)
I am not used to R, so to practice I am trying to do everything that I used to do on SPSS on R.
In my dataset each row is a case. The columns are survey questions (1 per question).
Say I have columns "A1" up to "A6", "B1" to "B6" and so on
I just finished calculating the mean for each person on A1 to A6
data$meandata <- rowMeans(subset(data, select=c(A1:A6), na.rm=TRUE))
How do I calculate the standard deviation of meandata ?
Hey the easiest way to do this is with the apply() function.
Assume you have 25 rows of data and 6 columns labeled A1 through A6.
data <- data.frame(A1=rnorm(25,50,4),A2=rnorm(25,50,4),A3=rnorm(25,50,4),
A4=rnorm(25,50,4),A5=rnorm(25,50,4),A6=rnorm(25,50,4))
You can use the apply function to find the standard deviation of each row columns 1 through 6 with the code below. The first argument is your data object. The second argument is an integer specifying either 1 for rows or 2 for columns (This is the direction the function will be applied to the data frame). The final argument is the function you wish to apply to your data frame (such as mean or standard deviation (sd) in this case. See the code below.
apply(data[,1:6],1,sd)
Indexing can be used to limit the number of rows or columns of data passed to the apply function. This is done by entering a vector of numbers for either the rows or columns you are interested in within brackets after your data object.
data[c(row.vector),c(column.vector)]
Say you only want to know the sd of the first 3 columns.
apply(data[,1:3],1,sd)
Now lets see the sd of columns 4 through 6 and rows 1 through 10
apply(data[1:10,4:6],1,sd)
Just for good measure lets find the sd of each column
apply(data,2,sd)
Notice that the sd is close to 4, which, is what I specified when I generated the pseudo-random data for columns A1 through A6.
Hope this helps