How to calculate column mean at intervals of row values in R? - r

I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.

The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)

Related

How to do two sorting in r when order matters

I have a data frame consisting of three variables named momentum returns(numeric),volatility (factor) and market states (factor). Volatility and market states both have two -two levels. Volatility have levels named high and low. Market states have level named positive and negative I want to make a two sorted table. I want mean of momentum returns in every case.
library(wakefield)
mom<-rnorm(30)
vol<-r_sample_factor(30,x=c("high","low"))
mar_state<-r_sample_factor(30,x=c("positive","negtive"))
df<-data.frame(mom,vol,mar)
Based on the suggestion given by #r2evans if you want mean of every sorted cases you can apply following code.
xtabs(mom~vol+mar,aggregate(mom~vol+mar,data=df,mean))
## If you want simple sum in every case
xtabs(mom~vol+mar,data=df)
You can also do this with help of data.table package. This approach will do same task in less time.
library(data.table)
df<-as.data.table(df)
## if you want results in data frame format
df[,.(mean(mom)),by=.(vol,mar)]
## if you want in simple vector form
df[,mean(mom),by=vol,mar]

Calculating standard deviation for each row on selected columns

I am not used to R, so to practice I am trying to do everything that I used to do on SPSS on R.
In my dataset each row is a case. The columns are survey questions (1 per question).
Say I have columns "A1" up to "A6", "B1" to "B6" and so on
I just finished calculating the mean for each person on A1 to A6
data$meandata <- rowMeans(subset(data, select=c(A1:A6), na.rm=TRUE))
How do I calculate the standard deviation of meandata ?
Hey the easiest way to do this is with the apply() function.
Assume you have 25 rows of data and 6 columns labeled A1 through A6.
data <- data.frame(A1=rnorm(25,50,4),A2=rnorm(25,50,4),A3=rnorm(25,50,4),
A4=rnorm(25,50,4),A5=rnorm(25,50,4),A6=rnorm(25,50,4))
You can use the apply function to find the standard deviation of each row columns 1 through 6 with the code below. The first argument is your data object. The second argument is an integer specifying either 1 for rows or 2 for columns (This is the direction the function will be applied to the data frame). The final argument is the function you wish to apply to your data frame (such as mean or standard deviation (sd) in this case. See the code below.
apply(data[,1:6],1,sd)
Indexing can be used to limit the number of rows or columns of data passed to the apply function. This is done by entering a vector of numbers for either the rows or columns you are interested in within brackets after your data object.
data[c(row.vector),c(column.vector)]
Say you only want to know the sd of the first 3 columns.
apply(data[,1:3],1,sd)
Now lets see the sd of columns 4 through 6 and rows 1 through 10
apply(data[1:10,4:6],1,sd)
Just for good measure lets find the sd of each column
apply(data,2,sd)
Notice that the sd is close to 4, which, is what I specified when I generated the pseudo-random data for columns A1 through A6.
Hope this helps

Finding percentile of a particular input in R

I have a dataset column which contains values. When a new input is given, I want to check this column and finding the percentile of that input value in that column.
I tried with quantile function. But the quantile function gives the values of 25th,50th percentile and so on. But I want the reverse of it. I want the percentile of a given value.
The following is my reproducible example,
data <- seq(90,100,length.out=1000)
input <- 97
My output should be the percentile of 97 in the data column. Is this possible to do?
Thanks
You may also use a somewhat more statistical version with an empirical cumulative distribution function:
ecdf(data)(input)
or
F <- ecdf(data)
F(input)
This approach also allows for vectorization over input.
I think you want to count the fraction of the data that are (is?) less than the input value:
mean(input>data)
## [1] 0.7

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

R: Function: generate and save multiple matrices based on multiple conditions

I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)

Resources