While doing data transformation on different variables I was unable to transform variables which have higher values and have almost the same range of values. I want to know how to transform this kind of data?.
MonthlyRate
Min. : 2094
1st Qu.: 8047
Median :14236
Mean :14313
3rd Qu.:20462
Max. :26999
This is the summary of a variable.
It sounds like you are trying to normalize different variables so that they are on the same unitless scale. R has a built-in scale function
scale(my_dataframe)
which will normalize each column vector so that they have the same range of values, measured in standard deviations from the mean. This only works on numeric vectors, but if your dataframe includes other types of data you can normalize each numeric vector individually
my_dataframe$monthly_rate <- scale(my_dataframe$monthly_rate)
and acheive the same effect.
Related
I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.
The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)
I am a beginner with R and I have a question about simple functions such as mean or standard deviation for a big data set. My data shows monthly returns for hedge funds for the past 30 years and has 1550 columns for all hedge funds. I saw that I can calculate the mean with the mean function for a specific column by referring to the column with the name of my dataset and a $ and the no. of the column. However, I was wondering how I can get the mean for every hedge fund (which is every column) without assigning every single column. Thanks in advance for your help!
We can use colMeans
colMeans(df1, na.rm=TRUE)
where 'df1' is the dataset.
or another option would be to loop through the columns and calculate the mean
vapply(df1, mean, na.rm=TRUE, numeric(1))
I'm switching from R to SPSS for a specific project (I'm not allowed to use SPSS/R integration) and need to summarize quickly a big dataset. In R, it's quite simple, one can use the summary() function and in few seconds obtain the summary of each variable.
I would need to know if there is a function in SPSS that do the same job. If not, how could I achieve it.
For the non-R users summary.default would return labelled values for Min. , 1st Quartile, Median, Mean , 3rd Quartile, Max. for each numeric column and a counts of the 6 most common items and the count of the "(Other)" category if a factor or character variable.
Descriptives comes close.
descriptives var1 var2 var3
/statistics = mean median stddev variance min max .
(I'm not sure about quartiles).
If you have a mixture of continuous and categorical variables, use DESCRIPTIVES or SUMMARIZE for continuous and FREQUENCIES for categorical. You can use the SPSSINC SELECT VARIABLES extension command installed with Statistics to create macros listing variables according to the measurement level and then use the appropriate macro for each command.
I have a dataset column which contains values. When a new input is given, I want to check this column and finding the percentile of that input value in that column.
I tried with quantile function. But the quantile function gives the values of 25th,50th percentile and so on. But I want the reverse of it. I want the percentile of a given value.
The following is my reproducible example,
data <- seq(90,100,length.out=1000)
input <- 97
My output should be the percentile of 97 in the data column. Is this possible to do?
Thanks
You may also use a somewhat more statistical version with an empirical cumulative distribution function:
ecdf(data)(input)
or
F <- ecdf(data)
F(input)
This approach also allows for vectorization over input.
I think you want to count the fraction of the data that are (is?) less than the input value:
mean(input>data)
## [1] 0.7
I am trying to compute the median vector of a data set s with column A1 and B1. The median vector is the median for each observation from both columns.
I tried to do this and it did not work.
median(s[c("A1","B1")])
Is there another way to do it?
The median of two observations is simply the mean. So rowMeans(s[,c("A1","B1")]). Equivalently, apply(s[,c("A1","B1")],1,median)
Another solution:
library(plyr)
colwise(median)(s[c("A1", "B1")])
which has the advantage of returning a data frame.