I am not used to R, so to practice I am trying to do everything that I used to do on SPSS on R.
In my dataset each row is a case. The columns are survey questions (1 per question).
Say I have columns "A1" up to "A6", "B1" to "B6" and so on
I just finished calculating the mean for each person on A1 to A6
data$meandata <- rowMeans(subset(data, select=c(A1:A6), na.rm=TRUE))
How do I calculate the standard deviation of meandata ?
Hey the easiest way to do this is with the apply() function.
Assume you have 25 rows of data and 6 columns labeled A1 through A6.
data <- data.frame(A1=rnorm(25,50,4),A2=rnorm(25,50,4),A3=rnorm(25,50,4),
A4=rnorm(25,50,4),A5=rnorm(25,50,4),A6=rnorm(25,50,4))
You can use the apply function to find the standard deviation of each row columns 1 through 6 with the code below. The first argument is your data object. The second argument is an integer specifying either 1 for rows or 2 for columns (This is the direction the function will be applied to the data frame). The final argument is the function you wish to apply to your data frame (such as mean or standard deviation (sd) in this case. See the code below.
apply(data[,1:6],1,sd)
Indexing can be used to limit the number of rows or columns of data passed to the apply function. This is done by entering a vector of numbers for either the rows or columns you are interested in within brackets after your data object.
data[c(row.vector),c(column.vector)]
Say you only want to know the sd of the first 3 columns.
apply(data[,1:3],1,sd)
Now lets see the sd of columns 4 through 6 and rows 1 through 10
apply(data[1:10,4:6],1,sd)
Just for good measure lets find the sd of each column
apply(data,2,sd)
Notice that the sd is close to 4, which, is what I specified when I generated the pseudo-random data for columns A1 through A6.
Hope this helps
Related
I have an 11 x 8 data frame of numeric values in R that I want to find the standard deviation of. However, I cannot take the standard deviation of a matrix (use the sd() function), only the columns. But I need every data value used. How do I make this data frame into one column so that all values are used when finding the standard deviation? Hope this makes sense.
#generate data
df <- data.frame(matrix(rbinom(8*11, 1, .5), ncol=8))
#get sd
sd(unlist(df))
edit: just saw the comment where user fra got there first
I have a data table with 3 variables, 1 frequency column, and I am wishing to add another proportion column.
The variable 1 has 4 unique values.
Variable 2 has 5,
And Variable 3 has 2.
The frequencies captures the amount of times that happens.
But if I add the prop.table to it, it will calculate the proportion regarding the whole data.table, when I really want it to calculate the proportion in the subsets of Variable 2.
I thought of iterating, but it seems complicated in tables.
You could use the aggregate function (or tapply) to sum all the counts within the categories of variable 2, then use prop.table or similar on the result.
If you want to use the tidyverse instead of base R then this would be a group_by followed by summarise to add within each group, then prop_table again to calculate the proportions.
I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.
The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)
I am a new R user.
I have a dataframe consisting of 50 columns and 300 rows. The first column indicates the ID while the 2nd until the last column are standard deviation (sd) of traits. The pooled sd for each column are indicated at the last row. For each column, I want to remove all those values ten times greater than the pooled sd. I want to do this in one run. So far, the script below is what I have came up for knowing whether a value is greater than the pooled sd. However, even the ID (character) are being processed (resulting to all FALSE). If I put raw_sd_summary[-1], I have no way of knowing which ID on which trait has the criteria I'm looking for.
logic_sd <- lapply(raw_sd_summary, function(x) x>tail(x,1) )
logic_sd_df <- as.data.frame(logic_sd)
What shall I do? And how can I extract all those values labeled as TRUE (greater than pooled sd) that are ten times greater than the pooled SD (along with their corresponding ID's)?
I think your code won't work since lapply will run on a data.frame's columns, not its rows as you want. Change it to
logic_sd <- apply(raw_sd_summary, 2, function(x) x>10*tail(x,1) )
This will give you a logical array of being more than 10 times the last row. You could recover the IDs by replacing the first column
logic_sd[,1] <- raw_sd_summary[,1]
You could remove/replace the unwanted values in the original table directly by
raw_sd_summary[-300,-1][logic_sd[-300,-1]]<-NA # or new value
I'm relatively new in R so excuse me if I'm not even posting this question the right way.
I have a matrix generated from combination function.
double_expression_combinations <- combn(marker_column_vector,2)
This matrix has x columns and 2 rows. Each column has 2 rows with numbers that will be used to represent column numbers in my main data frame named initial. These columns numbers are combinations of columns to be tested. The initial data frame is 27 columns (thousands of rows) with values of 1 and 0. The test consists in using the 2 numbers given by double_expression_combinations as column numbers to use from initial. The test consists in adding each row of those 2 columns and counting how many times the sum is equal to 2.
I believe I'm able to come up with the counting part, I just don't know how to use the data from the double_expression_combinations data frame to select columns to test from the "initial" data frame.
Edited to fix corrections made by commenters
Using R it's important to keep your terminology precise. double_expression_combinations is not a dataframe but rather a matrix. It's easy to loop over columns in a matrix with apply. I'm a bit unclear about the exact test, but this might succeed:
apply( double_expression_combinations, 2, # the 2 selects each column in turn
function(cols){ sum( initial[ , cols[1] ] + initial[ , cols[2] ] == 2) } )
Both the '+' and '==' operators are vectorised so no additional loop is needed inside the call to sum.