Proportion and Averaging Data - r

I am brand new to r and I am trying to calculate the proportion of the number of 'i' for each timepoint and then average them. I do not know the command for this but I have the script to find the total number of 'i' in the time points.
C1imask<-C16.3[,2:8]== 'i'&!is.na(C16.3[,2:8])
C16.3[,2:8][C1imask]
C1inactive<-C16.3[,2:8][C1imask]
length(C1inactive)
C1bcmask<-C16.3[,8]== 'bc'&!is.na(C16.3[,8])
C16.3[,8][C1bcmask]
C1broodcare<-C16.3[,8][C1bcmask]
length(C1broodcare)
C1amask<-C16.3[,12]== 'bc'&!is.na(C16.3[,12])
C16.3[,12][C1amask]
C1after<-C16.3[,12][C1amask]
length(C1after)
C1<-length(C1after)-length(C1broodcare)
C1

I'd try taking the mean of a logical vector created with the test. You would use na.rm as an argument to mean. You will get the proportion of non-NA values that meet the test rather than the proportion of with number of rows as the denominator.
test <- sample( c(0,1,NA), 100, replace=TRUE)
mean( test==0, na.rm=TRUE)
#[1] 0.5072464
If you needed a proportion of total number of rows you would use sum and divide by nrow(dframe_name). You can then use sapply or lapply to iterate across a group of columns.

Related

R - Create a loop to calculate the length and Sum of a column where Identifier equals value in list

I am attempting to create a loop to calculate the length and average value of a column where the identifier equals the value in a list. I basically have a dataframe with a Identifier, number of occurrences, and additional data. I also have a list that contains the unique of the identifiers (50 string values). I want to summarize the number of rows and average value for each of those 50 values.
So far I've tried creating two functions to calculate those values, and then integrating it into the loop but have been unsuccessful.
infoAvg = function(x){
average = mean(x)
return(average)}
infoLen = function(x){
length = length(x)
return(length)
}
Here x is the DF and y is the column I want to calculate on.
Does it make sense to take this approach, and if so how do I integrate it into a loop?
Thanks.

How selecting rows after caltulation of each row's quantiles?

I have a big dataframe with numerical values (12579 rows and 21 columns) from which I would like to extract those columns that fit in the first and the fourth quartile of each row (every row has independent values).
That is why I have calculated each row's quantiles in order to obtain two cutoffs by row.
library(matrixStats)
d_q1 <- rowQuantiles(delta, probs = c(0.25, 0.75))
delta2 <- as.data.frame(cbind(delta,d_q1))
dim(delta2) # 12579 23
library(dplyr)
delta2 <- filter(delta2, delta2[,1:21] <= `25%` & delta2[,1:21] >= delta2$`75%`)
I expected getting those values in Q1 and Q4. However, when I tried to filter the values, I always obtain an error message:
Error: Result must have length 12579, not 264159
Can somebody help me?
Thank you!
I'm not entirely sure what you are trying here, but my guess is that you want for each row the values smaller than Q1 and larger than Q3. In that case this line should work for you.
t(apply(delta, 1, sort))[,c(1:6, 16:21)]
Regarding your code, dplyr::filter() doesn't work that way, it is meant to give you a subset of the rows in your dataframe, so its arguments need to be logical vectors of the same length as the number of rows in your dataframe.

How to sum rows in a data frame if the observations are NOT the NAs?

I've a data frame that contains 100 participants' data, and I want to calculate the Total scores of each participant. Some participants have data missing completely, however, I still want their Total score to be NA, Total = NA.
Regarding the participants who have some/partial NAs, I want to sum all the scores that do NOT have NAs. In other words, I want to calculate the total of each row without calculating NAs. When I used rowSums(df[2:10], rm.na = T), the function calculates the rows, but it gives 0s for those whom data are missing completely.
Is there anyway to calculate each participants' total scores without deleting NAs and also assigns "NA" as a total score to the completely missing data? Thank you in advance.
Just use apply and specify a function that does exactly what you want- hope this helps:
apply(data,1,function(x){
if(sum(is.na(x))==ncol(data)){
return(NA)
}else{
return(sum(x,na.rm=T))
}})

Extract elements 10x greater than the last values for multiple columns

I am a new R user.
I have a dataframe consisting of 50 columns and 300 rows. The first column indicates the ID while the 2nd until the last column are standard deviation (sd) of traits. The pooled sd for each column are indicated at the last row. For each column, I want to remove all those values ten times greater than the pooled sd. I want to do this in one run. So far, the script below is what I have came up for knowing whether a value is greater than the pooled sd. However, even the ID (character) are being processed (resulting to all FALSE). If I put raw_sd_summary[-1], I have no way of knowing which ID on which trait has the criteria I'm looking for.
logic_sd <- lapply(raw_sd_summary, function(x) x>tail(x,1) )
logic_sd_df <- as.data.frame(logic_sd)
What shall I do? And how can I extract all those values labeled as TRUE (greater than pooled sd) that are ten times greater than the pooled SD (along with their corresponding ID's)?
I think your code won't work since lapply will run on a data.frame's columns, not its rows as you want. Change it to
logic_sd <- apply(raw_sd_summary, 2, function(x) x>10*tail(x,1) )
This will give you a logical array of being more than 10 times the last row. You could recover the IDs by replacing the first column
logic_sd[,1] <- raw_sd_summary[,1]
You could remove/replace the unwanted values in the original table directly by
raw_sd_summary[-300,-1][logic_sd[-300,-1]]<-NA # or new value

Find the maximum and mean length of the consecutive "TRUE"-arguments

I started with a daily time series of wind speeds. I wanted to examine of the mean and maximum number of consecutive days under a certain threshold change between two periods of time. This is how far I've come: I subsetted the data to rows with values beneath the threshold and identified consecutive days.
I now have a data frame that looks like this:
dates consecutive_days
1970-03-25 NA
1970-04-09 TRUE
1970-04-10 TRUE
1970-04-11 TRUE
1970-04-12 TRUE
1970-04-15 FALSE
1970-05-08 TRUE
1970-05-09 TRUE
1970-05-13 FALSE
What I want to do next is to find the maximum and mean length of the consecutive "TRUE"-arguments. (which in this case would be: maximum=4; mean=3).
Here is one method using rle:
# construct sample data.frame:
set.seed(1234)
df <- data.frame(days=1:12, consec=sample(c(TRUE, FALSE), 12, replace=T))
# get rle object
consec <- rle(df$consec)
# max consecutive values
max(consec$lengths[consec$values==TRUE])
# mean consecutive values
mean(consec$lengths[consec$values==TRUE])
Quoting from ?rle, rle
Compute[s] the lengths and values of runs of equal values in a vector
We save the results and then subset to consecutive TRUE observations to calculate the mean and max.
You could easily combine this into a function, or simply concatenate the results above:
myResults <- c("max"=max(consec$lengths[consec$values==TRUE]),
"mean"= mean(consec$lengths[consec$values==TRUE]))

Resources