R function to calculate number of variables to divide total by - r

I have a data set with survey scores. I am summing the scores by row as follows:
d$total.score <- rowSums(d[, c("a", "b", "c", "d")], na.rm=TRUE)
I need to create another variable, the average score. If some of the variables had NA as a cell (e.g., 3+1+4+NA=8), the total I need to divide by will not be 4 but might be 2 or 3. What function can I use to calculate this number I need to divide by?
Thank you!

Related

Creating a column based on two criteria using max values from another column

I've got a dataset of species observations over time and I am trying to calculate observation dates based on the max value of criteria:
Df <- data.frame(Sp = c(1,1,2,2,3,3),
Site = c("A", "B", "C", "D"),
date = c('2021-1-1','2021-1-2','2021-1-3','2021-1-4','2021-1-5','2021-1-6', "2021-03-01","2021-03-05")
N = c(2,5,9,4,14,7,3,11)
I want to create a new column called Nmax that showing the in which date the value of N for a Sp on a given Site was max, so the column would look something like this:
Dmax=c("2021-1-2", "2021-1-2", '2021-1-2', '2021-1-2', '2021-1-5', '2021-1-5', "2021-03-05","2021-03-05")
So Dmax would show that for Sp 1 in site A the date in which N was max was "2021-1-2" and so on.
I've tried grouping by Site, Sp, and date and using mutate together which.max(N) but didn't work. I'd like to keep all my rows.
Any help is welcome.
Thanks!
From your desired output, it seems like you want the max date regardless of site. Just group by site. Also, your sample data only has 6 rows for Sp instead of 8 so I just assumed a 4th Sp
Df |>
group_by(Sp) |>
mutate(Dmax = date[which.max(N)])

Proportion and Averaging Data

I am brand new to r and I am trying to calculate the proportion of the number of 'i' for each timepoint and then average them. I do not know the command for this but I have the script to find the total number of 'i' in the time points.
C1imask<-C16.3[,2:8]== 'i'&!is.na(C16.3[,2:8])
C16.3[,2:8][C1imask]
C1inactive<-C16.3[,2:8][C1imask]
length(C1inactive)
C1bcmask<-C16.3[,8]== 'bc'&!is.na(C16.3[,8])
C16.3[,8][C1bcmask]
C1broodcare<-C16.3[,8][C1bcmask]
length(C1broodcare)
C1amask<-C16.3[,12]== 'bc'&!is.na(C16.3[,12])
C16.3[,12][C1amask]
C1after<-C16.3[,12][C1amask]
length(C1after)
C1<-length(C1after)-length(C1broodcare)
C1
I'd try taking the mean of a logical vector created with the test. You would use na.rm as an argument to mean. You will get the proportion of non-NA values that meet the test rather than the proportion of with number of rows as the denominator.
test <- sample( c(0,1,NA), 100, replace=TRUE)
mean( test==0, na.rm=TRUE)
#[1] 0.5072464
If you needed a proportion of total number of rows you would use sum and divide by nrow(dframe_name). You can then use sapply or lapply to iterate across a group of columns.

How to sum rows in a data frame if the observations are NOT the NAs?

I've a data frame that contains 100 participants' data, and I want to calculate the Total scores of each participant. Some participants have data missing completely, however, I still want their Total score to be NA, Total = NA.
Regarding the participants who have some/partial NAs, I want to sum all the scores that do NOT have NAs. In other words, I want to calculate the total of each row without calculating NAs. When I used rowSums(df[2:10], rm.na = T), the function calculates the rows, but it gives 0s for those whom data are missing completely.
Is there anyway to calculate each participants' total scores without deleting NAs and also assigns "NA" as a total score to the completely missing data? Thank you in advance.
Just use apply and specify a function that does exactly what you want- hope this helps:
apply(data,1,function(x){
if(sum(is.na(x))==ncol(data)){
return(NA)
}else{
return(sum(x,na.rm=T))
}})

Frequency of data points by two variables in R [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have what I know must be a simple answer but I can't seem to figure it out.
Suppose I have a dataset:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test <- c(12,16, NA, 11, 15,NA, 0,12, 5)
df <- data.frame(id,visit,test)
And I want to know the number of data points per visit so that the final output looks something like this:
visit test
A 3
B 3
C 1
How would I go about doing this? I've tried using table
table(df$visit, df$test)
but I get a full grid of the number of values present the combination of visits and test values.
I can sum each row by doing this:
sum(table(df$visit, df$test))[1,]
sum(table(df$visit, df$test))[2,]
sum(table(df$visit, df$test))[3,]
But I feel like there is an easier way and I'm missing it! Any help would be greatly appreciated!
aggregate of base R would be ideal for this. Group id by visit and count the length. Remove the rows with NA using !is.na() prior to determining the length
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
# Group.1 x
#1 A 3
#2 B 3
#3 C 1
How about:
data.frame(rowSums(table(df$visit, df$test)))

Better subsetting and counting values in a dataframe [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))
Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.
library(dplyr)
fake.cluster <- fake.data %>%
distinct() %>%
count(b)
Here is an option using data.table
library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
# b members
#1: a 4
#2: b 3
#3: c 3

Resources