Counting complete cases by ID for several variables - r

I'm just beginning to learn R, so my apologies if this is simpler than I think it is, but I'm really struggling to find an answer.
What I'm attempting to do is to create a vector with a count of complete cases, by ID, for multiple variables.
For example, in this data frame:
ID<-c(1:5)
score.1<-c(1, 7, 3, 5, NA, 4, 6, 9, 11, NA)
score.2<-c(2, NA, 7, 6, NA, 5, NA, 7, 10, 1)
sample<-data.frame(ID, score.1, score.2)
ID score.1 score.2
1 1 2
2 7 NA
3 3 7
4 5 6
5 NA NA
1 4 5
2 6 NA
3 9 7
4 11 10
5 NA 1
The output I'm looking for is something like:
ID Complete
1 4
2 2
3 4
4 4
5 1
Is there a way to do this that I'm missing? I've tried count(complete.cases(sample)) with plyr and sum(complete.cases()), but it's not giving me what I actually want.
Any help with this is appreciated.

You can use dplyr:
library(dplyr)
sample %>%
mutate(new_var = rowSums(!is.na(sample[,2:3]))) %>%
group_by(ID) %>%
summarize(Complete = sum(new_var))
The output is exactly what you are looking for:
ID Complete
(int) (dbl)
1 4
2 2
3 4
4 4
5 1

with package dplyr and base function complete.cases, try
require(dplyr)
sample %>%
mutate(complete = complete.cases(sample)) %>%
group_by(ID) %>%
summarise(complete = sum(complete))

This should do it:
score.1_complete <- sample[complete.cases(sample$score.1), ]
score.2_complete <- sample[complete.cases(sample$score.2), ]
total <- rbind(score.1_complete, score.2_complete)
output <- count(total, "ID")
my reasoning:
score.1_complete selects the rows where score.1 (though not necessarily score.2) is complete. score.2_complete selects the rows where score.2 (though not necessarily score.1) is complete. therefore, counting how many times an ID shows up in total gives you how many times score.1 is complete for that ID + how many times score.2 is complete for that ID, which is what you want.

Here is another option with gather/summarise. We convert the 'wide' to 'long' format with gather (from tidyr), get the sum of non-NA 'value' grouped by 'ID'.
library(tidyr)
library(dplyr)
gather(sample, score, value,-ID) %>%
group_by(ID) %>%\
summarise(value= sum(!is.na(value)) )
# ID value
# (int) (int)
#1 1 4
#2 2 2
#3 3 4
#4 4 4
#5 5 1
Or a base R approach would be
tapply(rowSums(!is.na(sample[-1])), sample$ID, FUN=sum)
# 1 2 3 4 5
# 4 2 4 4 1

Related

Fill empty rows with values from other rows

I have a dataset with a number of cases. Every case has two observations. The first observation for case number 1 has value 3 and the second observation has value 7. The two observations for case number 2 have missing values. I need to write code to fill the empty cells with the same values from case number 1 so that the first row for case 2 will have the same value as case 1 for obs = 1 and the second row will have the same value for obs = 2. Of course, this is a very short version of a much bigger dataset so I need something that is flexible enough to accommodate for a couple of hundred cases and where the values to use as fillers change for every subjects.
Here is a toy data set:
# toy dataset
df <- data.frame(
case = c(1, 1, 2, 2),
obs = c(1, 2, NA, NA),
value = c(3, 7, NA, NA)
)
# case obs value
# 1 1 1 3
# 2 1 2 7
# 3 2 NA NA
# 4 2 NA NA
#Desired output:
case obs value
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
We may use fill with grouping on the row sequence (rowid) of case
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rowid(case)) %>%
fill(obs, value) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 3
case obs value
<dbl> <dbl> <dbl>
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7

How do I subset patient data based on number of readings for a particular variable for each patient?

I keep trying to find an answer, but haven't had much luck. I'll add a sample of some similar data.
What I'd be trying to do here is exclude patient 1 and patient 4 from my subset, as they only have one reading for "Mobility Score". So far, I've been unable to work out a way of counting the number of readings under each variable for each patient. If the patient only has one or zero readings, I'd like to exclude them from a subset.
This is an imgur link to the sample data. I can't upload the real data, but it's similar to this
This can be done with dplyr and group_by. For more information see ?group_by and ?summarize
# Create random data
dta <- data.frame(patient = rep(c(1,2),4), MobiScor = runif(8, 0,20))
dta$MobiScor[sample(1:8,3)] <- NA
# Count all avaiable Mobility scores per patient and leave original format
library(dplyr)
dta %>% group_by(patient) %>% mutate(count = sum(!is.na(MobiScor)))
# Merge and create pivot table
dta %>% group_by(patient) %>% summarize(count = sum(!is.na(MobiScor)))
Example data
patient MobiScor
1 1 19.203898
2 2 13.684209
3 1 17.581468
4 2 NA
5 1 NA
6 2 NA
7 1 7.794959
8 2 NA
Result (mutate) 1)
patient MobiScor count
<dbl> <dbl> <int>
1 1 19.2 3
2 2 13.7 1
3 1 17.6 3
4 2 NA 1
5 1 NA 3
6 2 NA 1
7 1 7.79 3
8 2 NA 1
Result (summarize) 2)
patient count
<dbl> <int>
1 1 3
2 2 1
You can count the number of non-NA in each group and then filter based on that.
This can be done in base R :
subset(df, ave(!is.na(Mobility_score), patient, FUN = sum) > 1)
Using dplyr
library(dplyr)
df %>% group_by(patient) %>% filter(sum(!is.na(Mobility_score)) > 1)
and data.table
library(data.table)
setDT(df)[, .SD[sum(!is.na(Mobility_score)) > 1], patient]

Filter for first 5 observations per group in tidyverse

I have precipitation data of several different measurement locations and would like to filter for only the first n observations per location and per group of precipitation intensity using tidyverse functions.
So far, I've grouped the data by location and by precipitation intensity.
This is a minimal example (there are several observations of each rainfall intensity per location)
df <- data.frame(location = c(rep(1, 7), rep(2, 7)),
rain = c(1:7, 1:7))
location rain
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 2 1
9 2 2
10 2 3
11 2 4
12 2 5
13 2 6
14 2 7
I thought that it should be quite easy using group_by() and filter(), but so far, I haven't found an expression that would return only the first n observations per rain group per location.
df %>% group_by(rain, location) %>% filter(???)
You can do:
df %>%
group_by(location) %>%
slice(1:5)
location rain
<dbl> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
library(dplyr)
df %>%
group_by(location) %>%
filter(row_number() %in% 1:5)
Non-dplyr solutions (that also rearrange the rows)
# Base R
df[unlist(lapply(split(row.names(df), df$location), "[", 1:5)), ]
# data.table
library(data.table)
setDT(df)[, .SD[1:5], by = location]
An option in data.table
library(data.table)
setDT(df)[, .SD[seq_len(.N) <=5], location]

How to find indices of change based on two vectors r

I have two vectors which contain indices which look like
index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1
Now, I want to find the length of each combination between index A and index B. So, in my example there are three unique combinations for index A and index B and I want to get back 3, 2, 2 in a vector. Does anyone know how to this without a for loop?
EDIT:
So, in this example there are three unique combinations (1 1, 1 2 and 2 1) for which the there are 3 of combination 1 1, 2 of 1 2 and 2 of 2 1. Therefore, I want to return 3, 2, 2
I think this is what you want:
library(plyr)
df <- data.frame(index_A = c(1, 1, 1, 1, 1, 2, 2),
index_B = c(1, 1, 1, 2, 2, 1, 1))
count(df, vars = c("index_A", "index_B"))
#> index_A index_B freq
#> 1 1 1 3
#> 2 1 2 2
#> 3 2 1 2
Created on 2019-03-17 by the reprex package (v0.2.1)
I got this from here.
In base R, we can use table
as.data.frame(table(dat))
You could paste the vectors together and call rle
rle(do.call(paste0, dat))$lengths
# [1] 3 2 2
If you need the result as a data.frame, do
as.data.frame(unclass(rle(do.call(paste0, dat))))
# lengths values
#1 3 11
#2 2 12
#3 2 21
data
text <- "indexA indexB
1 1
1 1
1 1
1 2
1 2
2 1
2 1"
dat <- read.table(text = text, header = TRUE)
This is somehow hacky:
library(dplyr)
df %>%
mutate(Combined=paste0(`index A`,"_",`index B`)) %>%
group_by(Combined) %>%
summarise(n=n())
# A tibble: 3 x 2
Combined n
<chr> <int>
1 1_1 3
2 1_2 2
3 2_1 2
Can actually just do:
df %>%
group_by(`index A`,`index B`) %>%
summarise(n=n())
Adding tidyr unite as suggested by #kath
library(tidyr)
df %>%
unite(new_col,`index A`,`index B`,sep="_") %>%
add_count(new_col) %>%
unique()
Data:
df<-read.table(text="index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1",header=T,as.is=T,fill=T)
df<-df[,1:2]
names(df)<-c("index A","index B")
Using dplyr :
library(dplyr)
count(dat,!!!dat)$n
# [1] 3 2 2

Count subgroups in group_by with dplyr [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 4 years ago.
I'm stuck trying to do some counting on a data frame. The gist is to group by one variable and then break further by groups based on a second variable. From here I want to count the size if the subgroups for each group. The sample code is this:
set.seed(123456)
df <- data.frame(User = c(rep("A", 5), rep("B", 4), rep("C", 6)),
Rank = c(rpois(5,1), rpois(4,2), rpois(6,3)))
#This results in an error
df %>% group_by(User) %>% group_by(Rank) %>% summarize(Res = n_groups())
So what I want is 'User A' to have 3, 'User B' to have 4, and 'User C' to have 5. In other words the data frame df would end up looking like:
User Rank Result
1 A 2 3
2 A 2 3
3 A 1 3
4 A 0 3
5 A 0 3
6 B 1 4
7 B 2 4
8 B 0 4
9 B 6 4
10 C 1 5
11 C 4 5
12 C 3 5
13 C 5 5
14 C 5 5
15 C 8 5
I'm still learning dplyr, so I'm unsure how I should do it. How can this be achieved? Non-dplyr answers are also very welcome. Thanks in advance!
Try this:
df %>% group_by(User) %>% mutate(Result=length(unique(Rank)))
Or (see comment below):
df %>% group_by(User) %>% mutate(Result=n_distinct(Rank))
A base R option would be using ave
df$Result <- with(df, ave(Rank, User, FUN = function(x) length(unique(x))))
df$Result
#[1] 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5
and a data.table option is
library(data.table)
setDT(df)[, Result := uniqueN(Rank), by = User]

Resources