Counting occurrence of diagnosis code across multiple columns in large R dataset - r

I'm using two years of NIS data (already combined) to search for a diagnosis code across all of the DX columns. The columns start at I10_DX1 to I10_DX40 (which are column #18-57). I want to create a new dataset that has the observations that has this diagnosis code in any of these columns.
I 've tried loops and the ICD packages but haven't been able to get it right. Most recently tried code as follows:
get_icd_labels(icd3 = c("J80"), year = 2018:2019) %>%
arrange(year, icd_sub) %>%
filter(icd_sub %in% c("J80") %>%
select(year, icd_normcode, label) %>%
knitr::kable(row.names = FALSE)

This is a tidyverse (dplyr) solution. If you don't already have a unique id for each record, I'd start out by adding one.
df <-
df %>%
mutate(my_id = row_number())
Next, I'd gather the diagnosis codes into a table where each record is a single diagnosis.
diagnoses <-
df %>%
select(my_id, 18:57) %>%
gather("diag_num","diag_code",2:ncol(.)) %>%
filter(!is.na(diag_code)) #No need to keep a bunch of empty rows
Finally, I would join my original df to the diagnoses data frame and filter for the code I want.
df %>%
inner_join(diagnoses, by = "my_id") %>%
filter(diag_code == "J80")

Related

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

R: Add count for unique values within Group, disregarding other variables within dataframe

I would like to add a new variable to my data frame, which, for each group says the number of unique entries with relation to one variable (state), while disregaring others.
Data input
df <- data.frame(id=c(1,2,3,4,5,6,7,8,9),
state=c("CT","CT","AK","TX","TX","AZ","GA","TX","WA"),
group=c(1,1,2,3,3,3,4,4,4),
age=c(12,33,57,98,45,67,16,85,22)
)
df
Desired output
want <- data.frame(id=c(1,2,3,4,5,6,7,8,9),
state=c("CT","CT","AK","TX","TX","AZ","GA","TX","WA"),
group=c(1,1,2,3,3,3,4,4,4),
age=c(12,33,57,98,45,67,16,85,22),
count=c(1,1,1,2,2,2,3,3,3)
)
want
We need a group by n_distinct
library(dplyr)
df %>%
group_by(group) %>%
mutate(count = n_distinct(state)) %>%
ungroup

How to mutate new columns in R based on earliest and latest dates for other variables

In a dataset where each patient had multiple test administrations and a score on each test date, I have to identify the earliest & latest test dates, then subtract the difference of the scores of those dates. I think I've identified the first & last dates through dplyr, creating new columns for those:
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
arrange(desc(PatientID))
Score column is TotalScore
Now how do I extract the scores from these 2 dates (for each patient) to create new columns of earliest & latest scores? Haven't been able to figure out a mutate with case_when or if_else to create a score based on a record with a certain date.
Have you tried to use one combine verb, like left_join, for example?
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
ungroup() %>%
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “EarliestTestDate”)) %>% # picking the score of EarliestTestDate
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “LatestTestDate”)) %>% # picking the score of EarliestTestDate
arrange(desc(PatientID)) # now you can make the mutante task that you need.
I suggest to you see the dplyr cheatsheet.

Dataframe not populating when I am doing subset?

I have a dataframe that has two columns. One column is the product type, and the other is characters. I essentially want to break that column 'product' into 12 different data frames for each level. So for the first level, I am running this code:
df = df %>% select('product','comments')
df['product'] = as.character(df['product'])
df['comments'] = as.character(df['comments'])
Now that the dataframe is in the structure I want it, I want to take a variety of subsets, and here is my first subset code:
df_boatstone = df[df$product == 'water',]
#df_boatstone <- subset(df, product == "boatstone", select = c('product','comments'))
I have tried both methods, and the dataframe is being created, but has nothing in it. Can anyone catch my mistake?
The as.character works on a vector, while df['product'] or df['comments'] are both data.frame with a single column
df[['product']] <- as.character(df[['product']])
Or better would be
library(tidyverse)
df %>%
select(product, comments) %>%
mutate_all(as.character) %>%
filter(product == 'water')

R dplyr group_by subject appears to use entire dataframe instead of subject

Background
I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.
Problem
I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.
Code
This is what I have done so far:
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
library(dplyr)
library(plyr)
data <- tbl_df(data)
test <- data %>%
filter(!is.na(wght)) %>%
dplyr::group_by(subject_id) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
ungroup()
Sample of the test dataframe:
Find a .csv sample of my dataset here:
https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing
Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.
library(dplyr)
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
test <- data %>%
filter(!is.na(wght)) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
group_by(subject_id) %>%
summarise(value = max(maxwght)) %>%
ungroup()

Resources