I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.
Related
I have a dataset with a lot of variables that need to have its values labelled. I know how to add labels to their values one by one but I would like to incorporate a loop, which could automatically assign a label to a value of 1 (1 indicates that someone selected an option, for instance they have depression, while 0 means that they didn't select it) across several variables that are dsm_00 (where 1 is supposed to be labelled as "no diagnosis"), dsm_01 (1 is depression), dsm_02 (1 is anxiety) and so on up until dsm_34.
I have created a list of names to be assigned:
labels <- list("no diagnosis", "depression", "anxiety", "bipolar", ....).
And I have a code for how to do it one by one:
val_lab(mydat$dsm_00) = num_lab("
1 no diagnosis
")
I'm not sure how I would incorporate it as a loop (I have always struggled with those). Any help would be appreciated!
In this situation, you probably don't want to use a loop. An easier approach is to write a function that produces the desired label from a given value, and apply it to the whole column. A convenient way to do this is the mutate() function in the dplyr package. Here's an example:
labels <- list("no diagnosis", "depression", "anxiety", "bipolar")
# This is the function to contain your code for assigning labels
# based on values in your data set. Replace this with whatever
# logic you have. In this example, I've assumed that the values
# we are labeling are all integers we could use to look up labels.
get.label = Vectorize(
function(diagnosis.code) {
labels[[diagnosis.code]]
})
# This package gives you mutate() and %>%
library(dplyr)
# Example data.
data = data.frame(diagnosis.codes = c(1, 3, 2, 2, 1))
# Create a new column "label" by applying your function to the
# values in another column.
data = data %>% mutate(label = get.label(diagnosis.codes))
Now if you look at your data frame, you should get the following
> data
# response.codes label
# 1 1 no diagnosis
# 2 3 anxiety
# 3 2 depression
# 4 2 depression
# 5 1 no diagnosis
So I have a dataframe column with userID where there are duplicates. I was asked to find the userID that appear least frequent. What are the possible methods to achieve this. Only using Base R or Dplyr packages.
Something like this
userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3)
Expected Output would be 3 in this case.
If this is based on the lengths of same adjacent values
with(rle(userID), values[which.min(lengths)])
#[1] 3
Or if it is based on the full data values
names(which.min(table(userID)))
#[1] "3"
Another possibility is to get the min of mode:
# example dataframe
df <- data.frame(userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3))
# define Mode function
Mode <- function(x){
a = table(x) # x is a column
return(a[which.min(a)])
}
Mode(df$userID)
# Output:
3 #value
1 #count
Gives the value 3 and the count 1
I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1
I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1.
Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name.
Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the #paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]
I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.