Adding labels to values across many columns using loop in R - r

I have a dataset with a lot of variables that need to have its values labelled. I know how to add labels to their values one by one but I would like to incorporate a loop, which could automatically assign a label to a value of 1 (1 indicates that someone selected an option, for instance they have depression, while 0 means that they didn't select it) across several variables that are dsm_00 (where 1 is supposed to be labelled as "no diagnosis"), dsm_01 (1 is depression), dsm_02 (1 is anxiety) and so on up until dsm_34.
I have created a list of names to be assigned:
labels <- list("no diagnosis", "depression", "anxiety", "bipolar", ....).
And I have a code for how to do it one by one:
val_lab(mydat$dsm_00) = num_lab("
1 no diagnosis
")
I'm not sure how I would incorporate it as a loop (I have always struggled with those). Any help would be appreciated!

In this situation, you probably don't want to use a loop. An easier approach is to write a function that produces the desired label from a given value, and apply it to the whole column. A convenient way to do this is the mutate() function in the dplyr package. Here's an example:
labels <- list("no diagnosis", "depression", "anxiety", "bipolar")
# This is the function to contain your code for assigning labels
# based on values in your data set. Replace this with whatever
# logic you have. In this example, I've assumed that the values
# we are labeling are all integers we could use to look up labels.
get.label = Vectorize(
function(diagnosis.code) {
labels[[diagnosis.code]]
})
# This package gives you mutate() and %>%
library(dplyr)
# Example data.
data = data.frame(diagnosis.codes = c(1, 3, 2, 2, 1))
# Create a new column "label" by applying your function to the
# values in another column.
data = data %>% mutate(label = get.label(diagnosis.codes))
Now if you look at your data frame, you should get the following
> data
# response.codes label
# 1 1 no diagnosis
# 2 3 anxiety
# 3 2 depression
# 4 2 depression
# 5 1 no diagnosis

Related

Multivariate Cointegration Test?

I am currently performing an analysis on FDI and income inequality in a panel of 10 countries over 30 years. In this setting, I want to test for panel cointegration, unit roots etc. My data is currently in the format:
ID year var1 var2
1 1 3 4
1 2 4 NA
1 3 1 6
2 1 4 2
2 2 1 3
2 3 2 2`
and so on. I have used
data.frame(split(df$var1, df$ID))
to create a dataframe with the IDs as columns to use the purtest function from the plm package (note that my panel data is unbalanced). However, I have a lot of variables and this whole procedure seems excessively tedious.
Is there a way to perform a cointegration test on multiple variables at once? There appears to be an index specification in purtest but I don't fully understand how to use it. What format would my data have to be in and how would I get there? Would it be an issue that there is quite a few NAs in some variables?
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4, 2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2, NA,6,7,4,5,9,2))
The index argument to plm's purtest works like in the other functions of the package: it is used to specify the individual and the time dimension of your data. However, there is no need to use it directly in the purtest (or other functions) if data is converted to a pdata.frame first.
There is no need to split the data by individual on your end. purtest supports various ways/formats to input data. A user-driven split is just one option. If you already have a data frame in long-format, I suggest to convert it to pdata.frame and input a pseries to purtest (thus, no need to split data yourself). Something along these lines:
library(plm)
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4,2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2,NA,6,7,4,5,9,2))
pdf <- pdata.frame(df)
purtest(pdf$var1, test = "hadri", exo = "intercept")
To test multiple variables at once, you can put the purtest function into a loop, e.g., like this:
myvars <- as.list(pdf[ , -c(1:2)], keep.attributes = TRUE)
lapply(myvars, function(x) purtest(x, test = "ips", exo = "intercept", pmax = 1L))

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ
The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.
Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.
Why you just donĀ“t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

Subset data frame based on first letters of column name

I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:
x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")
The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:
partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col"
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"
My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1.
Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.
I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.
First create some example data:
df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50),
sample(names_variables, 50, TRUE),
sep = '')
colnames(df) = column_names
What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:
library(reshape2)
df_melt = melt(df)
head(df_melt)
variable value
1 85ham 0.83619111
2 85ham 0.08503596
3 85ham 0.54599402
4 85ham 0.42579376
5 85ham 0.68702319
6 85ham 0.88642715
Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:
library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
var_name = gsub('[0-9]', '', variable))
essentially removing the part of the string not needed. Now we can do nice things like:
mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name),
mean(value))
head(mean_per_indivdual_per_var)
individual_ID var_name mean(value)
1 63 spam 0.4840511
2 46 ham 0.4979884
3 20 shrub 0.5094550
4 90 ham 0.5550148
5 30 shrub 0.4233039
6 21 ham 0.4764298
It seems that your colnames are the standard ones of a data.frame, so to get just the column 1 you can do this:
df2 <- df[,1] #Where 1 can be changed to the number of column you wish.
There is no need to subset by a partial name.
Although it is not recommended you could create a loop to do so:
for (i in ncol(x)){
assing(paste("df",i), x[,i]) #I use paste to get a different name for each column
}
Although the #paulhiemstra solution avoids the loop.
So with the new information then you can do as you wanted with grep, but specifically telling how many matches you expect:
df2<- x[,grep("1{30}", colnames(x))]

Resources