Dynamically create value labels with haven::labelled, follow-up - r

Follow-up question to Dynamically create value labels with haven::labelled, where akrun provided a good answer using deframe.
I am using haven::labelled to set value labels of a variable. The goal is to create a fully documented dataset I can export to SPSS.
Now, say I have a df value_labels of values and their value labels. I also have a df df_data with variables to which I want allocate value labels.
value_labels <- tibble(
value = c(seq(1:6), seq(1:3), NA),
labels = c(paste0("value", 1:6),paste0("value", 1:3), NA),
name = c(rep("var1", 6), rep("var2", 3), "var3")
)
df_data <- tibble(
id = 1:10,
var1 = floor(runif(10, 1, 7)),
var2 = floor(runif(10, 1, 4)),
var3 = rep("string", 10)
)
Manually, I would create value labels for df_data$var1 and df_data$var2 like so:
df_data$var1 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3, values4 = 4, values5 = 5, values6 = 6))
df_data$var2 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3))
I need a more dynamic way of assigning correct value labels to the correct variable in a large dataset. The solution also needs to ignore character vectors, since I dont want these to have value labels. For that reason, var3 in value_labels is listed as NA.
The solution does not need to work with multiple datasets in a list.

Here is one option where we split the named 'value/labels' by 'name' after removing the NA rows, use the names of the list to subset the columns of 'df_data', apply the labelled and assign it to back to the same columns
lbls2 <- na.omit(value_labels)
lstLbls <- with(lbls2, split(setNames(value, labels), name))
df_data[names(lstLbls)] <- Map(haven::labelled,
df_data[names(lstLbls)], labels = lstLbls)
df_data
# A tibble: 10 x 4
# id var1 var2 var3
# <int> <dbl+lbl> <dbl+lbl> <chr>
# 1 1 2 [value2] 2 [value2] string
# 2 2 5 [value5] 2 [value2] string
# 3 3 4 [value4] 1 [value1] string
# 4 4 1 [value1] 2 [value2] string
# 5 5 1 [value1] 1 [value1] string
# 6 6 6 [value6] 2 [value2] string
# 7 7 1 [value1] 3 [value3] string
# 8 8 1 [value1] 1 [value1] string
# 9 9 3 [value3] 3 [value3] string
#10 10 6 [value6] 1 [value1] string

Related

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

Sum values in a comma-separated string

I have a dataset that has a column called QTY in which most of the values are already summed, but a few are several integers separated by commas. How can I replace those rows with the sums of the values?
I have:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 2, 4, 5, 8
4 Bcb 4, 1
Desired result:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 19
4 Bcb 5
I've tried messing around with for loops a bit and using ifelse(), but I can't quite figure it out.
This looks a bit ugly but should work. Assuming column QTY is a character -
your_df$QTY_new <- sapply(strsplit(your_df$QTY, ", "), function(x) sum(as.numeric(x)))
Using for loops should be this way:
data <- data.table(ID = 1:4,
Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1"),
QTY2 = numeric(4))
for(i in 1:nrow(data)){
data$QTY2[i] <- sum(as.numeric(unlist(strsplit(as.character(data$QTY[i]), ', '))))
}
and the resulting DF is:
ID Name QTY QTY2
1: 1 Abc 2 2
2: 2 Bac 3 3
3: 3 Cba 2, 4, 5, 8 19
4: 4 Bcb 4, 1 5
I made a function for solving your question. But let me explain how it works:
sumInRow = function(row_value, split = ",") {
# 1. split the values
row_value = strsplit(row_value, split = split)
# 2. Convert them to numeric and sum
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
return(row_sum)
}
The row_value, by default, will be a character because of the comma.
Then for each value we need to split them:
row_value = strsplit(row_value, split = split)
But it will return a list contain the split for all element in row_value, don't worry we'll use it later.
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
Sapply function works as an interator, for each element of the list we'll use the following function: convert to numeric and return the sum of them.
[EDIT_1]
To use if you have to call:
sumInRow(<your data frame>$QYT)
I hope it helps you.
Here is one option with tidyverse, We split the 'QTY' column by the delimiter , to expand the rows (separate_rows), grouped by 'ID', 'Name', get the sum of the 'QTY'
library(tidyverse)
df1 %>%
separate_rows(QTY, convert = TRUE) %>%
group_by(ID, Name) %>%
summarise(QTY = sum(QTY))
# A tibble: 4 x 3
# Groups: ID [4]
# ID Name QTY
# <int> <chr> <int>
#1 1 Abc 2
#2 2 Bac 3
#3 3 Cba 19
#4 4 Bcb 5
data
df1 <- structure(list(ID = 1:4, Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1")), class = "data.frame", row.names = c(NA,
-4L))

Create new identifier column in data frame with values from the name of containing nested list

I would like to create a new identifier column in each data frame with values from the name of containing nested list.
parent <- list(
a = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))),
b = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))))
Therefore, the result for the first data frame in list a would look like this:
> foo
first second identifier
1 1 4 a
2 2 5 a
3 3 6 a
The first data frame in list b would look like this:
>foo
first second identifier
1 1 4 b
2 2 5 b
3 3 6 b
Seems like you might want something like this
Map(function(name, list) {
lapply(list, function(x) cbind(x, identifier=name))
}, names(parent), parent)
Here we use Map() and take the list and the names of the list and just cbind those identifiers into the data.frames.
We could use tidyverse. Loop through the list with imap (gives both the values as well as the keys (name of the list) as .x and .y, then with map2, loop through the inner list of data.frame and mutate to create the column 'identifier as .y aka the names of the list
library(tidyverse)
imap(parent, ~ map2(.x, .y, ~ .x %>%
mutate(identifier = .y)))
#$a
#$a$foo
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$bar
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$puppy
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$b
#$b$foo
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$bar
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$puppy
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
If we want to have the column based on the data.frame name, loop through just the list elements with map, then use imap to loop through the inner list so as to get the keys (names of the inner list) and create a new column 'identifier
map(parent, ~ imap(.x, ~ .x %>%
mutate(identifier = .y)))

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.
Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Resources