if_else with haven_labelled column fails because of wrong class - r

I have the following data:
dat <- structure(list(value = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
label = "value: This is my label",
labels = c(`No` = 0, `Yes` = 1),
class = "haven_labelled"),
group = structure(c(1, 2, 1, 1, 2, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 1),
label = "my group",
labels = c(first = 1, second = 2, third = 3),
class = "haven_labelled")),
row.names = c(NA, -20L),
class = c("tbl_df", "tbl", "data.frame"),
label = "test.sav")
As you can see, the data uses a special class from tidyverse's haven package, so called labelled columns.
Now I want to recode my initial value variable such that:
if group equals 1, value should stay the same, otherwise it should be missing
I was trying the following, but getting an error:
dat_new <- dat %>%
mutate(value = if_else(group != 1, NA, value))
# Error: `false` must be a logical vector, not a `haven_labelled` object
I got so far as to understand that if_else from dplyr requires the true and false checks in the if_else command to be of same class and since there is no NA equivalent for class labelled (e.g. similar to NA_real_ for doubles), the code probably fails, right?
So, how can I recode my inital variables and preserve the labels?
I know I could change my code above and replace the if_else by R's base version ifelse. However, this deletes all labels and coerces the value column to a numeric one.

You can try dplyr::case_when for cases where group == 1. If no cases are matched, NA is returned:
dat %>% mutate(value = case_when(group == 1 ~ value))

You can create an NA value in the haven_labelled class with this ugly code:
haven::labelled(NA_real_, labels = attr(dat$value, "labels"))
I'd recommend writing a function for that, e.g.
labelled_NA <- function(value)
haven::labelled(NA_real_, labels = attr(value, "labels"))
and then the code for your mutate isn't quite so ugly:
dat_new <- dat %>%
mutate(value = if_else(group != labelled_NA(value), value))
Then you get
> dat_new[1:5,]
# A tibble: 5 x 2
value group
<dbl+lbl> <dbl+lbl>
1 NA 1 [first]
2 NA 2 [second]
3 0 [No] 1 [first]
4 0 [No] 1 [first]
5 NA 2 [second]

Related

R Manipulating List of Lists With Conditions / Joining Data

I have the following data showing 5 possible kids to invite to a party and what neighborhoods they live in.
I have a list of solutions as well (binary indicators of whether the kid is invited or not; e.g., the first solution invites Kelly, Gina, and Patty.
data <- data.frame(c("Kelly", "Andrew", "Josh", "Gina", "Patty"), c(1, 1, 0, 1, 0), c(0, 1, 1, 1, 0))
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1), c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
I'm looking for a way to now filter the solutions in the following ways:
a) Only keep solutions where there are at least 3 kids from both neighborhood A and neighborhood B (one kid can count as one for both if they're part of both)
b) Only keep solutions that have at least 3 kids selected (i.e., sum >= 3)
I think I need to somehow join data to the solutions in solutions, but I'm a bit lost on how to manipulate everything since the solutions are stuck in lists. Basically looking for a way to add entries to every solution in the list indicating a) how many kids the solution has, b) how many kids from neighborhood A, and c) how many kids from neighborhood B. From there I'd have to somehow filter the lists to only keep the solutions that satisfy >= 3?
Thank you in advance!
I wrote a little function to check each solution and return TRUE or FALSE based on your requirements. Passing your solutions to this using sapply() will give you a logical vector, with which you can subset solutions to retain only those that met the requirements.
check_solution <- function(solution, data) {
data <- data[as.logical(solution),]
sum(data[["Neighborhood A"]]) >= 3 && sum(data[["Neighborhood B"]]) >= 3
}
### No need for function to test whether `sum(solution) >= 3`, since
### this will *always* be true if either neighborhood sums is >= 3.
tests <- sapply(solutions, check_solution, data = data)
# FALSE FALSE FALSE FALSE FALSE
solutions[tests]
# list()
### none of the `solutions` provided actually meet criteria
Edit: OP asked in the comments how to test against all neighborhoods in the data, and return TRUE if a specified number of neighborhoods have enough kids. Below is a solution using dplyr.
library(dplyr)
data <- data.frame(
c("Kelly", "Andrew", "Josh", "Gina", "Patty"),
c(1, 1, 0, 1, 0),
c(0, 1, 1, 1, 0),
c(1, 1, 1, 0, 1),
c(0, 1, 1, 1, 1)
)
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B", "Neighborhood C",
"Neighborhood D")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1),
c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
check_solution <- function(solution,
data,
min_kids = 3,
min_neighborhoods = NULL) {
neighborhood_tests <- data %>%
filter(as.logical(solution)) %>%
summarize(across(starts_with("Neighborhood"), ~ sum(.x) >= min_kids)) %>%
as.logical()
# require all neighborhoods by default
if (is.null(min_neighborhoods)) min_neighborhoods <- length(neighborhood_tests)
sum(neighborhood_tests) >= min_neighborhoods
}
tests1 <- sapply(solutions, check_solution, data = data)
solutions[tests1]
# list()
tests2 <- sapply(
solutions,
check_solution,
data = data,
min_kids = 2,
min_neighborhoods = 3
)
solutions[tests2]
# [[1]]
# [1] 1 0 0 1 1
#
# [[2]]
# [1] 0 1 0 1 1

match every row whose `region_ID=0` with the rows whose `region_ID=1` and calculate a certain distance

I have a dataset that looks like the following:
structure(list(X = c(36, 37, 38, 39, 40, 41, 1, 2, 3, 4, 5, 6
), Y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), region_ID = c(0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb8fc819ae0>)
I want to match every row whose region_ID=0 with the rows whose region_ID=1 and calculate
dist_to_r1=sqrt((X - i.X)^2 + (Y - i.Y)^2))
where i. prefix refers to the latter rows. I want to do this using data table syntax.
I have been trying to do this with left joins, but couldn't make it work.
You want a full join, such that each of the six rows in region 0 are joined to the six rows in region 1?.
In that case, you can simply set allow.cartesian = T:
data[, id:=1][region_ID==0][data[region_ID==1], on ="id", allow.cartesian=T][, dist_to_r1:=sqrt((X-i.X)^2 + (Y-i.Y)^2)][]
Edit: OP clarified that only the minimum distance to a point in region 0 is required. In this case, we can do something like this:
data[,id:=1]
region0 = data[region_ID==0]
# function that gets the minimum distance between two regions
get_min_dist <- function(region_a, region_b) {
region_a[region_b, on="id", allow.cartesian=T][,min(sqrt((X-i.X)^2 + (Y-i.Y)^2))]
}
# apply the function above to every region
data[,
(min_dist_to_zero = get_min_dist(
region_a = region0,
region_b = data[region_ID==.BY]
)),
by=region_ID]
Output:
region_ID min_dist_to_zero
1: 0 0
2: 1 30

apply custom-made function to column pairs and create summary table

I have data with ratings on many parameters by two different raters; here are shown just a snippet of ratings on three same-prefix parameters (e.g. DH and DH_ptak):
df <- structure(list(DH = c(0, 1, NA, NA, 1, 1, 1, 1, 1, 1),
DH_ptak = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
SZ = c(1, 1, NA, NA, NA, 0, 1, 0, 1, 1),
SZ_ptak = c(1, 1, NA, NA, NA, 1, 0, NA, 1, 1),
RM = c(0, 1, 1, NA, NA, NA, 0, NA, 1, NA),
RM_ptak = c(0, 1, 1, 1, 1, NA, 0, 1, NA, 1)),
row.names = c(NA, 10L), class = "data.frame")
For each parameter I want to compare the two ratings columns. I use this function to find different ratings:
compare_fun <- function(c1, c2){
case_when(is.na(c1) & is.na(c2) ~ 0,
is.na(c1) | is.na(c2) ~ 1,
c1 != c2 ~ 1,
TRUE ~ 0)
}
I can use this function to sum the differences and compute an agreement percentage agree_pct:
library(dplyr)
df %>%
mutate(diff = compare_fun(DH, DH_ptak)) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum)/nrow(df)*100)
sum agree_pct
1 2 80
The problem is that I have multiple parameters. How can I compute for all ratings-column pairs the respective sum and agree_pct in one go, ideally, to obtain a table like this:
sum agree_pct
DH 2 80
SZ 3 70
RM 5 50
This is what I would do. It mostly involves pivoting the data a few times. First I make a column from row names so that I can use this to keep all the rows straight, then I go from wide to long with pivot_longer. I separate the column names to delineate between the two reviewers and assign them the names "grp1" and "grp2". Then I pivot_wider so that you have 2 columns, one for each reviewer. Lastly I apply your function across all the data, group by the variable of interest and summarize the data.
library(tidyverse)
df %>%
rownames_to_column("col") %>%
pivot_longer( -col) %>%
separate(name, into = c("var", "tmp"), sep = "_") %>%
mutate(grp = ifelse(is.na(tmp), "grp1", "grp2")) %>%
select(col, var, value, grp) %>%
pivot_wider(names_from = grp, values_from = value) %>%
mutate(diff = compare_fun(grp1, grp2)) %>%
group_by(var) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum)/nrow(df)*100)
#> # A tibble: 3 x 3
#> var sum agree_pct
#> <chr> <dbl> <dbl>
#> 1 DH 2 80
#> 2 RM 5 50
#> 3 SZ 3 70

find start and end idx of a time series by group in a data table

I have data table that looks like this:
data <- data.table(time = c(0, 1, 2, 3, 4, 5, 6, 7),
anom = c(0, 0, 1, 1, 1, 0, 0, 0),
gier = c(0, 0, 4, 9, 7, 0, 0, 0))
Now I am calculating some statistical values of the column gier grouped by column anom like this:
cols <- c("gier")
statFun <- function(x) list(mean = mean(x), median = median(x), std = sd(x))
statSum <- data[, unlist(lapply(.SD, statFun), recursive = FALSE), .SDcols = cols, by = anom]
This is fine but I want to go a step further and put in the start and end points of time depending on the start and of the anom groups (0 and 1). So in the end I have something like a new time series but only with the start and end points of time. So in the end the result should look like this:
res <- data.table(x.start = c(0, 2, 5),
x.end = c(1, 4, 7),
anom = c(0, 1, 0),
gier.mean = c(0, 6.666, 0),
gier.median = c(0, 7, 0),
gier.std = c(0, 2.516, 0))
How is it possible to achieve this?
addition: is there a way to achieve the result for multiple columns and not only one column like gier? For example I am able to do this but I don't know how to extend it with the mentioned columns. This way there is at least an extra column rn for the column names I calculate the statistical values.
res <- data[, setDT(do.call(rbind.data.frame, lapply(.SD, statFun)), keep.rownames = TRUE), .SDcols = cols, by = anom]
You can include additional calculation outside lapply :
library(data.table)
data[, unlist(c(lapply(.SD, statFun),
anom = first(anom), x.start = first(time), x.end = last(time)),
recursive = FALSE), rleid(anom), .SDcols = cols]
# rleid gier.mean gier.median gier.std anom x.start x.end
#1: 1 0.000000 0 0.000000 0 0 1
#2: 2 6.666667 7 2.516611 1 2 4
#3: 3 0.000000 0 0.000000 0 5 7
In dplyr we can do this similarly :
library(dplyr)
data %>%
group_by(grp = rleid(anom)) %>%
summarise(across(cols, list(mean = mean, median = median, std = sd)),
x.start = first(time),
x.end = last(time))

Is there a way to count occurrences of a specific value for unique columns in a dataframe in R?

I am relatively new to R and have a dataframe (cn_data2) with several duplicated columns. It looks something like this:
Gene breast_cancer breast_cancer breast_cancer lung_cancer lung_cancer
myc 1 0 1 1 2
ARID1A 0 2 1 1 0
Essentially, the rows are genes and the columns are different types of cancers. What I want is to find for each gene the number of times, a value (0,1,or 2) occurs for each unique cancer type.
I have tried several things but haven't been able to achieve what I want. For example, cn_data2$count1 <- rowSums(cn_data == '1') gives me a column with the number of "1" for each gene but what I want the number of "1" for each individual disease.
Hope my question is clear!I appreciate any help, thank you!
structure(list(gene1 = structure(1:6, .Label = c("ACAP3", "ACTRT2",
"AGRN", "ANKRD65", "ATAD3A", "ATAD3B"), class = "factor"), glioblastoma_multiforme_Primary_Tumor = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.1 = c(-1,
-1, -1, -1, -1, -1), glioblastoma_multiforme_Primary_Tumor.2 = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.3 = c(2,
2, 2, 2, 2, 2), glioblastoma_multiforme_Primary_Tumor.4 = c(0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 6L))

Resources