In R, what's the most efficient way to check if an object meets a criteria, and if it doesn't, modify it? - r

I have many cuts of my data that I eventually join together into one large dataset. However, sometimes the object is an error message because it didn't have enough sample size, causing the code to fail.
Before I do my full_joins, I want a simple way to say "If the length of any of these objects is 1, then make that object--and only those objects--have this set structure with just NAs". Is there a simple way to do that other than an if statement for each object? Or, alternatively, is there a way for R to 'skip' over the problematic rows if there's an error message (without relying on any specific characters)? I've used try(), but that doesn't always work and sometimes stops continuing to other joins.
#Here's an example of my data
library(dplyr)
object_1 <- tibble(name = c("Justin", "Corey"), month = c("Jan", "Jan"), score = c(1, 2))
object_2 <- tibble(name = c("Justin", "Corey"), month = c("Feb", "Feb"), score = c(100, 200))
object_3 <- "error message!"
object_4 <- tibble(name = c("Justin", "Corey"), month = c("Apr", "Apr"), score = c(95, 23))
object_5 <- "Another error!!"
#Here's me trying to join them, but it isn't working because of the errors
all_the_objects <- object_1 %>%
full_join(object_2) %>%
full_join(object_3) %>%
full_join(object_4) %>%
full_join(object_5)
#Here's a solution that works, but doesn't seem very elegant:
if(length(object_1) == 1) {
object_1 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_2) == 1) {
object_2 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_3) == 1) {
object_3 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_4) == 1) {
object_4 <- tibble(name = NA, month = NA, score = NA_real_)
} else if(length(object_5) == 1) {
object_5 <- tibble(name = NA, month = NA, score = NA_real_)
}
#Now it'll work
all_the_objects <- object_1 %>%
full_join(object_2) %>%
full_join(object_3) %>%
full_join(object_4) %>%
full_join(object_5)

We may place the objects in a list and do the check at once and then join with reduce
library(dplyr)
library(purrr)
map(mget(ls(pattern = '^object_\\d+$')),
~ if(is.vector(.x)) tibble(name = NA_character_, month = NA_character_,
score = NA_real_) else .x) %>%
reduce(full_join)
-output
# A tibble: 7 × 3
name month score
<chr> <chr> <dbl>
1 Justin Jan 1
2 Corey Jan 2
3 Justin Feb 100
4 Corey Feb 200
5 <NA> <NA> NA
6 Justin Apr 95
7 Corey Apr 23

Related

Stack data (maybe pivot_longer) but complicated, R

I have data like this:
df<-structure(list(record_id = c(1, 2, 4), alcohol = c(1, 2, 1),
ethnicity = c(1, 1, 1), bilateral_vs_unilateral = c(1, 2,
2), fat_grafting = c(1, 1, 0), number_of_adm_sheets_used = c(1,
NA, NA), number_of_adm_sheets_used_2 = c(1, 1, 1), number_of_fills = c(7,
NA, NA), number_of_fills_2 = c(7, NA, 2), total_fill_volume_ml_left = c(240,
NA, NA), total_volume_ml = c(240, 300, 550), implant_size_l = c(NA_real_,
NA_real_, NA_real_), implant_size_l_2 = c(NA_real_, NA_real_,
NA_real_)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
It is info about patients with each row representing a patient that underwent breast surgery.
I'd like to change it into each row representing a particular breast (of the two). There are several variables, everything from 'number_of_adm_sheets_used' to 'implant_size_l_2' that have a column for each side. I'd like to change those to represent either. An example is 'number_of_adm_sheets_used' stood for on the left side, and 'number_of_adm_sheets_used_2' was on the right side. I'd like to combine them to become one column of sheets used that was for either side.
My expected output would look like:
Pre-
Post-
I figure its some variant of pivot_longer but I'm having trouble with a few aspects:
the real data has 68 columns
I only need a duplicate row if the column "bilateral_vs_unilateral" is a "1" (meaning bilateral)
The way I've used pivot_longer before, you'd say "cols" and pick a big range, I'm not sure how to stack pairs of columns, if that makes sense.
Luckily, despite having 68 other columns, all of the "trouble" columns are shown below. Pairing 'number_of_adm_sheets_used' with 'number_of_adm_sheets_used_2'
'number_of_fills' with 'number_of_fills_2'
'total_fill_volume_ml_left' with 'total_volume_ml'
and 'implant_size_1' with 'implant_size_1_2'
Thank you
Here is one possibility, if I'm understanding the issue correctly.
# Make long format
df.long <- df %>%
pivot_longer(cols = -record_id) %>%
mutate(subject = ifelse(str_sub(name, -2, -1) == "_2", "breast 2", NA),
name = str_remove(name, "_2")) %>%
group_by(record_id, name) %>%
mutate(subject = case_when(
subject == "breast 2" ~ subject,
n() == 2 ~ "breast 1",
n() == 1 ~ "patient"
)) %>%
ungroup()
# statistics regarding the patient
patient <- df.long %>%
filter(subject == "patient") %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-subject)
# statistics regarding each breast
breasts <- df.long %>%
filter(str_detect(subject, "breast")) %>%
pivot_wider(names_from = name, values_from = value)
# merge the two data.frames
patient %>%
inner_join(breasts) %>%
select(record_id, subject, everything())
If you rename your "trouble columns" to a consistent pattern, then you can use pivot_longer()'s names_pattern argument and ".value" sentinel to pull pairs of values into rows. In my example code, I suffixed these with "_l" or "_r" for left- and right-sided variants. We can use the values_drop_na argument to keep only the valid rows for unilateral cases.
I also changed alcohol to a factor, just to demonstrate that it doesn't throw the error you noted in the bounty.
library(tidyverse)
df_long <- df %>%
mutate(alcohol = factor(alcohol)) %>%
rename(
number_of_adm_sheets_used_l = number_of_adm_sheets_used,
number_of_adm_sheets_used_r = number_of_adm_sheets_used_2,
number_of_fills_l = number_of_fills,
number_of_fills_r = number_of_fills_2,
total_fill_volume_ml_l = total_fill_volume_ml_left,
total_fill_volume_ml_r = total_volume_ml,
implant_size_l = implant_size_l,
implant_size_r = implant_size_l_2
) %>%
pivot_longer(
cols = ends_with(c("_l", "_r")),
names_to = c(".value", "side"),
names_pattern = "(.+)_(l|r)",
values_drop_na = TRUE
)
Output:
### move pivoted columns up front for illustration purposes
df_long %>%
relocate(record_id, side, number_of_adm_sheets_used:implant_size)
# A tibble: 4 x 10
record_id side number_of_adm_sheets_used number_of_fills total_fill_volume~
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 l 1 7 240
2 1 r 1 7 240
3 2 r 1 NA 300
4 4 r 1 2 550
# ... with 5 more variables: implant_size <dbl>, alcohol <fct>,
# ethnicity <dbl>, bilateral_vs_unilateral <dbl>, fat_grafting <dbl>

How to change the row names in R data.frame?

I would like to rename station in DF to something like DA056 to Happy and AB786 to Sad.
library(tidyverse)
DF1 <- data.frame(Station = rep("DA056",3), Level = 100:102)
DF2 <- data.frame(Station = rep("AB786",3), Level = 201:203)
DF <- bind_rows(DF1,DF2)
We can use factor with labels specified for corresponding levels
library(dplyr)
DF <- DF %>%
mutate(Station = factor(Station, levels = c("DA056", "AB786"),
labels = c("Happy", "Sad")))
DF$Station
#[1] Happy Happy Happy Sad Sad Sad
#Levels: Happy Sad
Or with recode
DF %>%
mutate(Station = recode(Station, DA056 = 'Happy', AB786 = 'Sad'))
# Station Level
#1 Happy 100
#2 Happy 101
#3 Happy 102
#4 Sad 201
#5 Sad 202
#6 Sad 203
If there are many values to be changed, a better option is a join after creating a key/val dataset
keyval <- data.frame(Station = c("DA056", "AB786"),
val = c("Happy", "Sad"), stringsAsFactors = FALSE)
DF %>%
left_join(keyval) %>%
mutate(Station = coalesce(val, Station))
Or with base R
DF$Station <- with(df, factor(Station, levels = c("DA056", "AB786"),
labels = c("Happy", "Sad")))
An option is to use dplyr::case_when:
library(dplyr)
DF1 <- data.frame(Station = rep("DA056",3), Level = 100:102, stringsAsFactors = F)
DF2 <- data.frame(Station = rep("AB786",3), Level = 201:203, stringsAsFactors = F)
DF <- bind_rows(DF1,DF2)
DF <- DF %>% mutate(Station = case_when( Station == "DA056" ~ "Happy",
Station == "AB786" ~ "Sad",
TRUE ~ Station))
Output
> DF
Station Level
1 Happy 100
2 Happy 101
3 Happy 102
4 Sad 201
5 Sad 202
6 Sad 203
You can do it using case_when:
DF %>%
mutate(Station = case_when(Station == "DA056" ~ "Happy", Station =="AB786" ~ "Sad"))
Another simple solution
DF$Station = ifelse(DF$Station == "DA056", "Happy", "Sad")

R - programmatically detect NA columns and return string

I have this vector of eligible columns for my script
cols <- c("country", "phone", "car")
And this dataframe
test <-
data.frame(
id = c(1, 2, 3),
country = c("us", NA, "uk"),
phone = c(1, 1, NA),
car = c(NA, 0, 1)
)
The goal is to create a new column with the result, where the condition will be based only on columns present in cols variable. In case that all values for id are NA, then res should be string nothing, if some of them are not NA, then I need to this colnames, in case that all columns are not NA then result should be string all.
result <-
data.frame(
id = c(1, 2, 3),
country = c("us", NA, NA),
phone = c(1, 1, NA),
car = c(NA, NA, NA),
res = c("country, phone", "phone", "nothing")
)
I can do it only via case_when() function
mutate(
res = case_when(
!is.na(country) & is.na(phone) & is.na(car) ~ "country",
T ~ "?"
)
You can do this in base R (rather than dplyr) using the code:
result$res <- apply(result[,cols],1, function(x){paste(cols[!is.na(x)], collapse=", ")})
result$res[results$res==""] <- "nothing"
The data which you have shared is different (test and result). So we will start with result by removing the res column.
library(dplyr)
result$res <- NULL
result %>%
mutate_all(as.character) %>%
tidyr::pivot_longer(cols = cols) %>%
group_by(id) %>%
summarise(res = toString(name[!is.na(value)])) %>%
type.convert() %>%
left_join(res, by = 'id') %>%
mutate(res = case_when(res == '' ~ 'nothing',
stringr::str_count(result, ',') ==
(length(cols) - 1) ~ 'all',
TRUE ~ as.character(result)))
# A tibble: 3 x 5
# id res country phone car
# <dbl> <chr> <fct> <dbl> <lgl>
#1 1 country, phone us 1 NA
#2 2 phone NA 1 NA
#3 3 nothing NA NA NA
We get the data in long format, get the column names which have non-NA value for each ID. We then change the res column to "all" or "nothing" if there are all or 0 matches respectively.

Rename columns of dataframe by days in R

I need to rename a dataframe by days in analysis.
names(dados) <- c("name", "day_1","Freq_1","Percent_1","day_2","Freq_2","Percent_2",
"day_3","Freq_3","Percent_3","day_4","Freq_4","Percent_4",
"day_5","Freq_5","Percent_5","day_6","Freq_6","Percent_6",
"day_7","Freq_7","Percent_7","day_8","Freq_8","Percent_8",
"day_9","Freq_9","Percent_9")
I'm doing an analysis that the data I get is in a list of dataframes, where each dataframe represents a day of analysis. I combine the dataframes and I have the columns 'name' unique and 'day_X', 'Freq_X' and 'Percent_X' for each dataframe as a return.
As return I need the columns to have the following names:
"name", "day_1","Freq_1","Percent_1","day_2","Freq_2","Percent_2","day_3","Freq_3","Percent_3"
How do I go about analyzing 50 days?
reproducible example:
day1 <- data.frame(name = c("jose", "mary", "julia"), freq = c(1,5,3), percent = c(40,30,20))
day2 <- data.frame(name = c("abner", "jose", "mary"), freq = c(3,5,4), percent = c(20,30,20))
day3 <- data.frame(name = c("abner", "jose", "mike"), freq = c(6,2,3), percent = c(40,30,70))
day4 <- data.frame(name = c("andre", "joseph", "ana"), freq = c(1,5,8), percent = c(40,30,20))
day5 <- data.frame(name = c("abner", "poli", "joseph"), freq = c(4,3,3), percent = c(10,30,10))
dates <- list(day1,day2,day4,day5)
data <- Reduce(function(x, y) merge(x, y, by = "name", all = TRUE), dates)
Here's a way to get what you want using the tidyverse suite of packages. We start by putting the data in the "long" format - but add a column with the date:
long_form <- dates %>%
imap_dfr(function(x, y) dplyr::mutate(x, day_num = y))
Now, to get the wide format you are after, we need to reformat things a bit, as done in the following code. I'm not sure what is supposed to go in the day_# variables, as #useR mentioned in the comments, so it's missing. If you have a variable called day, the code should automatically do the right thing as written.
wide_form <- long_form %>%
gather(key, value, -name,-day_num) %>%
dplyr::mutate(
key = paste(key, day_num, sep = "_")
) %>%
select(-day_num) %>%
spread(key, value)
One can use dplyr::bind_rows to merge all data frames form the list to a data frame. Please provide name to list so that day1, day2 etc can set beforehand. Finally, gather and spread is used to transform the data.
names(dates) <- paste("day", seq_along(dates), sep = "")
library(tidyverse)
bind_rows(dates,.id = "Name") %>%
group_by(Name) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
gather(Key, value, -Name,-rn) %>%
unite("Key", c("Key", "Name")) %>%
spread(Key, value) %>%
select(-rn)
Result:
# # A tibble: 3 x 12
# freq_day1 freq_day2 freq_day3 freq_day4 name_day1 name_day2 name_day3 name_day4 percent_day1 percent_day2 percent~ percent~
# * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 3 1 4 jose abner andre abner 40 20 40 10
# 2 5 5 5 3 mary jose joseph poli 30 30 30 30
# 3 3 4 8 3 julia mary ana joseph 20 20 20 10
#
Data:
Data is slightly modified from OP. I have included stringsAsFactors = FALSE argument as part of data.frame to avoid a mutate_at call to convert factor to character.
day1 <- data.frame(name = c("jose", "mary", "julia"), freq = c(1,5,3), percent = c(40,30,20), stringsAsFactors = FALSE)
day2 <- data.frame(name = c("abner", "jose", "mary"), freq = c(3,5,4), percent = c(20,30,20), stringsAsFactors = FALSE)
day3 <- data.frame(name = c("abner", "jose", "mike"), freq = c(6,2,3), percent = c(40,30,70), stringsAsFactors = FALSE)
day4 <- data.frame(name = c("andre", "joseph", "ana"), freq = c(1,5,8), percent = c(40,30,20), stringsAsFactors = FALSE)
day5 <- data.frame(name = c("abner", "poli", "joseph"), freq = c(4,3,3), percent = c(10,30,10), stringsAsFactors = FALSE)
dates <- list(day1,day2,day4,day5)

How to I cast data frame with more than 3 columns in R?

Importing from an Access database, I have data that look similar to this:
p <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-1','2018-11-1',
'2018-11-3', '2018-11-3')),
Area = c('AF','BB','CT', 'DF', 'BB'),
pCount = c(6, 3, 0, 12, 32),
ObsTime = c('8:51','8:59','9:13', '9:24', '9:30'),
stringsAsFactors = FALSE)
I want to cast my data with Rows as SurvDate and columns to be Areas (values as pCount) and ObsTime columns next to each Area with value ObsTime.
Example:
n <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-3')),
AF = c(6, NA),
TimeAF = c('8:51', NA),
BB = c(3, 32),
TimeBB = c('8:59', '9:30'),
CT = c(0, NA),
TimeCT = c(NA, '9:13'),
DF = c(NA,12),
TimeDF = c(NA, '9:24'))
I've tried variations on this theme, but can't get time to work.
library(reshape2)
dcast(p, SurvDate+ObsTime ~ Area)
Here is one way using tidyverse tools. Note that the output is not the same as your expected output, because it seems like you didn't put the values for CT in the right place (values spread across two dates). Approach is to unite the values so we have a single key-value pair to spread, and then separate out the columns again with mutate_at. We could also have used separate multiple times, though this would become unwieldy with too many Areas.
SurvDate <- as.Date(c('2018-11-1','2018-11-1','2018-11-1', '2018-11-3', '2018-11-3'))
Area <- c('AF','BB','CT', 'DF', 'BB')
People <- c(6, 3, 0, 12, 32)
ObsTime <- (c('8:51','8:59','9:13', '9:24', '9:30'))
p <- data.frame(SurvDate, Area, People, ObsTime, stringsAsFactors = FALSE)
library(tidyverse)
p %>%
unite(vals, People, ObsTime) %>%
spread(Area, vals) %>%
mutate_at(
.vars = vars(-SurvDate),
.funs = funs(
Time = str_extract(., "(?<=_).*$"),
Area = str_extract(., "^.*(?=_)")
)
) %>%
filter(!is.na(SurvDate)) %>%
select(SurvDate, matches("_")) %>%
select(SurvDate, order(colnames(.)))
#> SurvDate AF_Area AF_Time BB_Area BB_Time CT_Area CT_Time DF_Area
#> 1 2018-11-01 6 8:51 3 8:59 0 9:13 <NA>
#> 2 2018-11-03 <NA> <NA> 32 9:30 <NA> <NA> 12
#> DF_Time
#> 1 <NA>
#> 2 9:24
Created on 2018-04-30 by the reprex package (v0.2.0).

Resources