Stack data (maybe pivot_longer) but complicated, R - r

I have data like this:
df<-structure(list(record_id = c(1, 2, 4), alcohol = c(1, 2, 1),
ethnicity = c(1, 1, 1), bilateral_vs_unilateral = c(1, 2,
2), fat_grafting = c(1, 1, 0), number_of_adm_sheets_used = c(1,
NA, NA), number_of_adm_sheets_used_2 = c(1, 1, 1), number_of_fills = c(7,
NA, NA), number_of_fills_2 = c(7, NA, 2), total_fill_volume_ml_left = c(240,
NA, NA), total_volume_ml = c(240, 300, 550), implant_size_l = c(NA_real_,
NA_real_, NA_real_), implant_size_l_2 = c(NA_real_, NA_real_,
NA_real_)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
It is info about patients with each row representing a patient that underwent breast surgery.
I'd like to change it into each row representing a particular breast (of the two). There are several variables, everything from 'number_of_adm_sheets_used' to 'implant_size_l_2' that have a column for each side. I'd like to change those to represent either. An example is 'number_of_adm_sheets_used' stood for on the left side, and 'number_of_adm_sheets_used_2' was on the right side. I'd like to combine them to become one column of sheets used that was for either side.
My expected output would look like:
Pre-
Post-
I figure its some variant of pivot_longer but I'm having trouble with a few aspects:
the real data has 68 columns
I only need a duplicate row if the column "bilateral_vs_unilateral" is a "1" (meaning bilateral)
The way I've used pivot_longer before, you'd say "cols" and pick a big range, I'm not sure how to stack pairs of columns, if that makes sense.
Luckily, despite having 68 other columns, all of the "trouble" columns are shown below. Pairing 'number_of_adm_sheets_used' with 'number_of_adm_sheets_used_2'
'number_of_fills' with 'number_of_fills_2'
'total_fill_volume_ml_left' with 'total_volume_ml'
and 'implant_size_1' with 'implant_size_1_2'
Thank you

Here is one possibility, if I'm understanding the issue correctly.
# Make long format
df.long <- df %>%
pivot_longer(cols = -record_id) %>%
mutate(subject = ifelse(str_sub(name, -2, -1) == "_2", "breast 2", NA),
name = str_remove(name, "_2")) %>%
group_by(record_id, name) %>%
mutate(subject = case_when(
subject == "breast 2" ~ subject,
n() == 2 ~ "breast 1",
n() == 1 ~ "patient"
)) %>%
ungroup()
# statistics regarding the patient
patient <- df.long %>%
filter(subject == "patient") %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-subject)
# statistics regarding each breast
breasts <- df.long %>%
filter(str_detect(subject, "breast")) %>%
pivot_wider(names_from = name, values_from = value)
# merge the two data.frames
patient %>%
inner_join(breasts) %>%
select(record_id, subject, everything())

If you rename your "trouble columns" to a consistent pattern, then you can use pivot_longer()'s names_pattern argument and ".value" sentinel to pull pairs of values into rows. In my example code, I suffixed these with "_l" or "_r" for left- and right-sided variants. We can use the values_drop_na argument to keep only the valid rows for unilateral cases.
I also changed alcohol to a factor, just to demonstrate that it doesn't throw the error you noted in the bounty.
library(tidyverse)
df_long <- df %>%
mutate(alcohol = factor(alcohol)) %>%
rename(
number_of_adm_sheets_used_l = number_of_adm_sheets_used,
number_of_adm_sheets_used_r = number_of_adm_sheets_used_2,
number_of_fills_l = number_of_fills,
number_of_fills_r = number_of_fills_2,
total_fill_volume_ml_l = total_fill_volume_ml_left,
total_fill_volume_ml_r = total_volume_ml,
implant_size_l = implant_size_l,
implant_size_r = implant_size_l_2
) %>%
pivot_longer(
cols = ends_with(c("_l", "_r")),
names_to = c(".value", "side"),
names_pattern = "(.+)_(l|r)",
values_drop_na = TRUE
)
Output:
### move pivoted columns up front for illustration purposes
df_long %>%
relocate(record_id, side, number_of_adm_sheets_used:implant_size)
# A tibble: 4 x 10
record_id side number_of_adm_sheets_used number_of_fills total_fill_volume~
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 l 1 7 240
2 1 r 1 7 240
3 2 r 1 NA 300
4 4 r 1 2 550
# ... with 5 more variables: implant_size <dbl>, alcohol <fct>,
# ethnicity <dbl>, bilateral_vs_unilateral <dbl>, fat_grafting <dbl>

Related

Creating new column based on cluster in R

Dear Stack overlow users...
I am struggling with using R. I did not frequently use it but used stata instead..
My data set has several clusters
What I wanna do is making new cluster columns with the value
so the clusters will be clolumn and each column has value.
Many thanks in advance
If there exist equal number of values per cluster, using dummy data,
library(tidyverse)
df <- data.frame(
value = rnorm(5),
cluster = c(1:4, 4)
)
n = max(table(df$cluster))
for (i in unique(df$cluster)) {
m = n - nrow(df[df$cluster == i,])
if (m > 0){
df <- rbind(df, setNames(as.data.frame(matrix(rep(c(NA, i),m), ncol = 2, byrow = T)), names(df)))
}
}
df %>%
group_by(cluster) %>%
mutate(n = 1:n()) %>%
pivot_wider(names_from = cluster, values_from = value) %>%
select(-n)
`1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl>
1 -0.0549 0.250 0.618 -0.173
2 NA NA NA -2.22

How to sum every numeric column that start with the same name except the 2 last characters, in R?

I have a dataframe which contains >100 columns, some are numeric, some not.
All variables ending with "_f" or "_m" are numeric variables and I would like to sum all the pairs that start with the same pattern but end with "_f" or "_m".
Here is an example of variable names in my dataframe:
xxxxxxxxxxxxx_age1_f
xxxxxxxxxxxxx_age1_m
xxxxxxxxxxxxx_age2_f
xxxxxxxxxxxxx_age2_m
xxxxxxxxxxxxx_age3_f
xxxxxxxxxxxxx_age3_m
yyyyyyyyyy_age1_f
yyyyyyyyyy_age1_m
yyyyyyyyyy_age2_f
yyyyyyyyyy_age2_m
yyyyyyyyyy_age3_f
yyyyyyyyyy_age3_m
yyyyyyyyyy_age4_f
yyyyyyyyyy_age4_m
yyyyyyyyyy_age5_f
yyyyyyyyyy_age5_m
zzzzzzzzzzzzzzzzzzzz_age1_f
zzzzzzzzzzzzzzzzzzzz_age1_m
zzzzzzzzzzzzzzzzzzzz_age2_f
zzzzzzzzzzzzzzzzzzzz_age2_m
zzzzzzzzzzzzzzzzzzzz_age3_f
zzzzzzzzzzzzzzzzzzzz_age3_m
text_var_11
text_var_222
text_var_33333
(I'm abstracting the names here with x, y z to make my question clearer, they are not really named like that)
My first solution would be to sum each pair using dplyr::mutate() like this:
mutate( ... ) %>%
mutate( yyyyyyyyyy_age2 = yyyyyyyyyy_age2_f + yyyyyyyyyy_age2_m) %>%
mutate( yyyyyyyyyy_age3 = yyyyyyyyyy_age3_f + yyyyyyyyyy_age3_m) %>%
mutate( ... ) %>%
This will work, but there must be more intelligent way to do this without repeating this for all variable pairs.
After looking for a solution, the closest I found was this
Sum all columns whose names start with a pattern, by group
However the proposed solution doesn't work in my case for 2 reasons:
the substr() is not applicable to my problem since the lengths of the variable names change
this method assumes I only have variables I want to sum, while in my case I have many other variables that don't end with "_f" or "_m" and don't need to be summed (and cannot be summed since some are text)
I suppose the solution could be modified to apply to my case but I'm not sure how.
EDIT: here is sample data created with dput
structure(list(Groups = c("xx", "xx", "xx"), xxxxx_age0_f = c(8,
0, 7), xxxxx_age0_m = c(5, 0, 0), xxxxx_age1_f = c(1,
0, 0), xxxxx_age1_m = c(3, 2, 0), xxxxx_age2_f = c(0,
0, 2), xxxxx_age2_m = c(0, 1, 0), zzzz_age0_f = c(4,
2, NA), zzzz_age0_m = c(3, 6, NA), zzzz_age1_f = c(0,
0, NA), zzzz_age1_m = c(2, 0, NA), zzzz_age2_f = c(4,
1, NA), zzzz_age2_m = c(3, 1, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
ps: this simplified example has only one categorical variable, while I have hundreds.
Updated, with OP's actual example data:
inner_join(
dat,
dat %>%
select(Groups, ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=!Groups) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(Groups, name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="Groups", "name"),
by="Groups"
)
Previous example, prior to OP's actual example.
If your table looks like this:
dat
# A tibble: 2 x 9
zzzzzzzzzzzzzzzzzzzz_age1_f zzzzzzzzzzzzzzzzzzzz_age1_m zzzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ text_var_11 text_var_222 text_var_33333
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 -0.709 1.26 1.03 1.36 -0.140 -0.595 f o x
2 -0.202 0.164 -1.28 -1.48 -0.380 0.874 a p m
Then, you can do this:
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T))
To get this:
name value
<chr> <dbl>
1 zzzzzzzzzzzzzzzzzzzz_age1 0.510
2 zzzzzzzzzzzzzzzzzzzz_age2 -0.371
3 zzzzzzzzzzzzzzzzzzzz_age3 -0.240
Now, I don't know what your desired output structure is, but your mutate attempt above suggests you want to column bind these new aggregate columns. This could be done easily by taking that interim result above, and wrapping it in bind_cols, like this
bind_cols(
dat,
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="name")
)
This does not solve all of your problem but maybe you could try something along the lines of
df %>%
rowwise() # will apply your functions to rows instead of columns
for i in c("age1", "age2", "age3"){
df %>%
z <- mutate({{i}} = sum(c_across(contains({{i}})))) %>%
print(z)
}
df %>% ungroup()

R - programmatically detect NA columns and return string

I have this vector of eligible columns for my script
cols <- c("country", "phone", "car")
And this dataframe
test <-
data.frame(
id = c(1, 2, 3),
country = c("us", NA, "uk"),
phone = c(1, 1, NA),
car = c(NA, 0, 1)
)
The goal is to create a new column with the result, where the condition will be based only on columns present in cols variable. In case that all values for id are NA, then res should be string nothing, if some of them are not NA, then I need to this colnames, in case that all columns are not NA then result should be string all.
result <-
data.frame(
id = c(1, 2, 3),
country = c("us", NA, NA),
phone = c(1, 1, NA),
car = c(NA, NA, NA),
res = c("country, phone", "phone", "nothing")
)
I can do it only via case_when() function
mutate(
res = case_when(
!is.na(country) & is.na(phone) & is.na(car) ~ "country",
T ~ "?"
)
You can do this in base R (rather than dplyr) using the code:
result$res <- apply(result[,cols],1, function(x){paste(cols[!is.na(x)], collapse=", ")})
result$res[results$res==""] <- "nothing"
The data which you have shared is different (test and result). So we will start with result by removing the res column.
library(dplyr)
result$res <- NULL
result %>%
mutate_all(as.character) %>%
tidyr::pivot_longer(cols = cols) %>%
group_by(id) %>%
summarise(res = toString(name[!is.na(value)])) %>%
type.convert() %>%
left_join(res, by = 'id') %>%
mutate(res = case_when(res == '' ~ 'nothing',
stringr::str_count(result, ',') ==
(length(cols) - 1) ~ 'all',
TRUE ~ as.character(result)))
# A tibble: 3 x 5
# id res country phone car
# <dbl> <chr> <fct> <dbl> <lgl>
#1 1 country, phone us 1 NA
#2 2 phone NA 1 NA
#3 3 nothing NA NA NA
We get the data in long format, get the column names which have non-NA value for each ID. We then change the res column to "all" or "nothing" if there are all or 0 matches respectively.

gather 3 different detections of three different variables

I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)

How to I cast data frame with more than 3 columns in R?

Importing from an Access database, I have data that look similar to this:
p <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-1','2018-11-1',
'2018-11-3', '2018-11-3')),
Area = c('AF','BB','CT', 'DF', 'BB'),
pCount = c(6, 3, 0, 12, 32),
ObsTime = c('8:51','8:59','9:13', '9:24', '9:30'),
stringsAsFactors = FALSE)
I want to cast my data with Rows as SurvDate and columns to be Areas (values as pCount) and ObsTime columns next to each Area with value ObsTime.
Example:
n <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-3')),
AF = c(6, NA),
TimeAF = c('8:51', NA),
BB = c(3, 32),
TimeBB = c('8:59', '9:30'),
CT = c(0, NA),
TimeCT = c(NA, '9:13'),
DF = c(NA,12),
TimeDF = c(NA, '9:24'))
I've tried variations on this theme, but can't get time to work.
library(reshape2)
dcast(p, SurvDate+ObsTime ~ Area)
Here is one way using tidyverse tools. Note that the output is not the same as your expected output, because it seems like you didn't put the values for CT in the right place (values spread across two dates). Approach is to unite the values so we have a single key-value pair to spread, and then separate out the columns again with mutate_at. We could also have used separate multiple times, though this would become unwieldy with too many Areas.
SurvDate <- as.Date(c('2018-11-1','2018-11-1','2018-11-1', '2018-11-3', '2018-11-3'))
Area <- c('AF','BB','CT', 'DF', 'BB')
People <- c(6, 3, 0, 12, 32)
ObsTime <- (c('8:51','8:59','9:13', '9:24', '9:30'))
p <- data.frame(SurvDate, Area, People, ObsTime, stringsAsFactors = FALSE)
library(tidyverse)
p %>%
unite(vals, People, ObsTime) %>%
spread(Area, vals) %>%
mutate_at(
.vars = vars(-SurvDate),
.funs = funs(
Time = str_extract(., "(?<=_).*$"),
Area = str_extract(., "^.*(?=_)")
)
) %>%
filter(!is.na(SurvDate)) %>%
select(SurvDate, matches("_")) %>%
select(SurvDate, order(colnames(.)))
#> SurvDate AF_Area AF_Time BB_Area BB_Time CT_Area CT_Time DF_Area
#> 1 2018-11-01 6 8:51 3 8:59 0 9:13 <NA>
#> 2 2018-11-03 <NA> <NA> 32 9:30 <NA> <NA> 12
#> DF_Time
#> 1 <NA>
#> 2 9:24
Created on 2018-04-30 by the reprex package (v0.2.0).

Resources