Using pivot longer to collapse data from multiple columns - r

I've read through a number of similar posts and tutorials but am really struggling to understand the solution to my issue. I have a dataset that is wide, and when I make it longer - I want to collapse two sets of data (both duration and results).
For each participant (id), there is a category, and then a series of blood test results. Each test has both duration (in days) and a result (numeric value).
Here's how it looks now:
id
category
duration_1
results_1
duration_2
results_2
duration_3
results_3
01
diabetic
58
32
65
56
76
87
02
prediabetic
54
32
65
25
76
35
03
unknown
46
65
65
56
21
67
How I'd like it to be is:
id
category
duration
results
01
diabetic
58
32
01
diabetic
65
56
01
diabetic
76
87
02
prediabetic
54
32
02
prediabetic
65
25
02
prediabetic
76
35
03
unknown
46
65
03
unknown
65
25
03
unknown
21
67
I can get pivot longer to work for "results" - but I can't get it to pivot on both "results" and "duration".
Any assistance would be greatly appreciated. I'm still fairly new to R.
Thanks!

One way is to separate the column names into two columns while you pivot (hence the names_sep below). Then, you can just drop the column number.
library(tidyverse)
df %>%
tidyr::pivot_longer(!c(id, category),
names_to = c(".value", "num"),
names_sep = "_") %>%
dplyr::select(-num)
Output
# A tibble: 9 × 4
id category duration results
<chr> <chr> <dbl> <dbl>
1 01 diabetic 32 23
2 01 diabetic 87 67
3 01 diabetic 98 78
4 02 prediabetic 43 45
5 02 prediabetic 34 65
6 02 prediabetic 12 12
7 03 unknown 32 54
8 03 unknown 75 45
9 03 unknown 43 34
Data
df <-
structure(
list(
id = c("01", "02", "03"),
category = c("diabetic", "prediabetic", "unknown"),
duration_1 = c(32, 43, 32),
results_1 = c(23, 45, 54),
duration_2 = c(87, 34, 75),
results_2 = c(67, 65, 45),
duration_3 = c(98, 12, 43),
results_3 = c(78, 12, 34)
),
class = "data.frame",
row.names = c(NA,-3L)
)

Related

R - Identify and remove duplicate rows based on two columns

I have some data that looks like this:
Course_ID Text_ID
33 17
33 17
58 17
5 22
8 22
42 25
42 25
17 26
17 26
35 39
51 39
Not having a background in programming, I'm finding it tricky to articulate my question, but here goes: I only want to keep rows where Course_ID varies but where Text_ID is the same. So for example, the final data would look something like this:
Course_ID Text_ID
5 22
8 22
35 39
51 39
As you can see, Text_ID 22 and 39 are the only ones that have different Course_ID values. I suspect subsetting the data would be the way to go, but as I said, I'm quite a novice at this kind of thing and would really appreciate any advice on how to approach this.
Select those groups where there is no repeats of Course_ID.
In dplyr you can write this as -
library(dplyr)
df %>% group_by(Text_ID) %>% filter(n_distinct(Course_ID) == n()) %>% ungroup
# Course_ID Text_ID
# <int> <int>
#1 5 22
#2 8 22
#3 35 39
#4 51 39
and in data.table -
library(data.table)
setDT(df)[, .SD[uniqueN(Course_ID) == .N], Text_ID]
You can use ave testing if not anyDuplicated.
x[ave(x$Course_ID, x$Text_ID, FUN=anyDuplicated)==0,]
# Course_ID Text_ID
#4 5 22
#5 8 22
#10 35 39
#11 51 39
Data:
x <- read.table(header=TRUE, text="Course_ID Text_ID
33 17
33 17
58 17
5 22
8 22
42 25
42 25
17 26
17 26
35 39
51 39")
Here is my approach with rlist and dplyr:
library(dplyr)
your_data %>%
split(~ Text_ID) %>%
rlist::list.filter(length(unique(Course_ID)) == length(Course_ID)) %>%
bind_rows()
Returns:
# A tibble: 4 x 2
Course_ID Text_ID
<dbl> <dbl>
1 5 22
2 8 22
3 35 39
4 51 39
# Data used:
your_data <- structure(list(Course_ID = c(33, 33, 58, 5, 8, 42, 42, 17, 17, 35, 51), Text_ID = c(17, 17, 17, 22, 22, 25, 25, 26, 26, 39, 39)), row.names = c(NA, -11L), class = c("tbl_df", "tbl", "data.frame"))

Lookup table based on multiple conditions in R

Thank you for taking a look at my question!
I have the following (dummy) data for patient performance on 3 tasks:
patient_df = data.frame(id = seq(1:5),
age = c(30,72,46,63,58),
education = c(11, 22, 18, 12, 14),
task1 = c(21, 28, 20, 24, 22),
task2 = c(15, 15, 10, 11, 14),
task3 = c(82, 60, 74, 78, 78))
> patient_df
id age education task1 task2 task3
1 1 30 11 21 15 82
2 2 72 22 28 15 60
3 3 46 18 20 10 74
4 4 63 12 24 11 78
5 5 58 14 22 14 78
I also have the following (dummy) lookup table for age and education-based cutoff values to define a patient's performance as impaired or not impaired on each task:
cutoffs = data.frame(age = rep(seq(from = 35, to = 70, by = 5), 2),
education = c(rep("<16", 8), rep(">=16",8)),
task1_cutoff = c(rep(24, 16)),
task2_cutoff = c(11,11,11,11,10,10,10,10,9,13,13,13,13,12,12,11),
task3_cutoff = c(rep(71,8), 70, rep(74,2), rep(73, 5)))
> cutoffs
age education task1_cutoff task2_cutoff task3_cutoff
1 35 <16 24 11 71
2 40 <16 24 11 71
3 45 <16 24 11 71
4 50 <16 24 11 71
5 55 <16 24 10 71
6 60 <16 24 10 71
7 65 <16 24 10 71
8 70 <16 24 10 71
9 35 >=16 24 9 70
10 40 >=16 24 13 74
11 45 >=16 24 13 74
12 50 >=16 24 13 73
13 55 >=16 24 13 73
14 60 >=16 24 12 73
15 65 >=16 24 12 73
16 70 >=16 24 11 73
My goal is to create 3 new variables in patient_df that indicate whether or not a patient is impaired on each task with a binary indicator. For example, for id=1 in patient_df, their age is <=35 and their education is <16 years, so the cutoff value for task1 would be 24, the cutoff value for task2 would be 11, and the cutoff value for task3 would be 71, such that scores below these values would denote impairment.
I would like to do this for each id by referencing the age and education-associated cutoff value in the cutoff dataset, so that the outcome would look something like this:
> goal_patient_df
id age education task1 task2 task3 task1_impaired task2_impaired task3_impaired
1 1 30 11 21 15 82 1 1 0
2 2 72 22 28 15 60 0 0 1
3 3 46 18 20 10 74 1 1 0
4 4 63 12 24 11 78 1 0 0
5 5 58 14 22 14 78 1 0 0
In actuality, my patient_df has 600+ patients and there are 7+ tasks each with age- and education-associated cutoff values, so a 'clean' way of doing this would be greatly appreciated! My only alternative that I can think of right now is writing a TON of if_else statements or case_whens which would not be incredibly reproducible for anyone else who would use my code :(
Thank you in advance!
I would recommend putting both your lookup table and patient_df dataframe in long form. I think that might be easier to manage with multiple tasks.
Your education column is numeric; so converting to character "<16" or ">=16" will help with matching in lookup table.
Using fuzzy_inner_join will match data with lookup table where task and education match exactly == but age will between an age_low and age_high if you specify a range of ages for each lookup table row.
Finally, impaired is calculated comparing the values from the two data frames for the particular task.
Please note for output, id of 1 is missing, as falls outside of age range from lookup table. You can add more rows to that table to address this.
library(tidyverse)
library(fuzzyjoin)
cutoffs_long <- cutoffs %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "cutoff_value", names_pattern = "task(\\d+)") %>%
mutate(age_low = age,
age_high = age + 4) %>%
select(-age)
patient_df %>%
pivot_longer(cols = starts_with("task"), names_to = "task", values_to = "patient_value", names_pattern = "(\\d+)") %>%
mutate(education = ifelse(education < 16, "<16", ">=16")) %>%
fuzzy_inner_join(cutoffs_long, by = c("age" = "age_low", "age" = "age_high", "education", "task"), match_fun = list(`>=`, `<=`, `==`, `==`)) %>%
mutate(impaired = +(patient_value < cutoff_value))
Output
# A tibble: 12 x 11
id age education.x task.x patient_value education.y task.y cutoff_value age_low age_high impaired
<int> <dbl> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <int>
1 2 72 >=16 1 28 >=16 1 24 70 74 0
2 2 72 >=16 2 15 >=16 2 11 70 74 0
3 2 72 >=16 3 60 >=16 3 73 70 74 1
4 3 46 >=16 1 20 >=16 1 24 45 49 1
5 3 46 >=16 2 10 >=16 2 13 45 49 1
6 3 46 >=16 3 74 >=16 3 74 45 49 0
7 4 63 <16 1 24 <16 1 24 60 64 0
8 4 63 <16 2 11 <16 2 10 60 64 0
9 4 63 <16 3 78 <16 3 71 60 64 0
10 5 58 <16 1 22 <16 1 24 55 59 1
11 5 58 <16 2 14 <16 2 10 55 59 0
12 5 58 <16 3 78 <16 3 71 55 59 0

how can i make/replace some row names with bold text in a matrix in R

I have a matrix and i would like to make some row names bold to distinguish from other row names.
for example, i have a matrix with month as row names, and i want to make January, May and August as bold.
There's crayon package that you might want to look at although not sure if there's a simple way to do this. Anyways, here's a simple out-of-the-box alternative in good 'ol base R -
mat <- matrix(sample(100, 36), nrow = 12, dimnames = list(month.name, NULL))
bold <- c("January", "May", "August")
rownames(mat)[rownames(mat) %in% bold] <-
paste0("--------------", rownames(mat)[rownames(mat) %in% bold])
[,1] [,2] [,3]
--------------January 75 52 95
February 78 27 93
March 89 2 81
April 65 28 53
--------------May 67 15 30
June 90 19 86
July 13 39 85
--------------August 98 1 94
September 88 63 18
October 8 80 62
November 76 10 25
December 68 84 20

Reshaping data frame in r according to the longest row

I have a 189 by 1443 data frame containing heart rate data for 189 days for every minute of the day:
year month day `00:00` `00:01` `00:02` `00:03` `00:04` `00:05` ...
2018 04 07 NA 63 NA NA 62 NA ...
2018 04 08 57 NA 58 NA NA NA ...
2018 04 09 NA NA NA 52 NA 51 ...
I need to transform this data frame into 189 by 131 (which is the most amount of entries in one day), so basically align all entries to the left (in the way that the rows with <131 entries would have NAs from column x to 131).
The end result would have to look like this:
year month day `1` `2` `3` `4` `5` `6` ... `131`
2018 04 07 63 62 63 64 61 60 ... 59
2018 04 08 57 58 56 55 56 55 ... NA
2018 04 09 52 51 49 50 48 52 ... NA
.
.
.
Could anyone help me with that? Sadly, I don't have a clue where to start.
See if this works for you:
library(tidyverse)
df %>%
gather(minute, value, -c(year:day)) %>%
drop_na() %>%
group_by(year, month, day) %>%
arrange(year, month, day, minute) %>%
mutate(row = row_number()) %>%
select(-minute) %>%
spread(row, value)
# A tibble: 3 x 5
# Groups: year, month, day [3]
year month day `1` `2`
<dbl> <chr> <chr> <dbl> <dbl>
1 2018 04 07 63 62
2 2018 04 08 57 58
3 2018 04 09 52 51

Dealing with date format in zoo

I've a csv data file with the following formats
Stock prices over the period of Jan 1, 2015 to Sep 26, 2017
Now I use the following code to import the data as zoo object:
sensexzoo1<- read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv",
format="%d-%B-%Y", header=T, sep=",")
It produces the following error:
Error in read.zoo(file = "/home/bidyut/Downloads/SENSEX.csv", format =
"%d-%B-%Y", : index has 679 bad entries at data rows: 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
100 ...
What is the wrong with this? Please suggest
The problem is the mismatch between the header and the data. The header line has 5 fields and the remaining lines of the file have 6 fields:
head(count.fields("SENSEX.csv", sep = ","))
## [1] 5 6 6 6 6 6
When that happens it assumes that the first field of the data is the row names so by default the next field (which in fact contains the Open data) is assumed to be the time index.
We can address this in several alternative ways:
1) The easiest way to fix this is to add a field called Volume, say, to the header so that the header looks like this:
Date,Open,High,Low,Close,Volume
2) If you have many files of this format so that it is not feasible to modify them we can read the data in without the headers and then add them on in a second pass. The [, -5] drops the column of NAs and the [-1] on the second line drops the Date header.
z <- read.zoo("SENSEX.csv", format="%d-%B-%Y", sep = ",", skip = 1)[, -5]
names(z) <- unlist(read.table("SENSEX.csv", sep = ",", nrow = 1))[-1]
giving:
> head(z)
Open High Low Close
2015-01-01 27485.77 27545.61 27395.34 27507.54
2015-01-02 27521.28 27937.47 27519.26 27887.90
2015-01-05 27978.43 28064.49 27786.85 27842.32
2015-01-06 27694.23 27698.93 26937.06 26987.46
2015-01-07 26983.43 27051.60 26776.12 26908.82
2015-01-08 27178.77 27316.41 27101.94 27274.71
3) A third approach is to read the file in as text, use R to append ",Volume" to the first line and then read the text with read.zoo:
Lines <- readLines("SENSEX.csv")
Lines[1] <- paste0(Lines[1], ",Volume")
z <- read.zoo(text = Lines, header = TRUE, sep = ",", format="%d-%B-%Y")
Note: The first few lines of SENSEX.csv are shown below to make this self-contained (not dependent on the link in the question which could disappear in the future):
Date,Open,High,Low,Close
1-January-2015,27485.77,27545.61,27395.34,27507.54,
2-January-2015,27521.28,27937.47,27519.26,27887.90,
5-January-2015,27978.43,28064.49,27786.85,27842.32,
6-January-2015,27694.23,27698.93,26937.06,26987.46,
7-January-2015,26983.43,27051.60,26776.12,26908.82,
8-January-2015,27178.77,27316.41,27101.94,27274.71,
9-January-2015,27404.19,27507.67,27119.63,27458.38,
12-January-2015,27523.86,27620.66,27323.74,27585.27,
13-January-2015,27611.56,27670.19,27324.58,27425.73,
14-January-2015,27432.14,27512.80,27203.25,27346.82,

Resources