R: Long to wide without time - r

I am working with a medication prescription dataset which I want to transfer from long to wide format.
I tried to use the reshape function, however, this requires a time variable, which I don't have (at least not in a useful format I believe).
Concept dataset:
id <- c(1, 1, 1, 2, 2, 3, 3, 3)
prescription_date <- c("17JAN2009", "02MAR2009", "20MAR2009", "05JUL2009", "10APR2009", "09MAY2009", "13JUN2009", "29MAY2009")
med <- c("A", "B", "A", "B", "A", "B", "A", "B")
df <- data.frame(id, prescription_date, med)
To make a time variable I have tried to make a time variable like 1st, 2nd, etc med per id, but I didn't succeed.
Background: I want this in a wide format to eventually create definitions for diagnoses (i.e. if a patient had >1 prescriptions of A, diagnosis is confirmed). This has to be combined with factors from other datasets, hence the idea to go from long to wide.
Any help is much appreciated, thank you.

You might consider keeping the data in long format to perform some of these calculations. I would also suggest changing your dates into a date format that can be calculated upon. This will show, for instance, that the last two rows are not chronological. For instance:
library(dplyr)
df %>%
mutate(prescription_date = lubridate::dmy(prescription_date)) %>%
arrange(id, prescription_date) %>%
group_by(id) %>%
mutate(A_cuml = cumsum(med=="A"),
A_ttl = sum(med=="A")) %>%
ungroup()
# A tibble: 8 × 5
id prescription_date med A_cuml A_ttl
<dbl> <date> <chr> <int> <int>
1 1 2009-01-17 A 1 2
2 1 2009-03-02 B 1 2
3 1 2009-03-20 A 2 2
4 2 2009-04-10 A 1 1
5 2 2009-07-05 B 1 1
6 3 2009-05-09 B 0 1
7 3 2009-05-29 B 0 1
8 3 2009-06-13 A 1 1
If you calculate summary stats for each id, you might save this in a summarized table with one row per id and use joins (e.g. left_join) to append the results of each of these summaries.

Related

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

Differenciated sampling rate by group

For a machine learning model training, I'm trying to sample a dataframe that has a grouping variable, so that each group is treated with a different sampling rule. For instance, my data:
df = data.frame(value = 1:10, label=c("a", "a", "b", rep("c", 7)))
For groups of size under, say, 3, I want to take the whole group and no more, and for bigger groups I want to take a sample of size 3 without replacement.
So here, the result could be: df[c(1:3, 6,9,10),]
If I use group_by and sample_n, I get an size error. I thought of going "manual" with splits and differentiated sampling and then bind again the rows, but is there a more efficient and direct way?
Using the size of the group n(), in sample_n.
df %>% group_by(label) %>% sample_n(min(n(), 3))
# A tibble: 6 x 3
# Groups: label [3]
# value label n
# <int> <fct> <int>
#1 1 a 2
#2 2 a 2
#3 3 b 1
#4 5 c 7
#5 10 c 7
#6 4 c 7

How to obtain daily time series of categorical frequencies in R

I have a data frame as such:
data <- data.frame(daytime = c('2005-05-03 11:45:23', '2005-05-03 11:47:45',
'2005-05-03 12:00:32', '2005-05-03 12:25:01',
'2006-05-02 10:45:15', '2006-05-02 11:15:14',
'2006-05-02 11:16:15', '2006-05-02 11:18:03'),
category = c("A", "A", "A", "B", "B", "B", "B", "A"))
print(data)
daytime category date2
1 2005-05-03 11:45:23 A 05/03/05
2 2005-05-03 11:47:45 A 05/03/05
3 2005-05-03 12:00:32 A 05/03/05
4 2005-05-03 12:25:01 B 05/03/05
5 2006-05-02 10:45:15 B 05/02/06
6 2006-05-02 11:15:14 B 05/02/06
7 2006-05-02 11:16:15 B 05/02/06
8 2006-05-02 11:18:03 A 05/02/06
I would like to turn this data frame into a time series of daily categorical frequencies like this:
day cat_A_freq cat_B_freq
1 2005-05-01 3 1
2 2005-05-02 1 3
I have tried doing:
library(anytime)
data$daytime <- anytime(data$daytime)
data$day <- factor(format(data$daytime, "%D"))
table(data$day, data$category)
A B
05/02/06 1 3
05/03/05 3 1
But as you can see the formatting a new variable, day, changes the appearance of the date. You can also see that the table does not return the days in proper order (the years are out of order) so that I can then convert to a time series, easily.
Any ideas on how to get frequencies in an easier way, or if this is the way, how to get the frequencies in correct date order and into a dataframe for easy conversion to a time series object?
A solution using tidyverse. The format of your daytime column in your data is good, so we can use as.Date directly without specifying other formats or using other functions.
library(tidyverse)
data2 <- data %>%
mutate(day = as.Date(daytime)) %>%
count(day, category) %>%
spread(category, n)
data2
# # A tibble: 2 x 3
# day A B
# * <date> <int> <int>
# 1 2005-05-03 3 1
# 2 2006-05-02 1 3

Product of several columns on a data frame by a vector using dplyr

I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)

dplyr count number of one specific value of variable

Say I have a dataset like this:
id <- c(1, 1, 2, 2, 3, 3)
code <- c("a", "b", "a", "a", "b", "b")
dat <- data.frame(id, code)
I.e.,
id code
1 1 a
2 1 b
3 2 a
4 2 a
5 3 b
6 3 b
Using dplyr, how would I get a count of how many a's there are for each id
i.e.,
id countA
1 1 1
2 2 2
3 3 0
I'm trying stuff like this which isn't working,
countA<- dat %>%
group_by(id) %>%
summarise(cip.completed= count(code == "a"))
The above gives me an error, "Error: no applicable method for 'group_by_' applied to an object of class "logical""
Thanks for your help!
Try the following instead:
library(dplyr)
dat %>% group_by(id) %>%
summarise(cip.completed= sum(code == "a"))
Source: local data frame [3 x 2]
id cip.completed
(dbl) (int)
1 1 1
2 2 2
3 3 0
This works because the logical condition code == a is just a series of zeros and ones, and the sum of this series is the number of occurences.
Note that you would not necessarily use dplyr::count inside summarise anyway, as it is a wrapper for summarise calling either n() or sum() itself. See ?dplyr::count. If you really want to use count, I guess you could do that by first filtering the dataset to only retain all rows in which code==a, and using count would then give you all strictly positive (i.e. non-zero) counts. For instance,
dat %>% filter(code==a) %>% count(id)
Source: local data frame [2 x 2]
id n
(dbl) (int)
1 1 1
2 2 2

Resources