In search of a more efficient solution converting Wide data to long data - r

I want to convert the data from wide to long.I have solved the problem with the reshape package but then I manually had to define which column belonged the "gather columns", if there are hundreds of columns (which is the case in my data) that would be time consuming and a high risk of writing errors.
Does anyone know how to make a more efficient function to reach to this result?
id <- 1001:1003
qA2 <- c(10,5,1)
qB2 <- c(11,6,3)
qC2 <- c(10,7,5)
qA3 <- c(15,12,8)
qB3 <- c(18,15,7)
qC3 <- c(19,11,10)
df <- data.frame(id,qA2,qB2,qC2, qA3, qB3, qC3)
df
id qA2 qB2 qC2 qA3 qB3 qC3
1 1001 10 11 10 15 18 19
2 1002 5 6 7 12 15 11
3 1003 1 3 5 8 7 10
Solution with the reshape package:
library(reshape2)
df_test <- reshape(df, idvar="id", direction="long", varying=list(c(2,5), c(3,6), c(4,7)),v.names=c("qA", "qB", "qC"),times=2:3)
df_test
df_test <- df_test[order(df_test$id, df_test$time),]
id time qA qB qC
1001.2 1001 2 10 11 10
1001.3 1001 3 15 18 19
1002.2 1002 2 5 6 7
1002.3 1002 3 12 15 11
1003.2 1003 2 1 3 5
1003.3 1003 3 8 7 10

Using dplyr and tidyr, here is one way not sure about the efficiency though
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -id) %>%
mutate(key = sub("\\d+", "", key)) %>%
group_by(key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# A tibble: 6 x 4
# id qA qB qC
# <int> <dbl> <dbl> <dbl>
#1 1001 10 11 10
#2 1001 15 18 19
#3 1002 5 6 7
#4 1002 12 15 11
#5 1003 1 3 5
#6 1003 8 7 10

With the new version of tidyr (1.0.0) (already on CRAN, just update it):
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with("q"),
names_to = "time",
names_prefix = "q[A-Z]",
values_to = c("qA","qB","qC"))

Here is a base R one liner,
df1 <- cbind(id = df$id, (do.call(cbind, lapply(split.default(df[-1],
gsub('\\d+', '', names(df)[-1])), stack))[c(TRUE, FALSE)]))
df1[with(df1, order(id)),]
# id qA.values qB.values qC.values
#1 1001 10 11 10
#4 1001 15 18 19
#2 1002 5 6 7
#5 1002 12 15 11
#3 1003 1 3 5
#6 1003 8 7 10

We can use names_pattern with pivot_longer
library(tidyr)
pivot_longer(df, -id, names_to = c(".value", "time"), names_pattern= "(\\D+)(\\d+)")
# A tibble: 6 x 5
# id time qA qB qC
# <int> <chr> <dbl> <dbl> <dbl>
#1 1001 2 10 11 10
#2 1001 3 15 18 19
#3 1002 2 5 6 7
#4 1002 3 12 15 11
#5 1003 2 1 3 5
#6 1003 3 8 7 10

Related

Remove row on group depending on multiple criteria r

I have a dataset with some repeated values on Date variable, so I would like to filter this rows based on several conditions. As an example, the dataframe looks like:
df <- read.table(text =
"Date column_A column_B column_C Column_D
1 2020-01-01 10 15 15 20
2 2020-01-02 10 15 15 20
3 2020-01-03 10 13 15 20
4 2020-01-04 10 15 15 20
5 2020-01-05 NA 14 15 20
6 2020-01-05 7 NA NA 28
7 2020-01-06 10 15 15 20
8 2020-01-07 10 15 15 20
9 2020-01-07 10 NA NA 20
10 2020-01-08 10 15 15 20", header=TRUE)
df$Date <- as.Date(df$Date)
The different conditions to filter should be, ONLY on duplicated rows:
If "column A" is NA and the other numeric, select the numeric row
If both values are similar(both NA or both numeric), select row with less NAs.
My best approach, after several options is:
df$cnt_na <- apply(df[,2:5], 1, function(x) sum(is.na(x)))
df <- df %>% group_by(Date) %>% slice(which.min(all_of(cnt_na))) %>% select(-cnt_na)
Although in my case, it doesn't do the first condition. The main problem is that if I filter by !is.na(Date), I also remove other not duplicated rows.
Thanks in advance
I would sort your table based on your conditions and then pick the first row for every group:
library(dplyr)
df %>%
rowwise() %>%
mutate(cnt_na = sum(across(-Date, ~ sum(is.na(.))))) %>%
arrange(Date, is.na(column_A), cnt_na) %>%
group_by(Date) %>%
slice_head() %>%
ungroup()
which gives
# A tibble: 8 x 6
Date column_A column_B column_C Column_D cnt_na
<date> <int> <int> <int> <int> <int>
1 2020-01-01 10 15 15 20 0
2 2020-01-02 10 15 15 20 0
3 2020-01-03 10 13 15 20 0
4 2020-01-04 10 15 15 20 0
5 2020-01-05 7 NA NA 28 2
6 2020-01-06 10 15 15 20 0
7 2020-01-07 10 15 15 20 0
8 2020-01-08 10 15 15 20 0

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

how to calculate recent n days unique rows

Say I want count recent 15 days unique id for everyday. Here is the code:
library(tidyverse)
library(lubridate)
set.seed(1)
eg <- tibble(day = sample(seq(ymd('2018-01-01'), length.out = 100, by = 'day'), 300, replace = T),
id = sample(letters[1:26], 300, replace = T),
value = rnorm(300))
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
recent_15_days_unique_id = 'howto',
day_total = sum(value))
The result is
# A tibble: 95 x 4
day uniqu_id recent_15_days_unique_id day_total
<date> <int> <chr> <dbl>
1 2018-01-01 3 how -1.38
2 2018-01-02 3 how 2.01
3 2018-01-03 3 how 1.57
4 2018-01-04 6 how -1.64
5 2018-01-05 2 how -0.293
6 2018-01-06 4 how -2.08
For the 'recent_15_days_unique_id' column, first row is to count unique id between "day-15" to "day", which is '2017-12-17' and '2018-01-01', second row is between '2017-12-18' and '2018-01-02'.It is kind like 'rollsum' function but for counting.
We can ungroup and for every day, we can create a sequence of 15 days and count all the unique ids in that duration.
library(dplyr)
eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup() %>%
rowwise() %>%
mutate(recent_15_days_unique_id =
n_distinct(eg$id[eg$day %in% seq(day - 15, day, by = "1 day")]))
# day uniqu_id day_total recent_15_days_unique_id
# <date> <int> <dbl> <int>
#1 2018-01-02 2 0.170 2
#2 2018-01-03 2 -0.460 3
#3 2018-01-04 1 -1.53 3
#4 2018-01-05 2 1.67 5
#5 2018-01-06 2 1.52 6
#6 2018-01-07 4 -1.62 10
#7 2018-01-08 2 -0.0190 12
#8 2018-01-09 1 -0.573 12
#9 2018-01-10 2 -0.220 13
#10 2018-01-11 7 -1.73 14
Using the same logic we can also calculate it separately using sapply
new_eg <- eg %>%
group_by(day) %>%
summarise(uniqu_id = n_distinct(id),
day_total = sum(value)) %>%
ungroup()
sapply(new_eg$day, function(x)
n_distinct(eg$id[as.numeric(eg$day) %in% seq(x-15, x, by = "1 day")]))
#[1] 2 3 3 5 6 10 12 12 13 14 15 16 17 17 18 20 21 22 22 20 20 21 21 .....

Repeated measures in messy format, need help to tidy

I have a very large data set containing weekly weights that have been coded with week of study and the weight at that visit. There are some missing visits and the data is not currently aligned.
df <- data.frame(ID=1:3, Week_A=c(6,6,7), Weight_A=c(23,24,23), Week_B=c(7,7,8),
Weight_B=c(25,26,27), Week_C=c(8,9,9), Weight_C=c(27,26,28))
df
ID Week_A Weight_A Week_B Weight_B Week_C Weight_C
1 1 6 23 7 25 8 27
2 2 6 24 7 26 9 26
3 3 7 23 8 27 9 28
I would like to align the data by week number (ideal output below).
df_ideal <- data.frame (ID=1:3, Week_6=c(23,24,NA), Week_7=c(25,26,23),
Week_8=c(27,NA,27), Week_9=c(NA,26,28))
df_ideal
ID Week_6 Week_7 Week_8 Week_9
1 1 23 25 27 NA
2 2 24 26 NA 26
3 3 NA 23 27 28
I would appreciate some help with this, even to find a starting point to manipulate this data to an easier to manage format.
A tidyverse solution:
df <- data.frame(ID=1:3,
Week_A=c(6,6,7),
Weight_A=c(23,24,23),
Week_B=c(7,7,8),
Weight_B=c(25,26,27),
Week_C=c(8,9,9),
Weight_C=c(27,26,28))
library(tidyverse)
df_long <- df %>% gather(key="v", value="value", -ID) %>%
separate(v, into=c("v1", "v2")) %>%
spread(v1, value) %>%
complete(ID, Week) %>%
arrange(Week, ID)
df_long
# A tibble: 12 x 4
# ID Week v2 Weight
# <int> <dbl> <chr> <dbl>
# 1 1 6 A 23
# 2 2 6 A 24
# 3 3 6 <NA> NA
# 4 1 7 B 25
# 5 2 7 B 26
# 6 3 7 A 23
# 7 1 8 C 27
# 8 2 8 <NA> NA
# 9 3 8 B 27
#10 1 9 <NA> NA
#11 2 9 C 26
#12 3 9 C 28
df_wide <- df_long %>% select(-v2) %>%
spread(Week, Weight, sep="_")
df_wide
# A tibble: 3 x 5
# ID Week_6 Week_7 Week_8 Week_9
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
Personally, I'd keep using df_long instead of df_wide, as it is a tidy data frame, while df_wide is not.
Here is a possible approach using the data.table package
library(data.table)
#convert into a data.table
setDT(df)
#convert into a long format
mdat <- melt(df, id.vars="ID", measure.vars=patterns("^Week", "^Weight", cols=names(df)))
#pivot into desired output
ans <- dcast(mdat, ID ~ value1, value.var="value2")
ans output:
ID 6 7 8 9
1: 1 23 25 27 NA
2: 2 24 26 NA 26
3: 3 NA 23 27 28
And if you really need the "Week_" in your column names, you can use
setnames(ans, names(ans)[-1L], paste("Week_", names(ans)[-1L]))
Another tidyverse solution using a double-gather with a final spread
df %>%
gather(k, v, -ID, -starts_with("Weight")) %>%
separate(k, into = c("k1", "k2")) %>%
unite(k1, k1, v) %>%
gather(k, v, starts_with("Weight")) %>%
separate(k, into = c("k3", "k4")) %>%
filter(k2 == k4) %>%
select(-k2, -k3, -k4) %>%
spread(k1, v)
# ID Week_6 Week_7 Week_8 Week_9
#1 1 23 25 27 NA
#2 2 24 26 NA 26
#3 3 NA 23 27 28
In base R, it's a double reshape, firstly to long and then back to wide on a different variable:
tmp <- reshape(df, idvar="ID", varying=lapply(c("Week_","Weight_"), grep, names(df)),
v.names=c("time","Week"), direction="long")
reshape(tmp, idvar="ID", direction="wide", sep="_")
# ID Week_6 Week_7 Week_8 Week_9
#1.1 1 23 25 27 NA
#2.1 2 24 26 NA 26
#3.1 3 NA 23 27 28

data frame selecting top by grouping

I have a data frame such as:
set.seed(1)
df <- data.frame(
sample = 1:50,
value = runif(50),
group = c(rep(NA, 20), gl(3, 10)))
I want to select the top 10 samples based on value. However, if there is a group corresponding to the sample, I only want to include one sample from that group. If group == NA, I want to include all of them. Arranging df by value looks like:
df_top <- df %>%
arrange(-value) %>%
top_n(10, value)
sample value group
1 46 0.7973088 3
2 49 0.8108702 3
3 22 0.8394404 1
4 2 0.8612095 NA
5 27 0.8643395 1
6 20 0.8753213 NA
7 44 0.8762692 3
8 26 0.8921983 1
9 11 0.9128759 NA
10 30 0.9606180 1
I would want to include samples 36, 22, 2, 20, 11, and the next five highest values in my data frame that continue to fit the pattern. How do I accomplish this?
I think I figured this out. Would this be the best way:
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
filter(ifelse(!is.na(group), value == max(value), value == value)) %>%
ungroup() %>%
top_n(10, value)
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 18 0.992 NA
2 7 0.945 NA
3 21 0.935 1
4 4 0.908 NA
5 6 0.898 NA
6 35 0.827 2
7 41 0.821 3
8 20 0.777 NA
9 15 0.770 NA
10 17 0.718 NA
Similar method that uses slice instead of filter:
library(dplyr)
df_top <- df %>%
arrange(-value) %>%
group_by(group) %>%
slice(if(any(!is.na(group))) 1 else 1:n()) %>%
ungroup() %>%
top_n(10, value)
Result:
# A tibble: 10 x 3
sample value group
<int> <dbl> <int>
1 21 0.9347052 1
2 35 0.8273733 2
3 41 0.8209463 3
4 18 0.9919061 NA
5 7 0.9446753 NA
6 4 0.9082078 NA
7 6 0.8983897 NA
8 20 0.7774452 NA
9 15 0.7698414 NA
10 17 0.7176185 NA

Resources