Reshape dataframe - create rows as per data availability in R

Reshape dataframe - create rows as per data availability in R - r

I want to reshape the original dataframe into the target dataframe as follows.
But first, to recreate dataframes:
original <- data.frame(caseid = c("id101", 'id201', 'id202', 'id301', 'id302'),
age_child1 = c('3', '5', '8', NA, NA),
age_child2 = c('1', '7', NA, NA, NA),
age_child3 = c('2', '6', '8', '3', NA))
target <- data.frame(caseid = c('id101_1', 'id101_2', 'id101_3', 'id201_1', 'id201_2', 'id201_3', 'id202_1', 'id202_3', 'id301_3'),
age = c(3, 1, 2, 5, 7, 6, 8, 8, 3))
The caseid column represents mothers. I want to create a new caseid row per each of the children and add the respective 'age' value to the age column. If no 'age' value is available, it means there is not an n child and no new row should be created.
Thanks for the help!

You can use pivot_longer() and its various helpful options:
pivot_longer(original, cols = starts_with("age"), names_prefix = "age_child",values_to = "age",values_transform = as.integer) %>%
filter(!is.na(age)) %>%
mutate(caseid = paste0(caseid,"_",name)) %>%
select(-name)
Output:
# A tibble: 9 × 2
caseid age
<chr> <int>
1 id101_1 3
2 id101_2 1
3 id101_3 2
4 id201_1 5
5 id201_2 7
6 id201_3 6
7 id202_1 8
8 id202_3 8
9 id301_3 3

Using reshape form base r ,
original <- data.frame(caseid = c("id101", 'id201', 'id202', 'id301', 'id302'),
age_child1 = c('3', '5', '8', NA, NA),
age_child2 = c('1', '7', NA, NA, NA),
age_child3 = c('2', '6', '8', '3', NA))
a <- reshape(original , varying = c("age_child1" , "age_child2" , "age_child3") ,
direction = "long" ,
times = c("_1" , "_2" , "_3") ,
v.names = "age")
a$caseid <- paste0(a$caseid , a$time)
a <- a[order(a$caseid) , ][c("caseid" , "age")]
a <- na.omit(a)
row.names(a) <- NULL
a
#> caseid age
#> 1 id101_1 3
#> 2 id101_2 1
#> 3 id101_3 2
#> 4 id201_1 5
#> 5 id201_2 7
#> 6 id201_3 6
#> 7 id202_1 8
#> 8 id202_3 8
#> 9 id301_3 3
Created on 2022-06-01 by the reprex package (v2.0.1)

original %>%
pivot_longer(-caseid, names_to = 'child', names_pattern = '([0-9]+$)',
values_to = 'age', values_drop_na = TRUE)%>%
unite(caseid, caseid, child)
# A tibble: 9 x 2
caseid age
<chr> <chr>
1 id101_1 3
2 id101_2 1
3 id101_3 2
4 id201_1 5
5 id201_2 7
6 id201_3 6
7 id202_1 8
8 id202_3 8
9 id301_3 3

Related

Cannot use pivot_longer in r with multiple cell value

Say that I have a df.
And I want to change it into a long data format.
I found that this question (Long pivot for multiple variables using Pivot long) was similar with mine.
But I got an error when ran the below code. I did not know why.
What I expected should like the df_expected.
library(tidyverse)
df = data.frame(
dis = 'cvd',
pollution = 'pm2.5',
lag_day = '2',
b1.x = 1,
b1_ci.x = 2,
PC.x = 3,
pc_ci.x = 4,
b1.y =5,
b1_ci.y = 6,
PC.y = 7,
pc_ci.y = 8
)
# df
# dis pollution lag_day b1.x b1_ci.x PC.x pc_ci.x b1.y b1_ci.y PC.y pc_ci.y
# 1 cvd pm2.5 2 1 2 3 4 5 6 7 8
df %>%
pivot_longer(
cols = -c(dis:lag_day),
names_to = c('.value', 'from'),
names_sep = '.'
) # error code
# Error: Input must be a vector, not NULL.
# Run `rlang::last_error()` to see where the error occurred.
# In addition: Warning message:
# Expected 2 pieces. Additional pieces discarded in 8 rows [1, 2, 3, 4, 5, # 6, 7, 8].
df_expected = data.frame(
dis = 'cvd',
pollution = 'pm2.5',
lag_day = '2',
from = c('x', 'y'),
b1 = c(1,5),
b1_ci = c(2,6),
PC = c(3, 7),
pc_ci = c(4, 8)
)
# df_expected
# dis pollution lag_day from b1 b1_ci PC pc_ci
# 1 cvd pm2.5 2 x 1 2 3 4
# 2 cvd pm2.5 2 y 5 6 7 8

df %>%
pivot_longer(-c(dis, pollution, lag_day),
names_to = c('.value', 'from'), names_sep='[.]')
# A tibble: 2 x 8
dis pollution lag_day from b1 b1_ci PC pc_ci
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 cvd pm2.5 2 x 1 2 3 4
2 cvd pm2.5 2 y 5 6 7 8
in base R:
reshape(df, -(1:3), direction = 'long')
or even:
reshape(df, -(1:3), idvar = 1:3, direction = 'long')
dis pollution lag_day time b1 b1_ci PC pc_ci
cvd.pm2.5.2.x cvd pm2.5 2 x 1 2 3 4
cvd.pm2.5.2.y cvd pm2.5 2 y 5 6 7 8

R Converting Wide Data to Long

How can I convert my data from this:
example <- data.frame(RTD_1_LOC = c('A', 'B'), RTD_2_LOC = c('C', 'D'),
RTD_3_LOC = c('E', 'F'), RTD_4_LOC = c('G', 'H'),
RTD_5_LOC = c('I', 'J'),RTD_1_OFF = c('1', '2'), RTD_2_OFF = c('3', '4'),
RTD_3_OFF = c('5', '6'), RTD_4_OFF = c('7', '8'),
RTD_5_OFF = c('9', '10'))
to this:
example2 <- data.frame(RTD = c(1,1,2,2,3,3,4,4,5,5),LOC = c('A', 'B','C','D','E','F','G','H','I','J'),
OFF = c(1,2,3,4,5,6,7,8,9,10))
I have been using tidyverse gather, but I end up with about 50 columns
ex <- gather(example,RTD, Location, RTD_1_LOC:RTD_5_LOC)
ex$RTD <- sub('_LOC',"",ex$RTD)
ex3 <- gather(ex,RTD, Offset, RTD_1_OFF:RTD_5_OFF)
ex2$RTD <- sub('_OFF',"",ex2$RTD)

We can use pivot_longer from tidyr and specify the names_pattern to capture the groups from the column names. As the 'RTD' column should be left as such, specify in the names_to, a vector of 'RTD' and the column values (.value) so that the 'RTD' will get the digits capture ((\\d+) and the word ((\\w+)) 'LOC', 'OFF' will be created as new columns with the column values
library(dplyr)
library(tidyr)
example %>%
pivot_longer(cols = everything(),
names_to = c("RTD", ".value"), names_pattern = "\\w+_(\\d+)_(\\w+)")
-output
# A tibble: 10 x 3
RTD LOC OFF
<chr> <chr> <chr>
1 1 A 1
2 2 C 3
3 3 E 5
4 4 G 7
5 5 I 9
6 1 B 2
7 2 D 4
8 3 F 6
9 4 H 8
10 5 J 10

Split df column of integers into individual digits in R

I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA

We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))

Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA

Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA

Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)

Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))

Find the closest value in the group for each value in the group R

It's been two days since I'm trying to find this :
I have a dataframe with more than 2 mil observations with this structure
id = c(1,2,3,4,5,6,7,8,9,10,11,12)
group = c(1,1,1,1,2,2,2,2,3,3,3,3)
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F')
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)
I would like to find for each female the male with the closest time and this by group
By example here id = 2 is a female in the group 1 with time = 11 the closest male in the group 1 is id = 3
ect for each female in each group
I tried to use something like this
keep <- function(x){
a <- df[which.min(abs(df[which(df[,'sex'] == "M"),'time']-x[,'time'])),]
return(a)
}
apply(df, 1, keep)
But it does not work.
If someone can help me it would be great.

Are you after something like below?
setDT(df)[
,
c(
.SD[sex == "F"],
.(closestM_id = id[sex == "M"][max.col(-abs(outer(
time[sex == "F"],
time[sex == "M"], "-"
)))])
), group
]
which gives
group id sex time closestM_id
1: 1 2 F 11.0 3
2: 2 6 F 15.0 5
3: 2 7 F 9.0 5
4: 2 8 F 7.4 5
5: 3 12 F 21.0 9
Data
> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
"F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))

data.table solution using a rolling join to nearest time.
Using the df from Thomas' answer
setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",],
x.id,
on = .(group, time), roll = "nearest"]]
# id group sex time closestM_id
# 1: 2 1 F 11.0 3
# 2: 6 2 F 15.0 5
# 3: 7 2 F 9.0 5
# 4: 8 2 F 7.4 5
# 5: 12 3 F 21.0 9

You could split the data.frame() into groups of males and females then use outer() to find the absolute difference in time of all combinations.
Code:
lapply(split(df, df[, "group"]), function(x){
# split by sex
tmp1 <- split(x, x[, "sex"])
# time difference for every combination
tmp2 <- abs(t(outer(tmp1[["M"]][, "time"], tmp1[["F"]][, "time"], "-")))
# find minimum for each woman (rowwise minimum)
# and connect those numbers with original ID in input data.frame
tmp3 <- tmp1[["M"]][apply(tmp2, 1, which.min), ]
# ronames to represent female ID
rownames(tmp3) <- tmp1[["F"]][, "id"]
# return
tmp3
})
# $`1`
# id group sex time
# 2 3 1 M 11.5
#
# $`2`
# id group sex time
# 6 5 2 M 13.2
# 7 5 2 M 13.2
# 8 5 2 M 13.2
#
# $`3`
# id group sex time
# 12 9 3 M 18
Now each group has its own data.frame(). The rownames() represent the ID of the woman and the respective row of the man in the data.frame() with the smallest absolute difference in time.
Data
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
group = c(1,1,1,1,2,2,2,2,3,3,3,3),
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F'),
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21))

Restructuring the data would help. Create a separate data frame for each sex, create a third data set with all unique pairings of males and females, then merge and subset to narrow it down to the desired pairs. expand.grid is very handy for computing these sorts of combinations, after that, dplyr functions can be used to handle the rest of the logic.
library(dplyr)
# create one data set for females
females <- df %>%
filter(sex == "F") %>%
select(f_id = id, f_time = time, f_group = group)
# create one data set for males
males <- df %>%
filter(sex == "M") %>%
select(m_id = id, m_time = time, m_group = group)
# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
m_id = males %>% pull(m_id),
stringsAsFactors = FALSE)
# Merge in information about each individual
pairs <- pairs %>%
left_join(females, by = "f_id") %>%
left_join(males, by = "m_id") %>%
# eliminate any pairings that are in different groups
filter(f_group == m_group)
pairs
Result, potential pairs
f_id m_id f_time f_group m_time m_group
1 2 1 11.0 1 10.0 1
2 2 3 11.0 1 11.5 1
3 2 4 11.0 1 13.0 1
4 6 5 15.0 2 13.2 2
5 7 5 9.0 2 13.2 2
6 8 5 7.4 2 13.2 2
7 12 9 21.0 3 18.0 3
8 12 10 21.0 3 12.0 3
9 12 11 21.0 3 34.5 3
# compute distances and
# subset for the closest male to each female
pairs %>%
mutate(diff = abs(m_time - f_time)) %>%
group_by(f_id) %>%
filter(diff == min(diff)) %>%
select(m_id, f_id)
Output, the closest pairs
# A tibble: 5 x 2
# Groups: f_id [5]
m_id f_id
<dbl> <dbl>
1 3 2
2 5 6
3 5 7
4 5 8
5 9 12

R ifelse loop on unique values always resolves FALSE

I am newish to R and having trouble with a for loop over unique values.
with the df:
id = c(1,2,2,3,3,4)
rank = c(1,2,1,3,3,4)
df = data.frame(id, rank)
I run:
df$dg <- logical(6)
for(i in unique(df$id)){
ifelse(!unique(df$rank), df$dg ==T, df$dg == F)
}
I am trying to mark the $dg variable as T providing that rank is different for each unique id and F if rank is the same within each id.
I am not getting any errors, but I am only getting F for all values of $dg even though I should be getting a mix.
I have also used the following loop with the same results:
for(i in unique(df$id)){
ifelse(length(unique(df$rank)), df$dg ==T, df$dg == F)
}
I have read other similar posts but the advice has not worked for my case.
From Comments:
I want to mark dg TRUE for all instances of an id if rank changed at all for a given id. Im looking to say for a given ID which has anywhere between 1-13 instances, mark dg TRUE if rank differs across instances.

Update: How to identify groups (ids) that only have one rank?
After clarification that OP provided this would be a solution for this particular case:
library(dplyr)
df %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
For another data-set that has also an id, which has duplicates but also non-duplicate rank (presented below) this would be the output:
df2 %>%
group_by(id) %>%
mutate(dg = ifelse( length(unique(rank))>1 | n() == 1, T, F))
#:OUTPUT:
# Source: local data frame [9 x 3]
# Groups: id [5]
#
# # A tibble: 9 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
# 7 5 1 TRUE
# 8 5 1 TRUE
# 9 5 3 TRUE
Data-no-2:
df2 <- structure(list(id = c(1, 2, 2, 3, 3, 4, 5, 5, 5), rank = c(1, 2, 1, 3, 3, 4, 1, 1, 3
)), .Names = c("id", "rank"), row.names = c(NA, -9L), class = "data.frame")
How to identify duplicated rows within each group (id)?
You can use dplyr package:
library(dplyr)
df %>%
group_by(id, rank) %>%
mutate(dg = ifelse(n() > 1, F,T))
This will give you:
# Source: local data frame [6 x 3]
# Groups: id, rank [5]
#
# # A tibble: 6 x 3
# id rank dg
# <dbl> <dbl> <lgl>
# 1 1 1 TRUE
# 2 2 2 TRUE
# 3 2 1 TRUE
# 4 3 3 FALSE
# 5 3 3 FALSE
# 6 4 4 TRUE
Note: You can simply convert it back to a data.frame().
A data.table solution would be:
dt <- data.table(df)
dt$dg <- ifelse(dt[ , dg := .N, by = list(id, rank)]$dg>1,F,T)
Data:
df <- structure(list(id = c(1, 2, 2, 3, 3, 4), rank = c(1, 2, 1, 3,
3, 4)), .Names = c("id", "rank"), row.names = c(NA, -6L), class = "data.frame")
# > df
# id rank
# 1 1 1
# 2 2 2
# 3 2 1
# 4 3 3
# 5 3 3
# 6 4 4
N. B. Unless you want a different identifier rather than TRUE/FALSE, using ifelse() is redundant and costs computationally. #DavidArenburg

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reshape dataframe - create rows as per data availability in R - r

Related

Cannot use pivot_longer in r with multiple cell value

R Converting Wide Data to Long

Split df column of integers into individual digits in R

Find the closest value in the group for each value in the group R

R ifelse loop on unique values always resolves FALSE

Categories

Resources