R: Merging rows by duplicates in first column - r

I have a large dataset with duplicated values in the first column, like so:
ID date var1 var2
person1 052016 509 1678
person2 122016 301 NA
person1 072016 NA 45
I want to combine the IDs and to take the most recent value by "date", and if it`s NA - to take the last value that it's not NA.
The output should be like this:
ID date var1 var2
person2 122016 301 NA
person1 072016 509 45
I have tried with this, but it didn't worked.
library(dplyr)
data %>% group_by(ID) %>% summarise_all(funs(max(data$date))) %>% funs(first(.[!is.na(.)]))
What should I use to apply a working code to the whole dataset?

A solution using dplyr.
library(dplyr)
dat2 <- dat %>%
arrange(ID, desc(date)) %>%
group_by(ID) %>%
summarise_all(funs(first(.[!is.na(.)]))) %>%
ungroup()
dat2
# # A tibble: 2 x 4
# ID date var1 var2
# <chr> <int> <int> <int>
# 1 person1 72016 509 45
# 2 person2 122016 301 NA
DATA
dat <- read.table(text = "ID date var1 var2
person1 '052016' 509 1678
person2 '122016' 301 NA
person1 '072016' NA 45",
header = TRUE, stringsAsFactors = FALSE)

Using tidyverse and fill function.
Load data:
Mar_df <- structure(list(ID = structure(c(1L, 2L, 1L), .Label = c("person1",
"person2"), class = "factor"), date = c(52016L, 122016L, 72016L
), var1 = c(509L, 301L, NA), var2 = c(1678L, NA, 45L)), .Names = c("ID",
"date", "var1", "var2"), class = "data.frame", row.names = c(NA,
-3L))
Then:
Mar_df_summarised <- Mar_df %>%
arrange(ID,date) %>%
fill(...=var1,.direction="down") %>%
group_by(ID) %>%
summarise_all(.funs=funs(last(.)))
The result is:
# A tibble: 2 x 4
ID date var1 var2
<fctr> <int> <int> <int>
1 person1 72016 509 45
2 person2 122016 301 NA

Related

How to reshape a complicated data frame in R?

I have a dataframe that is complicated and Im trying to reshape it.
Here is an example of the type of data frame that I have:
names <- c("var1", 'var2', "split")
values <- rnorm(8)
from <- data.frame(a = rep(1, 10),
b = c(rep(1,3), rep(2, 7)),
c = c(names, names, rep("split", 4)),
d = c(rep("NA", 5), names, rep("split", 2)),
e = c(rep("NA", 7), names),
f = c(values[1:2], "NA", values[3:8], "NA"))
And this produces something that looks like this:
> from
a b c d e f
1 1 1 var1 NA NA -0.271930473373158
2 1 1 var2 NA NA -0.0968100775823158
3 1 1 split NA NA NA
4 1 2 var1 NA NA -1.73919094720254
5 1 2 var2 NA NA -0.52398152119997
6 1 2 split var1 NA 0.856367467674763
7 1 2 split var2 NA -0.729762707907525
8 1 2 split split var1 0.561460771889416
9 1 2 split split var2 0.0432022687633195
10 1 2 split split split NA
Inside my data frame from, I want to take var1 and var2 and turn them into columns. And then use the value from column f in from as the values that correspond to var1 and var2 (reading row-wise).
In other words, I am trying to reshape this data frame into something that looks like this:
> out
a b var1 var2
1 1 1 -0.2719305 -0.09681008
2 1 2 -1.7391909 -0.52398152
3 1 2 0.8563675 -0.72976271
4 1 2 0.5614608 0.04320227
Any suggestions as to how I could do this?
We could reshape to 'long' with pivot_longer, remove the NA elements and filter by keeping on the 'var' elements and then back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
from %>%
type.convert(as.is = TRUE) %>%
pivot_longer(cols = c:e, values_drop_na = TRUE) %>%
filter(str_detect(value, 'var')) %>%
select(-name) %>%
mutate(rn = rowid(a, b, value)) %>%
pivot_wider(names_from = value, values_from = f) %>%
select(-rn)
-output
# A tibble: 4 × 4
a b var1 var2
<int> <int> <dbl> <dbl>
1 1 1 -0.272 -0.0968
2 1 2 -1.74 -0.524
3 1 2 0.856 -0.730
4 1 2 0.561 0.0432
data
from <- structure(list(a = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
b = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), c = c("var1",
"var2", "split", "var1", "var2", "split", "split", "split",
"split", "split"), d = c("NA", "NA", "NA", "NA", "NA", "var1",
"var2", "split", "split", "split"), e = c("NA", "NA", "NA",
"NA", "NA", "NA", "NA", "var1", "var2", "split"), f = c("-0.271930473373158",
"-0.0968100775823158", "NA", "-1.73919094720254", "-0.52398152119997",
"0.856367467674763", "-0.729762707907525", "0.561460771889416",
"0.0432022687633195", "NA")), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
Here is a solution with one time pivoting:
library(dplyr)
library(tidyr)
library(stringr)
from %>%
type.convert(as.is = TRUE) %>%
filter(!is.na(f)) %>%
mutate(name = str_extract_all(paste(c,d,e), 'var(.)')) %>%
select(a, b, f, name) %>%
pivot_wider(
names_from = name,
values_from = f,
values_fn = list
) %>%
unnest(cols = c(var1, var2))
a b var1 var2
<int> <int> <dbl> <dbl>
1 1 1 -0.272 -0.0968
2 1 2 -1.74 -0.524
3 1 2 0.856 -0.730
4 1 2 0.561 0.0432
This can be achieved by coupling a series of logical operations to get the values in from$f
data.frame( a=from$a[rowSums(from == "var1", na.rm=T) == 1],
b=from$b[rowSums(from == "var1", na.rm=T) == 1],
var1=from$f[rowSums(from == "var1", na.rm=T) == 1],
var2=from$f[rowSums(from == "var2", na.rm=T) == 1] )
a b var1 var2
1 1 1 -0.2719305 -0.09681008
2 1 2 -1.7391909 -0.52398152
3 1 2 0.8563675 -0.72976271
4 1 2 0.5614608 0.04320227
The notion is to have a row_number mutation:
library(dplyr)
library(tidyr)
from %>%
type.convert(as.is = TRUE) %>%
filter(!is.na(f)) %>%
group_by(name = invoke(coalesce, across(c:e, na_if, 'split')))%>%
mutate(id = row_number()) %>%
pivot_wider(c(a, b, id), values_from = f) %>%
select(-id)
# A tibble: 4 x 4
a b var1 var2
<int> <int> <dbl> <dbl>
1 1 1 -0.272 -0.0968
2 1 2 -1.74 -0.524
3 1 2 0.856 -0.730
4 1 2 0.561 0.0432

Return value closest to date between 2 tables

I have 2 tables, both have a common ID that needs to be used to retrieve another value closest to the first table's date column.
Table_1
ID
date_1
1
2/3/2021
2
4/19/2019
3
1/6/2020
Table_2
ID
date_2
value
1
2/1/2021
x
1
4/19/2021
y
1
1/6/2020
z
2
5/19/2019
g
2
4/11/2019
a
3
4/11/2019
bb
3
7/17/2019
cc
3
1/16/2020
dd
And the goal is to add another column to table_1 to return the value from table_2 for the same ID that is closest to the date. In other words, I need to return the value from table_2 that shares the same ID value and has the minimum difference between date_1 and date_2.
Ex-
ID
date_1
result
1
2/3/2021
x
2
4/19/2021
a
3
1/6/2020
dd
There was an index match result I was able to find in excel but I would like to do this in R. Unsure if JOIN would be the best way or there's a more iterative way to solve this.
Please help?
Here's a way using dplyr and lubridate.
library(dplyr)
library(lubridate)
table_1 <- read.table(text='
ID date_1
1 2/3/2021
2 4/19/2019
3 1/6/2020', header=T)
table_1$date_1 <- mdy(table_1$date_1)
table_2 <- read.table(text='
ID date_2 value
1 2/1/2021 x
1 4/19/2021 y
1 1/6/2020 z
2 5/19/2019 g
2 4/11/2019 a
3 4/11/2019 bb
3 7/17/2019 cc
3 1/16/2020 dd', header=T)
table_2$date_2 <- mdy(table_2$date_2)
new_table_1 <-
table_2 %>%
left_join(table_1, by = 'ID') %>%
mutate(result = abs(date_2 - date_1)) %>%
group_by(ID) %>%
slice(which.min(result)) %>%
select(ID, date_1, value)
new_table_1
# A tibble: 3 x 3
# Groups: ID [3]
ID date_1 value
<int> <date> <chr>
1 1 2021-02-03 x
2 2 2019-04-19 a
3 3 2020-01-06 dd
You can also use the following solution. It is essential that you transform your date columns to date class before using this code:
library(dplyr)
library(lubridate)
Table_1 %>%
mutate(date_1 = mdy(date_1)) %>%
rowwise() %>%
mutate(Min = Table_2$value[Table_2$ID == ID][which.min(abs(date_1 - Table_2$date_2[Table_2$ID == ID]))])
# A tibble: 3 x 3
# Rowwise:
ID date_1 Min
<int> <date> <chr>
1 1 2021-02-03 x
2 2 2019-04-19 a
3 3 2020-01-06 dd
Date
Table_2
structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), date_2 = structure(c(18659,
18736, 18267, 18035, 17997, 17997, 18094, 18277), class = "Date"),
value = c("x", "y", "z", "g", "a", "bb", "cc", "dd")), class = "data.frame", row.names = c(NA,
-8L))

Pivot Wide with Custom Names, Original Values in the cell

I have data that is set up like the following - the CODE variable is character and needs to remain as it is because the numbers have meaning.
ID CODE
1 1.0
1 0.00
1 9.99
2 40.56
3 33.54
3 0.00
How would I use pivot wider to rearrange it so it is like the following, where I can have 4 CODE columns and if there isn't a fourth code per ID, it is just left blank
ID CODE_1 CODE_2 CODE_3 CODE_4
1 1.0 0.00 9.99 "."
2 40.56 "." "." "."
3 33.54 0.00 "." "."
Thank you!
This approach can be close to what you want. You can use tidyverse function complete() to enable the level not present in your original values. Here the code:
library(tidyverse)
#Code
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE)
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 9.99 NA
2 2 40.6 NA NA NA
3 3 33.5 0 NA NA
Some data used:
#Data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))
If you really want dots for missing values, you have to transform the variables to character and then assign the replace like this:
#Code 2
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE) %>%
mutate(across(CODE_1:CODE_4,~as.character(.))) %>%
replace(is.na(.),'.')
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <chr> <chr> <chr> <chr>
1 1 1 0 9.99 .
2 2 40.56 . . .
3 3 33.54 0 . .
We can use dcast from data.table
library(data.table)
dcast(setDT(df), ID ~ paste0("CODE_", rowid(ID)), value.var = 'CODE')
# ID CODE_1 CODE_2 CODE_3
#1: 1 1.00 0 9.99
#2: 2 40.56 NA NA
#3: 3 33.54 0 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))

R function to paste information from different rows with a common column? [duplicate]

This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]

Equivalent of summarise_all for group_by and slice

I'm currently using group_by then slice, to get the maximum dates in my data. There are a few rows where the date is NA, and when using slice(which.max(END_DT)), the NAs end up getting dropped. Is there an equivalent of summarise_all, so that I can keep the NAs in my data?
ID Date INitials
1 01-01-2020 AZ
1 02-01-2020 BE
2 NA CC
I'm using
df %>%
group_by(ID) %>%
slice(which.max(Date))
I need the final results to look like below, but it's dropping the NA entirely
ID Date Initials
1 02-01-2020 BE
2 NA CC
which.max() is not suitable in this case because (1) it drops missing values and (2) it only finds the first position of maxima. Here is a general solution:
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%m-%d-%Y")) %>%
group_by(ID) %>%
filter(Date == max(Date) | all(is.na(Date)))
# # A tibble: 2 x 3
# # Groups: ID [2]
# ID Date INitials
# <int> <date> <fct>
# 1 1 2020-02-01 BE
# 2 2 NA CC
df <- structure(list(ID = c(1L, 1L, 2L), Date = structure(c(1L, 2L,
NA), .Label = c("01-01-2020", "02-01-2020"), class = "factor"),
INitials = structure(1:3, .Label = c("AZ", "BE", "CC"), class = "factor")),
class = "data.frame", row.names = c(NA, -3L))
It's dropping the NA because you're asking it to find the max date...which NA would not fall into. If you want to go the "which.max" route, then I'd just run the dataset again, using filter, and grab the NA(s) and bind them to the dataset.
df.1 <- df%>%
filter(is.na(Date))
df <- rbind(df, df.1)

Resources