Merging rows in a dataframe R with duplicate id's

Merging rows in a dataframe R with duplicate id's - r

I have a question considering merging rows in a dataframe:
I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:
I have a dataframe with a structure like this:
person_id test_date serial_number freezer_number test_1 test_2 test_3 test_4
x 01/01/2010 c d positive NA NA NA
x 05/01/2010 a b NA positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
I want to merge the rows so that the data of the other columns remain intact (mainly the test
date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.
This would be the ideal output:
person_id test_date serial_number freezer_number test_date2 test_1 test_2 test_3 test_4
x 01/01/2010 c d 05/01/2010 positive positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
How do I go about this? I have tried the "aggregate()" functions before, however this is very unclear to me.
Any help is appreciated, I can give more information to clarify my current code and data frame!

You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA.
I added a pivot_wider to preserve the different test_dates (as pointed out by #Andrea M).
library(dplyr)
library(lubridate)
df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))
Output
> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA

What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider().
However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.
# Load libraries
library(tidyr)
library(dplyr)
library(stringr)
# Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)
df2 <- df %>%
# Add a column indicating test number
group_by(person_id) %>%
mutate(test_number = row_number(),
# Gather the test results into a single column
test_result = paste0(test_1, test_2, test_3, test_4) %>%
str_remove_all("NA")) %>%
select(-(test_1:test_4)) %>%
# Reshape from long to wide
pivot_wider(names_from = test_number,
values_from = c(test_date, serial_number,
freezer_number, test_result)) %>%
# Reorder the columns
relocate(ends_with("1"), .before = ends_with("2"))
df2

Related

Count number of non-blank columns and assign to individual respondents

I am trying to get a total number of friends that will become the denominator in a later step.
example data:
set.seed(24) ## for sake of reproducibility
n <- 5
data <- data.frame(id=1:n,
Q1= c("same", "diff", NA, NA, NA),
Q2= c("diff", "diff", "same", "diff", NA),
Q3= c("same", "diff", NA ,NA, "diff"),
Q4= c("diff", "same", NA, NA, NA))
i first need to create a column that contains a numeric count of how many columns each participant responded to (either "same" or "diff", not counting NAs/blanks). I have tried the following
friendship <- total.friends <- rowSums(c(data$Q1, data$Q2, data$Q3, data$Q4)), != "")
friendship <- total.friends <-rowSums(!is.na(c(data$Q1, data$Q2, data$Q3, data$Q4)))
Neither is effective, likely because my data is not numeric. the first did count the cells but did not group by id as I require. is there any function i can use to count the populated cells? how can i edit this to count cells populated only with "diff" so that i can then start on the second step (making the proportion)?

You could
data2 <- apply(data[,-1],MARGIN=1,function(x){c <- length(x[!is.na(x)])})
result <- as.data.frame(cbind(data[,1],data2)) %>% setNames(c("id","number"))
And result will hold the amount of not NA each id has.
The data2 is basically a count of the number of not NAs for each id, it uses the apply function with margin 1 which basically takes each row of your dataframe and applies a function to that row. The function that is being applied is the c<-length(x[!is.na(x)] part. Which basically, the 'x[!is.na(x)]' filters away all the NA entries in each row so that it only has NOT NA entries of the row, then we apply the length() function to that result so it gives us how many entries where there after filtering the NAs.
The result of that apply will be a single column array, in which each row is the result of computing that procedure to each row, and considering you have a row for each id. It translates as computing that function to each id
Lastly, in the result line I simply add the id back to the previous step, for the sake of having in it well identified and not just one column of results.
Hope this works for you :)

Here's a regex solution with grep:
data$count <- apply(data, 1, function(x) length(grep("[a-z]", x, value = T)))
Here using length you count the number of times grep finds a lower-case letter in any row cell.
Result:
data
id Q1 Q2 Q3 Q4 count
1 1 same diff same diff 4
2 2 diff diff diff same 4
3 3 <NA> same <NA> <NA> 1
4 4 <NA> diff <NA> <NA> 1
5 5 <NA> <NA> diff <NA> 1

You can also accomplish this using c_across and rowwise from the dplyr library:
library(dplyr)
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(Q1:Q4)))) %>%
dplyr::ungroup()
Note: alternatively you can use starts_with("Q") inside of c_across to do this across all columns that start with "Q" (shown below).
To count the number of a specific response you can do or compute other variables that depend on a newly created variable, like a proportion, in the mutate statement:
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(starts_with("Q")))),
Diff = sum(c_across(starts_with("Q")) == "diff", na.rm = T),
Prop = Diff / Total) %>%
dplyr::ungroup()
id Q1 Q2 Q3 Q4 Total Diff Prop
<int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 1 same diff same diff 4 2 0.5
2 2 diff diff diff same 4 3 0.75
3 3 NA same NA NA 1 0 0
4 4 NA diff NA NA 1 1 1
5 5 NA NA diff NA 1 1 1

R maximum with NAs in a vector

I have a problem. Lines where PROJECT="SNOP" are missing (they don't appear in df6) while they were present in df5. PROJECT = "SNOP" lines contain only NAs for the VERSION2 variable. Someone can help me? Here is my code:
which(df5$PROJECT=="SNOP") #200 lines appear
df6 <- df5 %>%
group_by(PROJECT) %>%
filter(VERSION2 == ifelse(!all(is.na(VERSION2)), max(VERSION2, na.rm=T), NA)) %>%
ungroup()
which(df6$PROJECT=="SNOP") #missing lines PROJECT="Snop" #answer: integer(0)

That is probably because "SNOP" has all NA values and filter drops them.
Consider this small example.
library(dplyr)
df <- data.frame(a = rep(c(1:2), each =3), b = c(1:3, NA, NA, NA))
Using your code, we do :
df %>%
group_by(a) %>%
filter(b == ifelse(!all(is.na(b)), max(b, na.rm=TRUE), NA))
# a b
# <int> <int>
#1 1 3
Notice how a = 2 is dropped.
Now you can decide what you want to do to those groups where all values are NA. For example, the below keeps all the rows where there are NA's.
df %>%
group_by(a) %>%
filter(if(!all(is.na(b))) b == max(b, na.rm=T) else TRUE)
# a b
# <int> <int>
#1 1 3
#2 2 NA
#3 2 NA
#4 2 NA

R coalesce down columns by identifer [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 2 years ago.
I have a long dataset with student grades and courses going across many semesters. It has many NAs and many rows for each student. I want it to have one long row per student to fill in those NAs but keep the same column names.
Here's a sample:
library(tidyverse)
sample <- tibble(student = c("Corey", "Corey", "Sibley", "Sibley"),
fall_course_1 = c("Math", NA, "Science", NA),
fall_course_2 = c(NA, "English", NA, NA),
fall_grade_1 = c(90, NA, 98, NA),
fall_grade_2 = c(NA, 60, NA, NA))
And here's what I'd like it to look like:
library(tidyverse)
answer <- tibble(student = c("Corey", "Sibley"),
fall_course_1 = c("Math", "Science"),
fall_course_2 = c("English", NA),
fall_grade_1 = c(90, 98),
fall_grade_2 = c(60, NA))
Some semesters, some students take many classes and some just one. I've tried using coalesce(), but I can't figure it out. Any help would be appreciated!

This should do it, pivot the data long, remove the na's and then pivot it back to wide.
You need to convert the numeric values to character temporarily so they can go in the same column as the course labels, then type_convert() is a lazy way to put them back again.
library(dplyr)
library(tidyr)
library(readr)
reshaped <- sample %>%
mutate_if(is.numeric, as.character) %>%
pivot_longer(-student) %>%
drop_na() %>%
pivot_wider(student, names_from = name, values_from = value) %>%
type_convert()

You could get the first non-NA value in each column for each student.
library(dplyr)
sample %> group_by(student) %>% summarise_all(~na.omit(.)[1])
# A tibble: 2 x 5
# student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
# <chr> <chr> <chr> <dbl> <dbl>
#1 Corey Math English 90 60
#2 Sibley Science NA 98 NA
This approach returns NA if there are all NA values in a group.

Using a custom colaesce function and dplyr:
coalesce_all_columns <- function(df) {
return(coalesce(!!! as.list(df)))
}
library(dplyr)
sample %>%
group_by(student) %>%
summarise_all(coalesce_all_columns)
# A tibble: 2 x 5
student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
<chr> <chr> <chr> <dbl> <dbl>
1 Corey Math English 90 60
2 Sibley Science NA 98 NA

You could also use data.table package as follows:
library(data.table)
setDT(sample)[, lapply(.SD, na.omit), student]
sample
# 1: Corey Math English 90 60
# 2: Sibley Science <NA> 98 NA

How to subset dataframe based on multiple variables in R

I have a dataframe of 286 columns and 157355 rows. I wish to subset rows that contain one or more of several defined factor variables such as F32, F341 etc.
Once this has been completed, I wish to identify which other factor variables are most common in the subset rows.
I have tried to filter for values of interest but an error messages appears saying the data must be numeric, logical or complex, for example;
d<- a %>%
filter_at(vars(f.41202.0.0:f.41202.0.65), all_vars('F32'))
I also tried this, but the resulting dataframe had no values present;
f <- a %>%
rowwise() %>%
filter(any(c(1:280) %in% c('F32', 'F320', 'F321', 'F322', 'F323',
'F328', 'F329', 'F330', 'F331', 'F332',
'F333', 'F334', 'F338', 'F339')))
the same occurred when I tried to place all relevant variables into an ICD object;
f <- b %>%
rowwise() %>%
filter(any(c(1:286) %in% ICD))
I would greatly appreciate any suggestions, thanks
my data looks like this (sorry I can't find a way to format it better on this page);
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
2 NA NA M87 NA
3 NA F3 NA K17
4 NA NA F3 M87
After sub-setting rows based on F3 it should look like this;
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
3 NA F3 NA K17
4 NA NA F3 M87
so the same variable columns are retained, but rows without F3 are removed
then I would hope to list the other variables (other than F3) based on how common they are within that subset, in this case that would be
most common: M87
2nd most common: K17
If it helps, I am trying to identify individuals with a particular disease, then I will try to find out which other diseases those individuals most commonly have
thanks for the help

If you wish to use tidyverse, you can use filter_all to look at all of the columns. Then, check if any_vars are in a vector of diagnostic codes. In my example, I look at F3 and F320.
Afterwards, if you want to count up the number of diagnosis codes, you could reshape your data from wide to long, and then count frequencies. If you wish, you can remove NA by filter. Let me know if this is what you had in mind.
df <- data.frame(
Var1 = c("F3", NA, NA, NA),
Var2 = c(NA, NA, "F3", NA),
Var3 = c(NA, "M87", NA, "F3"),
Var4 = c("M87", NA, "K17", "M87")
)
library(tidyverse)
df %>%
filter_all(any_vars(. %in% c("F3", "F320"))) %>%
pivot_longer(cols = starts_with("Var"), names_to = "Var", values_to = "Code") %>%
filter(!is.na(Code)) %>%
count(Code, sort = TRUE)
After the filter, you should have:
Var1 Var2 Var3 Var4
1 F3 <NA> <NA> M87
2 <NA> F3 <NA> K17
3 <NA> <NA> F3 M87
After pivot_longer and count:
# A tibble: 3 x 2
Code n
<fct> <int>
1 F3 3
2 M87 2
3 K17 1
Side note: if you wish to filter based on only some of your variables (instead of selecting all variables), you can use filter_at instead, such as:
filter_at(vars(starts_with("Var")), any_vars(. %in% c("F3", "F320")))

Summarize data frame to return non-NA values along subsets

Hoping that someone can help me with a trick. I've found similar questions online, but none of the examples I've seen do exactly what I'm looking for or work on my data structure.
I need to remove NAs from a data frame along data subsets and compress the remaining NA values into rows for each data subset.
Example:
#create example data
a <- c(1, 1, 1, 2, 2, 2) #this is the subsetting variable in the example
b <- c(NA, NA, "B", NA, NA, "C") #max 1 non-NA value for each subset
c <- c("A", NA, NA, "A", NA, NA)
d <- c(NA, NA, 1, NA, NA, NA) #some subsets for some columns have all NA values
dat <- as.data.frame(cbind(a, b, c, d))
> desired output
a b c d
1 B A 1
2 C A <NA>
Rules of thumb:
1) Need to remove NA values from each column
2) Loop along data subsets (column "a" in example above)
3) All columns, for each subset, have a max of 1 non-NA value, but some columns may have all NA values
Ideas:
lapply or dplyr is probably helpful to loop along all columns
na.omit is likely helpful, if the subsetting column that has entries for all
rows can be ignored (something like as.data.frame(lapply(dat.admin, na.omit))). issue in returning lapply output to data frame if some subsets don't return any non-NA values
x[which.min(is.na(x))] effectively accomplishes this if laboriously applied to each individual column
Any help is appreciated to put the final pieces together! Thank you!

One solution could be achieved using dplyr::summarise_all. The data needs to be group_by on a.
library(dplyr)
dat %>%
group_by(a) %>%
summarise_all(funs(.[which.min(is.na(.))]))
# # A tibble: 2 x 4
# a b c d
# <fctr> <fctr> <fctr> <fctr>
# 1 1 B A 1
# 2 2 C A <NA>

Solution with data.table and na.omit
library(data.table)
merge(setDT(dat)[,a[1],keyby=a], setDT(dat)[,na.omit(.SD),keyby=a],all.x=TRUE)
I think the merge statement can be improved

Not really sure if this is what you're looking for, but this might work for you. It at least replicates the small sample output you're looking for:
library(dplyr)
library(tidyr)
dat %>%
filter_at(vars(b:c), any_vars(!is.na(.))) %>%
group_by(a) %>%
fill(b) %>%
fill(c) %>%
filter_at(vars(b:c), all_vars(!is.na(.)))
# A tibble: 2 x 4
# Groups: a [2]
a b c d
<fctr> <fctr> <fctr> <fctr>
1 1 B A 1
2 2 C A NA
You could also use just dplyr:
dat %>%
group_by(a) %>%
summarise_each(funs(first(.[!is.na(.)])))