Combining rows by index in R [duplicate] - r

This question already has answers here:
Combining pivoted rows in R by common value
(4 answers)
Closed 4 years ago.
EDIT: I am aware there is a similar question that has been answered, but it does not work for me on the dataset I have provided below. The above dataframe is the result of me using the spread function. I am still not sure how to consolidate it.
EDIT2: I realized that the group_by function, which I had previously used on the data, is what was preventing the spread function from working in the way I wanted it to work originally. After using ungroup, I was able to go straight from the original dataset (not pictured below) to the 2nd dataframe pictured below.
I have a dataframe that looks like the following. I am trying to make it so that there is only 1 row for each id number.
id init_cont family 1 2 3
1 I C 1 NA NA
1 I C NA 4 NA
1 I C NA NA 3
2 I D 2 NA NA
2 I D NA 1 NA
2 I D NA NA 4
3 K C 3 NA NA
3 K C NA 4 NA
3 K C NA NA 1
I would like the resulting dataframe to look like this.
id init_cont family 1 2 3
1 I C 1 4 3
2 I D 2 1 4
3 K C 3 4 1

We cangroup_by the 'd', 'init_cont', 'family' and then do a summarise_all to remove all the NA elements in the columns 1:3
library(dplyr)
df1 %>%
group_by_at(names(.)[1:3]) %>%
summarise_all(na.omit)
#Or
#summarise_all(funs(.[!is.na(.)]))
# A tibble: 3 x 6
# Groups: d, init_cont [?]
# d init_cont family `1` `2` `3`
# <int> <chr> <chr> <int> <int> <int>
#1 1 I C 1 4 3
#2 2 I D 2 1 4
#3 3 K C 3 4 1

Related

Remove NA rows based on mulitple columns's name in R [duplicate]

This question already has answers here:
Omit rows containing specific column of NA
(10 answers)
Closed 2 years ago.
Given a small dataset as follows:
A B C
1 2 NA
NA 2 3
1 NA 3
1 2 3
How could I remove rows based on the condition: columns B and C have NAs?
The expected result will like this:
A B C
NA 2 3
1 2 3
Another option in Base R is
df[complete.cases(df[c("B","C")]),]
A B C
2 NA 2 3
4 1 2 3
With base R:
df[!is.na(df$B) & !is.na(df$C),]
Using dplyr:
df %>%
filter(!is.na(B), !is.na(C))
returns
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 NA 2 3
2 1 2 3
or
df %>%
drop_na(B, C)

Replacing NA with observed values? [duplicate]

This question already has answers here:
Filling missing value in group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 2 years ago.
I have a dataset that contains multiple observations per person. In some cases an individual will have their ethnicity recorded in some rows but missing in others. In R, how can I replace the NA's with the ethnicity stated in the other rows without having to manually change them?
Example:
PersonID Ethnicity
1 A
1 A
1 NA
1 NA
1 A
2 NA
2 B
2 NA
3 NA
3 NA
3 A
3 NA
Need:
PersonID Ethnicity
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
3 A
3 A
3 A
3 A
You could use fill from tidyr
df %>%
group_by(PersonID)%>%
fill(Ethnicity,.direction = "downup")
# A tibble: 12 x 2
# Groups: PersonID [3]
PersonID Ethnicity
<int> <fct>
1 1 A
2 1 A
3 1 A
4 1 A
5 1 A
6 2 B
7 2 B
8 2 B
9 3 A
10 3 A
11 3 A
12 3 A

Remove groups which do not have non-consecutive NA values in R

I have the following Data frame
group <- c(2,2,2,2,4,4,4,4,5,5,5,5)
D <- c(NA,2,NA,NA,NA,2,3,NA,NA,NA,1,1)
df <- data.frame(group, D)
df
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
9 5 NA
10 5 NA
11 5 1
12 5 1
I would like to only keep groups that contain non consecutive NA values at least once. in this case group 5 would be removed because it does not contain non consecutive NA values, but only consecutive NA values. group 2 and 4 remain because they do contain non consecutive NA values (NA values separated by row(s) with a non NA value).
therefore the resulting data frame would look like this:
df2
group D
1 2 NA
2 2 2
3 2 NA
4 2 NA
5 4 NA
6 4 2
7 4 3
8 4 NA
any ideas :)?
How about using difference between the index of NA-values per group?
library(dplyr)
df %>% group_by(group) %>% filter(any(diff(which(is.na(D))) > 1))
## A tibble: 8 x 2
## Groups: group [2]
# group D
# <dbl> <dbl>
#1 2. NA
#2 2. 2.
#3 2. NA
#4 2. NA
#5 4. NA
#6 4. 2.
#7 4. 3.
#8 4. NA
I'm not sure this would catch all potential edge cases but it seems to work for the given example.

spreading data in R - allowing multiple values per cell

With these data
d <- data.frame(time=1:5, side=c("r","r","r","l","l"), val = c(1,2,1,2,1))
d
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
We can spread to a tidy dataframe like this:
library(tidyverse)
d %>% spread(side,val)
Which gives:
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1 NA
But say we have more than one val for a given time/side. For example:
d <- data.frame(time=c(1:5,5), side=c("r","r","r","l","l","l"), val = c(1,2,1,2,1,2))
time side val
1 1 r 1
2 2 r 2
3 3 r 1
4 4 l 2
5 5 l 1
6 5 l 2
Now this won't work because of duplicated values:
d %>% spread(side,val)
Error: Duplicate identifiers for rows (5, 6)
Is there an efficient way to force this behavior (or alternative). The output would be e.g.
time l r
1 1 NA 1
2 2 NA 2
3 3 NA 1
4 4 2 NA
5 5 1, 2 NA
The data.table/reshape2 equivalent of tidyr::spread is dcast. It has a more complicated syntax than spread, but it's more flexible. To accomplish your task we can use the below chunk.
We use the formula to 'spread' side by time (filling with the values in the val column), provide the fill value of NA, and specify we want to list elements together when aggregation is needed per value of time.
library(data.table)
d <- data.table(time=c(1:5,5),
side=c("r","r","r","l","l","l"),
val = c(1,2,1,2,1,2))
data.table::dcast(d, time ~ side,
value.var='val',
fill=NA,
fun.aggregate=list)
#OUTPUT
# time l r
# 1: 1 NA 1
# 2: 2 NA 2
# 3: 3 NA 1
# 4: 4 2 NA
# 5: 5 1,2 NA

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

Resources