Remove NA rows based on mulitple columns's name in R [duplicate] - r

This question already has answers here:
Omit rows containing specific column of NA
(10 answers)
Closed 2 years ago.
Given a small dataset as follows:
A B C
1 2 NA
NA 2 3
1 NA 3
1 2 3
How could I remove rows based on the condition: columns B and C have NAs?
The expected result will like this:
A B C
NA 2 3
1 2 3

Another option in Base R is
df[complete.cases(df[c("B","C")]),]
A B C
2 NA 2 3
4 1 2 3

With base R:
df[!is.na(df$B) & !is.na(df$C),]
Using dplyr:
df %>%
filter(!is.na(B), !is.na(C))
returns
# A tibble: 2 x 3
A B C
<dbl> <dbl> <dbl>
1 NA 2 3
2 1 2 3
or
df %>%
drop_na(B, C)

Related

Replacing NA with observed values? [duplicate]

This question already has answers here:
Filling missing value in group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 2 years ago.
I have a dataset that contains multiple observations per person. In some cases an individual will have their ethnicity recorded in some rows but missing in others. In R, how can I replace the NA's with the ethnicity stated in the other rows without having to manually change them?
Example:
PersonID Ethnicity
1 A
1 A
1 NA
1 NA
1 A
2 NA
2 B
2 NA
3 NA
3 NA
3 A
3 NA
Need:
PersonID Ethnicity
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
3 A
3 A
3 A
3 A
You could use fill from tidyr
df %>%
group_by(PersonID)%>%
fill(Ethnicity,.direction = "downup")
# A tibble: 12 x 2
# Groups: PersonID [3]
PersonID Ethnicity
<int> <fct>
1 1 A
2 1 A
3 1 A
4 1 A
5 1 A
6 2 B
7 2 B
8 2 B
9 3 A
10 3 A
11 3 A
12 3 A

Combining rows by index in R [duplicate]

This question already has answers here:
Combining pivoted rows in R by common value
(4 answers)
Closed 4 years ago.
EDIT: I am aware there is a similar question that has been answered, but it does not work for me on the dataset I have provided below. The above dataframe is the result of me using the spread function. I am still not sure how to consolidate it.
EDIT2: I realized that the group_by function, which I had previously used on the data, is what was preventing the spread function from working in the way I wanted it to work originally. After using ungroup, I was able to go straight from the original dataset (not pictured below) to the 2nd dataframe pictured below.
I have a dataframe that looks like the following. I am trying to make it so that there is only 1 row for each id number.
id init_cont family 1 2 3
1 I C 1 NA NA
1 I C NA 4 NA
1 I C NA NA 3
2 I D 2 NA NA
2 I D NA 1 NA
2 I D NA NA 4
3 K C 3 NA NA
3 K C NA 4 NA
3 K C NA NA 1
I would like the resulting dataframe to look like this.
id init_cont family 1 2 3
1 I C 1 4 3
2 I D 2 1 4
3 K C 3 4 1
We cangroup_by the 'd', 'init_cont', 'family' and then do a summarise_all to remove all the NA elements in the columns 1:3
library(dplyr)
df1 %>%
group_by_at(names(.)[1:3]) %>%
summarise_all(na.omit)
#Or
#summarise_all(funs(.[!is.na(.)]))
# A tibble: 3 x 6
# Groups: d, init_cont [?]
# d init_cont family `1` `2` `3`
# <int> <chr> <chr> <int> <int> <int>
#1 1 I C 1 4 3
#2 2 I D 2 1 4
#3 3 K C 3 4 1

merge data by groups and by common ID (IDs duplicated outside groups) [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
This is not a duplicated question to How to join (merge) data frames. You can perform the left.merge inside the group but not to the whole data set. The ids are unique inside group, not acroos group. By not grouping and using a left.merge, you willl mess up the data.
I have a data with many groups (Panel data/Time seriers). Within the group, I want to merge the data by a common ID. And apply the same merge across all the groups that I have(same merge for all other groups).
#sample data
a<-data.frame(c(1:4,1:4),1,c('a','a','a','a','b','b','b','b'))
b<-data.frame(c(2,4,2,4),10,c('a','a','b','b'))
colnames(a)<-c('id','v','group')
colnames(b)<-c('id','v1','group')
> a
id v group
1 1 1 a
2 2 1 a
3 3 1 a
4 4 1 a
5 1 1 b
6 2 1 b
7 3 1 b
8 4 1 b
> b
id v1 group
1 2 10 a
2 4 10 a
3 2 10 b
4 4 10 b
I tried to use the dplyr group_by (group) and then merge(a,b,by='id',all.x=TRUE), but not sure how to apply dplyr to two data sets
desired output (left merge)
id v group.x v1 group.y
1 1 a NA <NA>
2 1 a 10 a
3 1 a NA <NA>
4 1 a 10 a
1 1 b NA <NA>
2 1 b 10 b
3 1 b NA <NA>
4 1 b 10 b
You can just include group in the by argument for the join:
a %>% left_join(b, by=c("id","group"))
id v group v1
1 1 1 a NA
2 2 1 a 10
3 3 1 a NA
4 4 1 a 10
5 1 1 b NA
6 2 1 b 10
7 3 1 b NA
8 4 1 b 10
This gives you only one "group" column, but v1 will be NA for cases where there's no matching row in b, so creating two separate "group" columns is redundant. Isn't that better, given that group (presumably) represents the same underlying division of the data in both data frames?

Replace na in column by value corresponding to column name in seperate table

I have a data frame which looks like this
data <- data.frame(ID = c(1,2,3,4,5),A = c(1,4,NA,NA,4),B = c(1,2,NA,NA,NA),C= c(1,2,3,4,NA))
> data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 NA NA 3
4 4 NA NA 4
5 5 4 NA NA
I have a mapping file as well which looks like this
reference <- data.frame(Names = c("A","B","C"),Vals = c(2,5,6))
> reference
Names Vals
1 A 2
2 B 5
3 C 6
I want my data file to be modified using the reference file in a way which would yield me this final data frame
> final_data
ID A B C
1 1 1 1 1
2 2 4 2 2
3 3 2 5 3
4 4 2 5 4
5 5 4 5 6
What is the fastest way I can acheive this in R?
We can do this with Map
data[as.character(reference$Names)] <- Map(function(x,y) replace(x,
is.na(x), y), data[as.character(reference$Names)], reference$Vals)
data
# ID A B C
#1 1 1 1 1
#2 2 4 2 2
#3 3 2 5 3
#4 4 2 5 4
#5 5 4 5 6
EDIT: Based on #thelatemail's comments.
NOTE: NO external packages used
As we are looking for efficient solution, another approach would be set from data.table
library(data.table)
setDT(data)
v1 <- as.character(reference$Names)
for(j in seq_along(v1)){
set(data, i = which(is.na(data[[v1[j]]])), j= v1[j], value = reference$Vals[j] )
}
NOTE: Only a single efficient external package used.
One approach is to compute a logical matrix of the target columns capturing which cells are NA. We can then index-assign the NA cells with the replacement values. The tricky part is ensuring the replacement vector aligns with the indexed cells:
im <- is.na(data[as.character(reference$Names)]);
data[as.character(reference$Names)][im] <- rep(reference$Vals,colSums(im));
data;
## ID A B C
## 1 1 1 1 1
## 2 2 4 2 2
## 3 3 2 5 3
## 4 4 2 5 4
## 5 5 4 5 6
If reference was the same wide format as data, dplyr's new (v. 0.5.0) coalesce function is built for replacing NAs; together with purrr, which offers alternate notations for *apply functions, it makes the process very simple:
library(dplyr)
# spread reference to wide, add ID column for mapping
reference_wide <- data.frame(ID = NA_real_, tidyr::spread(reference, Names, Vals))
reference_wide
# ID A B C
# 1 NA 2 5 6
# now coalesce the two column-wise and return a df
purrr::map2_df(data, reference_wide, coalesce)
# Source: local data frame [5 x 4]
#
# ID A B C
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1
# 2 2 4 2 2
# 3 3 2 5 3
# 4 4 2 5 4
# 5 5 4 5 6

How can I operate on elements of a data.frame in r, that creates a new column? [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 7 years ago.
Suppose I have a data.frame, df.
a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
I'd like to operate on it so that for all places where a and b are equal, I compute the mean of d.
I found that using aggregate can do this,
aggregate(d ~ a + b, df, mean)
This gives me something reasonable
a b d
1 2 5
2 1 3
2 3 6
But I would ideally like to keep my original d column, and add a new column m, so that I get the original data.frame with a new column "m" that contains the averages like,
a b d m
1 2 4 5
1 2 5 5
1 2 6 5
2 1 5 3
2 3 6 6
2 1 1 3
Any ideas on how to do this "properly" in R?
library(dplyr)
df <- read.table(text = "a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
" , header = T)
df %>%
group_by(a , b) %>%
mutate(m = mean(d))

Resources