This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 3 days ago.
I'm trying to impute my data and keep as many observations as I can. I want to select observations that have 1 NA value at most from the data found at: mlbench::data(PimaIndiansDiabetes2).
For example:
Var1 Var2 Var3
1 NA NA
2 34 NA
3 NA NA
4 NA 55
5 NA NA
6 40 28
What I would like returned:
Var1 Var2 Var3
2 34 NA
4 NA 55
6 40 28
This code returns rows with NA values and I know that I could join all observations with 1 NA value using merge() to observations without NA values. I'm not sure how to do extract those though.
na_rows <- df[!complete.cases(df), ]
A base R solution:
df[rowSums(is.na(df)) <= 1, ]
Its dplyr equivalent:
library(dplyr)
df %>%
filter(rowSums(is.na(pick(everything()))) <= 1)
Related
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 4 years ago.
everyone. I want to replace the NA value with value which is not NA for the same participants. I tried this, but it return the original df, i don't know what happened.
df = data.frame(block = c('1',NA,NA,'2',NA,NA,'3',NA,NA),
subject = c('31','31','31','32','32','32','33','33','33'))
df[df$subject == 1 & is.na(df$block)] = df[df$subject == 31 &!is.na(df$block)]
# define a for loop with from 1 to n
for (i in 1: length(unique(df$subject))){
subjects
# replace the block with NA in block that is not NA for the same participant
df[df$subject == i & is.na(df$block)] = df[df$subject == i & !is.na(df$block)]
}
Here is what i want to get.
enter image description here
Using the dplyr and the zoo library, I replaced the NA values in the block column with the previous non-NA row values:
library(dplyr)
library(zoo)
df2 <- df %>%
do(na.locf(.))
The end result looks as follow:
df2
block subject
1 1 31
2 1 31
3 1 31
4 2 32
5 2 32
6 2 32
7 3 33
8 3 33
9 3 33
This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 4 years ago.
lets say that I have this dataframe in R
df <- read.table(text="
id a b c
1 42 3 2 NA
2 42 NA 6 NA
3 42 1 NA 7", header=TRUE)
I´d like to calculate all columns to one, so result should look like this.
id a b c d
1 42 3 2 NA 5
2 42 NA 6 NA 6
3 42 1 NA 7 8
My code below doesn´t work since there is that NA values. Please note that I have to choose columns that I want to count since in my real dataframe I have some columns that I don´t want count together.
df %>%
mutate(d = a + b + c)
You can use rowSums for this which has an na.rm parameter to drop NA values.
df %>% mutate(d=rowSums(tibble(a,b,c), na.rm=TRUE))
or without dplyr using just base R.
df$d <- rowSums(subset(df, select=c(a,b,c)), na.rm=TRUE)
This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3
This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
nested ifelse() is the worst; what's the best? [duplicate]
(3 answers)
Closed 5 years ago.
I want help on R programming to fill col4
col4=col1, in case col1 is NA then col4=col2, in case col1 and col2 are NA then col4=col3
id col1 col2 col3
1 10 NA NA
2 NA 12 NA
3 NA NA 13
4 NA NA 1
5 2 3 NA
Answer:
id col4
1 10
2 12
3 13
4 1
5 2
Easily done with coalesce from dplyr. This solution works for N number of columns:
library(dplyr)
data %>%
mutate(col4 = coalesce(!!!data[-1]))
Result:
id col1 col2 col3 col4
1 1 10 NA NA 10
2 2 NA 12 NA 12
3 3 NA NA 13 13
4 4 NA NA 1 1
5 5 2 3 NA 2
Data:
data = read.table(text = "id col1 col2 col3
1 10 NA NA
2 NA 12 NA
3 NA NA 13
4 NA NA 1
5 2 3 NA", header = T)
Notes:
!!! shouldn't be confused with the negation operator ! (understandable confusion). It is an operator that is part of rlang, or the tidyverse (also available to dplyr) which enables explicit splicing.
What this means is that instead of inputting the entire data frame into coalesce (coalesce(data[-1])), I am separating the columns of data[-1] (or elements of the list) and have each element as an input to coalesce. So this:
coalesce(!!!data[-1])
is actually equivalent to this:
coalesce(col1, col2, col3)
The advantage of writing it this way is that you don't have to know the column names nor how many columns there are to begin with.
Using dplyr::coalesce, or any of the answers at How to implement coalesce in R?:
xx$col4 = with(xx, coalesce(col1, col2, col3))
This question already has answers here:
Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]
(2 answers)
Closed 5 years ago.
I have a CSV file with headers. Some of the features (columns) are factorial, some are numerical.
For the factorial variables I have a lot of columns with a lot of NAs, e.g.:
Num1 Fact1 Num2 Fact2 Fact3
9889 Bla 23 BBxv NA
NA NA 456 BBxz NA
NA Abcd 3 BBxx Jet
NA NA 100 BBxy NA
NA NA NA NA NA
I Want to remove all Factorial columns with more than 50% NAs in it.
e.g. the resulting data frame should be:
Num1 Num2 Fact2
9889 23 BBxv
NA 456 BBxz
NA 3 BBxx
NA 100 BBxy
NA NA NA
Moreover, Is there a way to also remove numerical columns with more than 50% NAs in it, in the SAME process?
e.g. after the cleanup the resulting data frame would be one that contains only Num2 and Fact2 columns.
Try:
dff[colMeans(is.na(dff)) <= 0.5]
Should get:
Num2 Fact2
23 BBxv
456 BBxz
3 BBxx
100 BBxy
NA <NA>
Edit:
If you're looking to remove columns with more than 50% of zeros in the same process, give the following a try:
dff[colMeans(is.na(dff)) <= 0.5 & colMeans((dff == 0), na.rm = T) <= 0.5]
I hope this helps.