R: How to replace NA value in specific condition [duplicate] - r

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 4 years ago.
everyone. I want to replace the NA value with value which is not NA for the same participants. I tried this, but it return the original df, i don't know what happened.
df = data.frame(block = c('1',NA,NA,'2',NA,NA,'3',NA,NA),
subject = c('31','31','31','32','32','32','33','33','33'))
df[df$subject == 1 & is.na(df$block)] = df[df$subject == 31 &!is.na(df$block)]
# define a for loop with from 1 to n
for (i in 1: length(unique(df$subject))){
subjects
# replace the block with NA in block that is not NA for the same participant
df[df$subject == i & is.na(df$block)] = df[df$subject == i & !is.na(df$block)]
}
Here is what i want to get.
enter image description here

Using the dplyr and the zoo library, I replaced the NA values in the block column with the previous non-NA row values:
library(dplyr)
library(zoo)
df2 <- df %>%
do(na.locf(.))
The end result looks as follow:
df2
block subject
1 1 31
2 1 31
3 1 31
4 2 32
5 2 32
6 2 32
7 3 33
8 3 33
9 3 33

Related

R - Select all rows that have one NA value at most? [duplicate]

This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 3 days ago.
I'm trying to impute my data and keep as many observations as I can. I want to select observations that have 1 NA value at most from the data found at: mlbench::data(PimaIndiansDiabetes2).
For example:
Var1 Var2 Var3
1 NA NA
2 34 NA
3 NA NA
4 NA 55
5 NA NA
6 40 28
What I would like returned:
Var1 Var2 Var3
2 34 NA
4 NA 55
6 40 28
This code returns rows with NA values and I know that I could join all observations with 1 NA value using merge() to observations without NA values. I'm not sure how to do extract those though.
na_rows <- df[!complete.cases(df), ]
A base R solution:
df[rowSums(is.na(df)) <= 1, ]
Its dplyr equivalent:
library(dplyr)
df %>%
filter(rowSums(is.na(pick(everything()))) <= 1)

Replacing NAs with values in the previous row for each group [duplicate]

This question already has answers here:
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Closed 4 years ago.
In df i would like to replace the NA values with the previous non-NA value for each id
id<-c(1,1,1,1,2,2,2)
purchase<-c(20,NA,NA,10,NA,NA,5)
df<-data.frame(id,purchase)
id purchase
1 20
1 NA
1 NA
1 10
2 NA
2 NA
2 5
The output should ideally look like:
id purchase
1 20
1 20
1 20
1 10
2 NA
2 NA
2 5
I am aware of Replacing NAs with latest non-NA value, but it does not do it per group.
Any help would be appreciated.
Three ways (so far), all utilizing zoo::na.locf by per group. One thing to note is that you need na.rm=FALSE, otherwise zoo::na.locf may return a shortened vector (as is the case where id is 2).
Base R
do.call("rbind.data.frame",
by(df, df$id, function(x) transform(x, purchase = zoo::na.locf(purchase, na.rm=FALSE))))
# id purchase
# 1.1 1 20
# 1.2 1 20
# 1.3 1 20
# 1.4 1 10
# 2.5 2 NA
# 2.6 2 NA
# 2.7 2 5
dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate(purchase = zoo::na.locf(purchase, na.rm = FALSE))
data.table
library(data.table)
DT <- as.data.table(df)
DT[, purchase := zoo::na.locf(purchase, na.rm = FALSE), by = "id" ]

Converting "-" to a "0" in R [duplicate]

This question already has answers here:
Replace given value in vector
(8 answers)
Closed 4 years ago.
I have a column Apps in dataframe dframe
that looks like this:
Apps
1 31
2 12
3 10
4 33
5 -
I need the column to be type int instead of String so I need to convert the 5th row to a 0.
Apps
1 31
2 12
3 10
4 33
5 0
dframe$Apps[dframe$Apps == "-"] <- "0"
dframe$Apps <- as.integer(dframe$Apps)
You can do it with ifelse and the tidyverse approach:
require(tidyverse)
df %>%
mutate(Apps = ifelse(Apps == "-", 0, Apps))
Apps
1 4
2 3
3 2
4 5
5 0
Dataset:
df <- read.table(text = " Apps
1 31
2 12
3 10
4 33
5 -", header = TRUE)
dframe$Apps <- as.integer(gsub("-", "0", dframe$Apps, fixed = TRUE))
will give you an integer column as I suspect you want.

R - counting with NA in dataframe [duplicate]

This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 4 years ago.
lets say that I have this dataframe in R
df <- read.table(text="
id a b c
1 42 3 2 NA
2 42 NA 6 NA
3 42 1 NA 7", header=TRUE)
I´d like to calculate all columns to one, so result should look like this.
id a b c d
1 42 3 2 NA 5
2 42 NA 6 NA 6
3 42 1 NA 7 8
My code below doesn´t work since there is that NA values. Please note that I have to choose columns that I want to count since in my real dataframe I have some columns that I don´t want count together.
df %>%
mutate(d = a + b + c)
You can use rowSums for this which has an na.rm parameter to drop NA values.
df %>% mutate(d=rowSums(tibble(a,b,c), na.rm=TRUE))
or without dplyr using just base R.
df$d <- rowSums(subset(df, select=c(a,b,c)), na.rm=TRUE)

Removing duplicates based on two columns in R [duplicate]

This question already has answers here:
Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
(3 answers)
Closed 7 years ago.
Suppose my data is as follows,
X Y
26 14
26 14
26 15
26 15
27 15
27 15
28 16
28 16
I want to remove the rows of duplicates. I am able to remove the duplicate rows based on one column by this command,
dat[c(T, diff(dat$X) != 0), ] or dat[c(T, diff(dat$Y) != 0), ]
But I want to remove the duplicates only when both the columns have the same previous value. I can't use unique here because the same data would occur later on. I want to check the previous value and calculate it
My sample output is,
x y
26 14
26 15
27 15
28 16
How can we do this in R?
Thanks
Ijaz
Using data.table v1.9.5 - installation instructions here:
require(data.table) # v1.9.5+
df[!duplicated(rleidv(df, cols = c("X", "Y"))), ]
rleidv() is best understood with examples:
rleidv(c(1,1,1,2,2,3,1,1))
# [1] 1 1 1 2 2 3 4 4
A unique index is generated for each consecutive run of values.
And the same can be accomplished on a list() or data.frame() or data.table() on a specific set of columns as well. For example:
df = data.frame(a = c(1,1,2,2,1), b = c(2,3,4,4,2))
rleidv(df) # computes on both columns 'a,b'
# [1] 1 2 3 3 4
rleidv(df, cols = "a") # only looks at 'a'
# [1] 1 1 2 2 3
The rest should be fairly obvious. We just check for duplicated() values, and return the non-duplicated ones.
using dplyr:
library(dplyr)
z %>% filter(X != lag(X) | Y != lag(Y) | row_number() == 1)
We need to include the row_number()==1 or we lose the first row

Resources