remove rows from dataframe based on value, ignoring NAs - r

I have a dataframe that I would like to delete rows from, based on a value in a specific column.
As an example, the dataframe appears something like this:
a b c d
1 1 2 3 0
2 4 NA 1 NA
3 6 4 0 1
4 NA 5 0 0
I would like to remove all rows with a value greater than 0 in column d. I have been trying to use the following code to do this:
df <- df[!df$d > 0, ]
but this is appearing to have the effect of deleting all the value is rows with an NA value in column d. I was assuming that a na.rm = TRUEargument was needed but I wasn't sure where to fit it in the function above.
Cheers,
Ant

We need to select the rows where d is not greater than 0 OR there is NA in d
df[with(df, !d > 0 | is.na(d)), ]
# a b c d
#1 1 2 3 0
#2 4 NA 1 NA
#4 NA 5 0 0
Or we can also use subset
subset(df, !d > 0 | is.na(d))
or dplyr filter
library(dplyr)
df %>% filter(!d > 0 | is.na(d))
The !d > 0 part can also be reversed to
subset(df, d < 1 | is.na(d))
to get the same result.

We can construct the logical vector with complete.cases
subset(df, !d > 0 | complete.cases(d))
# a b c d
#1 1 2 3 0
#3 6 4 0 1
#4 NA 5 0 0
Or use subset with replace
subset(df, !replace(d, is.na(d), 0) > 0)
Or with tidyverse
library(tidyverse)
df %>%
filter(!replace_na(d, 0) >0)
which is slightly different from the method mentioned here or here
data
df <- structure(list(a = c(1L, 4L, 6L, NA), b = c(2L, NA, 4L, 5L),
c = c(3L, 1L, 0L, 0L), d = c(0L, NA, 1L, 0L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))

If u add a |all rows that has a NA will match. The condition !df$d > 0 will get executed for those in d that are not a NA. So I think you were looking for:
df[is.na(df$d) | !df$d > 0, ]
Wheras, the below wont include the rows that has a NA in column d and that does not match the condition !df$d > 0
df[!is.na(df$d) & !df$d > 0, ]

Related

Grouping first few rows with positive value followed by another group with negative values and so on using R

I have a dataframe looks like this:
name strand
thrL 1
thrA 1
thrB 1
yaaA -1
yaaJ -1
talB 1
mog 1
I would like to group first few positive values into a group, negative values a group and next postive numbers as another group which look like this:
name strand directon
thrL 1 1
thrA 1 1
thrB 1 1
yaaA -1 2
yaaJ -1 2
talB 1 3
mog 1 3
I am thinking to use dplyr but I need some help with the code using R. Thank you so much.
Using rle :
df$direction <- with(rle(sign(df$strand)), rep(seq_along(values), lengths))
df
# name strand direction
#1 thrL 1 1
#2 thrA 1 1
#3 thrB 1 1
#4 yaaA -1 2
#5 yaaJ -1 2
#6 talB 1 3
#7 mog 1 3
This can be made shorter with data.table rleid.
df$direction <- data.table::rleid(sign(df$strand))
We can also do this as
df1$direction <- inverse.rle(within.list(rle(sign(df1$strand)),
values <- seq_along(values)))
df1$direction
#[1] 1 1 1 2 2 3 3
data
df1 <- structure(list(name = c("thrL", "thrA", "thrB", "yaaA", "yaaJ",
"talB", "mog"), strand = c(1L, 1L, 1L, -1L, -1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

how to change Na with other columns?

I Have 2 indicator:
licence age.6-17
Na 1
1 0
Na 0
0 1
how I can change Na to 1 if a person is more than 17 years (that is second column is 0) old and 0 otherwise?
output
licence age.6-17
0 1
1 0
1 0
0 1
using dplyr and ifelse
yourdata %>% mutate(licence = ifelse(`age.6-17` == 0, 1,0))
No need to change how the nature of "Na" nor the column name.
In addition, in case you would need to replace only the "Na" cells, considering "Na" is a string here
yourdata %>% mutate(licence = ifelse(licence == "Na" & `age.6-17` == 0, 1,0))
If however it is <NA> you would need is.na(licence) instead of licence == "Na"
In base you can subset with is.na and then subtract the value of age.6.17 from 1.
x <- read.table(header=T, na.string="Na", text="licence age.6-17
Na 1
1 0
Na 0
0 1")
idx <- is.na(x$licence)
x$licence[idx] <- 1-x$age.6.17[idx]
x
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
or in case you ignore what is actualy stored in column licence you can use:
with(x, data.frame(licence=1-age.6.17, age.6.17))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
Assuming your NAs are actual NA we can use case_when in dplyr and apply the conditions.
library(dplyr)
df %>%
mutate(licence = case_when(is.na(licence) & age.6.17 == 0 ~ 1L,
is.na(licence) & age.6.17 == 1 ~ 0L,
TRUE ~ licence))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
data
df <- structure(list(licence = c(NA, 1L, NA, 0L), age.6.17 = c(1L,
0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))

R programming , subsetting data , and plotting graphs

R - I have a dataframe, with 0 and 1 in a column , I found out the row index at which the toggling takes place, now I want to sample out data from these by setting these particular row IDS?
This is the data:
row id mode
1 0
2 0
3 1
4 1
5 0
6 0
7 0
8 1
9 1
10 1
After splitting dataframe there should be 4 new dataframes:
y[1] :
row id mode
1 0
2 0
y[2]
row id mode
3 1
4 1
y[3]
row id mode
5 0
6 0
7 0
And so on.
We can create a grouping variable based on the difference of adjacent elements in 'mode' and split the dataset based on that
split(df1, cumsum(c(TRUE, diff(df1$mode)!=0)))
#$`1`
# row id mode
#1 1 0
#2 2 0
#$`2`
# row id mode
#3 3 1
#4 4 1
#$`3`
# row id mode
#5 5 0
#6 6 0
#7 7 0
#$`4`
# row id mode
#8 8 1
#9 9 1
#10 10 1
Or another option is to use rleid from data.table
library(data.table)
split(df1, rleid(df1$mode))
Or using rle from base R
split(df1, with(rle(df1$mode), rep(seq_along(values), lengths)))
data
df1 <- structure(list(`row id` = 1:10, mode = c(0L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L)), .Names = c("row id", "mode"),
class = "data.frame", row.names = c(NA, -10L))

Using table() to create 3 variable frequency table in R

I'm new to R and seeking some help. I understand the following problem is fairly simple and have looked for similar questions. None give quite the answer I'm looking for - any help would be appreciated.
The problem:
Producing a frequency table using the table() function for three variables with data in the format:
Var1 Var2 Var3
1 0 1 0
2 0 1 0
3 1 1 1
4 0 0 1
Where, 0 = "No" and 1 = "Yes"
And the final table is in the following format with variables and values labelled:
Var3
Yes No
Var1 Yes 1 0
No 1 2
Var2 Yes 1 2
No 1 0
What I have tried so far:
Using the following code I'm able to produce a 2 variable table, with labels for the variables but not for the values (ie. No and Yes).
table(data$Var1, data$Var3, dnn = c("Var1", "Var3"))
It looks like this:
Var3
Var1 0 1
0 2 1
1 0 1
In trying to label the row and column values (0 = No and 1= Yes) I understand row.names and responseName can be used, however the following attempt to label row names gives an all arguments must have the same length error.
> table(data$Var1, data$Var2, dnn = c("Var1", "Var2"), row.names = c("No", "Yes"))
I have also tried using ftable() however the shape of the table produced using code below is not correct resulting in incorrect frequencies for the problem. The issue with labeling row & col values persists.
> ftable(data$Var1, data$Var2, data$Var3, dnn = c("Var1", "Var2", "Var3"))
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1
Any help in using table() to produce a table of the shape desired would be greatly appreciated.
You could try tabular from library(tables) after changing the labels as showed by #thelatemail
library(tables)
data[] <- lapply(data, factor, levels=1:0, labels=c('Yes', 'No'))
tabular(Var1+Var2~Var3, data=data)
# Var3
# Yes No
#Var1 Yes 1 0
# No 1 2
#Var2 Yes 1 2
# No 1 0
data
data <- structure(list(Var1 = c(0L, 0L, 1L, 0L), Var2 = c(1L, 1L, 1L,
0L), Var3 = c(0L, 0L, 1L, 1L)), .Names = c("Var1", "Var2", "Var3"
), class = "data.frame", row.names = c("1", "2", "3", "4"))
The easiest way is to probably use the reshape2 package. Firstly you will need to convert your numeric information to factors so that it doesn't treat it as a number.
data$Var1 <- as.factor(data$Var1)
data$Var2 <- as.factor(data$Var2)
data$Var3 <- as.factor(data$Var3)
Then you can easily just apply table(data) to get the information you want. If you really want to transform it in the format you specified, then pull it as a data.frame and then transform it as required:
df <- as.data.frame(table(data))
library(reshape2)
dcast(df, Var1+Var2 ~ Var3)
This as the output:
Var1 Var2 0 1
1 0 0 0 1
2 0 1 2 0
3 1 0 0 0
4 1 1 0 1
EDIT: You can just use ftable on the data frame once its all factors:
> ftable(data)
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1

Expanding a data.frame by replacing missing values with set of all possible values in R

I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))

Resources