keeping certain rows in data frame with a condition - r

I have a data frame in R for which I want to remove certain rows provided that match certain conditions. How can I do it ?
I have tried using dplyr and ifelse but my code does not give right answer
check8 <- distinct(df5,prod,.keep_all = TRUE)
Does not work! gives the entire data set
Input is:
check1 <- data.frame(ID = c(1,1,2,2,2,3,4),
prod = c("R","T","R","T",NA,"T","R"),
bad = c(0,0,0,1,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 1 T 0
# 3 2 R 0
# 4 2 T 1
# 5 2 <NA> 0
# 6 3 T 1
# 7 4 R 0
Output expected:
data.frame(ID = c(1,2,3,4),
prod = c("R","R","T","R"),
bad = c(0,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 2 R 0
# 3 3 T 1
# 4 4 R 0
I want to have the output such that for IDs where both prod or NA are there, keep only rows with prod R, but if only one prod is there then keep that row despite the prod .

Using dplyr we can use filter to select rows where prod == "R" or if there is only one row in the group, select that row.
library(dplyr)
check1 %>%
group_by(ID) %>%
filter(prod == "R" | n() == 1)
# ID prod bad
# <dbl> <fct> <dbl>
#1 1 R 0
#2 2 R 0
#3 3 T 1
#4 4 R 0

Here solution using an anti_join
library(dplyr)
check1 <- data.frame(ID = c(1,1,2,2,2,3,4), prod = c("R","T","R","T",NA,"T","R"), bad = c(0,0,0,1,0,1,0))
# First part: select all the IDs which contain 'R' as prod
p1 <- check1 %>%
group_by(ID) %>%
filter(prod == 'R')
# Second part: using anti_join get all the rows from check1 where there are not
# matching values in p1
p2 <- anti_join(check1, p1, by = 'ID')
solution <- bind_rows(
p1,
p2
) %>%
arrange(ID)

Related

For Loop over data frame, using dplyr results in error

I have a, simplified, a data frame with 71 columns and N rows. What I want to get is a frequency table of the values in the first column based on all other columns (all other columns have dummies). Simplified (with only 4 columns) this would be like that:
df <- data.frame(sample(1:8,20,replace=T),sample(0:1,20,replace = T),sample(0:1,20,replace = T),sample(0:1,20,replace = T))
I have tried this for loop with dplyr (where x is the first column with the 8 different values), and it only works for the first 10 or 11 columns without problems, but after then it only generates NA's and returns the error:
freq_df <- data.frame(matrix(NA, nrow=8, ncol=71))
for (i in 2:71){
freq_df[,i] <- df %>%
filter(df[i]==1) %>%
count(x) %>%
select(n)
}
in `[<-.data.frame`(`*tmp*`, , i, value = list(n = c(3L, 5L, 8L, :
replacement element 1 has 7 rows, need 8
Anyone knows why R returns this error? Thank you for your help!
Your error is because not all first column values will occur where other columns are 1. You have 8 unique values in the first column, maybe you have 7 when you filter on the 11th column == 1. So the results can have different lengths, which is a problem.
Try this instead, I think it's what you're trying to do. (If not, please clarify your goal by showing the expected output.)
names(df) = paste0("V", 1:4)
df %>%
group_by(V1) %>%
summarize(across(everything(), sum, .names = "{.col}_count"))
# V1 V2_count V3_count V4_count
# <int> <int> <int> <int>
# 1 1 1 0 1
# 2 2 2 1 2
# 3 3 3 3 2
# 4 4 0 0 0
# 5 5 0 0 0
# 6 6 3 1 2
# 7 7 3 1 1
# 8 8 1 1 0
In base R, we can do
names(df) <- paste0("V", 1:4)
out <- aggregate(.~ V1, df, sum, na.rm = TRUE)
names(out)[-1] <- paste0(names(out)[-1], "_count")

Filtering rows between zero values and save as new dataframes or datatables in R

I have a large csv dataset with more than 45k rows and 19 different variables. I'd like to filter it by a specific variable (V4) so that each filtered group starts with 0 and then the next 0 will mark the start of a new group/dataframe/datatable, while keeping all other variables inside this new table as well. I need those separate groups to further analyse each case of data.
I tried:
filtered_data <- my_data %>%
group_by("V4") %>%
filter("V4" == 0 & "V4" !=0)
View(filtered_data)
The first "V4" == 0 seems to work but I'm struggling how to define the end of each filtered dataframe e.g. how to filter from 0 to 3, then 0 to 5 etc.
How can I determine the length of each case? Is there a logical operator that saves each group before V4 turns 0 again? Or would it be better to create a loop?
Example of my_data:
V1 V2 V3 V4 . . . V19
1 0
2 1
3 2
4 ` 3
5 0
6 1
7 2
8 3
9 4
10 5
11 0
...
45k
Here is a way to group your rows with basic arithmetic.
I create the groups using a cumulative sum of an indicator variable (V4 is 0 or not) and split the data.frame into single dataframes using group_split.
# example data 12000 rows in total, 4000 groups of 3 rows
df <- data.frame(V1 = 1:12000,
V2 = sample(LETTERS, 12000, replace = T),
V4 = rep(0:2, 4000))
df <- df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_split(Groups)
So the first group/dataframe is
> df[[1]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 1 L 0 1
2 2 L 1 1
3 3 Y 2 1
the second
> df[[2]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 4 Z 0 2
2 5 N 1 2
3 6 Y 2 2
and so on.
If you want to save each data.frame seperately you could use something like this:
# new environment that holds all data.frames
dfEnv <- new.env()
df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_by(Groups) %>%
do({
# save every group inside the new environment as a single data.frame
dfEnv[[paste0("Group_", unique(.$Groups))]] <- .
})
Now you have dfEnv$Group_1, dfEnv$Group_2, ... and so on.
Inside do() you could also use saveRDS or write.csv to save the data to disk.

Subset data based on conditional statement

I would like to know if there is a way of combining ifelse statement and the filter function (in dplyr package) to subset a data frame. Consider the data
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
I want to delete rows following D2=1 or D1=D2=0 for each id. The expected output would look like
df<-data.frame(id=c(1,2,2,2,3),
A=c(3,5,4,3,9),
D1=c(0,1,1,0,0),
D2=c(1,0,0,1,0))
I have approached this by several attempts using group_by and the filter function but it appears conditional statements are needed but I'm finding it difficulty to combine these with the filter function. I have come across several Q&A on subsetting data (e.g. How to subset data by filtering and grouping efficiently in R) but these do not respond to my question. I greatly appreciate any help on this.
In dplyr , you can find out the first index where the condition is met and select rows which occur before the condition is satisfied for each group.
library(dplyr)
df %>%
group_by(id) %>%
filter(row_number() <= which(D1 == 0 & D2 == 0 | D2 == 1)[1])
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
#1 1 3 0 1
#2 2 5 1 0
#3 2 4 1 0
#4 2 3 0 1
#5 3 9 0 0
The above works assuming that at least one row in each group satisfies the condition. A general case, where there might be instances that none of the row satisfies the condition and we want to select all the rows in the group we can use :
df %>%
group_by(id) %>%
slice({
inds <- which(D1 == 0 & D2 == 0 | D2 == 1)[1]
if(!is.na(inds)) -((inds + 1):n()) else seq_len(n())})
It doesn't seem like you need to use dplyr here (unless I'm missing something). Try this:
df<-data.frame(id=c(1,1,1,2,2,2,2,3,3),
A=c(3,6,2,5,4,3,8,9,8),
D1=c(0,0,0,1,1,0,0,0,0),
D2=c(1,0,0,0,0,1,1,0,1))
del = c()
for (i in 1:nrow(df)){
if (df$D2[i] == 1 | (df$D1[i] ==0 & df$D2[i] == 0)){
del = c(del, i)
}
}
df = df[del,]
Pure dplyr:
df %>%
group_by(id) %>%
filter(row_number() == n() | rev(cumany(rev(!(D2 == 1 | (D1 == D2 & D2 == 0))))))
# # A tibble: 5 x 4
# # Groups: id [3]
# id A D1 D2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2 0 0
# 2 2 5 1 0
# 3 2 4 1 0
# 4 2 8 0 1
# 5 3 8 0 1

Filtering rows based on two conditions at the ID level

I have long data where a given subject has 4 observations. I want to only include a given id that meets the following conditions:
has at least one 3
has at least one of 1,2 OR NA
My data structure:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3), a=c(NA,1,2,3, NA,3,2,0, NA,NA,1,1))
My unsuccessful attempt (I get an empty data frame):
df %>% dplyr::group_by(id) %>% filter(a==3 & a %in% c(1,2,NA))
An option is to group by 'id', create a logic to return single TRUE/FALSE as output. Based on the OP's post, we need both values '3' and either one of the values 1, 2, NA in the column 'a'. So, 3 %in% a returns a logical vector of length 1, then wrap any on the second set where we do a comparison with multiple values or check the NA elements (is.na), merge both logical output with &
library(dplyr)
df %>%
group_by(id) %>%
filter((3 %in% a) & any(c(1, 2) %in% a|is.na(a)) )
# A tibble: 8 x 2
# Groups: id [2]
# id a
# <dbl> <dbl>
#1 1 NA
#2 1 1
#3 1 2
#4 1 3
#5 2 NA
#6 2 3
#7 2 2
#8 2 0
I have done this a bit of a long way to show how an idea could work. You can consolidate this a bit.
df %>%
group_by(id) %>%
mutate(has_3 = sum(a == 3, na.rm = T) > 0,
keep_me = has_3 & (sum(is.na(a)) > 0 | sum(a %in% c(1, 2)) > 0)) %>%
filter(keep_me == TRUE) %>%
select(id, a)
id a
<dbl> <dbl>
1 1 NA
2 1 1
3 1 2
4 1 3
5 2 NA
6 2 3
7 2 2
8 2 0
As I read it, the filter should keep ids 1 and 2. So I would use combo of all/any:
df %>%
group_by(id) %>%
filter(all(3 %in% a) & any(c(1,2,NA) %in% a))

Dummify character column and find unique values [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe with the following structure
test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))
Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below
data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))
a ff cc rr e
1 1 1 1 1 0
2 1 0 1 1 1
I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.
I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner,
i.e.
library(splitstackshape)
library(qdapTools)
mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
# a cc ff rr e
#V1 1 1 1 1 0
#V2 1 1 0 1 1
It can also be full splitstackshape as #A5C1D2H2I1M1N2O1R2T1 mentions in comments,
cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)
Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row
library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
# variable a cc e ff rr
# 1: 1 1 1 0 1 1
# 2: 2 1 1 1 0 1
Here's a base R approach:
x <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x)) # get unique elements
x <- lapply(x, factor, levels = lvl) # convert to factor
t(sapply(x, table)) # count elements and transpose
# a ff cc rr e
#[1,] 1 1 1 1 0
#[2,] 1 0 1 1 1
We can do this with tidyverse
library(tidyverse)
rownames_to_column(test, 'grp') %>%
separate_rows(col) %>%
filter(col!="") %>%
count( grp, col) %>%
spread(col, n, fill = 0) %>%
ungroup() %>%
select(-grp)
# A tibble: 2 × 5
# a cc e ff rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 0 1 1
#2 1 1 1 0 1
Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.
test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';')))
yy=strsplit(as.character(test$col), split = ';')
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))
Another approach with tidytext and tidyverse
library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
unnest_tokens(word, col) %>%
rownames_to_column(var="row") %>%
mutate(row = floor(parse_number(row)),
val = 1) %>%
spread(word, val, fill = 0) %>%
select(-row)
df
# a cc e ff rr
#1 1 1 0 1 1
#2 1 1 1 0 1
Another simple solution without any extra packages:
x = c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

Resources