Using ifelse to remove unwanted rows from the dataset in R - r

I have a dataset where I want to remove the occurences of month 11 in the first observation year for a couple of my individuals. Is it possible to do this with ifelse? Something like:
ifelse(ID=="1" & Month=="11" and Year=="2006", "remove these rows",
ifelse(ID=="2" & Month=="11" & Year=="2007", "remove these rows",
"nothing"))
As always, all help appreciated! :)

You don't even need the ifelse() if all you want is an indicator of which to remove or not.
ind <- (Month == "11") &
((ID == "1" & Year == "2006") | (ID == "2" & Year == "2007"))
ind will contain a TRUE if Month is "11" and if either of the other two subclauses is TRUE.
Then you can drop those sample using !ind in any subset operation via [ or subset().
dat <- data.frame(ID = rep(c("1","2"), each = 72),
Year = rep(c("2006","2007","2008"), each = 24),
Month = rep(as.character(1:12), times = 3))
ind <- with(dat, (Month == "11") & ((ID == "1" & Year == "2006") |
(ID == "2" & Year == "2007")))
ind
dat2 <- dat[!ind, ]
Which gives
R> ind
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
R> dat2 <- dat[!ind, ]
R> nrow(dat)
[1] 144
R> nrow(dat2)
[1] 140
which is correct in terms of the example data/

A data.table solution, which will be time and memory efficient (and slightly less coding). It will scale well for big data sets.
If the columns were integer, not factor
library(data.table)
DT <- data.table(ID = rep(1:2, each = 72),
Year = rep(2006:2008, each = 24),
Month = rep(1:12, times = 3))
# or you could use: DT <- as.data.table(dat)
setkey(DT,ID,Year,Month)
DT[-DT[J(1:2,2006:2007,11),which=TRUE]]

Related

How can I remove rows with same elements in all columns of a dataframe?

I have a dataframe with the following elements:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
3539 6 6
3540 2 2
I want to remove rows with the same elements in all columns.
My desired result is as follows:
> x[3536:3540,]
V1 V2
3536 2 6
3537 13 6
3538 9 6
I tried this:
x<-x[,1] != x[,2]
But I get only boolean values for each row, not matrix with rows with non-same elements in columns:
> x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[113] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Any help would be greatly appreciated.
You need to subset/filter:
Base R:
x_new <- x[x$V1 != x$V2,]
dplyr:
library(dplyr)
x_new <- x %>%
filter(V1 != V2)
Result:
x_new
V1 V2
2 1 2
3 1 3
Data:
x <- data.frame(
V1 = c(1,1,1,1),
V2 = c(1,2,3,1)
)
The below is assuming you want to subset within specific rows as per original post.
library(data.table)
setDT(df)
df <- df[3536:3540][V1 != V2]

subsetting by index in R

I have an vector with indexes:
indexes
[1] 25 2 16 23
and another vector with logical:
logical
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
i want to keep all logical items that, except those with indexes stored in indexes.
i thought this would have an easy solution, but mine doesn't work:
for(index in indexes){
logical[index] = NULL
}
You could just use minus (-) indexing :
indexes <- c(25, 2, 16, 23)
logicals <- sample(c(T,F),25,replace=T)
logicals
#> [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
#> [13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
#> [25] FALSE
logicals[-indexes]
#> [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE
#> [13] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE

Unexpected results from str_detect()

str_detect(c("abc", "xyz"), letters)) does not return expected results.
It should be a vector of
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE TRUE TRUE
But instead it returns
str_detect(c("abc", "xyz"), letters))
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE TRUE FALSE TRUE
Why? And how do I get the desired result?
The reason for this is because str_detect recycles arguments. It's comparing abc against a, then xyz against b, then abc against c, and so on. You should paste together abc and xyz into a single character, or just supply c("abcxyz"), but I'm assuming this might be a simplified version of a more complex issue.
library(stringr)
rgx <- paste0(c("abc", "xyz"), collapse = "")
str_detect(rgx, letters)

Splitting data based on time condition

I have rows of data that are seconds apart, however I found some anomalies. The difference between some rows is 30min or above, so I want to split my data to multiple other data frames at that condition which means loop through my data frame and split when the difference in time is above 30min. I’ve tried this already but it splits my data to one row data frame.
RBD < - function(x){
i <- 0
while(i < length(data$Time)){
if(data$Time[i+1]-data$Time[i] > 60*30){
rb <- 1
}
else{
rb<-0
}
i <- i+1
}
}
ListData <- Data %>%
group_by(Data$temp)%>%
transmute(ind=all((RBD = 1))%>%
.$ind
names(ListData) <- paste0(‘Data’, seq_along(ListData))
split(Data, ListData)
My Data looks like this
Data
There's a very helpful function in base R: diff, which can do the heavy lifting for you. If this doesn't work for you, try posting a reprex and I'll see if i can help you troubleshoot.
Lets simulate some data:
set.seed(123)
x <- sample(1200, 100)
x <- x + sample(c(0, 0, 0, 0, 2400), 100, replace = TRUE)
RBD <- function(x){
res <- lag(x) > 60*30
res[1] <- FALSE
res
}
RBD(x)
# [1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
# [13] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [25] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [37] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [49] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [73] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
# [85] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
# [97] FALSE FALSE FALSE FALSE

Create a logical or binary matrix/data.frame from a list of factors in R

I have a list of approximately 2 million elements. The list is made up of vectors of character strings. There are about 50 different character strings so can be considered factors. The vectors of character strings are different lengths varying between 1 and 50 (i.e the total number of character strings).
I want to convert the list to a logical or binary matrix/data.frame. Currently my method involves lapply and is incredibly slow, I would like to know if there is a vectorised approach.
require(dplyr); require(tidyr)
#create test data set
set.seed(123)
list1 <- list()
ListLength <-10
elementlength <- sample(1:5, ListLength, replace = TRUE )
for(i in 1:length(elementlength) ){
list1[[i]] <- sample(letters[1:15], elementlength[i])
}
#Create data frame from list using lapply
lapply(list1, function(n){
data.frame(type = n, value = TRUE) %>%
spread(., key = type, value )
}) %>% bind_rows()
I don't know if there is a way by preallocating the data frame then filling it in somehow.
Type <- unique(unlist(list1, use.names = FALSE))
#Create empty dataframe
TypeMat <- data.frame(matrix(NA,
ncol = length(Type),
nrow = ListLength)) %>%
setNames(Type)
We could use mtabulate from qdapTools
library(qdapTools)
mtabulate(list1)!=0
# a b c d e f g h i j k l m o
#[1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
#[3,] TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
#[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
#[8,] TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
#[9,] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[10,]FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Resources