Remove non-similar rows in R data frame with all.equal

Remove non-similar rows in R data frame with all.equal - r

I want to remove all rows in a data frame where Month and Mo columns are more than 1 month apart. I have heard you can do this with all.equal(df$Month, df$Mo, 1), but it is just returning a string. Is this possible in R?
Row Month Mo
1 1 1
1 2 4 #<-Remove

According to ?all.equal documentation, the return value of all.equal is
Either TRUE (NULL for attr.all.equal) or a vector of mode "character" describing the differences between target and current.
So no, you can't do it with all.equal, as it returns a single value. You can see more details in the docs about what the function does.
To do what you want, you can use plain R:
d <- data.frame(Row = 1:2, Month = 1:2, Mo = c(1,4)) # your data.frame
# Row Month Mo
# 1 1 1 1
# 2 2 2 4
d[!(abs(d$Month - d$Mo) > 1),] # d without rows where Month and Mo are far apart.
# Row Month Mo
# 1 1 1 1
or equivalently
d[abs(d$Month - d$Mo) <= 1,]

you can do something with dplyr
library(dplyr)
df %>%
filter(Month == Mo | Month == Mo+1 | Month == Mo-1)

Related

Subsetting whole clusters froma dataframe

In my data.frame below, I wonder how to subset a whole cluster of study that has any outcome larger than 1 in it?
My desired output is shown below. I tried subset(h, outcome > 1) but that doesn't give my desired output.
h = "
study outcome
a 1
a 2
a 1
b 1
b 1
c 3
c 3"
h = read.table(text = h,h=T)
DESIRED OUTPUT:
"
study outcome
a 1
a 2
a 1
c 3
c 3"

Modify the subset -
subset the 'study' based on the first logical expression outcome > 1
Use %in% on the 'study' to create the final logical expression in subset
subset(h, study %in% study[outcome > 1])
-output
study outcome
1 a 1
2 a 2
3 a 1
6 c 3
7 c 3
If we want to limit the number of 'study' elements having 'outcome' value 1, i.e. the first 'n' 'study', then get the unique 'study' from the first expression of subset, use head to get the first 'n' 'study' values and use %in% to create logical expression
n <- 3
subset(h, study %in% head(unique(study[outcome > 1]), n))
Or can be done with a group by approach with any
library(dplyr)
h %>%
group_by(study) %>%
filter(any(outcome > 1)) %>%
ungroup

R First Row By Group When Condition Is Met

dataHAVE=data.frame(STUDENT=c(1,1,1,2,2,2,3,3,3),
SCORE=c(0,1,1,5,1,2,1,1,1),
CAT=c(3,10,7,4,5,0,4,5,1),
FOX=c(5,0,10,8,9,1,8,9,0))
dataWANT=data.frame(STUDENT=c(1,2,3),
SCORE=c(1,1,1),
CAT=c(10,5,4),
FOX=c(0,9,8))
I have 'dataHAVE' and want 'dataWANT' which takes the first row for every 'STUDENT' when 'SCORE' equals to 1. I am seeking a data.table solution because of it being a large data. I try this but do not know how to set the criteria for 'SCORE'
dataWANT[,.SD[1],by = key(STUDENT)]

Convert the 'data.frame' to 'data.table' (setDT), grouped by 'STUDENT', specify the logical condition in i, get the index of the first row (.I[1]), extract that column ($V1) and subset the rows
library(data.table)
setDT(dataHAVE)[dataHAVE[SCORE == 1, .I[1], STUDENT]$V1]
.I returns row index. If we don't have a grouping column, it would return a vector i.e.
setDT(dataHAVE)[SCORE == 1, .I]
#[1] 1 2 3 4 5 6
when we provide the grouping column, by default, the .I returns with a named column V1 (we could override it by changing the name)
setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]
# STUDENT colindex
#1: 1 2
#2: 2 5
#3: 3 7
Nowe, we have two columns, 'STUDENT', 'colindex'. We are specifically interested in the 'colindex', so extract with standard procedures ($ or [[) and then use that as row index in i
i1 <- setDT(dataHAVE)[SCORE == 1, .(colindex = .I[1]), STUDENT]$colindex
i1
#[1] 2 5 7
This we use for subsetting
dataHAVE[i1]

Here is a base R option using subset + ave
subset(
dataHAVE,
ave(SCORE==1, STUDENT, FUN = function(x) seq_along(x) == min(which(x)))
)
which gives
STUDENT SCORE CAT FOX
2 1 1 10 0
5 2 1 5 9
7 3 1 4 8

Solution 1. There is a straightforward and comprehensive solution in two lines:
dataWANT <- dataHAVE[dataHAVE$SCORE == 1,] #Filter score equals to 1
dataWANT <- dataWANT[!duplicated(dataWANT$STUDENT), ] #Remove duplicated students
Solution 2. However, if you prefer to solve in one line:
dataWANT <- dataHAVE[!duplicated(paste0(dataHAVE$STUDENT, dataHAVE$SCORE)) & dataHAVE$SCORE ==1, ]
That creates a logical vector showing which of the combinations that are not duplicated of preceding elements, and combine it with a test if 'SCORE' is 1.

You could use match to get 1st row where SCORE = 1 for each STUDENT.
library(data.table)
setDT(dataHAVE)
dataHAVE[, .SD[match(1, SCORE)], STUDENT]
# STUDENT SCORE CAT FOX
#1: 1 1 10 0
#2: 2 1 5 9
#3: 3 1 4 8

subsetting a dataframe by existing object

I have a predefined object grade <- "G3". I would like to subset a data frame by grabbing 3 from "grade" object, subsetting only grade 3.
Here is an example of data
id <- c(1,2,3,4,5)
grade <- c(3,3,4,4,5)
score <- c(10,5,10,5,10)
data <- data.frame("id"=id,"grade"=grade, "score"=score)
> data
id grade score
1 1 3 10
2 2 3 5
3 3 4 10
4 4 4 5
5 5 5 10
I would like to get something like this:
> data
id grade score
1 1 3 10
2 2 3 5
Thanks!

With tidyverse, we can use !! to check for the 'grade' object in the global environment instead of the column in the 'data' environment, remove the 'G' and do a ==
library(dplyr)
library(stringr)
data %>%
filter(grade == str_remove(!!grade, "G"))
# id grade score
#1 1 3 10
#2 2 3 5

You can use filter, but you would likely want to change the object name so it doesn't match the variable name.
Grade <- "G3"
data <- data.frame("id"=id,"grade"=grade, "score"=score) %>%
filter(paste0("G", grade) == Grade)

You can use readr's parse_number to extract digits from a string with a minimum of fuss, and then subset with the result:
library(readr)
data[data$grade == parse_number(grade),]
Or with base R's sub replace non-numbers with "":
data[data$grade == sub("[^0-9]", "", grade),]
Or if the only other character in your string is always "G" then:
data[data$grade == sub("G", "", grade),]

Replacing missing values in time series data in R

I am new to R. I was hoping to replace the missing values for X in the data. How can I replace the missing values of "X" when "Time" = 1 or 2 with the value of "X" when "Time" = 3 for the same "SubID" and the same "Day"
SubID: subject number
Day: each subject's day number (1,2,3...21)
Time: morning marked as 1, afternoon marked as 2, and evening marked as
3
X: only has a valid value when Time is 3, others are missing.
SubID Day Time X
1 1 1 NA
1 1 2 NA
1 1 3 7.4
1 2 1 NA
1 2 3 6.2
2 1 1 NA
2 1 2 NA
2 1 3 7.1
2 2 3 5.9
2 2 2 NA
2 2 1 NA
I was able to go as far as the following codes in zoo. I have very limited experience in R. Thank you in advance!
data2 <- transform(data1,
x = na.aggregate(x,by=SubID,FUN=sum,na.rm = T))

Here's the explanation of my comment:
library(data.table)
library(zoo)
setDT(data1)
data1[order(-Time),
Xf := na.locf(X),
by = .(SubID, Day)]
Ok so the setDT function makes the data1 object a data.table. Then order(-Time) orders data1 with respect to Time in descending order (because of the -). Xf := na.locf(X) creates a new column Xf by reference (which means you don't have to assign this back to data1) as na.locf(X) which is a function in the zoo package that fills the NAs forward with the previous value (in this case filling 2 and 1 with the value in 3). The last line specifies that we want to do this grouped by SubID and Day.
Hope it's clearer now, feel free to ask if you have further doubts.

You can sort the data by descending time and then use X[1].
library(dplyr)
df <- tibble(SubID=1, Day=1, Time=c(1,2,3), X=c(NA, NA, 2.2))
df <- df %>%
group_by(SubID, Day) %>%
arrange(desc(Time)) %>%
mutate(
X=case_when(
is.na(X) ~ X[1],
TRUE ~ X)
)

Take first non-0 value or last 0 value if that's all there is

Ciao,
Here is my replicating example.
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
The tall data file HAVE is the one I need to convert to WANT. So essentially for each ID I need to identify the first non-zero value and that value goes into the data file WANT. If all values of absence is NA than TIME is NA. If all values of ABSENCE is 0 then I report the last possible row in WANT (as reflected in the TIME variable)
This is my attempt:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
but I do not know how to take the last of the 0 rows if there are only 0s.

library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
I'm using data.table since the tidyverse does not and never will support sub-assignment / modifying only rows selected by a condition (like the is.na(ABSENCE) here).
If there two rules can be made more consistent with each other, this should be doable in a left join or a single group_by + slice as the OP attempted, though. Okay, here's one way, though it looks impossible to debug:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3

Using data.table as well, based on subsetting the .I row counter:
WANT <- HAVE[
HAVE[,
if(all(is.na(ABSENCE))) .I[1] else
if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
min(.I[ABSENCE > 0], na.rm=TRUE),
by=ID
]$V1,
]
WANT[is.na(ABSENCE), TIME := NA_integer_]
# ID ABSENCE TIME
#1: 1 NA NA
#2: 2 0 3
#3: 3 1 3
#4: 4 0 2
#5: 5 1 2
#6: 6 0 3

Here are two approaches using dplyr and custom functions. Both rely on the data being sorted by TIME.
Filter Approach
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
The filter() step reduces the dataframe to the rows you need. Since it doesn't modify the TIME values, we need to mutate() as well.
Summarize Approach
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
The order of summarization matters because we're overwriting variables, so this approach is risky. It only produces the output WANT if you summarize TIME and then ABSENCE.