Extract first value after a specific observation - r

I couldn't quite find what I was looking for: this comes closest: Extract rows for the first occurrence of a variable in a data frame.
So I'm trying to figure out how I could extract the row directly following a specific observation. So for every year, there will be a place in the data where the observation is "over" and then I want the first numeric value following that "over." How would I do that?
So in the minimal example below, I would want to pluck the "7" (from the threshold variable) which directly follows the "over."
Thanks much!
other.values = c(13,10,10,9,8,4,5,7,7,5)
values = c(12,15,16,7,6,4,5,8,8,4)
df = data.frame(values, other.values)%>%mutate(threshold = ifelse(values - other.values > 0, "over", values))

You can do :
library(dplyr)
df %>%
mutate(grp = cumsum(threshold != 'over')) %>%
filter(lag(threshold) == 'over' & lag(grp) != grp)
# values other.values threshold grp
#1 7 9 7 2
#2 4 5 4 6

You could try something like this but surely there is a better way
df$threshold[max(which(df$threshold == "over"))+1]
Basically, this return the row index of the last match to "over" and then adds 1.

EDIT:
You can subset on those rows which do not have the value over in threshold AND (&) which do have over in the row prior (lag):
library(dplyr)
df[which(!df$threshold=="over" & lag(df$threshold)=="over"),]
values other.values threshold
values other.values threshold
4 7 9 7
10 4 5 4

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!
You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

How to extract a subset based on value by grepl

Hi i want to extract all the observations starting from "120.5" I am doing it in following way.
a<-c(120.1,120.3,120.5,120.566)
Part<-c(1,2,3,4)
DFFF<-data.frame(a,Part)
lill <- subset(DFFF, grepl('^120.5', a), select = Part)
> lill
Part
3 3
I want outcome to be 3 and 4. How to do that in R.
Since you're only subsetting on a numerical variable, #NelsonGon's solution DFFF[DFFF$a>=120.5,]is absolutely the first option. If, for some reason, you have to use greplyou can subset like this:
DFFF[grepl("120.5", DFFF$a), ]
a Part
3 120.500 3
4 120.566 4
But bear in mind that this only works as long as the numbers in a are not equal to or greater than 120.6; all these values will not be matched.
In base R
ind <- which(DFFF$a >= 120.5)
lill <- DFFF$Part[ind]
Tidyverse
library(tidyverse)
DFFF %>%
filter(a >= 120.5) %>%
pull(Part)
You were close, quotes needed.
lill <- subset(DFFF, grepl('^120.5', a), select="Part")
lill
# Part
# 3 3
# 4 4

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?
You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]
You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4
A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

Nested subsetting with "["

I recently discovered that, after subsetting an object (i.e. a data frame) with "[", the resulting object could be subset with "[" on the same line of code (I should have realized it earlier!). Here is an example:
# Create a data frame
df1 <- as.data.frame(matrix(1:9, nrow = 3))
# Take a look at the data frame
df1
V1 V2 V3
1 1 4 7
2 2 5 8
3 3 6 9
# If I want the value which is on the 3rd row and 2nd column
df1[3,2]
[1] 6
# But I could also
df1[,2][3]
[1] 6
A few words on the second alternative. df[,2] returns an atomic vector, which is then subset with df[,2][3].
The following data frame will be helpful to illustrate my issue. It is a simple data frame containing the name of 26 students, their respective department as well as a numeric value. A seed number is added for reproducibility.
set.seed(123)
df2 <- data.frame(name = letters, dept = sample(c("econ", "stat", "math"), 26, replace = TRUE), value = runif(26, 0, 100))
head(df2)
name dept value
1 a econ 54.40660
2 b math 59.41420
3 c stat 28.91597
4 d math 14.71136
5 e math 96.30242
6 f econ 90.22990
I would like to know who has the lowest value in the econ department. The first thing I tried was:
df2[df2$dept == "econ" & df2$value == min(df2$value),]
[1] name dept value
<0 rows> (or 0-length row.names)
It took me a while to understand what I was doing wrong, but I finally realized that the problem was that my code assumed that the person who had the lowest value overall was also from the econ department, which is not the case (and that's the answer that R gave me). Actually, the person with the lowest value overall is from the stat department.
i <- which(df$value == min(df$value))
df[i,]
name dept value
9 i stat 2.461368
Of course, I can easily find the answer to my question with:
df_econ <- df2[df2$dept == "econ",]
df_econ
name dept value
1 a econ 54.40660
6 f econ 90.22990
15 o econ 14.28000
17 q econ 41.37243
18 r econ 36.88455
19 s econ 15.24447
df_econ[df_econ$value == min(df_econ$value),]
name dept value
15 o econ 14.28
But I would like to know if I can get the same result using "nested" subsetting with the [ operator. What I mean is with a code like this:
df2[df2$dept == "econ",][... ,]
I do not know how to refer to the value column at this point since the resulting data frame of the first subsetting operation df2[df2$dept == "econ",] is a data frame different from df2. I also know that the value column is the 3rd column, but I do not know how to set subsetting conditions using column indexes rather than their names.
Thank you for your help.
Here are some options:
library(dplyr)
# also in #bramtayl's answer:
df2 %>% filter(dept == "econ") %>% filter(value==min(value))
# or
df2 %>% filter(dept == "econ") %>% slice(which.min(value))
# or...
library(data.table)
setDT(df2)[dept == "econ"][value==min(value)]
# or
setDT(df2)[dept == "econ"][which.min(value)]
These packages offer convenient ways of chaining not available in base R except awkwardly, like
subset(subset(df2, dept=="econ"), value == min(value))
There may be other packages, but these two are widely used lately.
Comment. If you're just browsing data, I'd recommend aggregating at the dept level:
# dplyr:
df2 %>% group_by(dept) %>% slice(which.min(value))
# data.table:
df2[, .SD[which.min(value)], by=dept]
dept name value
1: econ o 14.280002
2: math t 13.880606
3: stat i 2.461368
Agreed that chaining is necessary:
library(magrittr)
df %>%
`[`(.$dept == "econ", ) %>%
`[`(.$value == min(.$value), )
Why not stick with dplyr though?
library(dplyr)
df %>%
filter(dept == "econ") %>%
filter(value == min(value) )

Resources