How to extract a subset based on value by grepl

How to extract a subset based on value by grepl - r

Hi i want to extract all the observations starting from "120.5" I am doing it in following way.
a<-c(120.1,120.3,120.5,120.566)
Part<-c(1,2,3,4)
DFFF<-data.frame(a,Part)
lill <- subset(DFFF, grepl('^120.5', a), select = Part)
> lill
Part
3 3
I want outcome to be 3 and 4. How to do that in R.

Since you're only subsetting on a numerical variable, #NelsonGon's solution DFFF[DFFF$a>=120.5,]is absolutely the first option. If, for some reason, you have to use greplyou can subset like this:
DFFF[grepl("120.5", DFFF$a), ]
a Part
3 120.500 3
4 120.566 4
But bear in mind that this only works as long as the numbers in a are not equal to or greater than 120.6; all these values will not be matched.

In base R
ind <- which(DFFF$a >= 120.5)
lill <- DFFF$Part[ind]
Tidyverse
library(tidyverse)
DFFF %>%
filter(a >= 120.5) %>%
pull(Part)

You were close, quotes needed.
lill <- subset(DFFF, grepl('^120.5', a), select="Part")
lill
# Part
# 3 3
# 4 4

Related

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

I am looking of a way to create a new column (using dplyr's mutate) based on certain "conditions".
library(tidyverse)
qq <- 5
df <- data.frame(rn = 1:qq,
a = rnorm(qq,0,1),
b = rnorm(qq,10,5))
myf <- function(dataframe,value){
result <- dataframe %>%
filter(rn<=value) %>%
nrow
return(result)
}
The above example is a rather simplified version for which I am trying to filter the piped dataframe (df) and obtain a new column (foo) whose values will depict how many rows there are with rn less than or equal to the current rn (each row's rn - coming from the piped df ). Below you can see the output I am getting vs the one I expect to obtain :
df %>%
mutate(
foo_i_am_getting = myf(.,rn),
foo_expected = 1:qq)
rn a b foo_i_am_getting foo_expected
1 1 -0.5403937 -4.945643 5 1
2 2 0.7169147 2.516924 5 2
3 3 -0.2610024 -7.003944 5 3
4 4 -0.9991419 -1.663043 5 4
5 5 1.4002610 15.501411 5 5
The actual calculation I am trying to perform is more cumbersome, however, if I solve the above simplified version, I believe I can handle the rest of the manipulation/calculations inside the custom function.
BONUS QUESTION : Currently the name of the column I want to apply the filter on (i.e. rn) is hardcoded in the custom function (filter(rn<=value)). It would be great if this was an argument of the custom function, to be passed 'tidyverse' style - i.e. without quotation marks - e.g. myf <- function(dataframe,rn,value)
Disclaimer : I 've done my best to describe the problem at hand, however, if there are still unclear spots please let me know so I can elaborate further.
Thanks in advance for your support!

You need to do it step by step, because now you are passing whole vector to filter instead of only one value each time:
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq)
Now we are passing 1 to filter for rn column (and function returns number of rows), then 2 for rn column.
Function could be:
myf <- function(vec_filter, dataframe, vec_rn) {
map_dbl(vec_filter, ~ nrow(filter(dataframe, {{vec_rn}} <= .x)))
}
df %>%
mutate(
foo_i_am_getting = map_dbl(.$rn, function(x) nrow(filter(., rn <= x))),
foo_expected = 1:qq,
foo_function = myf(rn, ., rn))

Extract first value after a specific observation

I couldn't quite find what I was looking for: this comes closest: Extract rows for the first occurrence of a variable in a data frame.
So I'm trying to figure out how I could extract the row directly following a specific observation. So for every year, there will be a place in the data where the observation is "over" and then I want the first numeric value following that "over." How would I do that?
So in the minimal example below, I would want to pluck the "7" (from the threshold variable) which directly follows the "over."
Thanks much!
other.values = c(13,10,10,9,8,4,5,7,7,5)
values = c(12,15,16,7,6,4,5,8,8,4)
df = data.frame(values, other.values)%>%mutate(threshold = ifelse(values - other.values > 0, "over", values))

You can do :
library(dplyr)
df %>%
mutate(grp = cumsum(threshold != 'over')) %>%
filter(lag(threshold) == 'over' & lag(grp) != grp)
# values other.values threshold grp
#1 7 9 7 2
#2 4 5 4 6

You could try something like this but surely there is a better way
df$threshold[max(which(df$threshold == "over"))+1]
Basically, this return the row index of the last match to "over" and then adds 1.

EDIT:
You can subset on those rows which do not have the value over in threshold AND (&) which do have over in the row prior (lag):
library(dplyr)
df[which(!df$threshold=="over" & lag(df$threshold)=="over"),]
values other.values threshold
values other.values threshold
4 7 9 7
10 4 5 4

R selecting rows by conditions given in an external table

Given the following data
data_min <- data.frame("cond"=c("a","b","c"),"min"=c(1,3,1))
data <- data.frame("cond"=c("a","b","b","a","c"),"val"=c(0,2,4,7,0))
I would like to select all rows from data for that the value in val is bigger than the minimum value specified in data_min for that condidition. Thus, in the given example, I expect to end up with a table
cond val
b 4
a 7
So far, I have tried
datanew <- data[which(data$cond==data_min$cond & data$val > data_min$min),]
which gives me a 7but not b 4. I have two questions, (1) why do I get the result I get, and (2) how do I get the desired result?

You need to use match because the data.frames have different numbers of rows:
data[data_min[match(data$cond, data_min$cond),]$min <= data$val,]

You could just merge the two data frames together to make things easier:
> m=merge(data,data_min,by='cond')
> m[which(m$val > m$min), c('cond','val')]
cond val
2 a 7
4 b 4

A solution using dplyr. We can perform a join first and then filter the condition between the val and min column.
library(dplyr)
data2 <- data %>%
left_join(data_min, by = "cond") %>%
filter(val > min) %>%
select(-min)
data2
cond val
1 b 4
2 a 7

Accessing column names in a data frame

I have a data frame with column names z_1, z_2 upto z_200. In the following example, for ease of representation, I am showing only z_1
df <- data.frame(x=1:5, y=2:6, z_1=3:7, u=4:8)
df
i=1
tmp <- paste("z",i,sep="_")
subset(df, select=-c(tmp))
The above code will be used in a loop i for accessing certain elements that need to be removed from the data frame
While executing the above code, I get the error "Error in -c(tmp) : invalid argument to unary operator"
Thank you for your help

Try:
df[names(df)!=tmp]
The reason your code is not working is because -c(tmp), where tmp is a character, evaluates to nothing. You can use this way of excluding with numerical values only.
Alternatively this would also work:
subset(df, select=-which(names(df)==tmp))
Because which returns a number.

I you want to use subset and have a large number of columns of similar names to include or exclude, I usually think about using grepl to construct a logical vector of matches to column names (or you could use it to construct a numeric vector just as easily). Negation of the result would remove columns
df <- data.frame(x=1:5, y=2:6, z_1=3:7, u=4:8)
df
i=1
tmp <- paste("z",i,sep="_")
subset(df, select= !grepl("^z", names(df) ) )
x y u
1 1 2 4
2 2 3 5
3 3 4 6
4 4 5 7
5 5 6 8
With negation this lets you remove (or without it include) all of the columns starting with "z" using that pattern. Or you can use grep with value =TRUE in combination with character values:
subset(df, select= c("x", grep("^z", names(df), value=TRUE ) ) )

How to select rows from data.frame with 2 conditions

I have an aggregated table:
> aggdata[1:4,]
Group.1 Group.2 x
1 4 0.05 0.9214660
2 6 0.05 0.9315789
3 8 0.05 0.9526316
4 10 0.05 0.9684211
How can I select the x value when I have values for Group.1 and Group.2?
I tried:
aggdata[aggdata[,"Group.1"]==l && aggdata[,"Group.2"]==lamda,"x"]
but that replies all x's.
More info:
I want to use this like this:
table = data.frame();
for(l in unique(aggdata[,"Group.1"])) {
for(lambda in unique(aggdata[,"Group.2"])) {
table[l,lambda] = aggdata[aggdata[,"Group.1"]==l & aggdata[,"Group.2"]==lambda,"x"]
}
}
Any suggestions that are even easier and giving this result I appreciate!

The easiest solution is to change "&&" to "&" in your code.
> aggdata[aggdata[,"Group.1"]==6 & aggdata[,"Group.2"]==0.05,"x"]
[1] 0.9315789
My preferred solution would be to use subset():
> subset(aggdata, Group.1==6 & Group.2==0.05)$x
[1] 0.9315789

Use & not &&. The latter only evaluates the first element of each vector.
Update: to answer the second part, use the reshape package. Something like this will do it:
tablex <- recast(aggdata, Group.1 ~ variable * Group.2, id.var=1:2)
# Now add useful column and row names
colnames(tablex) <- gsub("x_","",colnames(tablex))
rownames(tablex) <- tablex[,1]
# Finally remove the redundant first column
tablex <- tablex[,-1]
Someone with more experience using reshape may have a simpler solution.
Note: Don't use table as a variable name as it conflicts with the table() function.

There is a really helpful document on subsetting R data frames at:
http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
Here is the relevant excerpt:
Subsetting rows using multiple
conditional statements: There is no
limit to how many logical statements
may be combined to achieve the
subsetting that is desired. The data
frame x.sub1 contains only the
observations for which the values of
the variable y is greater than 2 and
for which the variable V1 is greater
than 0.6.
x.sub1 <- subset(x.df, y > 2 & V1 > 0.6)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to extract a subset based on value by grepl - r

Hi i want to extract all the observations starting from "120.5" I am doing it in following way. a<-c(120.1,120.3,120.5,120.566) Part<-c(1,2,3,4) DFFF<-data.frame(a,Part) lill <- subset(DFFF, grepl('^120.5', a), select = Part) > lill Part 3 3 I want outcome to be 3 and 4. How to do that in R.

In base R ind <- which(DFFF$a >= 120.5) lill <- DFFF$Part[ind] Tidyverse library(tidyverse) DFFF %>% filter(a >= 120.5) %>% pull(Part)

You were close, quotes needed. lill <- subset(DFFF, grepl('^120.5', a), select="Part") lill # Part # 3 3 # 4 4

Related

How to create new column (using dplyr's mutate) based on conditions applied on the entire piped dataframe

Extract first value after a specific observation

R selecting rows by conditions given in an external table

Accessing column names in a data frame

How to select rows from data.frame with 2 conditions

Categories

Resources