Searching for values in one data frame using another - r

I have 2 data frames: dfA and dfB
I would like to be able to extract whole rows from dfB that meet criteria based on dfA
Example:
if (dfA$colA == dfB$colB) && (dfB$colC >= dfA$colD) &&
(dfB$colC <= dfA$colE) { print rows from dfB }
The values from the 1st column in dfA need to be an exact match for the 2nd column in dfB
AND
the values from column 3 in dfB need to fall within a range set by columns 4 and 5 in dfA.
The output should be the rows from dfB that meet these criteria.

I am not sure with R but I guess it must be similar to Pandas: Just create three boolean masks, one for each criteria, than combine those to an overall-mask.
Example:
1stBoolMask = dfB[dfA$colA == dfB$colB] -> Something like ( 0 0 1 1 0 1 0 1 ... ) returns. A "1" stands for every matching entry in dfB.
2ndBoolMask = ...
3rdBoolMask = ...
-> OverallMask = 1stBoolMask & 2ndBoolMask & 3rdBoolMask
Then apply this one to dfB and you should be done. The "1s" in the resulting filter represent the matching lines of dfB.

Related

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

What is the alternative for multiple element vector to multiple element vector comparison inside while loop in R?

while (Data$City!="Mumbai" || Data$City!="Delhi" || Data$City!= "Bengaluru")
The error is following :
In while (Data$City!="Mumbai" || Data$City!=...: the condition has length >1 and only the first element will be used.
I want to compare elements of a column with certain values/elements of a vector in while loop and conditionally execute 'n' statements under it? What's the alternative for the limitation above ? What's the alternative : A function/function with apply() or ifelse ?
DataO <- c("Mumbai","Jaipur","Delhi","Chennai","Bengaluru")
Data1 <- setNames(data.frame(matrix(ncol = 1, nrow = 5), c("City"))
for(i in seq_along(DataO))
{
while (DataO!="Mumbai" || DataO!="Delhi" || DataO!= "Bengaluru")
{
Data1$City[i] <- as.character(DataO[i])
}
}
I want to execute the statement under 'while()' when Mumbai==Mumbai(i=1) and then for Delhi==Delhi(i=3) and then for Bengaluru==Bengaluru(i=5). It should skip iteration i=2 and i=4.
Here only the first element(i=1) gets evaluated and added(Mumbai)
> Data1
City
1 Mumbai
2 <NA>
3 <NA>
4 <NA>
5 <NA>
The desired output :
> Data1
City
1 Mumbai
2 <NA>
3 Delhi
4 <NA>
5 Bengaluru
The crux here is ' while something(element/row obs) in one place(data column/vector) matches something(element/ row obs) in other place(data column/vector) execute statements till the condition is satisfied and iterate this for all subsequent matches (and break out of the loop) '.
Digression : Can rownames be empty(character type "") in R / Is it possible to assign empty rownames(character type "") in R ?
Assuming Data$City is a vector of city names, and also assuming that you want to check if at least one of those city names is present in a given list, you could:
Store all valid city names into a character vector, namely validCities.
Use the %in% operator between those two vectors in order to obtain a logical vector. This vector will be of the same length as the first one, and will say which ones of those cities are contained in the second vector.
Use the sum function to verify if there is at least one positive, i.e., check if any of the cities contained in the first vector is present in the second vector.
Example below.
Data <- data.frame(City = c('Chennai', 'Delhi', 'Bhopal', 'Pune', 'Kolkata'));
validCities <- c('Mumbai', 'Delhi', 'Bengaluru');
if (sum(Data$City %in% validCities) > 0) {
// Your code here.
}
Updated:
Now that you have provided your desired output, I can see that that's pretty easy. Don't stuck on loop-focused approaches, a data.frame can be easily selected and filtered by row, just provide a condition for those rows that you want to consider, and indicate what columns you want to retrieve or modify.
In this case, I'm selecting those rows which CITY is not one of the three provided, and I'm assigning a NA value to the CITY column:
data <- data.frame(CITY = c('Mumbai', 'Jaipur', 'Delhi', 'Chennai', 'Bengaluru'));
data[!(data$CITY %in% c('Mumbai', 'Delhi', 'Bengaluru')), 'CITY'] <- NA;
Output:
> data
CITY
1 Mumbai
2 <NA>
3 Delhi
4 <NA>
5 Bengaluru
Also, you could simply remove the undesired rows, in which case the remaining rows would keep their original row name:
data <- data[data$CITY %in% c('Mumbai', 'Delhi', 'Bengaluru'), , drop = FALSE];
Output:
> data
CITY
1 Mumbai
3 Delhi
5 Bengaluru

Merging multiple columns in a dataframe based on condition in R

I am very new to R, and I want to do the following:
I have a data frame that consists of ID, Col1, Col2, Col3 columns.
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text="
ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'
")
I want to merge those 3 columns into one, where if there is "Never" and 0 in the other columns the value is "Never", if there is "Once a month" and the rest are 0, then "Once a month" and so on. All columns are mutually exclusive meaning there cannot be "Never" and "Once a month" in the same raw.
//I tried to apply this loop:
for (val in df) {
if(df$Col1 == "Never" && df$Col2 == "0")
{
df$consolidated <- "Never"
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month")
{
how_oft_purch_gr_pers$consolidated <- "Less than once a month"
}
}
I wanted to figure first for two columns only, but it didn't work, as all raws in the consolidated column are filled with "Less than once a month".
I want it to be like this:
ID Col1 Col2 Col3 Consolidated
1 0 Less than once a month 0 Less than once a month
2 Never 0 0 Never
3 0 0 Once a month Once a month
Any hint on what am I doing wrong?
Thank you in advance
You can think of using dplyr::coalesce after replacing 0 with NA. The coalesce() finds the first non-missing value (in a row in this case) and creates a new column. The solution can be as:
library(dplyr)
df %>% mutate_at(vars(starts_with("Col")), funs(na_if(.,"0"))) %>%
mutate(Consolidated = coalesce(Col1,Col2,Col3)) %>%
select(ID, Consolidated)
# OR in concise way once can simply write as
bind_cols(df[1], Consolidated = coalesce(!!!na_if(df[-1],"0")))
# ID Consolidated
# 1 1 Less than once a month
# 2 2 Never
# 3 3 Once a month
Data:
df <- read.table(text =
"ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'",
stringsAsFactors = FALSE, header = TRUE)
Even though #MKR has written a good answer, I want to point out a few errors in your code which might be the reason why it does not work
for (val in df) {
You problably want to loop over all rows of df. However, in fact you are looping over columns of your data frame. The reason is that a data frame is a list of vectors (your columns) which all must have the same length. With your code you iterate over the elements of df, which is the columns. See Q&A For each row in data.frame
if(df$Col1 == "Never" && df$Col2 == "0"){
Note that when using the double && instead of &, R is looking only at the first element of the vector you give it. See for example Q&A Boolean Operators && and ||
df$consolidated <- "Never"
Here, you set the whole column consolidated of df to "Never", because you do not use the iteration var from above (even if it would stand for one df row which it does not, like you wrote it).
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month"){
You need to use else if(...), not else (...). Like you wrote it, R will think the statement in (....) should be executed if the if(...) above is not true and the statement in {...} after the if would be regarded by R as having nothing to do with the if... else... construct, because it already executed (...). So it will execute the {...} block always, regardless of what is the outcome of the above if(...).
Is df$`Col1 a typo? The backtick ` should only occur in pairs and can be used around variables (also column names)
df$consolidated <- "Less than once a month"
Here you again set a whole column to one value, like explained above.
}
}
This is a possiblity using base R
Start your result column. Initialize it with only "0".
df$coalesced <- "0"
Loop over some columns of df (Col1--Col3). Use drop = FALSE in case you might only use one column, because R would output a vector in that case and for would loop over the elements of that vector and not over the single column in that case.
for( column in d[, c("Col1","Col2","Col3"), drop = FALSE]){
This checks each of coalesced if it is already filled, and if not (if it is "0" it fill it with the current column (which may also be "0")
df$coalesced <- ifelse(df$coalesced == "0", column, df$coalesced)
}
Add the new column to your data frame
df$coalesced <- coalesced

How to drop a buffer of rows in a data frame around rows of a certain condition

I am trying to remove rows in a data frame that are within x rows after rows meeting a certain condition.
I have a data frame with a response variable, a measurement type that represents the condition, and time. Here's a mock data set:
data <- data.frame(rlnorm(45,0,1),
c(rep(1,15),rep(2,15),rep(1,15)),
seq(
from=as.POSIXct("2012-1-1 0:00", tz="EST"),
to=as.POSIXct("2012-1-1 0:44", tz="EST"),
by="min"))
names(data) <- c('Variable','Type','Time')
In this mock case, I want to delete the first 5 rows in condition 1 after condition 2 occurs.
The way I thought about solving this problem was to generate a separate vector that determines the distance that each observation that is a 1 is from the last 2. Here's the code I wrote:
dist = vector()
for(i in 1:nrow(data)) {
if(data$Type[i] != 1) dist[i] <- 0
else {
position = i
tempcount = 0
while(position > 0 && data$Type[position] == 1){
position = position - 1
tempcount = tempcount + 1
}
dist[i] = tempcount
}
}
This code will do the trick, but it's extremely inefficient. I was wondering if anyone had some cleverer, faster solutions.
If I understand you correctly, this should do the trick:
criteria1 = which(data$Type[2:nrow(data)] == 2 & data$Type[2:nrow(data)] != data$Type[1:nrow(data)-1]) +1
criteria2 = as.vector(sapply(criteria1,function(x) seq(x,x+5)))
data[-criteria2,]
How it works:
criteria1 contains indices where Type==2, but the previous row is not the same type. The strange lookign subsets like 2:nrow(data) are because we want to compare to the previous row, but for the first row there is no previous row. herefore we add +1 at then end.
criteria2 contains sequences starting with the number in criteria1, to those numbers+5
the third row performs the subset
This might need small modification, I wasn't exactly clear what criteria 1 and criteria 2 were from your code. Let me know if this works or you need any more advice!

Group categories in R according to first letters of a string?

I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.
Having 3 different rows, only showing the id and the column I need to group by:
ID........... TEXTFIELD
1............ VGH2130
2............ BFGF2345
3............ VGH3321
Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as
ID........... TEXTFIELD........... NEWCOL
1............ VGH2130............. VGH
2............ BFGF2345............ BFGF
3............ VGH3321............. VGH
And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......) )
Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)
You can also try
> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
ID TEXTFIELD group
1 1 VGH2130 VGH
2 2 BFGF2345 BFGF
3 3 VGH3321 VGH
you can try, if df is your data.frame:
df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)
> df
# ID TEXTFIELD NEWCOL
#1 1 VGH2130 VGH
#2 2 BFGF2345 BFGF
#3 3 VGH3321 VGH
Does the text field always have 3 or 4 letters preceding the numbers?
you can check that by doing:
nrow(data[grepl("[aA-zZ]{1,4}\\d+", data$TEXTFIELD)== TRUE, ]) #this gives number of rows where TEXTFIELD contains 3,4 letters followed by digits
If yes, then:
require(stringr)
data$NEWCOL <- str_extract(data$TEXTFIELD, "[aA-zZ]{1,4}")
Final Step:
data$group <- ifelse(data$NEWCOL == "VGH", "Group Name", ifelse(data$NEWCOL == "BFGF", "Group Name", ifelse . . . . ))
# Complete the ifelse statement to classify all groups

Resources