Reverse lookup for loop in R - r

I have a set of numbers / string that makes other number / string. I need to create a function that gives me a list of the all the numbers / string needed to create that number / string.
Consider the following dataset
ingredients <- c('N/A', 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product)
head(data)
If I input function(dog), I would like a list that returns bird and then cat. The function knows when to stop when ingredients = N/A (there's nothing more to look up).
It seems like some of sort of for loop that appends is the right approach.
needed <- list()
for (product in list){
needed[[product]]<-df
}
df <- dplyr::bind_rows(product)

I appended your initial code to make N/A simply equal to NA so I could use the is.na function in my R code. Now the sample data is
ingredients <- c(NA, 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product)
Code is below:
ReverseLookup <- function (input) {
ans <- list()
while (input %in% data$product) {
if (!is.na(as.character(data[which(data$product == input),]$ingredients))) {
ans <- append(ans, as.character(data[which(data$product == input),]$ingredients))
input <- as.character(data[which(data$product == input),]$ingredients)
}
else {
break
}
}
print(ans)
}
I create an empty list and then create a while loop that just checks if the input exists in the product column. If so, it then checks to see if the corresponding ingredient to the product input is a non-NA value. If that's the case, the ingredient will be appended to ans and will become the new input. I also added a break statement to get out of the while loop when you reach an NA.
I did a quick test on the case where there is no NA in your dataframe and it appears to be working fine. Maybe someone else here can figure out a more concise way to write this, but it should work for you.

You can likely find a way to use a tree of some type to work through nodes. But, using a recursive function in base R, I have come up with this.
I have also changed the 'N/A' to NA to make life easier. Also, I have added in stringsAsFactors = F to the data frame.
ingredients <- c(NA, 'cat', 'bird')
product <- c('cat', 'bird', 'dog')
data <- data.frame(ingredients, product, stringsAsFactors = F)
reverse_lookup <- function(data, x, last_result = NULL) {
if (! is.null(last_result)) {
x <- data[data$product == last_result[length(last_result)], "ingredients"]
}
if (! is.na(x)) {
last_result <- reverse_lookup(data, x, c(last_result, x))
}
last_result
}
This returns the input as well, which you can always drop off as the first element of the vector.
> reverse_lookup(data, "dog")
[1] "dog" "bird" "cat"

Related

R function used to rename columns of a data frames

I have a data frame, say acs10. I need to relabel the columns. To do so, I created another data frame, named as labelName with two columns: The first column contains the old column names, and the second column contains names I want to use, like the table below:
column_1
column_2
oldLabel1
newLabel1
oldLabel2
newLabel2
Then, I wrote a for loop to change the column names:
for (i in seq_len(nrow(labelName))){
names(acs10)[names(acs10) == labelName[i,1]] <- labelName[i,2]}
, and it works.
However, when I tried to put the for loop into a function, because I need to rename column names for other data frames as well, the function failed. The function I wrote looks like below:
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
print(varName[i,1])
print(varName[i,2])
print(names(dataF))
}
}
renameDF(acs10, labelName)
where dataF is the data frame whose names I need to change, and varName is another data frame where old variable names and new variable names are paired. I used print(names(dataF)) to debug, and the print out suggests that the function works. However, the calling the function does not actually change the column names. I suspect it has something to do with the scope, but I want to know how to make it works.
In your function you need to return the changed dataframe.
renameDF <- function(dataF,varName){
for (i in seq_len(nrow(varName))){
names(dataF)[names(dataF) == varName[i,1]] <- varName[i,2]
}
return(dataF)
}
You can also simplify this and avoid for loop by using match :
renameDF <- function(dataF,varName){
names(dataF) <- varName[[2]][match(names(dataF), varName[[1]])]
return(dataF)
}
This should do the whole thing in one line.
colnames(acs10)[colnames(acs10) %in% labelName$column_1] <- labelName$column_2[match(colnames(acs10)[colnames(acs10) %in% labelName$column_1], labelName$column_1)]
This will work if the column name isn't in the data dictionary, but it's a bit more convoluted:
library(tibble)
df <- tribble(~column_1,~column_2,
"oldLabel1", "newLabel1",
"oldLabel2", "newLabel2")
d <- tibble(oldLabel1 = NA, oldLabel2 = NA, oldLabel3 = NA)
fun <- function(dat, dict) {
names(dat) <- sapply(names(dat), function(x) ifelse(x %in% dict$column_1, dict[dict$column_1 == x,]$column_2, x))
dat
}
fun(d, df)
You can create a function containing just on line of code.
renameDF <- function(df, varName){
setNames(df,varName[[2]][pmatch(names(df),varName[[1]])])
}

How to spot a phrase/word in a cell of a dataframe, using R

Having a data.frame like 'df', I would like to spot this exact phrase "keratinization [GO:0031424]" in each cell of the column 'bio_process'. Afterwards, I want to create a new vector with 'ID' of the observations that the match occured.
ID <- c("Q9BYP8", "Q17RH7", "Q6L8G8", "Q9BYR4")
bio_process <- c("keratinization [GO:0031424]", "NA", "keratinization [GO:0031424]", "aging [GO:0007568]; hair cycle [GO:0042633]; keratinization [GO:0031424]")
df <- as.data.frame(cbind(ID,bio_process))
in order to acheive this, I applied a for loop. I used the %in% into the loop, like this:
n <- 4
ids <- vector(mode = "character", length = n)
for (i in 1:n) {
if ("keratinization [GO:0031424]" %in% df$bio_process[i]) {
ids[i] <- data$ID[i]
}
}
As a result I would like the content of 'ids' vector to be like this one below.
"Q9BYP8" "Q6L8G8" "Q9BYR4"
However, %in% does not work for the cells were 'keratinization [GO:0031424]' is not the only content.
Any ideas? Thank you
you can use grepl in Base-R
df$ID[grepl("keratinization \\[GO:0031424\\]",df$bio_process)]
[1] Q9BYP8 Q6L8G8 Q9BYR4
note I had to escape the [ character with \\ as square brackets have special meaning in regex.

Finding Matches Across Char Vectors in R

Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
Psudo code for what I think of as the solution:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
We can write a small functions to extract matches and then loop over the keys:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.
You can interate using grep
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

How to filter for 'any value' in R?

Strange question but how to do I filter such that all rows are returned for a dataframe? For example, say you have the following dataframe:
Pts <- floor(runif(20, 0, 4))
Name <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
df <- data.frame(Pts, Name)
And say you want to set up a predetermined filter for this dataframe, for example:
Ptsfilter <- c("2", "1")
Which you will then run through the dataframe, to get your new filtered dataframe
dffil <- df[df$Pts %in% Ptsfilter, ]
At times, however, you don't want the dataframe to be filtered at all, and in the interests of automation and minimising workload, you don't want to have to go back and remove/comment-out every instance of this filter. You just want to be able to adjust the Ptsfilter value such that no rows will be filtered out of the dataframe, when that line of code is run.
I have experimented/guesses with things like:
Ptsfilter <- c("")
Ptsfilter <- c(" ")
Ptsfilter <- c()
to no avail.
Is there a value I can enter for Ptsfilter that will achieve this goal?
You might need to define a function to do this for you.
filterDF = function(df,filter){
if(length(filter)>0){
return(df[df$Pts %in% filter, ])
}
else{
return(df)
}
}

Resources