Efficient way of searching for a string in a data frame - r

I would like to come up with an efficient way of finding a string in a data.frame including the values stored in row names.
Present approach
Primitively, I can achieve that running this code:
data(mtcars)
mtcars$rows <- row.names(mtcars)
sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) })
I don't like it as it returns data for all columns:
> length(sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) }))
[1] 12
I would like to prettify this code so it only returns:
column name for successful match
row name for successful match
in format column X row
Additional considerations
Following suggestions expressed in comments, it occurred to me that I would also like to search column names, if possible.

Related

Changing dataframes in bulk? How to apply a list of operations to multiple dataframes?

So, I have 6 data frames, all look like this (with different values):
Now I want to create a new column in all the data frames for the country. Then I want to convert it into a long df. This is how I am going about it.
dlist<- list(child_mortality,fertility,income_capita,life_expectancy,population)
convertlong <- function(trial){
trial$country <- rownames(trial)
trial <- melt(trial)
colnames(trial)<- c("country","year",trial)
}
for(i in dlist){
convertlong(i)
}
After running this I get:
Using country as id variables
Error in names(x) <- value :
'names' attribute [5] must be the same length as the vector [3]
That's all, it doesn't do the operations on the data frames. I am pretty sure I'm taking a stupid mistake, but I looked online on forums and cannot figure it out.
maybe you can replace
trial$country <- rownames(trial)
by
trial <- cbind(trial, rownames(trial))
Here's a tidyverse attempt -
library(tidyverse)
#Put the dataframes in a named list.
dlist<- dplyr::lst(child_mortality, fertility,
income_capita, life_expectancy,population)
#lst is not a typo!!
#Write a function which creates a new column with rowname
#and get's the data in long format
#The column name for 3rd column is passed separately (`col`).
convertlong <- function(trial, col){
trial %>%
rownames_to_column('country') %>%
pivot_longer(cols = -country, names_to = 'year', values_to = col)
}
#Use `imap` to pass dataframe as well as it's name to the function.
dlist <- imap(dlist, convertlong)
#If you want the changes to be reflected for dataframes in global environment.
list2env(dlist, .GlobalEnv)

Speed up string filtering in R

I have a dataframe with a column of codes separated by commas. I am currently filtering this dataframe by looking through the code column and if a code appears from the list, I keep that row. My issue is that this dataframe is expanding, as is the list of acceptable codes, so I'd like to speed this process up if possible. Ideally there would be a way to mark a row as already checked and if a good code was in the row, then to not have to check it again for all the other acceptable codes.
Current dataframe looks something like this:
Code_column
,12ab,
,12ab,123b,
,456t,345u,
,12ab,789p,
list of good codes:
good_codes <- c(',123b,', ',456t,', ',345u,')
My filtering process currently:
df %>%
filter(sapply(`Code_column`,
function(x) any(sapply(good_codes, str_detect, string = x))) == TRUE)
Final column
Code_column
,12ab,123b,
,456t,345u,
I think we do not need sapply
df[str_detect(df$Code_column,paste(good_codes, collapse = '|')),]
[1] ",12ab,123b," ",456t,345u,"
You can pass | & to str_detect
paste(good_codes, collapse = '|')
[1] ",123b,|,456t,|,345u,"

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

How to assign a subset from a data frame `a' to a subset of data frame `b'

It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.

Eliminate dataframe rows that match a character string

I have a dataframe rawdata with columns that contain ecological information. I am trying to eliminate all of the rows for which the column LatinName matches a vector of species for which I already have some data, and create a new dataframe with only the species that are missing data. So, what I'd like to do is something like:
matches <- c("Thunnus thynnus", "Balaenoptera musculus", "Homarus americanus")
# obviously these are a random subset; the real vector has ~16,000 values
rawdata_missing <- rawdata %>% filter(LatinName != "matches")
This doesn't work because the boolean operator can't be applied to a character string. Alternatively I could do something like this:
rawdata_missing <- filter(rawdata, !grepl(matches, LatinName)
This doesn't work either because !grepl also can't use the character string.
I know there are a lot of ways I could subset rawdata using the rows where LatinName IS in matches, but I can't figure out a neat way to subset rawdata such that LatinName is NOT in matches.
Thanks in advance for the help!
filteredData <- rawdata[!(rawdata$LatinName %in% Matches), ]
Another way by using subset, paste, mapply and grepl is...
fileteredData <- subset(rawdata,mapply(grepl,rawdata$LatinName,paste(Matches,collapse = "|")) == FALSE)

Resources