R using grepl across multiple dataframes in a list - r

I have a list of dataframes that each contain multiple of the same columns. In one of the columns, there are multiple instances where a row just contains "[]". My goal is to replace these instances with a blank.
I've attempted to do so via the map function and grepl. While it runs there is no change to the output. Am I going in the right direction here?
Please not that I differentiate between "[]" and "[value]"
I only want to replace the empty brackets with blanks.
My code below:
first_column <- c("1", "2", "3","4")
second_column <- c("value1", "value2","[]","[value]")
first_column_2 <- c("5", "6", "7","8")
second_column_2 <- c("value1", "[]","[]","[value2]")
first_column_3<- c("9", "10", "11","12")
second_column_3 <- c("[]", "[value2]","[]","[]")
df_1 <- data.frame(first_column,second_column)
df_2 <- data.frame(first_column_2,second_column_2)
df_3 <- data.frame(first_column_3,second_column_3)
df_list <- list(df_1,df_2,df_3)
var <- c(2)
df_list <- map(df_list, ~.x[!grepl("[[]",var),])
Thanks!

We can use lapply and gsub to accomplish this. grepl returns true or false if the pattern matches, whereas gsub allows you to replace matches with something else. Note that instead of specifying an empty string (''), you could just as easily specify NA, but that will depend on your definition of "blank".
Here I use base R's lapply, which in this case is equivalent to purrr::map (even the syntax is interchangeable here).
data <- lapply(df_list, function(x) {
x %>%
mutate(across(where(is.character), ~gsub('\\[\\]', '', .x)))
})
[[1]]
first_column second_column
1 1 value1
2 2 value2
3 3
4 4 [value]
[[2]]
first_column_2 second_column_2
1 5 value1
2 6
3 7
4 8 [value2]
[[3]]
first_column_3 second_column_3
1 9
2 10 [value2]
3 11
4 12

You've got a few issues:
(a) you say you want to replace "[]" with "", but your code is trying to drop them completely, not replace them. Use sub instead of grepl for replacing---or even better, since you are matching a whole string don't use regex at all
(b) you are running grepl on the number 2: you have var <- 2 and your command is grepl("[[]",var), which is grepl("[[]", 2), which is always FALSE as the string "2" doesn't contain brackets.
(c) Your grepl pattern is searching for any string that contains a [ in it. So if you correct (a) and (b), you'll still match with strings like "[value1]".
As I said in (a), when you're matching a full string, you don't need regex at all. I'd do it like this:
df_list <- map(df_list, ~ {
.x[[var]][.x[[var]] == "[]"] = ""
.x
})
df_list
# [[1]]
# first_column second_column
# 1 1 value1
# 2 2 value2
# 3 3
# 4 4 [value]
#
# [[2]]
# first_column_2 second_column_2
# 1 5 value1
# 2 6
# 3 7
# 4 8 [value2]
#
# [[3]]
# first_column_3 second_column_3
# 1 9
# 2 10 [value2]
# 3 11
# 4 12

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

List index by number and some element NULL,how to convert to data frame?

In R program, the list length is unknow.It is generated from for loop.
for example:
ls <- list()
ls[[1]] <- 5
ls[[3]] <- a
ls[[6]] <- 8
....
Some index(ordinal number) is undefined.
I want to convert to data frame, such as follows:
1 5
2 NA
3 a
4 NA
5 NA
6 8
...
Additional question: how to get the ordinal number range of this list?
A base R approach could be, assuming here you have "ls" is already there in the environment :
Explanation:
We first iterate through all all the elements using lapply, In the anonymous function part, we try to find the null values, where ever there is null value found , we replace with NA. Once the list NULL values are replaced with NA, we bind them row wise using 'rbind' from do.call. To get the last part as sequence, we can use either seq function or colon operator to create a sequence.
dfs <- data.frame(col1 = do.call('rbind', lapply(ls,
function(x)ifelse(is.null(x), NA, x))),
col2 = seq(1,length(ls)), stringsAsFactors = F)
Alternate Using unlist(instead of do.call and rbind) :
dfs <- data.frame(col1 = unlist(lapply(ls,
function(x)ifelse(is.null(x), NA, x))), col2 =
seq(1,length(ls)), stringsAsFactors = F)
Output:
> dfs
# col1 col2
# 1 3 1
# 2 NA 2
# 3 6 3
# 4 NA 4
# 5 NA 5
# 6 8 6

How to correct/standardize variable names if their format is not consistent

I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call

Table scraped from a web page is read as a single character vector: how to convert into a dataframe?

I have scraped a large table from a web page using the rvest package, but it is reading it as a single vector:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
that I need to deal with as a dataframe that looks like this:
bar<-as.data.frame(cbind(Animal=c("Dog","Cat","Goat"),A=c(1,4,7),B=c(2,5,8),C=c(3,6,9)))
This might be a simple dilemma but I'd appreciate the help.
you can create a matrix from your vector and turn it into a data frame:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal" , foo)
m <- matrix(foo , ncol = 4 , byrow = TRUE)
df <- as.data.frame(m[-1,] , stringsAsFactors = FALSE)
colnames(df) <- m[1,]
# I assume you want numerics for your A,B,C columns:
df[,2:4]<-apply(df[,2:4],2,as.numeric)
lapply(df,class)
$Animal
[1] "character"
$A
[1] "numeric"
$B
[1] "numeric"
$C
[1] "numeric"
Just split it into required number of rows and rbind it. I added "Animal" at the start of foo to make the elements equal in each row when splitting
foo = c("Animal", foo)
df = data.frame(do.call(rbind, split(foo, ceiling(seq_along(foo)/4))),
stringsAsFactors = FALSE)
colnames(df) = df[1,]
df = df[-1,]
df
# Animal A B C
#2 Dog 1 2 3
#3 Cat 4 5 6
#4 Goat 7 8 9
If you want the proper column types, you can try this. Split into a list, name the list, then convert the column types before coercing to data frame.
l <- setNames(split(tail(foo, -3), rep(1:4, 3)), c("Animal", foo[1:3]))
as.data.frame(lapply(l, type.convert)) ## stringsAsFactors=FALSE if desired
# Animal A B C
# 1 Dog 1 2 3
# 2 Cat 4 5 6
# 3 Goat 7 8 9
Here is a convenient tool to work with list,
seqList <-
function(character,by= 1,res=list()){
### sequence characters by
if (length(character)==0){
res
} else{
seqList(character[-c(1:by)],by=by,res=c(res,list(character[1:by])))
}
}
Once you convert your characters into lists it's easier to manipulate them for instance you can do.
options(stringsAsFactors=FALSE)
foo <-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal",foo)
df <- data.frame(t(do.call("rbind",
lapply(1:4,function(x) do.call("cbind",lapply(seqList(foo,4),"[[",x))))))
colnames(df) <- df[1,]
df <- df[-1,]
## > df
## Animal A B C
## 2 Dog 1 2 3
## 3 Cat 4 5 6
## 4 Goat 7 8 9
Note:
I haven't tested the efficiency of the function. It might not be very efficient for large amount of characters.
The use of matrices might a better tool for this job.

How to access columns with same names of all dataframes in a list?

As was suggested somewhere in Stackoverflow, to store a number of dataframes, I put them in a list. Now, how can I access particular columns - all with the same name - of all dataframes in that list (to find a maximum)?
list[1:25]["colname"]
gives NULL and
list[[1:25]]["colname"]
gives
"Error in list[[1:25]] : recursive indexing failed at level 3",
although I can get one column with
list[[1]]["colname"]
I also tried c(), but didn't work.
I have tried several searches, but couldn't find anything relevant. I'm not really a programmer, just needing that for research. I'm learning R (with Rstudio) on the fly (I have read some tutorials), so it might be that I just don't know the right words to search.
If x is your list and Sepal.Length is the column of which you want to take the maximum over all datasets in your list:
x <- rep(list(iris),25)
max(unlist(lapply(x,function(df) max(df$Sepal.Length))))
If you want the maximum for every dataset in the list:
lapply(x,function(df) max(df$Sepal.Length))
Here's one possible way using sapply along with [[.
data <- list(data.frame(a = 1:3, b = 4:6),
data.frame(a = 10:15, b = 40:45))
sapply(data, "[[", "b")
[[1]]
[1] 4 5 6
[[2]]
[1] 40 41 42 43 44 45
One of the apply family functions will do the trick. Using an anonymous function in this case to operate on each data.frame in the list to find the max value of the first column...
# Make some reproducible data
set.seed(1)
ll <- replicate( 3 , as.data.frame( matrix( sample( 9 ) , 3 ) ) , simplify = FALSE )
#[[1]]
# V1 V2 V3
#1 1 4 3
#2 9 8 5
#3 7 2 6
#[[2]]
# V1 V2 V3
#1 4 8 5
#2 7 2 3
#3 6 9 1
#[[3]]
# V1 V2 V3
#1 6 4 2
#2 3 9 5
#3 1 7 8
# Get max value of each of the first columns - could use a quoted column name here
lapply( ll , function(x) max( x[ , 1 ] ) )
#[[1]]
#[1] 9
#[[2]]
#[1] 7
#[[3]]
#[1] 6

Resources