Preprocessing: text analysis on many columns from a dataframe

Preprocessing: text analysis on many columns from a dataframe - r

Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\\s+"," ",str_trim(df$name))
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22

Related

How to insert a part of a value of a column into a new column

I have not been programming for that long and have now encountered a problem to which I have not yet been able to find a solution.
In my dataframe there is a column that contains several pieces of information. For example, one row looks like this:
sp|O94910|AGRL1_HUMAN
or like this
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN
Now I want to create a new column with the combination of digits between the two vertical bars.
For the upper example it would be O94910, for the lower Q13554; Q13555
I have already tried functions like str_extract_all, str_match or gsub. But nothing worked.
The "id" is the column I look at. It includes different combinations of digits. I need the one between the two |
> dput(head(anaDiff_PD_vs_CTRL$id, 10))
c("sp|O94910|AGRL1_HUMAN", "sp|P02763|A1AG1_HUMAN", "sp|P19652|A1AG2_HUMAN",
"sp|P25311|ZA2G_HUMAN", "sp|Q8NFZ8|CADM4_HUMAN", "sp|P08174|DAF_HUMAN",
"sp|Q15262|PTPRK_HUMAN", "sp|P78324|SHPS1_HUMAN;sp|Q5TFQ8|SIRBL_HUMAN;sp|Q9P1W8|SIRPG_HUMAN",
"sp|Q8N3J6|CADM2_HUMAN", "sp|P19021|AMD_HUMAN")>

With dplyr and stringr you can try...
library(dplyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(dig = str_extract_all(col, "(?<=sp\\|)[A-Z0-9]+(?=\\|)"),
dig = paste0(dig, collapse = "; "))
#> # A tibble: 4 x 2
#> # Rowwise:
#> col dig
#> <chr> <chr>
#> 1 sp|Q8NFZ8|CADM4_HUMAN Q8NFZ8
#> 2 sp|94910|AGRL1_HUMAN 94910
#> 3 sp|O94910|AGRL1_HUMAN O94910
#> 4 sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN Q13554; Q13555
data
dat <- data.frame(col = c("sp|Q8NFZ8|CADM4_HUMAN", "sp|94910|AGRL1_HUMAN", "sp|O94910|AGRL1_HUMAN", "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"))
Created on 2022-02-02 by the reprex package (v2.0.1)

Here is a solution without tidyverse:
dat <- read.table(text = "
sp|Q8NFZ8|CADM4_HUMAN
sp|94910|AGRL1_HUMAN
sp|O94910|AGRL1_HUMAN
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN")
ids <- strsplit(dat$V1, ";")
ids <- lapply(ids, function(x) gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", x))
ids <- lapply(ids, function(x) paste(x, collapse="; "))
dat$newcol <- unlist(ids)
Even with tidyverse, I would define a helper function for more clarity:
extract_ids <- function(x) {
ids <- strsplit(x, ";")
ids <- map(ids, ~ gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", .))
ids <- map(ids, ~ paste(., collapse="; "))
unlist(ids)
}
dat <- dat %>% mutate(ids = extract_ids(V1))

This solution should help if you want to change your column names in a similar fashion:
library(tidyverse)
# create test data frame with column names "sp|O94910|AGRL1_HUMAN" and "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
df <- data.frame(col1, col2)
names(df)[1] <- "sp|O94910|AGRL1_HUMAN"
names(df)[2] <- "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
names <- as.data.frame((str_split(colnames(df), "\\|", simplify = TRUE))) # split the strings representing the column names seperated by "|" into a list
# remove all strings that contain less digits than letters or special characters
for(i in 1:nrow(names)) {
for(j in 1:ncol(names)){
if ( (str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[0-9]") >
str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[:alpha:]|[:punct:]") )){
names[i,j] <- names[i,j]
} else {
names[i,j] <- ""
}
}
}
# combine the list columns into a single column calles "colnames"
names <- names %>% unite("colnames", 1:5, na.rm = TRUE, remove = TRUE, sep = ";")
# remove all ";" separators at the start of the strings, the end of the strings, and series of ";" into a single ";"
for (i in 1:nrow(names)){
names[i,] <- str_replace(names[i,],"\\;+$", "") %>%
str_replace("^\\;+", "") %>%
str_replace("\\;{2}", ";")
}
# convert column with new names into a vector
new_names <- as.vector(names$colnames)
# replace old names with new names
names(df) <- new_names

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]

You can use:
df[,match(z, substring(colnames(df), 1, 3))]

With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well

Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means

You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.

You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

How to create a function to output how many characters in each row of a vector are contained in another vector?

I have a dataframe filled with words and their properties, and one of the columns is titled "Spelling". This has the spelling of every word, with each character separated by a space, like so:
P L A T E
M A T
R O C K
I have a separate vector of characters with a subset of letters which have been deemed important for the analysis. Like this:
important_letters <- c("P","M","E")
I need to write a function that, for every row in the dataframe, counts how many characters in the word are contained in important letters and creates a new column in the dataframe with this number. In this example, the new column would contain a 2 for plate, 1 for mat, and 0 for rock.
I have been trying to figure this out and I know the return line of the function should be something like this:
return(sum(a %in% important_letters))
Any help would be greatly appreciated.

Try this:
library(dplyr)
# your objects
Spelling <- c('P L A T E', 'M A T', 'R O C K')
important_letters <- c("P","M","E")
df <- data.frame(Spelling, stringsAsFactors = FALSE)
# first, create a new variable (field) in dataframe
df$important_letters_count <- NA
# this is the function
count_important <- function(x) {
for (i in 1:nrow(x)) {
x$important_letters_count[i] <-
sum(strsplit(x$Spelling[i], " ")[[1]] %in% important_letters)
}
x
}
# call the functions this way
df <- count_important(df)

Another option:
sum_letters <- function(words, imp_letters) {
sapply(words,
function(x) sum(unlist(strsplit(x, split = " ")) %in% imp_letters, na.rm = TRUE)
)
}
Can be called like:
df$sum_letters <- sum_letters(df$words, important_letters)
Output:
df
words sum_letters
1 P L A T E 2
2 M A T 1
3 R O C K 0

Use purrr and stringr. The map derivate iterates over your character vector and applies your custom function to each element. Use str_count to count matches of your important letters.
library(stringr)
library(purrr)
df <- data.frame(
Spelling = c("P L A T E", "M A T", "R O C K")
)
important_letters <- c("P", "M", "E")
map_dbl(df$Spelling, ~ sum(str_count(.x, important_letters)))

Try this:
myfunc <- function(v, imp) {
ptn <- paste0(c("[^", imp, "]"), collapse = "")
nchar(gsub(ptn, "", v))
}
# sample data
vec <- c("P L A T E", "M A T", "R O C K")
important_letters <- c("P","M","E")
myfunc(vec, important_letters)
# [1] 2 1 0
Because this is operating on a vector here, it can easily iterate over frame columns.

Multiple list objects as a character variable in df. How to convert into original df with R?

I have a problem with my dataset containing character variables that are actually a list of values which I would want to convert into dataframe. Orginal dataframe consists several 1000 of rows.
I would like to split to a list objects in order to convert lists into dataframe (long format), but I lack some skills with list-objects and splitting characters.
Reproducible example:
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
test <- data.frame(id, name, value)
test
I would like have an outcome like this:
id <- c("112","112","112")
name <- c( "dog", "cat","attashee")
value <- c("21000", "23400", "26800")
test1 <- data.frame(id, name, value)
test1
I suppose, I need to start by erasing first and the last characters { and }:
test$name <- gsub("{", "", test$name, fixed=TRUE)
test$name <- gsub("}", "", test$name, fixed=TRUE)
I have tried to use these string-split-into-list-r, convert-a-list-formatted-as-string-in-a-list and convert-a-character-variable-to-a-list-of-list,
test$name <- strsplit(test$name, ',')[[1]]
but I get an error message(when I try this to first row of my original data): "replacement has 91 rows, data has 1".
The fact is, I´m pretty lost here as I would need to convert name and value columns simultaneously (and I don´t know how to convert even one column).
All the help and advises are more than appreciated.

This should work:
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
test <- data.frame(id, name, value)
test
id <- rep(test$id, length(name))
name <- gsub("\\{", '', name)
name <- gsub("\\}", '', name)
name <- gsub('"', '', name)
name <- gsub('\\s+', '', name)
name <- strsplit(name, ',')[[1]]
value <- gsub("\\{", '', value)
value <- gsub("\\}", '', value)
value <- gsub('"', '', value)
value <- gsub('\\s+', '', value)
value <- strsplit(value, ',')[[1]]
test1 <- data.frame(id, name, value)
test1

I would parse and evaluate:
id <- c("112", "113")
name <- c( "{\"dog\", \"cat\",\"attashee\"}", "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}", "{\"21001\", \"23401\", \"26801\"}")
test <- data.frame(id, name, value)
clean_parse_eval <- function(x) {
eval(parse(text = gsub("\\}", ")", gsub("\\{", "c\\(", x))))
}
We then need a split-apply-combine approach to do this for every row. Of course, this isn't very fast.
library(data.table)
setDT(test)
test[, lapply(.SD, clean_parse_eval), by = id]
# id name value
#1: 112 dog 21000
#2: 112 cat 23400
#3: 112 attashee 26800
#4: 113 dog 21001
#5: 113 cat 23401
#6: 113 attashee 26801
Obviously, it would be better to avoid producing such malformed data to begin with.

This will work for one column. I tried this for your name object and it does the job
library(stringr)
library(dplyr)
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
x <- as.data.frame(name) %>% mutate_each(funs(str_replace_all(., "\"", "")))
result <- strsplit(x$name,"![a-z]")[[1]]
result <- gsub('\\{', '', result)
result <- gsub('\\}', '', result)
result <- strsplit(as.character(result), split = ',', fixed = TRUE)[[1]]
result <- gsub(" +", "", result)
str(result)
#chr [1:3] "dog" "cat" "attashee"

This can do the job for you, no libraries, some explanation in within lines
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
convert <- function(col, isFloat = FALSE) {
# remove the two {} characters
col <- gsub("{", "", gsub("}", "", col, fixed=TRUE), fixed=TRUE)
# create a vector with the right content
ans <- eval(parse(text = paste0('c(', col, ')')))
if (isFloat)
unlist(lapply(ans, as.numeric))
else
ans
}
test <- data.frame(convert(id), convert(name), convert(value, TRUE))
# optionally you can fix the names here
names(test) <- c('id', 'name', 'value')
# final result
test
Notice that you can use the convert function to add more columns to your dataframe if you need to.
If your name variable for example have 1000 more entries, then do name <- do.call(paste0, name) to put all into a single string and then replace the}{ with a , by doing name <- gsub("}{", ",", name, fixed=TRUE) then you fall in the original case, and then the same logic applies.

assign dynamic variable names using a loop and mutate from dplyr

I would like to split a character field into individual variables, one for each character in a string.
library(dplyr)
temp1 <- data.frame(a = c('dedefdewfe' , 'rewewqreqw'))
for(i in 1:10){
temp1 <- temp1 %>%
mutate(paste('v' , i , ,sep = '') = substr(a , i , i))
}
The resulting dataframe would have 11 variables, the original a , v1 through v10

tidyr::separate is good for this. You can't split on an empty string, but you can specify splitting positions ...
library(tidyr)
library(dplyr)
temp1 %>%
mutate(b=a) %>% ## make a copy
separate(b,into=paste0("v",1:10),sep=1:9)
(probably better practice to use nc <- nchar(temp1$a[1]) and then use nc, nc-1 instead of 10, 9 respectively)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Preprocessing: text analysis on many columns from a dataframe - r

Related

How to insert a part of a value of a column into a new column

How do I get the column number from a dataframe which contains specific strings?

How to create a function to output how many characters in each row of a vector are contained in another vector?

Multiple list objects as a character variable in df. How to convert into original df with R?

assign dynamic variable names using a loop and mutate from dplyr

Categories

Resources