Changing multiple markup errors over multiple columns in R

Changing multiple markup errors over multiple columns in R - r

I run stuck over a piece of coding that allows me to change multiple text markup errors, ie "Ã¯Â¿Â½nt" which should be "ent", or "Ã¯Â¿Â½de" which should be "ide", or "Ã¯n" which should be "in" (without quotation marks), over 47 columns in total.
A quick example of the dataframe
x <- data.frame("Name"(c("PatÃ¯Â¿Â½nt", "PatÃ¯Â¿Â½nt"),"Type"(c("Ã¯Â¿Â½de", "Ã¯Â¿Â½de"),"Role"(c("Ã¯n", "Ã¯n")))))
$x
Name Type Role
PatÃ¯Â¿Â½nt Ã¯Â¿Â½de Ã¯n
PatÃ¯Â¿Â½nt Ã¯Â¿Â½de Ã¯n
(Not sure if code is correct, but you get the meaning of how my dataframe looks like).
Now, what I tried from other posted solutions, especially from: the one posted here is that I ended up with quite some line of code, an excerpt:
x <- data.frame(lapply(x, function(y){ gsub("Ã¯Â¿Â½ne", "ine", y)}))
So, for every text markup error there is a line of code like above, that fixes the markup error. However, using the lapply/apply family it changes my original classes to factors. And using the stringr package, I can not get this done since it keeps putting it in a single vector, an excerpt
x <- str_replace(string = x, pattern="Ã¯Â¿Â½nt", replacement = "ent")
So my question here is: is there another way to replace multiple strings over multiple columns, meanwhile maintaining the original dataframe classes?
EDIT
I edited slightly the code from Calum You to:
mutate_at(.vars = vars(everything()),
.funs = ~ str_replace_all(., pattern = "Ã¯Â¿Â½nt", replacement = "")) %>%
Such that it replaces all instances within a row, instead of the first.
Next, by using the snippet below before the functions of mutate_at, the code now iterates only over character vectors instead of all vectors, i.e. numeric/factor/date etc.
df2 <- df1[, sapply(df1, class) == 'character'] %>%

So you just want to apply your replace function to every column, because the markup errors can be in any column? Try this approach, which uses mutate_at to apply your function to as many columns as you like (here to all of them)
library(tidyverse)
df <- tibble(
Name = c("PatÃ¯Â¿Â½nt", "PatÃ¯Â¿Â½nt"),
Type = c("Ã¯Â¿Â½de", "Ã¯Â¿Â½de"),
Role = c("Ã¯n", "Ã¯n")
)
df %>%
mutate_at(
.vars = vars(everything()),
.funs = ~ str_replace(., pattern = "Ã¯Â¿Â½nt", replacement = "ent")
) %>%
mutate_at(
.vars = vars(everything()),
.funs = ~ str_replace(., pattern = "Ã¯Â¿Â½de", replacement = "ide")
) %>%
mutate_at(
.vars = vars(everything()),
.funs = ~ str_replace(., pattern = "Ã¯n", replacement = "in")
)
# A tibble: 2 x 3
Name Type Role
<chr> <chr> <chr>
1 Patent ide in
2 Patent ide in

Related

Using purr::map to rename columns based on another list in R

I have multiple files and I want to rename the second column of each file with a name coming from the
samples=c("sample1","sample2") dataset. As I am learning purr::map functions, I am struggling to do the renaming with the inside map.
Here is an example:
Any help is extremely appreciated
library(purrr)
library(data.table)
library(dplyr)
files <- paste0("file", 1:3, ".txt")
## Create example files in a temp dir
temp <- tempdir()
walk(files, ~ write.csv(iris[1:2], file.path(temp, .x), row.names = FALSE))
files |>
map(~ fread(file.path(temp, .x)) %>% rename(test = 1, samples=2))
Of course, this does not work, but this is here I am so far.

This is one way to do it:
We use map2() and loop over both files and samples and for each file we first read in the data fread(file,path(temp, .x)) and then pipe that into rename(., test = 1, !! sym(.y) := 2)).
samples contains strings. We need to make the strings into object names with sym (or alternatively as.name()) and evaluate them with !!. If we use this kind of syntax on the lefthand side we also need the walrus operator := instead of =.
samples=c("sample1","sample2", "sample3")
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% rename(., test = 1, !! sym(.y) := 2))
If you want to rename a different column in every data.frame its better to construct a list of lists as below and splice each sublist into rename() with !!!. (The example below just uses the second column but we could change that to any column number we want).
samples = list(
list("sample1" = 2),
list("sample2" = 2),
list("sample3" = 2)
)
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% rename(., test = 1, !!! .y))
Since you are using data.table to read-in the data we don't need dyplr::rename() to rename the colums. Especially the case where you want to rename each second column is easier with data.table::setnames():
samples = c("sample1", "sample2","sample3")
files |>
map2(samples, ~ fread(file.path(temp, .x)) %>% setnames(., 1, .y))

use dplyr to combine columns of data.frame when column names are not known

Given a tibble:
library(tibble)
myTibble <- tibble(a = letters[1:3], b = c(T, F, T), c = 1:3)
I can use transmute to paste the columns, separated by '.':
> library(dplyr)
> transmute(myTibble, concat = paste(a, b, c, sep = "."))
# A tibble: 3 x 1
concat
<chr>
1 a.TRUE.1
2 b.FALSE.2
3 c.TRUE.3
If I want to use the above transmute statement in a function that receives a tibble, I won't know the names of the tibble or the number of columns ahead of time. What dplyr syntax would allow me to paste all columns in a tibble separated by a '.'?
Please note, I can do this with something like:
> apply(myTibble, 1, paste, collapse = ".")
[1] "a.TRUE.1" "b.FALSE.2" "c.TRUE.3"
but I am trying to understand dplyr better. So, yes, this is a specific problem I am trying to solve, but I am also stumped as to why I can't solve it with dplyr, which means there is something key about dplyr column selection I don't yet understand, and I'd like to learn, so that is why I'm asking specifically about a dplyr solution.

With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.
vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()

If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.

I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))

With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

Is it possible to add a third dummy variable using ifelse() in R?

I was using this code to create a new Group column based on partial strings found inside the column var for 2 groups, Sui and Swe. I had to add another group, TRD, and I've been trying to tweak the ifelse function do this, but no success. Is this doable? are there any other solutions or other functions that might help me do this?
m.df <- molten.df%>% mutate(
Group = ifelse(str_detect(variable, "Sui"), "Sui", "Swedish"))
Current m.df:
var value
ADHD_iFullSuiTrim.Threshold1 0.00549427
ADHD_iFullSuiTrim.Threshold1 0.00513955
ADHD_iFullSweTrim.Threshold1 0.00466352
ADHD_iFullSweTrim.Threshold1 0.00491633
ADHD_iFullTRDTrim.Threshold1 0.00658535
ADHD_iFullTRDTrim.Threshold1 0.00609122
Desired Result:
var value Group
ADHD_iFullSuiTrim.Threshold1 0.00549427 Sui
ADHD_iFullSuiTrim.Threshold1 0.00513955 Sui
ADHD_iFullSweTrim.Threshold1 0.00466352 Swedish
ADHD_iFullSweTrim.Threshold1 0.00491633 Swedish
ADHD_iFullTRDTrim.Threshold1 0.00658535 TRD
ADHD_iFullTRDTrim.Threshold1 0.00609122 TRD
Any help or suggestion would be appreciated even if the result can be accomplished using other functions.

No ifelse() is needed. I'd use Group = str_extract(var, pattern = "(Sui)|(TRD)|(Swe)").
You could do fancier regex with a lookbehind for "iFull" and a lookahead for "Trim", but I can never remember how to do that.
A little more roundabout, but general if you want whatever is between "iFull" and "Trim" would be a replacement:
str_replace_all(var, pattern = "(.*iFull)|(Trim.*)", "")

Try to use multiple ifelse
library(dplyr)
library(stringr)
m.df <- molten.df %>%
mutate(Group = ifelse(str_detect(var, "Sui"), "Sui",
ifelse(str_detect(var, "Swe"), "Swedish", "TRD")))
Or case_when
m.df <- molten.df %>%
mutate(Group = case_when(
str_detect(var, "Sui") ~ "Sui",
str_detect(var, "Swe") ~ "Swe",
TRUE ~ "TRD"
))
Data Preparation
molten.df <- read.table(text = "var value
'ADHD_iFullSuiTrim.Threshold1' 0.00549427
'ADHD_iFullSuiTrim.Threshold1' 0.00513955
'ADHD_iFullSweTrim.Threshold1' 0.00466352
'ADHD_iFullSweTrim.Threshold1' 0.00491633
'ADHD_iFullTRDTrim.Threshold1' 0.00658535
'ADHD_iFullTRDTrim.Threshold1' 0.00609122",
header = TRUE, stringsAsFactors = FALSE)

For future reference - provide all the necessary components for repeating the analysis e.g., packages and example data
# load ----
library(dplyr)
library(stringr)
# data ----
df=data.frame(var=c('ADHD_iFullSuiTrim.Threshold1',
'ADHD_iFullSuiTrim.Threshold1',
'ADHD_iFullSweTrim.Threshold1',
'ADHD_iFullSweTrim.Threshold1',
'ADHD_iFullTRDTrim.Threshold1',
'ADHD_iFullTRDTrim.Threshold1'),
value = c(0.00549427, 0.00513955, 0.00466352, 0.00491633, 0.00658535, 0.00609122))
df %>%
mutate(Group = case_when(str_detect(var, "Sui")~"Sui",
str_detect(var, "Swe")~"Swedish",
str_detect(var, "TRD")~"TRD"))

gather_ does not work. Shouldn't quoting and ~ing have the same effect in standard evaluation mode?

I have issues getting tidyr's gather to work in it's standard evaluation version gather_ :
require(tidyr)
require(dplyr)
require(lazyeval)
df = data.frame(varName=c(1,2))
gather works:
df %>% gather(variable,value,varName)
but I'd like to be able to take the name varName from a variable in standard evaluation mode, and can't seem to get it right:
name='varName'
df %>% gather_("variable","value",interp(~v,v=name))
Error in match(x, y, 0L) : 'match' requires vector arguments
I'm also confused by the following.
This works as expected:
df %>% gather_("variable","value","varName")
The next line should be equivalent to last line (from my understanding of http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html ), but doesn't work:
df %>% gather_(~variable,~value,~varName)
Error in match(x, y, 0L) : 'match' requires vector arguments

Looking at the source of tidyr:::gather_.data.frame, you can see that it is just a wrapper for reshape2::melt. As such, it only works for character or numeric arguments. Acutally the following (which I would consider a bug) works:
df %>% gather_("variable", "value", 1)
As far as I can tell the nse vignette only refers to dplyr and not to tidyr.

Although this question has been answered, the following code could be used for defining keys and values for gathering purposes more generally in a function, using a vector of inputs for key and value:
data <- data.frame(a = runif(10), b = runif(10), c = runif(10))
Key <- "ColId"
Value <- "ColValue"
data %>% gather(key = KeyTmp, value = ValTmp) %>%
rename_(.dots = setNames("KeyTmp", Key) ) %>%
rename_(.dots = setNames("ValTmp", Value) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Changing multiple markup errors over multiple columns in R - r

Related

Using purr::map to rename columns based on another list in R

use dplyr to combine columns of data.frame when column names are not known

Ho to run a function (many times) that changes variable (tibble) in global env

Is it possible to add a third dummy variable using ifelse() in R?

gather_ does not work. Shouldn't quoting and ~ing have the same effect in standard evaluation mode?

Categories

Resources