Deriving a variable from a column name passed to a function - r

I've gotten hold of some really messy data and I wrote a function to do some conversions (string to numeric), and I would love to improve it. Basically the function takes a vector of messy character data and converts the data to numeric.
for example:
## say you had this
df1 <- data.frame ( V1 = c(" $25.25", "4,828", " $7,253"), V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(vec){
vec <- str_trim(vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print("!!ERROR STRANGE CHARACTERS!!")
}
}
df1$V1recode <- numconv(df1$V1)
df1$V2recode <- numconv(df1$V2)
[1] "!!ERROR STRANGE CHARACTERS!!"
How do can I assign the name of the original column name within the function so I can paste it to the error message within the function, so it instead reads:
!!ERROR STRANGE CHARACTER IN V2!!
I've tried calling names() and colnames() within the function, but this doesn't seem to work.
Thanks in advance,
C

The old deparse(substitute(.)) trick seems to work.
numconv <- function(vec){nam <- deparse(substitute(vec))
vec <- gsub(" ","", vec)
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS!!", nam) )
}
}
df1$V2recode <- numconv(df1$V2)
# [1] "!!ERROR STRANGE CHARACTERS!! df1$V2"
(I didn't load stringr since I thought a gsub call would be more efficient.)

I feel this is a somewhat hacky way to do this, but you could use substitue and then strsplit on the $, but this assumes you always call a column using its name with $. Anyway, you can get the column name using this and paste it into an error message as you wish...
x <- strsplit(as.character( substitute(vec) ) ,"$" )[[3]]

The key is to wrap the recoding up into the function as well. That way you can keep track of which columns you're working on and so get the column names to put in your warning message. The following function recodes whatever columns of a data frame are listed in the 'col_names' argument (if left null the function applies to all of them). The function returns the original data frame, plus the recoded columns with the string in flag added to the column names.
require(stringr)
df1 <- data.frame (
V1 = c(" $25.25", "4,828", " $7,253"),
V2 = c( "THIS is bad data", "725", "*error"))
numconv <- function(df, col_names = NULL, flag = "recode"){
if(is.null(col_names)) {
col_names <- colnames(df)
}
out <- lapply(1:length(col_names), function(i) {
vec <- str_trim(df[,col_names[i]])
vec <- gsub(",|\\$", "", vec)
if( sum(!grepl( "[0-9]",vec)) == 0){
vec <- as.numeric(vec)
}
if( sum(!grepl( "[0-9]",vec)) != 0){
print(paste("!!ERROR STRANGE CHARACTERS in", col_names[i], "!!"))
}
vec
})
out <- data.frame(out, stringsAsFactors = FALSE)
colnames(out) <- paste(col_names, flag, sep = "")
cbind(df, out)
}
numconv(df1)
[1] "!!ERROR STRANGE CHARACTERS in V2 !!"
V1 V2 V1recode V2recode
1 $25.25 THIS is bad data 25.25 THIS is bad data
2 4,828 725 4828.00 725
3 $7,253 *error 7253.00 *error

Related

A way to strsplit and replace all of one character with several variations of alternate strings?

I am sure there is a simple solution and I am just getting too frustrated to work through it but here is the issue, simplified:
I have a string, ex: AB^AB^AB^^BAAA^^BABA^
I want to replace the ^s (so, 7 characters in the string), but iterate through many variants and be able to retain them all as strings
for example:
replacement 1: CCDCDCD to get: ABCABCABDCBAAADCBABAD
replacement 2: DDDCCCD to get: ABDABDABDCBAAACCBABAD
I imagine strsplit is the way, and I would like to do it in a for loop, any help would be appreciated!
The positions of the "^" can be found using gregexpr, see tmp
x <- "AB^AB^AB^^BAAA^^BABA^"
y <- c("CCDCDCD", "DDDCCCD")
tmp <- gregexpr(pattern = "^", text = x, fixed = TRUE)
You can then split the 'replacements' character by character using strsplit, this gives a list. Finally, iterate over that list and replace the "^" with the characters from your replacements one after the other.
sapply(strsplit(y, split = ""), function(i) {
`regmatches<-`("AB^AB^AB^^BAAA^^BABA^", m = tmp, value = i)
})
Result
# [1] "ABCABCABCCBAAACCBABAC" "ABDABDABDDBAAADDBABAD"
You don't really need a for loop. You can strplit your string and pattern, and then replace the "^" with the vector.
str <- unlist(strsplit(str, ""))
pat <- unlist(strsplit("CCDCDCD", ""))
str[str == "^"] <- pat
paste(str, collapse = "")
# [1] "ABCABCABDCBAAADCBABAD"
An option is also with gsubfn
f1 <- Vectorize(function(str1, str2) {
p <- proto(fun = function(this, x) substr(str2, count, count))
gsubfn::gsubfn("\\^", p, str1)
})
-testing
> unname(f1(x, y))
[1] "ABCABCABDCBAAADCBABAD" "ABDABDABDCBAAACCBABAD"
data
x <- "AB^AB^AB^^BAAA^^BABA^"
y <- c("CCDCDCD", "DDDCCCD")
Given x <- "AB^AB^AB^^BAAA^^BABA^" and y <- c("CCDCDCD", "DDDCCCD"), we can try utf8ToInt + intToUtf8 + replace like below
sapply(
y,
function(s) {
intToUtf8(
replace(
u <- utf8ToInt(x),
u == utf8ToInt("^"),
utf8ToInt(s)
)
)
}
)
which gives
CCDCDCD DDDCCCD
"ABCABCABDCBAAADCBABAD" "ABDABDABDCBAAACCBABAD"

Using lapply with gsub to replace word in dataframe using another dataframe as 'dictionnary'

I have a dataframe called data where I want to replace some word in specific columns A & B.
I have a second dataframe called dict that is playing the role of dictionnary/hash containing the words and the values to use for replacement.
I think it could be done with purrr’s map() but I want to use apply. It's for a package and I don't want to have to load another package.
The following code is not working but it's give you the idea. I'm stuck.
columns <- c("A", "B" )
data[columns] <- lapply(data[columns], function(x){x}) %>% lapply(dict, function(y){
gsub(pattern = y[,2], replacement = y[,1], x)})
This is working for one word to change...but I'm not able to pass the list of changes conainted in the dictionnary.
data[columns] <- lapply(data[columns], gsub, pattern = "FLT1", replacement = "flt1")
#Gregor_Thomas is right, you need a for loop to have a recursive effect, otherwise you just replace one value at the time.
df <- data.frame("A"=c("PB1","PB2","OK0","OK0"),"B"=c("OK3","OK4","PB1","PB2"))
dict <- data.frame("pattern"=c("PB1","PB2"), "replacement"=c("OK1","OK2"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x <- gsub(pattern = dict$pattern[i], replacement = dict$replacement[i],x)
}
return(x)
})
Or, if your dict data is too long you can generate a succession of all the gsub you need using a paste as a code generator :
paste0("df[,'A'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'A'])")
It generates all the gsub lines for the "A" column :
"df[,'A'] <- gsub(pattern = 'PB1', replacement = 'OK1',df[,'A'])"
"df[,'A'] <- gsub(pattern = 'PB2', replacement = 'OK2',df[,'A'])"
Then you evaluate the code and wrap it in a lapply for the various columns :
lapply(c("A","B"), FUN = function(v) { eval(parse(text=paste0("df[,'", v,"'] <- gsub(pattern = '", dict$pattern,"', replacement = '", dict$replacement,"',df[,'",v,"'])"))) })
It's ugly but it works fine to avoid long loops.
Edit : for a exact matching between df and dict maybe you should use a boolean selection with == instead of gsub().
(I don't use match() here because it selects only the first matching
df <- data.frame("A"=c("PB1","PB2","OK0","OK0","OK"),"B"=c("OK3","OK4","PB1","PB2","AB"))
dict <- data.frame("pattern"=c("PB1","PB2","OK"), "replacement"=c("OK1","OK2","ZE"))
apply(df[,c("A","B")],2, FUN=function(x) {
for (i in 1:nrow(dict)) {
x[x==dict$pattern[i]] <- dict$replacement[i]
}
return(x)
})

Loop over sentences to see if the sentence contains a trigger word

I have the following data-frame.
sentences <- c("this is app is great", "the price it too high")
df <- data.frame(sentences)
I would now like to run over each sentence in the data-frame to see whether a sentence contains a word from a list. I have the following lists set up:
product_names <- c("app", "mega").
marketing_names <- c("campaign", "marketing").
price_names <- c("price", "expensive").
I have written the following code:
for(i in 1:nrow(df)){
list = strsplit(df$sentences, " ")
for(l in list){
if(l %in"% product_names){
#Do something
print(l)
}
if(l %in"% marketing_names){
#Do something
}
if(l %in"% price_names){
#Do something
}
}
}
But this does not seem to work as I do not get a hit. I should get a hit on the first sentence. Any feedback on what I am doing wrong?
Many small things.
The dataframe is interpreted as levels and not character, so it need conversion.
You try to split df$sentences, instead of df$sentences[i], the current line.
Finally, strsplit returns a list with an array inside, so you need to first select the first element of the list to access the array of words.
Putting all together this lines becomes : list = strsplit(as.character(df$sentences[i]), " ")[[1]]
Finally, the %in"% should read %in%, so the final code reads :
sentences <- c("this is app is great", "the price it too high")
df <- data.frame(sentences)
product_names <- c("app", "mega")
marketing_names <- c("campaign", "marketing")
price_names <- c("price", "expensive")
for(i in 1:nrow(df))
{
list = strsplit(as.character(df$sentences[i]), " ")[[1]]
for(l in list)
{
if(l %in% product_names)
{
#Do something
print(paste(l,"found in product_names"))
}
if(l %in% marketing_names){
print(paste(l,"found in marketing_names"))
}
if(l %in% price_names){
print(paste(l,"found in price_names"))
}
}
}
Double loop, loop through sentences, then loop through name types for split words, see example:
df <- data.frame(sentences = c("this is app is great", "the price it too high"),
stringsAsFactors = FALSE)
t(sapply(df$sentences, function(i){
l <- unlist(strsplit(i, " "))
sapply(list(product_names = c("app", "mega"),
marketing_names = c("campaign", "marketing"),
price_names = c("price", "expensive")), function(j){
any(l %in% j)
})
}))
# product_names marketing_names price_names
# this is app is great TRUE FALSE FALSE
# the price it too high FALSE FALSE TRUE

Turn index-based loop into name-based function

I have at disposal a clean dataframe (1500r x 297c, named 'Data' - very inspiring) with both numeric/factor columns. However, as this is often the case, my factors were encoded as numbers (each number representing a level) hence a dataframe full a numeric vectors.
To overcome this matter I also have a second dataframe (VarLabels), containing information about the columns of the 1st dataframe (which has... 297 rows as you would imagine). In there, one specific column helps me defining what should be the data class in the main dataframe (named VarLabels$TypeVar).
I wrote the following piece of code, which might not be optimal but proved to work so far:
(NB: as you can see, for data labelled 'MIX' I wish to create a copy to have one numeric and one factor)
nbcol <- ncol(Data)
indexcol <- which(colnames(VarLabels) == "TypeVar")
for(i in 1:nbcol){
if (colnames(Data)[[i]] %in% VarLabels$VarName){
if (VarLabels[i,indexcol] == "Quant"){
Data[[i]] <- as.numeric(Data[[i]])
} else if (VarLabels[i,indexcol] == "Qual") {
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
} else if (VarLabels[i,indexcol] == "Mix") {
Data <- cbind(Data, Data[[i]])
Data[[i]] <- as.character(Data[[i]])
Data[[i]] <- as.factor(Data[[i]])
Data[[ncol(Data)]] <- as.numeric(Data[[ncol(Data)]])
colnames(Data)[[ncol(Data)]] <- paste(colnames(Data)[[i]], "Num", sep = "_")
} else {
Data[[i]] <- as.numeric(Data[[i]])
}
} else {
}
}
Do you have a neater solution, possibly using a function to reduce the number of code lines / using names instead of column index? (which may be risky if order changes in one of the two dataframes) I recently got into R and am still struggling with user-defined functions.
I read other related topics like:
Change all columns from factor to numeric in R
Function to change class of columns in R to match the class of an other dataset
Convert type of multiple columns of a dataframe at once
How do I get the classes of all columns in a data frame?
but could not apply the answers to my own problem. Any idea how to make things simple? (if possible!)
The following function does what the question asks for.
It matches input data set X column names with the new column types with a sequence of which/match statements, without needing loops. The coercion is performed with lapply loops.
The test data set is the built-in data set mtcars.
coerceCols <- function(X, VarLabels){
i <- which(VarLabels$TypeVar == "Qual")
j <- match(VarLabels$VarName[i], names(X))
X[j] <- lapply(X[j], factor)
i <- which(VarLabels$TypeVar == "Mix")
j <- match(VarLabels$VarName[i], names(X))
tmp <- X[j]
names(tmp) <- paste(names(tmp), "Num", sep = "_")
X[j] <- lapply(X[j], factor)
cbind(X, tmp)
}
Data <- mtcars
VarLabels <- data.frame(VarName = names(mtcars),
TypeVar = c("Quant", "Mix", "Quant",
"Quant", "Quant", "Quant",
"Quant", "Qual", "Qual",
"Mix", "Mix"),
stringsAsFactors = FALSE)
coerceCols(Data, VarLabels)

LastAssigned (In R what was last right of an arrow (<-) and returning that)

I try to create an R function that returns the data frame I last created. I use R studio. The idea and current solution from this question stems from my previous question
lastAssigned <- function(match = "<- *data.frame",
remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
sub(remove, "", match[length(match)])
}
However the current function returns the name. I now would need a function returning the data.frame.
I would be very happy if someone could please point me in the right direction.
Intended usage:
df_head <- data.frame("age" = c(19, 21),
sex = c("m", "f"))
lastAssigned()
library(data.table)
setDT(lastAssigned())
#gives error (obviously)
alittlefunction <-function(x){
sexy <- rev(c('m', 'f'))
x <- within(x, sex <- factor(sex, levels=sexy))
}
alittlefunction(lastAssigned())
#gives error (obviously)
How could I rebuild lastAssigned in a way it will work for this usage?
Not sure what you are looking for. Something like this:
getLastAssigned <- function(match = "<- *data.frame",
remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
get(sub(remove, "", match[length(match)]), envir = .GlobalEnv)
}
df_head <- data.frame("age" = c(19, 21),
sex = c("m", "f"))
> getLastAssigned()
age sex
1 19 m
2 21 f
You can use 'get' over here, :). Like below,
lastAssigned <- function(match = "<- *data.frame",
remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
return(get(sub(remove, "", match[length(match)]), envir = .GlobalEnv))
}
alittlefunction <-function(x){
sexy <- rev(c('m', 'f'))
x <- within(x, sex <- factor(sex, levels=sexy))
return(x)
}
It should generate correct output as you expect, I believe unless I misunderstood something here.
Edit: Just a note, As mentioned in comments, you also need to provide environment name to avoid conflicts, also you can use pos instead of envir.
From the documentation:
The pos argument can specify the environment in which to look for the
object in any of several ways: as a positive integer (the position in
the search list); as the character string name of an element in the
search list
Running the function: lastAssigned() gives a dataframe df_head
A simple example can help you to understand:
x <- 2
get('x') will return 2.
I hope this is what you were expecting. Thanks
You can also use parse function over here, like you can replace the entire return with below line:
return(eval(parse(text =sub(remove, "", match[length(match)])), envir = .GlobalEnv ))

Resources