R: Logical from 2 vectors on pattern match - r

Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!

Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?

Related

R code for whatsapp average word length per person

I am new to R. Currently, I have parsed messages from a Whatsapp chat group and now I am trying to visualize data for average word length per member.
I am using this code to calculate the number of words for every time "Eddy" message
for(i in grep("Eddy",chatcsv[,2],fixed=TRUE)){
length(which(!is.na(chatcsv[i,4:111])))
}
This does not return any output or any error message.
My intention is to then sum up the total length and then divide by the number of times a person message. Lastly, I plan to place the average as a vector and visualize it as a bar graph.
Thank you
Your syntax is wrong. You should use:
allnames <- chatcsv[,2] #or cimilar
eddyindexes <- grep("Eddy",allnames,fixed=TRUE) #return indexes of eddys chats
eddyschats <- chatcsv[eddyindexes, 4:100]
eddysavgcharacters <- apply[eddyschats,function(x) mean(nchar(x))] #average nchars of eddys chats
I'm thinking you are coming from a non-functional language. (Not a language that is dysfunctional, but rather one that is not a "functional language".) Your expression length(which(!is.na(chatcsv[i,4:111]))) would do nothing, because it is inside a for loop but was not assigned to any name. It just disappears. You would have needed to create a named vector (let's say res) with res <-numeric(0) before your loop and then within your loop done:
res[i] <- length(which(!is.na(chatcsv[i,4:111])))
The earlier answerer was confusing grep and grepl in his comment. The grep function returns integer values; the grepl function returns logical vectors. They can both be used for indexing.
Whether that expression would give you the basis for furhter efforts is no clear. It would depend on the contents of chatcsv[i,4:111]. If the contents are single words then perhaps it would succeed. If they are sentences then it would not. The length function would just return the number of non-NA values in the row-vector. Only if your prior (undescribed) operations had created a clean set of "words" in that set of columns would you be getting meaningful results.

Swap name and content of a (lookup) vector in an one liner / library function

In my code I use lookup tables quite often, for example to have more verbose versions of column names in a data frame. For instance:
lkp <- c(speed = "Speed in mph", dist = "Stopping Distance in ft")
makePlot <- function(x = names(cars)) {
x <- match.arg(x)
hist(cars[[x]], xlab = lkp[[x]])
}
Now it happens that I want to reverse the lookup vector [*], which is easily done by
setNames(names(lkp), lkp)
If lkp is a bit more complicated, this becomes quite a lot of typing:
setNames(names(c(firstLkp, secondLkp, thirdLkp, youGotTheIdea)),
c(firstLkp, secondLkp, thirdLkp, youGotTheIdea))
with a lot of redundant code. Of course I could create a temporary variable
fullLkp <- c(firstLkp, secondLkp, thirdLkp, youGotTheIdea)
setNames(names(fullLkp), fullLkp)
Or even write a simple function doing it for me
swap_names_content <- function(x) setNames(names(x), x)
However, since this seems to me to be such a common task, I was wondering whether there is already a function in one of the popular packages doing the same?
[*] A common use case for me is the use of shiny's selectInput for instance:
List of values to select from. If elements of the list are named, then that name rather than the value is displayed to the user.
That is, it is exactly the reverse of my typical lookup table.

R add to a list in a loop, using conditions

I have a data.frame dim = (200,500)
I want to do a shaprio.test on each column of my dataframe and append to a list. This is what I'm trying:
colstoremove <- list();
for (i in range(dim(I.df.nocov)[2])) {
x <- shapiro.test(I.df.nocov[1:200,i])
colstoremove[[i]] <- x[2]
}
However this is failing. Some pointers? (background is mainly python, not much of an R user)
Consider lapply() as any data frame passed into it runs operations on columns and the returned list will be equal to number of columns:
colstoremove <- lapply(I.df.noconv, function(col) shapiro.test(col)[2])
Here is what happens in
for (i in range(dim(I.df.nocov)[2]))
For the sake of example, I assume that I.df.nocov contains 100 rows and 5 columns.
dim(I.df.nocov) is the vector of I.df.nocov dimensions, i.e. c(100, 5)
dim(I.df.nocov)[2] is the 2nd dimension of I.df.nocov, i.e. 5
range(x)is a 2-element vector which contains minimal and maximal values of x. For example, range(c(4,10,1)) is c(1,10). So range(dim(I.df.nocov)[2]) is c(5,5).
Therefore, the loop iterate twice: first time with i=5, and second time also with i=5. Not surprising that it fails!
The problem is that R's function range and Python's function with the same name do completely different things. The equivalent of Python's range is called seq. For example, seq(5)=c(1,2,3,4,5), while seq(3,5)=c(3,4,5), and seq(1,10,2)=c(1,3,5,7,9). You may also write 1:n, it is the same as seq(n), and m:n is same as seq(m,n) (but the priority of ':' is very high, so 1:2*x is interpreted as (1:2)*x.
Generally, if something does not work in R, you should print the subexpressions from the innerwise to the outerwise. If some subexpression is too big to be printed, use str(x) (str means "structure"). And never assume that functions in Python and R are same! If there is a function with same name, it usually does a different thing.
On a side note, instead of dim(I.df.nocov)[2] you could just write ncol(I.df.nocov) (there is also a function nrow).

Faster alternative methods to for-loop in R for pattern matching

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.

How to access single elements in a table in R

How do I grab elements from a table in R?
My data looks like this:
V1 V2
1 12.448 13.919
2 22.242 4.606
3 24.509 0.176
etc...
I basically just want to grab elements individually. I'm getting confused with all the R terminology, like vectors, and I just want to be able to get at the individual elements.
Is there a function where I can just do like data[v1][1] and get the element in row 1 column 1?
Try
data[1, "V1"] # Row first, quoted column name second, and case does matter
Further note: Terminology in discussing R can be crucial and sometimes tricky. Using the term "table" to refer to that structure leaves open the possibility that it was either a 'table'-classed, or a 'matrix'-classed, or a 'data.frame'-classed object. The answer above would succeed with any of them, while #BenBolker's suggestion below would only succeed with a 'data.frame'-classed object.
There is a ton of free introductory material for beginners in R: CRAN: Contributed Documentation
?"[" pretty much covers the various ways of accessing elements of things.
Under usage it lists these:
x[i]
x[i, j, ... , drop = TRUE]
x[[i, exact = TRUE]]
x[[i, j, ..., exact = TRUE]]
x$name
getElement(object, name)
x[i] <- value
x[i, j, ...] <- value
x[[i]] <- value
x$i <- value
The second item is sufficient for your purpose
Under Arguments it points out that with [ the arguments i and j can be numeric, character or logical
So these work:
data[1,1]
data[1,"V1"]
As does this:
data$V1[1]
and keeping in mind a data frame is a list of vectors:
data[[1]][1]
data[["V1"]][1]
will also both work.
So that's a few things to be going on with. I suggest you type in the examples at the bottom of the help page one line at a time (yes, actually type the whole thing in one line at a time and see what they all do, you'll pick up stuff very quickly and the typing rather than copypasting is an important part of helping to commit it to memory.)
Maybe not so perfect as above ones, but I guess this is what you were looking for.
data[1:1,3:3] #works with positive integers
data[1:1, -3:-3] #does not work, gives the entire 1st row without the 3rd element
data[i:i,j:j] #given that i and j are positive integers
Here indexing will work from 1, i.e,
data[1:1,1:1] #means the top-leftmost element

Resources