Error in using grep in SparkR - r

I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!

Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))

Related

Trying to find a better way to sorting the data in R

In my data frame I am trying to sort the data in descending order. I am using the below line of code for sorting my data and it works as intended.
CNS25VOL <- CNS25VOL[order(-CNS25VOL$MATVOL22), ]
However if I refer to the same column by it's index number, the code throws an error
CNS25VOL <- CNS25VOL[order(-CNS25VOL[, 2]), ]
Error thrown is
Error in CNS25VOL[, 2] : incorrect number of dimensions
While I do have a solution to what I am intending to do, but issue I see is if all of a sudden name of my column changes the code won't work. I know that their position will stay same in the data frame.
How can we handle it.
order(-CNS25VOL[, 2]) order here does expect a vector which you try to construct via the [] in CNS25VOL[, 2]. Normal dataframes will return a vector consisting only of the 2nd column. A tibble however will return a tibble with only one column.
You can reproduce the behaviour of normal data.frames with the drop = FALSE argument to [] as in
CNS25VOL[, 2, drop = TRUE]
Try to always be aware whether you are using a standard data.frame or a tibble or a data.table because they look very similar and are not in the details. Also see https://tibble.tidyverse.org/reference/subsetting.html
dplyr functions tend to give you a tibble back even if you fed them a classical data.frame.

Problems with renaming columns via variables in R

I'm having issues with a specific problem I have a dataset of a ton of matrices that all have V1 as their column names, essentially NULL. I'm trying to write a loop to replace all of these with column names from a list but I'm running into some issues.
To break this down to the most simple form, this code isn't functioning as I'd expect it to.
nameofmatrix <- paste('column_', i, sep = "")
colnames(eval(as.name(nameofmatrix))) <- c("test")
I would expect this to take the value of column_1 for example, and replace (in the 2nd line) with "test" as the column name.
I tried to break this down smaller, for example, if I run print(eval(as.name(nameofmatrix)) I get the object's column/rows printed as expected and if I run print(colnames(eval(as.name(nameofmatrix))) I'm getting NULL as expected for the column header (since it was set as V1).
I've even tried to manually type in the column name, such as colnames(column_1) <- c("test) and this successfully works to rename the column. But once this variable is put in the text's place as shown above, it does not work the same. I'm having difficulties finding a solution on how to rename several matrix columns after they have been created with this method. Does anyone have any advice or suggestions?
Note, the error I'm receiving on trying to run this is
Error in eval([as.name](nameofmatrix)) <- \`vtmp\` : could not find function "eval<-"
We could return the values of the objects in a list with get (if there are multiple objects use mget, then rename the objects in the list and update those objects in the global env with list2env
list2env(lapply(mget(nameofmatrix), function(x) {colnames(x) <- newnames
x}), .GlobalEnv)
It can also be done with assign
data(mtcars)
nameofobject <- 'mtcars'
assign(nameofobject, `colnames<-`(get(nameofobject),
c('mpg1', names(mtcars)[-1])))
Now, check the names of 'mtcars'
names(mtcars)[1]
#[1] "mpg1"

How to remove the first row from multiple dataframes?

I have multiple dataframes and would like to remove the first row in all of them.
I have tried using a for loop but cannot understand what I am doing wrong
for (i in cities){
i <- i[-1, ]
}
I get the following error code:
Error in i[-1, ] : incorrect number of dimensions
If we assume that the only objects in your workspace are dataframes then this might succeed:
cities <- objects() )
for (i in cities) { assign(i, get(i)[-1,])}
Explanation:
Two thing wrong with original codes:
One was already mentioned in comments. "df" is not the same as df. You need to use get to convert a character value to a "true" R name that is used to retrieve an object having that name. The result of object() is only a character value. In R the term "name" means a "language object". See the help page: ?mode. (There is potential confusion about rownames and columnnames which are always "character"-class.) It's not like SAS which is a macro language that has no such distinction.
The second error was trying to get substitution for the i on the left-hand side of <-. The would have failed even if you were working with actual R names. The assign function is designed to handle character values that are then converted to R names.
say you get a list of all the tables in your environment, and you call that list cities. You can't just iterate over each value of cities and change things, because in the list they are just characters.
Here is what you need:
for (i in cities){
tmp <- get(i) # load the actual table
tmp <- tmp[-1, ] # remove first column
assign(i, tmp) # re-assign table to original table name
}

R Code levenshteinSim() function: comparing two columns in data

I am trying to get a comparison score for two columns in R data frame.
I use library RecordLinkage and tried to apply levenshteinSim() function.
The ideas is to get a similar results to
levenshteinSim("GR 7G SOLID LEGGING", "GEORGE OPP SOLID LEGGING")
[1] 0.7083333,
but comparing column to column.
Tried to use it as follows:
gw$test<-levenshteinSim(gw$ITEM_DESCRIPTION, gw$ITEM_SIGNING_DESCRIPTION)
where gw is my data frame.
However I get the error:
Error in nchar(str1) : 'nchar()' requires a character vector
Is there any way to apply this function to two columns instead of two actual vectors?
I will appreciate any help.
please check the class of your both columns. It should be "character". And if it is not then use as.character() for both of them. For eg:
gw$ITEM_DESCRIPTION<- as.character(gw$ITEM_DESCRIPTION)

Error in R "undefined columns selected"

I am trying to initiate this code using the zoo command:
gld <- zoo(gld[,7], gld_dates)
Unfortunately I get an error message telling me this:
Error in `[.data.frame`(gld, , 7) : undefined columns selected
I want to use the zoo function to create zoo objects from my data.
The function should take two arguments: a vector of data and
a vector of dates.
This is the data I am using[LINK BROKEN].
I believe I have have 7 columns in my data set. Any ideas?
The code I am trying to implement is found here[LINK BROKEN].
Is their anything wrong with this code?
You don't say what your gld_dates is exactly, but if gld starts as your original data and you want to make a zoo object of the 7th column ordering by the 1st column (dates), I can do
gld_zoo <- zoo(gld[, 7], gld[, 1])
just fine. Equivalently, but with more readability,
gld_zoo <- zoo(gld$Adj.close, gld$Date)
reminds me what each column is.
Subsetting requires the names of the subset columns to match those in the data frame. This code subsets the dataset french_fries with potat instead of potato:
data("french_fries")
df_potato <- french_fries[, c("potatoes")]
and it fails with:
Error in `[.data.frame`(french_fries, , c("potatoes")) :
undefined columns selected
but using the right name potato works:
df_potato <- french_fries[, c("potato")]

Resources