R Code levenshteinSim() function: comparing two columns in data - r

I am trying to get a comparison score for two columns in R data frame.
I use library RecordLinkage and tried to apply levenshteinSim() function.
The ideas is to get a similar results to
levenshteinSim("GR 7G SOLID LEGGING", "GEORGE OPP SOLID LEGGING")
[1] 0.7083333,
but comparing column to column.
Tried to use it as follows:
gw$test<-levenshteinSim(gw$ITEM_DESCRIPTION, gw$ITEM_SIGNING_DESCRIPTION)
where gw is my data frame.
However I get the error:
Error in nchar(str1) : 'nchar()' requires a character vector
Is there any way to apply this function to two columns instead of two actual vectors?
I will appreciate any help.

please check the class of your both columns. It should be "character". And if it is not then use as.character() for both of them. For eg:
gw$ITEM_DESCRIPTION<- as.character(gw$ITEM_DESCRIPTION)

Related

Error in using grep in SparkR

I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!
Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))

fishers exact test help creating a matrix

I am a second year M.Sc student and I am running into a bit of a snag running my statistics.
I am trying to run a contingency table and Fishers test and I keep getting an error.
Error in fisher.test(GAL4UAS) : if 'x' is not a matrix, 'y' must be given
If anyone can see what I have done wrong/may be missing I would really appreciate it?
This is the code:
setwd("/Users/Pria/Desktop/Data Analysis/")
GAL4UAS <-- data.frame(Yes=c(20,21,19),No=c(10,9,11))
GAL4UAS <- lapply(GAL4UAS, abs)
fisher.test(GAL4UAS)
fisher.test(GAL4UAS[c(1,2)])
fisher.test(GAL4UAS[c(1,3)])
fisher.test() is anticipating a matrix as an input and not a data frame. Try putting your data into a matrix. One option among several would be:
m <- matrix(c(20,21,19,10,9,11),nrow = 3,ncol=2,byrow=FALSE)
fisher.test(m)
When you apply the abs() using lapply the output is a list and not a data.frame. The apply function returns the output in a matrix format which is expected in the fisher.test(). So maybe you can try this:
GAL4UAS <- data.frame(Yes=c(20,21,19),No=c(10,9,11))
GAL4UAS <- apply(GAL4UAS, abs, MARGIN=c(1,2))
fisher.test(GAL4UAS)

R Apply a function to matrix where each vector (row) is an argument

I am trying to write a function to apply a function to each row in a matrix, but the problem is I need each vector(row) in the matrix to be used as an argument for the function. I'm using sapply so I can store it as a result matrix and sort it.
What I have so far is
r=apply(m,1,cosineSim(x['word',]))
where cosineSim is defined as
cosineSim <- function(v1,v2){
a <- sum(v1*v2)
b <- sqrt(sum(v1*v1))* sqrt(sum(v2*v2))
return(a/b)
}
But the problem I'm having is I can't figure out how to use each vector that's being applied as an argument for the cosine function which takes two vectors. I have one vector, but the second is supposed to be the current row that the apply function is on. I'm new to R so please forgive me if my solution is trivial. Thanks for any help.
Some sample data I'm working with includes:
the 0.41800 0.249680 -0.41242 0.121700 0.345270 -0.044457 -0.49688 -0.178620 -0.00066023 -0.656600 0.278430 -0.14767 -0.55677 0.14658 -0.0095095
. 0.15164 0.301770 -0.16763 0.176840 0.317190 0.339730 -0.43478 -0.310860 -0.44999000 -0.294860 0.166080 0.11963 -0.41328 -0.42353 0.5986800
of 0.70853 0.570880 -0.47160 0.180480 0.544490 0.726030 0.18157 -0.523930 0.10381000 -0.175660 0.078852 -0.36216 -0.11829 -0.83336 0.1191700
to 0.68047 -0.039263 0.30186 -0.177920 0.429620 0.032246 -0.41376 0.132280 -0.29847000 -0.085253 0.171180 0.22419 -0.10046 -0.43653 0.3341800
and 0.26818 0.143460 -0.27877 0.016257 0.113840 0.699230 -0.51332 -0.473680 -0.33075000 -0.138340 0.270200 0.30938 -0.45012 -0.41270 -0.0993200
in 0.33042 0.249950 -0.60874 0.109230 0.036372 0.151000 -0.55083 -0.074239 -0.09230700 -0.328210 0.095980 -0.82269 -0.36717 -0.67009 0.4290900
This is a small example of the matrix I'm working with and I'm trying to use each of those rows as a vector for my cosineSim function.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Error in R "undefined columns selected"

I am trying to initiate this code using the zoo command:
gld <- zoo(gld[,7], gld_dates)
Unfortunately I get an error message telling me this:
Error in `[.data.frame`(gld, , 7) : undefined columns selected
I want to use the zoo function to create zoo objects from my data.
The function should take two arguments: a vector of data and
a vector of dates.
This is the data I am using[LINK BROKEN].
I believe I have have 7 columns in my data set. Any ideas?
The code I am trying to implement is found here[LINK BROKEN].
Is their anything wrong with this code?
You don't say what your gld_dates is exactly, but if gld starts as your original data and you want to make a zoo object of the 7th column ordering by the 1st column (dates), I can do
gld_zoo <- zoo(gld[, 7], gld[, 1])
just fine. Equivalently, but with more readability,
gld_zoo <- zoo(gld$Adj.close, gld$Date)
reminds me what each column is.
Subsetting requires the names of the subset columns to match those in the data frame. This code subsets the dataset french_fries with potat instead of potato:
data("french_fries")
df_potato <- french_fries[, c("potatoes")]
and it fails with:
Error in `[.data.frame`(french_fries, , c("potatoes")) :
undefined columns selected
but using the right name potato works:
df_potato <- french_fries[, c("potato")]

Resources