Equivalent of Cut() function in SparkR - r

I am trying to create a variable that contains "buckets" of a numeric value in another column. For example:
nts$size_bucket<-cut(nts$loansize, c(0, 5000,10000, 25000,50000,100000,200000,300000, 500000,Inf),
c('<$5K', '5-10K', '10-25K', '25-50K', '50-100K', '100-200K', '200-300K', '300-500K', '500K+'))
In normal R, cut would work perfectly, but it doesn't appear to work with a SparkR dataframe and gives the exception:
'x' must be numeric
even though x is numeric.
Any suggestions for how to accomplish this in SparkR?
Thanks!

Related

Error in using grep in SparkR

I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!
Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))

Alternative to loop for double brackets in data frame

I have an unusual problem with an spss dataset that I import to R via the Haven package (I also made a post about this on GitHub). The dataset is full of variables with missing value definitions that are not included among value labels, which leads to errors in R. Eg. -77 is defined as a missing value, but not as a value label. Indexing the variable's column returns
Error: `x` and `labels` must be same type
The only way I've found to fix the issue is to apply a label, remove the missing value, then remove the label:
ds <- read_spss(sav.file, user_na=TRUE)
val_label(ds[[1]], -77) <- "temp"
na_values(ds[[1]]) <- NULL
val_label(ds[[1]], -77) <- NULL
The solution relies on double brackets (or $). I'm wondering what the fastest way to apply this to all numeric variables in a large dataset would be. I could easily do it with a for loop, but I'm looking for something faster.

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Barplot does not evaluate data in R

Thanks in advance for your response.
I am trying to create a stacked bar plot from a csv file, and I have run into the following hiccup:
First I put the csv into a variable:
test <- read.csv(file=\"test4.csv\",sep=\",\",head=TRUE")
Then I try to create a bar plot using the following
barplot(test)
and I get the following error,
Error in barplot.default(test) : 'height' must be a vector or a matrix
so I try
barplot(t(test))
and it works but as expected the axis are switched, so I try
barplot(t(t(test)))
and it works, but I feel there must be a better solution than transposing the transposed.
The issue is that read.csv outputs a data frame and barplot expects either a vector or a matrix. The barplot function works when you transpose because t() coerces data frames to matrices.
If you either start with
test <- as.matrix(read.csv(file="test4.csv",sep=",",head=TRUE))
or later on do
barplot(as.matrix(test))
then you should be fine.

Error in R "undefined columns selected"

I am trying to initiate this code using the zoo command:
gld <- zoo(gld[,7], gld_dates)
Unfortunately I get an error message telling me this:
Error in `[.data.frame`(gld, , 7) : undefined columns selected
I want to use the zoo function to create zoo objects from my data.
The function should take two arguments: a vector of data and
a vector of dates.
This is the data I am using[LINK BROKEN].
I believe I have have 7 columns in my data set. Any ideas?
The code I am trying to implement is found here[LINK BROKEN].
Is their anything wrong with this code?
You don't say what your gld_dates is exactly, but if gld starts as your original data and you want to make a zoo object of the 7th column ordering by the 1st column (dates), I can do
gld_zoo <- zoo(gld[, 7], gld[, 1])
just fine. Equivalently, but with more readability,
gld_zoo <- zoo(gld$Adj.close, gld$Date)
reminds me what each column is.
Subsetting requires the names of the subset columns to match those in the data frame. This code subsets the dataset french_fries with potat instead of potato:
data("french_fries")
df_potato <- french_fries[, c("potatoes")]
and it fails with:
Error in `[.data.frame`(french_fries, , c("potatoes")) :
undefined columns selected
but using the right name potato works:
df_potato <- french_fries[, c("potato")]

Resources