I have a dataset named "cog_gpa" in R subsetted from the fragile families data set that contains student GPA and some test scores used to measure cognitive abilities. I want to run a random forest to check which ones are more important than others in terms of predicting the GPA.
My dataset (cog_gpa) is a tibble (4898*5) and looks somewhat like this:
ch5dsss ch5ppvts ch5wj9ss ch5wj10ss GPA
13.0000 98 104.0000 117.0000 3.7500
9.0000 76 84.0000 84.0000 3.5000
9.3524 92 92.6763 97.9623 4.0250
When I check str(cog_gpa), I see that all the predictor variables are of the type "dbl+lbl" whereas GPA is of type "num".
I start with the following where I split by GPA:
cog_gpa_split <- initial_split(cog_gpa$GPA, prop = .7)
However, I get the following error:
Error: Can't select within an unnamed vector.
Run `rlang::last_error()` to see where the error occurred.
When I run rlang::last_error(), I get the following:
<error/rlang_error>
Can't select within an unnamed vector.
Backtrace:
1. rsample::initial_split(cog_gpa$GPA, prop = 0.7)
2. rsample::mc_cv(...)
3. tidyselect::vars_select(names(data), !!enquo(strata))
4. tidyselect:::eval_select_impl(...)
Run `rlang::last_trace()` to see the full context.
When I run rlang::last_trace(), I get the following:
<error/rlang_error>
Can't select within an unnamed vector.
Backtrace:
x
1. \-rsample::initial_split(cog_gpa$GPA, prop = 0.7)
2. \-rsample::mc_cv(...)
3. \-tidyselect::vars_select(names(data), !!enquo(strata))
4. \-tidyselect:::eval_select_impl(...)
How do I go about resolving this error? I think I need to ensure that all of my variables are of the same type (i.e. all of them should be of the type numeric) but I am not sure how to do that.
This has been happening recently, and I cannot understand how to resolve. N.B. I am using rStudio v0.99.893
I have created a character vector from a data.table, which I then attempt to View, and receive the above error:
Error in View : 'names' attribute [4] must be the same length as the vector [1]
The original DT has ~10,000 observations of 12 variables, here is a subset capturing all classes:
> head(DT, 3)
HQ URL type ID1 ID2 completion date_first
1: imag image-welcome basic 444 24 0.1111111 2016-01-04 14:55:57
2: imag image-welcome basic 329 12 0.2222222 2016-03-15 11:37:21
3: imag image-confirm int 101 99 0.1111111 2016-01-06 20:55:07
as.character(sapply(DT, class))
[1] "character" "character" "character" "integer"
[5] "integer" "numeric" "c(\"POSIXct\", \"POSIXt\")"
From DT I create a character vector of the unique values of URL for a subset of interest (only 'imag' HQ):
URL.unique <- unique(DT[HQ == "imag", URL])
> class(URL.unique)
[1] "character"
> names(URL.unique)
NULL
> View(URL.unique)
Error in View : 'names' attribute [4] must be the same length as the vector [1]
> length(URL.unique)
[1] 262
Printing URL.unique in the console works fine, as does exporting it via write.table() but it is annoying that I cannot view it.
Unless there is something implicitly incorrect about the above, I am resorting to reinstalling rStudio. I've already tried quitting and relaunching, just in case there was some issue as I tend to leave multiple projects open on my computer over days.
Any help would be appreciated!
As noted by #Jonathan, this is currently filed with RStudio to investigate. Can confirm reinstalling and other measures did not resolve the issue which still persists. If it is reproduced and filed as a bug, I would request #Jonathan to supply the details here for anyone else to tie into.
The workaround of View(data.frame(u = URL.unique)) does the job to launch the viewer on the data object of interest (thanks #Frank)
I am using View(as.matrix(df$col_name)) and it seems to be working well.
I have an error that I don't understand.
I have downloaded an Excel file with unemploymente rates by country and by year.
Basically, column 1 is Country, column 2 is 1990, column 3 etc...
I am trying to plot an histogram unemployment rate in 2005.
I use this code:
qplot(x=2005,y=Country,data=data)
But I always have this error:
Error: unexpected numeric constant in
I have tried to:
- convert all the names in character
- add a "y" before the year
- put brackets
But I still have this error.
Error: unexpected numeric constant in "qplot(y=data$2005"
Error: unexpected numeric constant in "qplot(x=y 2005"
With brackets, I have this error
Error: unexpected '[' in "qplot(x=["
Any idea? Many thanks in advance!
Edit:
Dataset:[link]https://docs.google.com/spreadsheets/d/1frieoKODnD9sX3VCZy5c3QAjBXMY-vN7k_I9gR-gcU8/pub?gid=0[link]
I have downloaded it (xlxs format), and changed the name of the first column
library(ggplot2)
library(readxl)
file<-"indicator_t 15-24 unemploy.xlsx"
excel_sheets(file)
data<-read_excel(file)
I've tried to plot:
qplot(x=2005,y=Total 15-24 unemployment (%),data=data)
Error: unexpected numeric constant in "qplot(x=2005,y=Total 15"
I have changed the named of the first column, and added a "y" before the years.
names2<-paste("y",names(data[,2:length(data)]))
data2<-c("Country",names2)
colnames(data)<-data2
I still have an error:
qplot(x=y2005,y=Country,data=data)
Error in eval(expr, envir, enclos) : object 'y2005' not found
There are several problems in your code, and you could certainly benefit from reading some basic references on R, such as http://tryr.codeschool.com/
What you are trying to do may be accomplished by
qplot ( x = data$"2005" , ylab="Total 15-24 unemployment (%)")
Here, the first argument specifies which data should be plotted, and ylab is used to set the y-axis label. Notice that this label must be enclosed by "quotes".
Edit:
Note also that "2005" may or may not be the name of your column. Check what are your column names with colnames(data).
Regarding the comment below, if the name of the column is actually 2005, you need to quote it as well. If you don't, R will interpret 2005 as a numerical constant:
> x$2000
Error: unexpected numeric constant in "x$2000"
> x$"2000"
[1] 1 2 4 6
I am following this tutorial by Rob Hyndman for initialization (additive).
Steps to calculate initial values are specified as:
I am running above steps manually (with pen/paper) on data set provided in Rob Hydman free online text book. Values I got after first two steps are:
I used same data set on "R", but seasonal output values in R are drastically different (screenshot below)
Not sure what I am doing wrong. Any help would be appreciated.
Another interesting thing I have observed just now is, initial level (l(t)) in text book is 33.8, but in R output it is : 48.24, which proves that I am missing something while calculating manually.
EDIT:
Here is how I am calculating Moving Averages Smooth (Based on formula used in Section 2 of this link. )
After calculating I have de-trended, means original value - smoothed value.
Then seasonal values: Which is
S1 =Average of Q1
S2 = Average of Q2
...
The first two values of your moving average are incorrect. You have assumed that the values prior to the first observation are zero. They are not zero, they are missing, which is quite different. It is impossible to compute the moving average for the first two observations for this reason.
The third and subsequent values of your moving average are only approximately correct because you have rounded the data to the first decimal point instead of using the data as provided in the fpp package in R.
The values obtained following this procedure are used as initial values in the optimization within ets(). So the output from ets() will not contain the initial values but the optimized values. The table in the book gives the optimized values. You will not be able to reproduce them using a simple procedure.
However, you can reproduce what is provided by HoltWinters because it does not do any optimization of initial values. Using HoltWinters, the initial seasonal values are given as:
> HoltWinters(y)$fitted[1:4,]
xhat level trend season
[1,] 43.73934 33.21330 1.207739 9.318302
[2,] 28.25863 35.65614 1.376490 -8.774002
[3,] 36.86581 37.57569 1.450688 -2.160566
[4,] 41.87604 38.83521 1.424568 1.616267
(The output in coefficients gives the final states not the initial states.)
The seasonal indices in the last column can be computed as follows:
y MAsmooth detrend detrend.adj
41.72746 NA NA NA
24.04185 NA NA NA
32.32810 34.41724 -2.089139 -2.160566
37.32871 35.64101 1.687695 1.616267
46.21315 36.82342 9.389730 9.318302
29.34633 38.04890 -8.702575 -8.774002
36.48291 NA NA NA
42.97772 NA NA NA
The last column is the adjusted detrended data (so they add to zero).
I am working with data, 1st two columns are dates, 3rd column is symbol, and 4th and 5th columns are prices.
So, I created a subset of the data as follows:
test.sub<-subset(test,V3=="GOOG",select=c(V1,V4)
and then I try to plot a time series chart using the following
as.ts(test.sub)
plot(test.sub)
well, it gives me a scatter plot - not what I was looking for.
so, I tried plot(test.sub[1],test.sub[2])
and now I get the following error:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
To make sure the no. of rows were same, I ran nrow(test.sub[1]) and nrow(test.sub[2]) and they both return equal rows, so as a newcomer to R, I am not sure what the fix is.
I also ran plot.ts(test.sub) and that works, but it doesn't show me the dates in the x-axis, which it was doing with plot(test.sub) and which is what I would like to see.
test.sub[1]
V1
1107 2011-Aug-24
1206 2011-Aug-25
1307 2011-Aug-26
1408 2011-Aug-29
1510 2011-Aug-30
1613 2011-Aug-31
1718 2011-Sep-01
1823 2011-Sep-02
1929 2011-Sep-06
2035 2011-Sep-07
2143 2011-Sep-08
2251 2011-Sep-09
2359 2011-Sep-13
2470 2011-Sep-14
2581 2011-Sep-15
2692 2011-Sep-16
2785 2011-Sep-19
2869 2011-Sep-20
2965 2011-Sep-21
3062 2011-Sep-22
3160 2011-Sep-23
3258 2011-Sep-26
3356 2011-Sep-27
3455 2011-Sep-28
3555 2011-Sep-29
3655 2011-Sep-30
3755 2011-Oct-03
3856 2011-Oct-04
3957 2011-Oct-05
4059 2011-Oct-06
4164 2011-Oct-07
4269 2011-Oct-10
4374 2011-Oct-11
4479 2011-Oct-12
4584 2011-Oct-13
4689 2011-Oct-14
str(test.sub)
'data.frame': 35 obs. of 2 variables:
$ V1:Class 'Date' num [1:35] NA NA NA NA NA NA NA NA NA NA ...
$ V4: num 0.475 0.452 0.423 0.418 0.403 ...
head(test.sub) V1 V4
1212 <NA> 0.474697
1313 <NA> 0.451907
1414 <NA> 0.423184
1516 <NA> 0.417709
1620 <NA> 0.402966
1725 <NA> 0.414264
Now that this is working, I'd like to add a 3rd variable to plot a 3d chart - any suggestions how I can do that. thx!
So I think there are a few things going on here that are worth talking through:
first, some example data:
test <- data.frame(End = Sys.Date()+1:5,
Start = Sys.Date()+0:4,
tck = rep("GOOG",5),
EndP= 1:5,
StartP= 0:4)
test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
First, note that test and test.sub are both data frames, so calls like test.sub[1] don't really "mean" anything to R.** It's more R-ish to write test.sub[,1] by virtue of consistency with other R structures. If you compare the results of str(test.sub[1]) and str(test.sub[,1]) you'll see that R treats them slightly differently.
You said you typed:
as.ts(test.sub)
plot(test.sub)
I'd guess you have extensive experience with some sort of OO-language; and while R does have some OO flavor to it, it doesn't apply here. Rather than transforming test.sub to something of class ts, this just does the transformation and throws it away, then moves on to plot the data frame you started with. It's an easy fix though:
test.sub.ts <- as.ts(test.sub)
plot(test.sub.ts)
But, this probably isn't what you were looking for either. Rather, R creates a time series that has two variables called "End" (which is the date now coerced to an integer) and "EndP". Funny business like this is part of the reason time series packages like zoo and xts have caught on so I'll detail them instead a little further down.
(Unfortunately, to the best of my understanding, R doesn't keep date stamps with its default ts class, choosing instead to keep start and end dates as well as a frequency. For more general time series work, this is rarely flexible enough)
You could perhaps get what you wanted by typing
plot(test.sub[,1], test.sub[,2])
instead of
plot(test.sub[1], test.sub[2])
since the former runs into trouble given that you are passing two sub-data frames instead of two vectors (even though it looks like you would be).*
Anyways, with xts (and similarly for zoo):
library(xts) # You may need to install this
xtemp <- xts(test.sub[,2], test.sub[,1]) # Create the xts object
plot(xtemp)
# Dispatches a xts plot method which does all sorts of nice time series things
Hope some of this helps and sorry for the inline code that's not identified as such: still getting used to stack overflow.
Michael
**In reality, they access the lists that are used to structure a data frame internally, but that's more a code nuance than something worth relying on.
***The nitty-gritty is that when you pass plot(test.sub[1], test.sub[2]) to R, it dispatches the method plot.data.frame which takes a single data frame and tries to interpret the second data frame as an additional plot parameter which gets misinterpreted somewhere way down the line, giving your error.
The reason that you get the Error about different x and y lengths is immediately apparent if you do a traceback immediately upon raising the error:
> plot(test.sub[1],test.sub[2])
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
> traceback()
6: stop("'x' and 'y' lengths differ")
5: xy.coords(x, y, xlabel, ylabel, log)
4: plot.default(x1, ...)
3: plot(x1, ...)
2: plot.data.frame(test.sub[1], test.sub[2])
1: plot(test.sub[1], test.sub[2])
The problems in your call are manifold. First, as mentioned by #mweylandt test.sub[1] is a data frame with the single component, not a vector comprised of the contents of the first component of test.sub.
From the traceback, we see that the plot.data.frame method was called. R is quite happy to plot a data frame as long as it has at least two columns. R took you at your word and passed test.sub[1] (as a data.frame) on to plot() - test.sub[2] never gets a look in. test.sub[1] is eventually passed on to xy.coords() which correctly informs you that you have lots of rows for x but 0 rows for y because test.sub[1] only contains a single component.
It would have worked if you'd done plot(test.sub[,1], test.sub[,2], type = "l") or used the formula interface to name the variables plot(V4 ~ V1, data = test.sub, type = "l") as I show in my other Answer.
Surely it is easier to use the formula interface:
> test <- data.frame(End = Sys.Date()+1:5,
+ Start = Sys.Date()+0:4,
+ tck = rep("GOOG",5),
+ EndP= 1:5,
+ StartP= 0:4)
>
> test.sub = subset(test, tck=="GOOG",select = c(End, EndP))
> head(test.sub)
End EndP
1 2011-10-19 1
2 2011-10-20 2
3 2011-10-21 3
4 2011-10-22 4
5 2011-10-23 5
> plot(EndP ~ End, data = test.sub, type = "l")
I work extensively with time series type data and rarely, if ever, have any need for the "ts" class of objects. Packages zoo and xts are very useful, but if all you want to do is plot the data, i) get the date/time information correctly formatted/set-up as a "Date" or "POSIXt" class object, and then ii) just plot it using standard graphics and type = "l" (or type = "b" or type = "o" if you want to see the observation times).