TM - Clustering data with special date variable - r

Ive got the following data from tripadvisor:
'data.frame': 682 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 674 levels "id","rn106322397",..: 672 671 670 669 668 667 666 665 664 663 ...
$ quote : Factor w/ 606 levels "\"Picturesque Lake Konigssee\"",..: 389 139 113 149 384 39 176 598 199 603 ...
$ rating : Factor w/ 6 levels "1","2","3","4",..: 3 5 5 5 4 5 5 5 4 5 ...
$ date : Factor w/ 505 levels "date","Reviewed 1 August 2014\n",..: 200 200 427 427 427 443 434 351 313 494 ...
$ reviewnospace: Factor w/ 674 levels "- Good car parking facilities- Organized boat trips- Ensure that you have enough time at hand for the boat trip",..: 624 573 144 211 507 26 351 672 451 249 ...
I try to cluster the data on the basis of the date, to get two groups - winter and summer vacationers. With this clustering i want to analyse the reviews afterwards. I am using the tm package and tried it with the following code:
> x <- read.csv ("seeganz.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
> corp <- VCorpus(VectorSource(x$reviewnospace), readerControl = list(language = "eng"))
> meta(corp,tag = "date") <- x$date
> idx <- meta(corp, "date") == 'December'
But it is not working as the content say 0 documents:
> corp [idx]
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 1
Content: documents: 0
As the date has the structure "Reviewed 1 August 2014", how do I have to adapt this code to get, for example just the reviews from Nov - Feb?
Do you have any idea how I can solve this problem?
Thank you.

Generic Approach:
Use substr(date, 10, nchar(date)) to get to 1 August 2014 call this new vector dateNew
Use normal date function e.g. as.Date(dateNew,...) to change dateNew into a vector of type Date where you can do subsetting/subtraction and other operations
References from http://www.statmethods.net/input/dates.html
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]

Related

How do I plot asset stock prices in R?

I'm trying to plot asset stock prices in R. I'm downloading the data in csv format from Yahoo Finance and then importing it to R so I can run some statistical tests on it and draw a few plots.
I'm currently trying to plot the closing price vs the date, and I'm not having a lot of success. R is just plotting it as a series of distinct points and won't join these points up with lines, despite me trying to use the argument type = "l".
price <- read.csv("~/Downloads/AAPL.csv")
plot(price$Date,price$Close,type="l")
I'm just grabbing the data from here: https://finance.yahoo.com/quote/AAPL/history?p=AAPL
I get an output like this every time, regardless of what kind of extra arguments I try.
For example, I tried to make it red, didn't change at all.
Thanks!
The problem is that pric$Date is a factor (categorical variable) and not a number. You can convert the date string to a Posix timestamp with as.POSIXlt, and then compute a floating point representation therefrom, e.g. year + yday/366.
Try this
price$Date = as.Date(price$Date)
plot(price$Date,price$AAPL.Close,type="l",col=4)
or better
library(quantmod)
fro = '2014-07-31'
Apple = getSymbols('AAPL',auto.assign = F,from=fro)
chartSeries(Apple,subset = "last 3 years")
You don't need to use a package unless you want to create candlestick charts.
df <- read.csv("AAPL.csv")
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Factor w/ 254 levels "2019-07-10","2019-07-11",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
df$Date <- as.Date(df$Date) # Otherwise it is treated as a factor variable
> str(df)
'data.frame': 254 obs. of 7 variables:
$ Date : Date, format: "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-15" ...
$ Open : num 202 203 202 204 205 ...
$ High : num 204 204 204 206 206 ...
$ Low : num 202 202 202 204 204 ...
$ Close : num 203 202 203 205 204 ...
$ Adj.Close: num 201 199 201 203 202 ...
$ Volume : int 17897100 20191800 17595200 16947400 16866800 14107500 18582200 20929300 22277900 18355200 ...
plot(y=df$Close, x=df$Date, col="red", type = "l") # look at ?plot for more details

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

can't draw the grouped value above stacked bar plot in ggplot2

I have a ggplot2 question, I run the code below show the stacked barplot without add value above each bar correctly:
p=ggplot(data=essnn)
p+geom_bar(binwidth=0.5,stat="identity")+ #
aes(x=reorder(classname,-amount,sum), y=amount, label=amount, fill = sort(year))+
theme()
I want add the sum amount grouped by year in each class, and here is my code:
+geom_text(aes(x=classes,y=total,label=total), data=essnnta, fill=NULL, size=3)
But an error message appear:
Error in fill = year, can not find object "year"
That's my problem: why the object "year" can be found when I draw stack bar plot without add the sum amount grouped by year in each class, but when I add the sum amount grouped by year, the error appear?
> str(essnn)
'data.frame': 48619 obs. of 15 variables:
$ id : int 2006051337 2006051337 2006051337 2006051337 2006051337 2006051337 2004070648 2006031360 2006031360 2004070062 ...
$ gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ age : num 30 30 30 30 30 30 38 43 43 37 ...
$ class : Factor w/ 92 levels "100ab","100aa",..: 18 18 18 18 18 18 18 18 18 18 ...
$ classname: Factor w/ 1136 levels "cad"," Office2010",..: 111 111 111 111 111 111 116 107 107 107 ...
$ grade : num 7 5 6 8 3 4 1 4 3 2 ...
$ year : Factor w/ 6 levels "98","99","100",..: 3 3 3 3 2 2 4 5 5 3 ...
$ ses : num 212 210 211 213 207 208 217 221 220 210 ...
$ date : int 1010421 1001115 1010214 1010701 1000411 1000627 1020424 1030304 1021121 1001108 ...
$ money : num 5800 5800 5800 5800 5200 5200 3000 0 5500 5500 ...
$ discount : num 1160 1160 1160 1160 1040 1040 600 0 275 550 ...
$ amount : num 4640 4640 4640 4640 4160 ...
$ idc : Factor w/ 7 levels "在校生","校友",..: 2 2 2 2 2 2 2 7 7 7 ...
$ mdy : Date, format: "2012-04-21" "2011-11-15" "2012-02-14" "2012-07-01" ...
$ day : num 1123 1281 1190 1052 1499 ...
> str(essnnta)
'data.frame': 10 obs. of 2 variables:
$ classes: Factor w/ 10 levels "JD","JF",..: 1 7 8 4 6 10 3 5 2 9
$ total : num 55603526 43708950 43555010 35649129 33214372 ...
Your problem might be that your x-axes are not the same in the two data frames. So ggplot does not know which value corresponds with which stack. I am not sure about this as I don't understand the way you define your x axis in the original barplot. I also find it a bit strange to define the aes outside of the ggplot function or the geom_bar. But that might just be me be used to a different kind of syntax.
All in all I find it difficult to help you as you do not provide any reproducible example.
Here is a small bit of data, and a plot that sort of works. If you supplement your question with your data (or a subset of it), see if this works. You may also want to position the label at the top of each bar.
essnn <- data.frame(year = c(98,99,100,101,102),
classname = c("a", "b", "c", "d", "e"),
amount = c(1e6, 2e6,3e6,4e6,5e6))
essnnta <- data.frame(total = c(10, 20, 30, 40, 50))
ggplot(data=essnn, aes(x=reorder(classname,-amount, sum), y=amount, fill = year)) +
geom_bar(binwidth=0.5, stat="identity", position = "stack") +
geom_text(aes(x=essnn$classname, y=essnnta$total, label=essnnta$total), size=3) # not "classes"

R. Handling dates and wide format from an imported Stata file

I have been given a Stata data file (counts.dta) that contains daily counts for the years 1975 to 2006 stored in wide-format. The columns are labelled month (full name of the month as a character string), day (numeric with values 1-31), and then the years from 1975 to 2006 with labels '_1975', '_1976' ... '_2006'. I assume that the underline is a consequence of something in Stata. There are dummy counts of zero (0) inserted for the date 29 February when the year-column is not a leap year.
I want to do several things. First, convert to long form with a sensible representation for year. Second, change the tri-partite representation of the date to something more sensible.
My approach has been to change the character string month to a factor and then to get it into the correct order:
require("foreign")
counts <- read.dta(file='counts.dta')
counts[['month']] <- as.factor( counts[['month']] )
counts[['month']] <-
factor(counts[['month']], levels( counts[['month']] )[c(5,4,8,1,9,7,6,2,12,11,10,3)])
I then have
str( counts )
'data.frame': 366 obs. of 34 variables:
$ month: Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ _1975: int 515 649 745 599 445 667 725 749 646 740 ...
$ _1976: int 485 685 529 467 630 723 712 685 715 504 ...
$ _1977: int 505 437 489 588 634 734 682 537 453 673 ...
and so forth. Converting to long format
lcounts <- reshape(counts,
direction="long",
varying=list(names( counts )[3:34]),
v.names="n.counts",
idvar=c("month","day"),
timevar="Year",
times=1975:2006)
str( lcounts )
gives
'data.frame': 11712 obs. of 4 variables:
$ month : Factor w/ 12 levels "January","February",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ Year : int 1975 1975 1975 1975 1975 1975 1975 1975 1975 1975 ...
$ n.counts: int 515 649 745 599 445 667 725 749 646 740 ...
plus some further lines relating to the original Stata file.
My questions are: (1) what is now a good way to convert to factor-month, numeric-year and numeric-day to a useful date format, so that I can determine, for example, the day of the week, the interval between two dates and so on? (2) Was there a better way to have tackled the problem from the start?
This should be pretty easy because all you have to do is paste together the rows of your data.frame and use as.Date to create a Date class vector.
Let's start with some data similar to yours:
dat <- data.frame(month = c(rep("January",31), rep("February",29)),
day = c(1:31, 1:29),
Year = 1975,
n.counts = 515)
Then the creation of the date variable is simple:
dat$Date <- as.Date(with(dat, paste(as.numeric(month), day, Year)), "%m %d %Y")
str(dat)
# 'data.frame': 60 obs. of 5 variables:
# $ month : Factor w/ 2 levels "February","January": 2 2 2 2 2 2 2 2 2 2 ...
# $ day : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Year : num 1975 1975 1975 1975 1975 ...
# $ n.counts: num 515 515 515 515 515 515 515 515 515 515 ...
# $ Date : Date, format: "1975-02-01" "1975-02-02" "1975-02-03" "1975-02-04" # ...
The main focus in this thread is naturally what to do in R after data import, but here I bundle together various details on the Stata side of this.
It is longstanding advice that data of this kind are much more easily handled in Stata in a long shape and reshape long is a standard command to do that conversion for data arriving with each year's data in a separate variable (R users: please read "column" as a translation). So, if possible, you should ask a provider of such Stata files to do that before export.
What the OP calls labels such as _1975 are legal variable names in Stata, and as the OP guesses the underscore is needed because variable names in Stata may not start with numeric characters.
On the information given, it would have been possible to export the data without loss from Stata in file formats other than .dta, notably as the usual kinds of text files (.csv, etc.).
Stata's preferred way of holding daily dates is as integers with origin 0 = 1 January 1960 (so 26 March 2015 would be 20173), which presumably is trivially easy to convert to any date representation in R.
In short, the particular and indeed peculiar form of the data as presented to the OP is in no sense either required by any Stata syntax or even recommended as part of good Stata practice.

How to split the vector into small group in R?

x<-rnorm(5000,5,3)
How can i split x into 500 groups ,there are ten numbers in every group ?
Answer #1:
x<-rnorm(5000,5,3)
y<-matrix(nr=500,nc=10)
y[]<-x
Answer #2:
Skip the first step and just create the matrix directly.
y<-matrix(rnorm(5000,5,3),nr=500,nc=10)
Are you looking for something like this :
# create a vector of group labels
group <- rep(sample(1:500,replace=F,size=500),10)
group.name <- paste("group",as.character(group),sep=" ")
# create a dataframe of groups and corresponding values
df <- data.frame(group=group.name,value=rnorm(5000,5,3))
# check the dataframe
str(df)
'data.frame': 5000 obs. of 2 variables:
$ group: Factor w/ 500 levels "group 1","group 10",..: 271 115 404 252 138 243 375 308 434 16 ...
$ value: num 8.55 10.14 3.71 8.79 4.17 ...
head(df)
group value
1 group 342 8.547406
2 group 201 10.135465
3 group 462 3.713305
4 group 325 8.786934
5 group 222 4.171373
6 group 317 3.478123

Resources