Related
I would like to collect information on several stocks using a loop and save all the information required into a single data frame. I need to use a loop because the approach I have used (see below) is not very efficient. It retrieves information only for select stocks and skips some. Below is what I've tried:
library(quantmod)
library(TTR)
stocks <-c("MRO", "TSLA", "HAL", "XOM", "DIN", "DRI", "DENN","WEN", "SPCE", "DE", "DRI", "KSS", "AAL","DFS", "LYV","SPXL")
dataEnv <- new.env()
getSymbols(stocks, from = "2014-02-01",to= "2016-01-01", env=dataEnv)
plist <- eapply(dataEnv,Ad)
pframe <- do.call(merge, plist)
pframe1 <- as.data.frame(apply(pframe[,1:ncol(pframe)],2,function(x) diff(x)*100/head(x,-1)))
You can either use the tidyquant or the BatchGetSymbols package. My personal preference is the latter when dealing with data coming from yahoo.
Using tidyquant:
library(tidyquant)
stocks <-c("MRO", "TSLA", "HAL", "XOM", "DIN", "DRI", "DENN","WEN", "SPCE", "DE", "DRI", "KSS", "AAL","DFS", "LYV","SPXL")
tq_stocks <- tq_get(stocks, from = "2014-02-01",to= "2016-01-01")
tq_stocks
# A tibble: 7,245 x 8
symbol date open high low close volume adjusted
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 MRO 2014-02-03 32.8 32.8 32.0 32.1 8983000 28.1
2 MRO 2014-02-04 32.2 32.4 31.9 32.3 10932900 28.4
3 MRO 2014-02-05 32.3 32.4 31.6 32.1 6534500 28.1
4 MRO 2014-02-06 31.7 33.0 31.6 31.8 9408400 27.9
5 MRO 2014-02-07 31.9 32.8 31.7 32.6 8184400 28.6
6 MRO 2014-02-10 32.5 32.5 32.0 32.3 5862600 28.3
7 MRO 2014-02-11 32.3 32.9 32.3 32.7 6140400 28.7
8 MRO 2014-02-12 33.0 33.3 32.8 33.3 5202500 29.2
9 MRO 2014-02-13 33.0 33.4 32.7 33.3 6755900 29.2
10 MRO 2014-02-14 33.0 33.4 32.9 33.2 6096300 29.3
tidyquant will give some warnings. These you can ignore, a ticket has been opened to address these.
Using BatchGetSymbols:
library(BatchGetSymbols)
batch_stocks <- BatchGetSymbols(stocks, first.date = "2014-02-01", last.date = "2016-01-01")
str(batch_stocks)
List of 2
$ df.control: tibble [15 x 6] (S3: tbl_df/tbl/data.frame)
..$ ticker : chr [1:15] "MRO" "TSLA" "HAL" "XOM" ...
..$ src : chr [1:15] "yahoo" "yahoo" "yahoo" "yahoo" ...
..$ download.status : chr [1:15] "OK" "OK" "OK" "OK" ...
..$ total.obs : int [1:15] 483 483 483 483 483 483 483 483 483 483 ...
..$ perc.benchmark.dates: num [1:15] 1 1 1 1 1 1 1 1 1 1 ...
..$ threshold.decision : chr [1:15] "KEEP" "KEEP" "KEEP" "KEEP" ...
$ df.tickers:'data.frame': 6762 obs. of 10 variables:
..$ price.open : num [1:6762] 32.8 32.2 32.3 31.7 31.9 ...
..$ price.high : num [1:6762] 32.8 32.4 32.4 33 32.8 ...
..$ price.low : num [1:6762] 32 31.9 31.6 31.6 31.7 ...
..$ price.close : num [1:6762] 32.1 32.3 32.1 31.8 32.6 ...
..$ volume : num [1:6762] 8983000 10932900 6534500 9408400 8184400 ...
..$ price.adjusted : num [1:6762] 28.1 28.4 28.1 27.9 28.6 ...
..$ ref.date : Date[1:6762], format: "2014-02-03" "2014-02-04" "2014-02-05" "2014-02-06" ...
..$ ticker : chr [1:6762] "MRO" "MRO" "MRO" "MRO" ...
..$ ret.adjusted.prices: num [1:6762] NA 0.00873 -0.00742 -0.00903 0.02483 ...
..$ ret.closing.prices : num [1:6762] NA 0.00873 -0.00742 -0.00903 0.02483 ...
batch_stocks will be a list of 2 data.frames. The first is a control data.frame that shows if all the tickers have been downloaded correctly. The second data.frame contains all the ticker data. An advantage of BatchGetSymbols is that it can run in parallel if you use it in combination with the future package. Also, if you already have the data locally it will not download the data again. So running this 3 times in a row, it will only download the data once, and get the rest from the temporarily stored data.
I have raw data shown below. I'm trying to move a row of data that corresponds to a label it matches to a new location in the dataframe.
dat<-read.table(text='RowLabels col1 col2 col3 col4 col5 col6
L 24363.7 25944.9 25646.1 25335.4 23564.2 25411.5
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
181920 652.9 729.2 652.1 689.1 612.5 738.4
1 104.3 107.3 103.5 104.2 98.3 110.1
2 103.6 102.6 100.1 103.2 88.8 117.7
3 53.5 99.1 46.7 70.3 53.9 32.5
4 93.5 107.2 98.3 99.3 97.3 121.1
5 96.8 109.3 104 102.2 98.7 112.9
6 103.6 96.9 104.7 104.4 91.5 137.7
7 97.6 106.8 94.8 105.5 84 106.4
181930 732.1 709.6 725.8 729.5 554.5 873.1
1 118.4 98.8 102.3 102 101.9 115.8
2 96.7 103.3 104.6 105.2 81.9 128.7
3 96 98.2 99.4 97.9 69.8 120.6
4 100.7 101 103.6 106.6 59.6 136.2
5 106.1 103.4 104.7 104.8 76.1 131.8
6 105 102.1 103 108.3 81 124.7
7 109.2 102.8 108.2 104.7 84.2 115.3
N 3836.4 4395.8 4227.3 4567.4 4009.9 4434.6
610 88.1 96.3 99.6 92 90 137.6
1 88.1 96.3 99.6 92 90 137.6
181920 113.1 100.6 106.5 104.2 87.3 108.2
1 113.1 100.6 106.5 104.2 87.3 108.2
181930 111.3 99.1 104.5 115.5 103.6 118.8
1 111.3 99.1 104.5 115.5 103.6 118.8
',header=TRUE)
I want to match the values of the three N-prefix labels: 610, 181920 and 181930 with its corresponding L-prefix labels. Basically move that row of data into the L-prefix as a new row, labeled 0 or 8 for example. So, the result for label, 610 would look like:
RowLabels col1 col2 col3 col4 col5 col6
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
8 88.1 96.3 99.6 92 90 137.6
Is this possible? I tried searching and I found some resources pointing toward dplyr or tidyr or aggregate. But I can't find a good example that matches my case. How to combine rows based on unique values in R? and
Aggregate rows by shared values in a variable
library(dplyr)
library(zoo)
df <- dat %>%
filter(grepl("^\\d+$",RowLabels)) %>%
mutate(RowLabels_temp = ifelse(grepl("^\\d{3,}$",RowLabels), as.numeric(as.character(RowLabels)), NA)) %>%
na.locf() %>%
select(-RowLabels) %>%
distinct() %>%
group_by(RowLabels_temp) %>%
mutate(RowLabels_indexed = row_number()-1) %>%
arrange(RowLabels_temp, RowLabels_indexed) %>%
mutate(RowLabels_indexed = ifelse(RowLabels_indexed==0, RowLabels_temp, RowLabels_indexed)) %>%
rename(RowLabels=RowLabels_indexed) %>%
data.frame()
df <- df %>% select(-RowLabels_temp)
df
Output is
col1 col2 col3 col4 col5 col6 RowLabels
1 411.4 439.0 437.3 436.9 420.7 516.9 610
2 86.4 113.9 103.5 113.5 80.3 129.0 1
3 102.1 99.5 96.3 100.4 99.5 86.0 2
4 109.7 102.2 100.2 112.9 92.3 123.8 3
5 88.9 87.1 103.6 102.5 93.6 134.1 4
6 -50.3 -40.2 -72.3 -61.4 -27.0 -22.7 5
7 -35.3 -9.3 25.3 -0.3 15.6 -27.3 6
8 109.9 85.8 80.7 69.3 66.4 94.0 7
9 88.1 96.3 99.6 92.0 90.0 137.6 8
...
It sounds like you want to use the match() function, for example:
target<-c(the values of your target order)
df<-df[match(target, df$column_to_reorder),]
http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504
This is the website from with I want to read data.
My code is as follows,
library(XML)
fileurl <- "http://www.aqistudy.cn/historydata/daydata.php?city=苏州&month=201404"
doc <- htmlTreeParse(fileurl, useInternalNodes = TRUE, encoding = "utf-8")
rootnode <- xmlRoot(doc)
pollution <- xpathSApply(rootnode, "/td", xmlValue)
But I got a lot of messy code, and I don't know how to fix this problem.
I appreciate for any help!
This can be simplified using library(rvest) to directly read the table
library(rvest)
url <- "http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504"
doc <- read_html(url) %>%
html_table()
doc[[1]]
# 日期 AQI 范围 质量等级 PM2.5 PM10 SO2 CO NO2 O3 排名
# 1 2015-04-01 106 67~144 轻度污染 79.3 105.1 20.2 1.230 89.5 76 308
# 2 2015-04-02 74 31~140 良 48.1 79.7 18.8 1.066 51.5 129 231
# 3 2015-04-03 98 49~136 良 72.9 89.2 16.0 1.323 50.9 62 293
# 4 2015-04-04 92 56~158 良 67.6 78.2 14.3 1.506 57.4 93 262
# 5 2015-04-05 87 42~167 良 63.7 56.1 16.9 1.245 50.8 91 215
# 6 2015-04-06 46 36~56 优 29.1 30.8 10.0 0.817 37.5 98 136
# 7 2015-04-07 45 34~59 优 27.0 42.4 12.0 0.640 36.6 77 143
I have a table like this.
X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 103.27 105.2 99.7 106.7 96.7 108.4 88.7 73.67
2 BS 100.17 104.5 97.6 103.6 91.7 106.2 85.5 73.66
3 DG 101.00 102.5 98.9 101.1 91.2 106.2 80.9 75.67
4 IC 97.80 103.4 97.2 102.4 88.4 103.3 85.7 70.00
5 DJ 106.20 103.1 99.1 97.7 90.7 106.2 77.5 74.00
6 GJ 97.47 101.7 98.6 101.2 89.9 105.6 81.7 73.33
7 US 99.80 105.6 98.2 0.0 81.7 103.6 84.3 68.00
8 GG 98.13 105.7 98.6 103.7 92.2 105.2 85.9 73.66
9 GO 96.13 101.2 96.8 101.7 86.4 105.7 78.1 72.66
10 CB 104.20 105.2 101.5 100.3 88.3 106.2 78.8 72.00
11 CN 107.20 95.0 96.1 98.7 88.2 103.7 78.5 71.33
12 GB 98.87 102.0 95.3 100.2 87.2 104.2 78.5 70.33
13 GN 99.57 103.3 95.6 102.6 89.2 103.7 83.2 72.00
14 JB 99.60 96.2 98.2 96.2 86.2 101.7 84.5 71.34
15 JN 93.83 98.6 98.8 95.2 87.2 102.7 83.9 70.33
16 JJ 93.63 101.7 93.2 98.1 0.0 0.0 83.9 71.00
17 SJ 0.00 0.0 0.0 0.0 0.0 106.5 81.9 73.34
This is a test score that took place in some provinces of South Korea in each year.
The boundary of the test score was [0,110] until 2013, but it was changed to [0,100] in 2014.
My objective is to normalize the test score into some boundary or hopely some standardized region.
Maybe, I can first convert the scores among 2008 and 2013 into 100% scale, and subtract column mean and divide by standard deviation of each column to achieve this. But then, that is only standardized in each column.
Is there any possible way to normalize (or standardize) the test score as a whole?
By the way, the test score 0 means there was no test, so it must be ignored in the normalization process. And, this is csv format for your convenience..
,2008,2009,2010,2011,2012,2013,2014,2015
SU,103.27,105.2,99.7,106.7,96.7,108.4,88.7,73.67
BS,100.17,104.5,97.6,103.6,91.7,106.2,85.5,73.66
DG,101,102.5,98.9,101.1,91.2,106.2,80.9,75.67
IC,97.8,103.4,97.2,102.4,88.4,103.3,85.7,70
DJ,106.2,103.1,99.1,97.7,90.7,106.2,77.5,74
GJ,97.47,101.7,98.6,101.2,89.9,105.6,81.7,73.33
US,99.8,105.6,98.2,0,81.7,103.6,84.3,68
GG,98.13,105.7,98.6,103.7,92.2,105.2,85.9,73.66
GO,96.13,101.2,96.8,101.7,86.4,105.7,78.1,72.66
CB,104.2,105.2,101.5,100.3,88.3,106.2,78.8,72
CN,107.2,95,96.1,98.7,88.2,103.7,78.5,71.33
GB,98.87,102,95.3,100.2,87.2,104.2,78.5,70.33
GN,99.57,103.3,95.6,102.6,89.2,103.7,83.2,72
JB,99.6,96.2,98.2,96.2,86.2,101.7,84.5,71.34
JN,93.83,98.6,98.8,95.2,87.2,102.7,83.9,70.33
JJ,93.63,101.7,93.2,98.1,0,0,83.9,71
SJ,0,0,0,0,0,106.5,81.9,73.34
I think the best would probably need be to convert columns 2 to 6 i.e. the ones in the range [0-110] to the range of [0-100]. In this way everything will be in the same scale. In order to do this:
Data:
df <- read.table(header=T, text=' X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 103.27 105.2 99.7 106.7 96.7 108.4 88.7 73.67
2 BS 100.17 104.5 97.6 103.6 91.7 106.2 85.5 73.66
3 DG 101.00 102.5 98.9 101.1 91.2 106.2 80.9 75.67
4 IC 97.80 103.4 97.2 102.4 88.4 103.3 85.7 70.00
5 DJ 106.20 103.1 99.1 97.7 90.7 106.2 77.5 74.00
6 GJ 97.47 101.7 98.6 101.2 89.9 105.6 81.7 73.33
7 US 99.80 105.6 98.2 0.0 81.7 103.6 84.3 68.00
8 GG 98.13 105.7 98.6 103.7 92.2 105.2 85.9 73.66
9 GO 96.13 101.2 96.8 101.7 86.4 105.7 78.1 72.66
10 CB 104.20 105.2 101.5 100.3 88.3 106.2 78.8 72.00
11 CN 107.20 95.0 96.1 98.7 88.2 103.7 78.5 71.33
12 GB 98.87 102.0 95.3 100.2 87.2 104.2 78.5 70.33
13 GN 99.57 103.3 95.6 102.6 89.2 103.7 83.2 72.00
14 JB 99.60 96.2 98.2 96.2 86.2 101.7 84.5 71.34
15 JN 93.83 98.6 98.8 95.2 87.2 102.7 83.9 70.33
16 JJ 93.63 101.7 93.2 98.1 0.0 0.0 83.9 71.00
17 SJ 0.00 0.0 0.0 0.0 0.0 106.5 81.9 73.34')
You could do:
df[2:6] <- lapply(df[2:6], function(x) {
x / 110 * 100
})
Essentially you divide by 120 which is the max in [0-110] in order to convert to the range between [0-1] and then multiply by 100 to convert that in the range between [0-100].
Output:
> df
X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 93.88182 95.63636 90.63636 97.00000 87.90909 108.4 88.7 73.67
2 BS 91.06364 95.00000 88.72727 94.18182 83.36364 106.2 85.5 73.66
3 DG 91.81818 93.18182 89.90909 91.90909 82.90909 106.2 80.9 75.67
4 IC 88.90909 94.00000 88.36364 93.09091 80.36364 103.3 85.7 70.00
5 DJ 96.54545 93.72727 90.09091 88.81818 82.45455 106.2 77.5 74.00
6 GJ 88.60909 92.45455 89.63636 92.00000 81.72727 105.6 81.7 73.33
7 US 90.72727 96.00000 89.27273 0.00000 74.27273 103.6 84.3 68.00
8 GG 89.20909 96.09091 89.63636 94.27273 83.81818 105.2 85.9 73.66
9 GO 87.39091 92.00000 88.00000 92.45455 78.54545 105.7 78.1 72.66
10 CB 94.72727 95.63636 92.27273 91.18182 80.27273 106.2 78.8 72.00
11 CN 97.45455 86.36364 87.36364 89.72727 80.18182 103.7 78.5 71.33
12 GB 89.88182 92.72727 86.63636 91.09091 79.27273 104.2 78.5 70.33
13 GN 90.51818 93.90909 86.90909 93.27273 81.09091 103.7 83.2 72.00
14 JB 90.54545 87.45455 89.27273 87.45455 78.36364 101.7 84.5 71.34
15 JN 85.30000 89.63636 89.81818 86.54545 79.27273 102.7 83.9 70.33
16 JJ 85.11818 92.45455 84.72727 89.18182 0.00000 0.0 83.9 71.00
17 SJ 0.00000 0.00000 0.00000 0.00000 0.00000 106.5 81.9 73.34
And now you can compare between the years. Also, as you will notice zeros will remain zeros.
How can I make a residual plot according to the following (what are y_hat and e here)?
Is this a form of residual plot as well?
beeflm=lm(PBE ~ CBE + PPO + CPO + PFO +DINC + CFO+RDINC+RFP+YEAR, data = beef)
summary(beeflm)
qqnorm(residuals(beeflm))
#plot(beeflm) #in manuals I have seen they use this but it gives me multiple plot
or is this one correct?
plot(beeflm$residuals,beeflm$fitted.values)
I know through the comments that plot(beeflm,which=1) is correct but according to the stated question I should use matplot but I receive the following error:
matplot(beeflm,which=1,
+ main = "Beef: residual plot",
+ ylab = expression(e[i]), # only 1st is taken
+ xlab = expression(hat(y[i])))
Error in xy.coords(x, y, xlabel, ylabel, log = log) :
(list) object cannot be coerced to type 'double'
And when I use plot I receive the following error:
plot(beeflm,which=1,main="Beef: residual plot",ylab = expression(e[i]),xlab = expression(hat(y[i])))
Error in plot.default(yh, r, xlab = l.fit, ylab = "Residuals", main = main, :
formal argument "xlab" matched by multiple actual arguments
Also do you know what does the following mean? Any example for illustrating this (or external link)?
Beef data is like the following:
Here's the beef data.frame:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
7 1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0 64.0 791
8 1932 79.0 46.0 49.5 69.7 42.8 31.4 87.8 53.9 733
9 1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0 53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1 58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3 63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5 70.5 845
13 1937 73.0 54.4 62.2 55.0 52.1 44.5 90.4 72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6 67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8 73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5 77.6 798
17 1941 56.0 60.0 43.9 67.4 52.2 56.3 97.5 89.5 830
Use plot(beeflm, which=1) to get the plot between residuals and fitted values.
require(graphics)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
plot(lm.D9, which=1)
Edited
You can use matplot as given below:
matplot(
x = lm.D9$fitted.values
, y = lm.D9$resid
)
An example illustrating this using the mtcars data:
fit <- lm(mpg ~ ., data=mtcars)
plot(x=fitted(fit), y=residuals(fit))
and
par(mfrow=c(3,4)) # or 'layout(matrix(1:12, nrow=3, byrow=TRUE))'
for (coeff in colnames(mtcars)[-1])
plot(x=mtcars[, coeff], residuals(fit), xlab=coeff, ylab=expression(e[i]))