How to create a plot consisting of multiple residuals? - r

How can I make a residual plot according to the following (what are y_hat and e here)?
Is this a form of residual plot as well?
beeflm=lm(PBE ~ CBE + PPO + CPO + PFO +DINC + CFO+RDINC+RFP+YEAR, data = beef)
summary(beeflm)
qqnorm(residuals(beeflm))
#plot(beeflm) #in manuals I have seen they use this but it gives me multiple plot
or is this one correct?
plot(beeflm$residuals,beeflm$fitted.values)
I know through the comments that plot(beeflm,which=1) is correct but according to the stated question I should use matplot but I receive the following error:
matplot(beeflm,which=1,
+ main = "Beef: residual plot",
+ ylab = expression(e[i]), # only 1st is taken
+ xlab = expression(hat(y[i])))
Error in xy.coords(x, y, xlabel, ylabel, log = log) :
(list) object cannot be coerced to type 'double'
And when I use plot I receive the following error:
plot(beeflm,which=1,main="Beef: residual plot",ylab = expression(e[i]),xlab = expression(hat(y[i])))
Error in plot.default(yh, r, xlab = l.fit, ylab = "Residuals", main = main, :
formal argument "xlab" matched by multiple actual arguments
Also do you know what does the following mean? Any example for illustrating this (or external link)?
Beef data is like the following:
Here's the beef data.frame:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
7 1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0 64.0 791
8 1932 79.0 46.0 49.5 69.7 42.8 31.4 87.8 53.9 733
9 1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0 53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1 58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3 63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5 70.5 845
13 1937 73.0 54.4 62.2 55.0 52.1 44.5 90.4 72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6 67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8 73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5 77.6 798
17 1941 56.0 60.0 43.9 67.4 52.2 56.3 97.5 89.5 830

Use plot(beeflm, which=1) to get the plot between residuals and fitted values.
require(graphics)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
plot(lm.D9, which=1)
Edited
You can use matplot as given below:
matplot(
x = lm.D9$fitted.values
, y = lm.D9$resid
)

An example illustrating this using the mtcars data:
fit <- lm(mpg ~ ., data=mtcars)
plot(x=fitted(fit), y=residuals(fit))
and
par(mfrow=c(3,4)) # or 'layout(matrix(1:12, nrow=3, byrow=TRUE))'
for (coeff in colnames(mtcars)[-1])
plot(x=mtcars[, coeff], residuals(fit), xlab=coeff, ylab=expression(e[i]))

Related

How can I organise and move row of data based on label matches?

I have raw data shown below. I'm trying to move a row of data that corresponds to a label it matches to a new location in the dataframe.
dat<-read.table(text='RowLabels col1 col2 col3 col4 col5 col6
L 24363.7 25944.9 25646.1 25335.4 23564.2 25411.5
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
181920 652.9 729.2 652.1 689.1 612.5 738.4
1 104.3 107.3 103.5 104.2 98.3 110.1
2 103.6 102.6 100.1 103.2 88.8 117.7
3 53.5 99.1 46.7 70.3 53.9 32.5
4 93.5 107.2 98.3 99.3 97.3 121.1
5 96.8 109.3 104 102.2 98.7 112.9
6 103.6 96.9 104.7 104.4 91.5 137.7
7 97.6 106.8 94.8 105.5 84 106.4
181930 732.1 709.6 725.8 729.5 554.5 873.1
1 118.4 98.8 102.3 102 101.9 115.8
2 96.7 103.3 104.6 105.2 81.9 128.7
3 96 98.2 99.4 97.9 69.8 120.6
4 100.7 101 103.6 106.6 59.6 136.2
5 106.1 103.4 104.7 104.8 76.1 131.8
6 105 102.1 103 108.3 81 124.7
7 109.2 102.8 108.2 104.7 84.2 115.3
N 3836.4 4395.8 4227.3 4567.4 4009.9 4434.6
610 88.1 96.3 99.6 92 90 137.6
1 88.1 96.3 99.6 92 90 137.6
181920 113.1 100.6 106.5 104.2 87.3 108.2
1 113.1 100.6 106.5 104.2 87.3 108.2
181930 111.3 99.1 104.5 115.5 103.6 118.8
1 111.3 99.1 104.5 115.5 103.6 118.8
',header=TRUE)
I want to match the values of the three N-prefix labels: 610, 181920 and 181930 with its corresponding L-prefix labels. Basically move that row of data into the L-prefix as a new row, labeled 0 or 8 for example. So, the result for label, 610 would look like:
RowLabels col1 col2 col3 col4 col5 col6
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
8 88.1 96.3 99.6 92 90 137.6
Is this possible? I tried searching and I found some resources pointing toward dplyr or tidyr or aggregate. But I can't find a good example that matches my case. How to combine rows based on unique values in R? and
Aggregate rows by shared values in a variable
library(dplyr)
library(zoo)
df <- dat %>%
filter(grepl("^\\d+$",RowLabels)) %>%
mutate(RowLabels_temp = ifelse(grepl("^\\d{3,}$",RowLabels), as.numeric(as.character(RowLabels)), NA)) %>%
na.locf() %>%
select(-RowLabels) %>%
distinct() %>%
group_by(RowLabels_temp) %>%
mutate(RowLabels_indexed = row_number()-1) %>%
arrange(RowLabels_temp, RowLabels_indexed) %>%
mutate(RowLabels_indexed = ifelse(RowLabels_indexed==0, RowLabels_temp, RowLabels_indexed)) %>%
rename(RowLabels=RowLabels_indexed) %>%
data.frame()
df <- df %>% select(-RowLabels_temp)
df
Output is
col1 col2 col3 col4 col5 col6 RowLabels
1 411.4 439.0 437.3 436.9 420.7 516.9 610
2 86.4 113.9 103.5 113.5 80.3 129.0 1
3 102.1 99.5 96.3 100.4 99.5 86.0 2
4 109.7 102.2 100.2 112.9 92.3 123.8 3
5 88.9 87.1 103.6 102.5 93.6 134.1 4
6 -50.3 -40.2 -72.3 -61.4 -27.0 -22.7 5
7 -35.3 -9.3 25.3 -0.3 15.6 -27.3 6
8 109.9 85.8 80.7 69.3 66.4 94.0 7
9 88.1 96.3 99.6 92.0 90.0 137.6 8
...
It sounds like you want to use the match() function, for example:
target<-c(the values of your target order)
df<-df[match(target, df$column_to_reorder),]

as.Date conversion returns NA

I am using this data. When I want to define the variable as.Date() I am getting NA.
This is the code I am using. Can someone please advise what I am doing wrong?
dataf <- read.csv("DB.csv")
dataf$Date <- as.Date(dataf$Date, format = "%b-%Y")
str(dataf)
'data.frame': 55 obs. of 9 variables:
$ Date : Date, format: NA NA ...
$ Sydney : num 85.3 88.2 87 84.4 84.2 84.8 83.2 82.6 81.4 81.8 ...
$ Melbourne: num 60.7 62.1 60.8 60.9 60.9 62.4 62.5 63.2 63.1 64 ...
$ Brisbane : num 64.2 69.4 70.7 71.7 71.2 72 72.6 73.3 73.6 75 ...
$ Adelaide : num 62.2 63.9 64.8 65.9 67.1 68.6 68.6 69.3 70 71.6 ...
$ Perth : num 48.3 50.6 52.5 53.7 54.7 57.1 59.4 62.6 65.2 70 ...
$ Hobart : num 61.2 66.5 68.7 71.8 72.3 74.6 75.8 76.9 76.7 79.1 ...
$ Darwin : num 40.5 43.3 45.5 45.2 46.8 49.7 53.6 54.7 56.1 60.2 ...
$ Canberra : num 68.3 70.9 69.9 70.1 68.6 69.7 70.3 69.9 70.1 71.7 ...
In addition to the good suggestions in the comments, you should try lubridate::parse_date_time" which can handle incomplete dates
as.Date("01-2017", format="%m-%Y")
# [1] NA
as.POSIXct("01-2017", format="%m-%Y")
# [1] NA
as.POSIXlt("01-2017", format="%m-%Y")
# [1] NA
library(lubridate)
parse_date_time("01-2017", "my")
# [1] "2017-01-01 UTC"

(statistics) 2-way table normalization

I have a table like this.
X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 103.27 105.2 99.7 106.7 96.7 108.4 88.7 73.67
2 BS 100.17 104.5 97.6 103.6 91.7 106.2 85.5 73.66
3 DG 101.00 102.5 98.9 101.1 91.2 106.2 80.9 75.67
4 IC 97.80 103.4 97.2 102.4 88.4 103.3 85.7 70.00
5 DJ 106.20 103.1 99.1 97.7 90.7 106.2 77.5 74.00
6 GJ 97.47 101.7 98.6 101.2 89.9 105.6 81.7 73.33
7 US 99.80 105.6 98.2 0.0 81.7 103.6 84.3 68.00
8 GG 98.13 105.7 98.6 103.7 92.2 105.2 85.9 73.66
9 GO 96.13 101.2 96.8 101.7 86.4 105.7 78.1 72.66
10 CB 104.20 105.2 101.5 100.3 88.3 106.2 78.8 72.00
11 CN 107.20 95.0 96.1 98.7 88.2 103.7 78.5 71.33
12 GB 98.87 102.0 95.3 100.2 87.2 104.2 78.5 70.33
13 GN 99.57 103.3 95.6 102.6 89.2 103.7 83.2 72.00
14 JB 99.60 96.2 98.2 96.2 86.2 101.7 84.5 71.34
15 JN 93.83 98.6 98.8 95.2 87.2 102.7 83.9 70.33
16 JJ 93.63 101.7 93.2 98.1 0.0 0.0 83.9 71.00
17 SJ 0.00 0.0 0.0 0.0 0.0 106.5 81.9 73.34
This is a test score that took place in some provinces of South Korea in each year.
The boundary of the test score was [0,110] until 2013, but it was changed to [0,100] in 2014.
My objective is to normalize the test score into some boundary or hopely some standardized region.
Maybe, I can first convert the scores among 2008 and 2013 into 100% scale, and subtract column mean and divide by standard deviation of each column to achieve this. But then, that is only standardized in each column.
Is there any possible way to normalize (or standardize) the test score as a whole?
By the way, the test score 0 means there was no test, so it must be ignored in the normalization process. And, this is csv format for your convenience..
,2008,2009,2010,2011,2012,2013,2014,2015
SU,103.27,105.2,99.7,106.7,96.7,108.4,88.7,73.67
BS,100.17,104.5,97.6,103.6,91.7,106.2,85.5,73.66
DG,101,102.5,98.9,101.1,91.2,106.2,80.9,75.67
IC,97.8,103.4,97.2,102.4,88.4,103.3,85.7,70
DJ,106.2,103.1,99.1,97.7,90.7,106.2,77.5,74
GJ,97.47,101.7,98.6,101.2,89.9,105.6,81.7,73.33
US,99.8,105.6,98.2,0,81.7,103.6,84.3,68
GG,98.13,105.7,98.6,103.7,92.2,105.2,85.9,73.66
GO,96.13,101.2,96.8,101.7,86.4,105.7,78.1,72.66
CB,104.2,105.2,101.5,100.3,88.3,106.2,78.8,72
CN,107.2,95,96.1,98.7,88.2,103.7,78.5,71.33
GB,98.87,102,95.3,100.2,87.2,104.2,78.5,70.33
GN,99.57,103.3,95.6,102.6,89.2,103.7,83.2,72
JB,99.6,96.2,98.2,96.2,86.2,101.7,84.5,71.34
JN,93.83,98.6,98.8,95.2,87.2,102.7,83.9,70.33
JJ,93.63,101.7,93.2,98.1,0,0,83.9,71
SJ,0,0,0,0,0,106.5,81.9,73.34
I think the best would probably need be to convert columns 2 to 6 i.e. the ones in the range [0-110] to the range of [0-100]. In this way everything will be in the same scale. In order to do this:
Data:
df <- read.table(header=T, text=' X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 103.27 105.2 99.7 106.7 96.7 108.4 88.7 73.67
2 BS 100.17 104.5 97.6 103.6 91.7 106.2 85.5 73.66
3 DG 101.00 102.5 98.9 101.1 91.2 106.2 80.9 75.67
4 IC 97.80 103.4 97.2 102.4 88.4 103.3 85.7 70.00
5 DJ 106.20 103.1 99.1 97.7 90.7 106.2 77.5 74.00
6 GJ 97.47 101.7 98.6 101.2 89.9 105.6 81.7 73.33
7 US 99.80 105.6 98.2 0.0 81.7 103.6 84.3 68.00
8 GG 98.13 105.7 98.6 103.7 92.2 105.2 85.9 73.66
9 GO 96.13 101.2 96.8 101.7 86.4 105.7 78.1 72.66
10 CB 104.20 105.2 101.5 100.3 88.3 106.2 78.8 72.00
11 CN 107.20 95.0 96.1 98.7 88.2 103.7 78.5 71.33
12 GB 98.87 102.0 95.3 100.2 87.2 104.2 78.5 70.33
13 GN 99.57 103.3 95.6 102.6 89.2 103.7 83.2 72.00
14 JB 99.60 96.2 98.2 96.2 86.2 101.7 84.5 71.34
15 JN 93.83 98.6 98.8 95.2 87.2 102.7 83.9 70.33
16 JJ 93.63 101.7 93.2 98.1 0.0 0.0 83.9 71.00
17 SJ 0.00 0.0 0.0 0.0 0.0 106.5 81.9 73.34')
You could do:
df[2:6] <- lapply(df[2:6], function(x) {
x / 110 * 100
})
Essentially you divide by 120 which is the max in [0-110] in order to convert to the range between [0-1] and then multiply by 100 to convert that in the range between [0-100].
Output:
> df
X X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
1 SU 93.88182 95.63636 90.63636 97.00000 87.90909 108.4 88.7 73.67
2 BS 91.06364 95.00000 88.72727 94.18182 83.36364 106.2 85.5 73.66
3 DG 91.81818 93.18182 89.90909 91.90909 82.90909 106.2 80.9 75.67
4 IC 88.90909 94.00000 88.36364 93.09091 80.36364 103.3 85.7 70.00
5 DJ 96.54545 93.72727 90.09091 88.81818 82.45455 106.2 77.5 74.00
6 GJ 88.60909 92.45455 89.63636 92.00000 81.72727 105.6 81.7 73.33
7 US 90.72727 96.00000 89.27273 0.00000 74.27273 103.6 84.3 68.00
8 GG 89.20909 96.09091 89.63636 94.27273 83.81818 105.2 85.9 73.66
9 GO 87.39091 92.00000 88.00000 92.45455 78.54545 105.7 78.1 72.66
10 CB 94.72727 95.63636 92.27273 91.18182 80.27273 106.2 78.8 72.00
11 CN 97.45455 86.36364 87.36364 89.72727 80.18182 103.7 78.5 71.33
12 GB 89.88182 92.72727 86.63636 91.09091 79.27273 104.2 78.5 70.33
13 GN 90.51818 93.90909 86.90909 93.27273 81.09091 103.7 83.2 72.00
14 JB 90.54545 87.45455 89.27273 87.45455 78.36364 101.7 84.5 71.34
15 JN 85.30000 89.63636 89.81818 86.54545 79.27273 102.7 83.9 70.33
16 JJ 85.11818 92.45455 84.72727 89.18182 0.00000 0.0 83.9 71.00
17 SJ 0.00000 0.00000 0.00000 0.00000 0.00000 106.5 81.9 73.34
And now you can compare between the years. Also, as you will notice zeros will remain zeros.

read.table, read.csv or scan for reading text file in R?

I am confused which of the following should I use? (actually as of now all of them give me errors):
> beef = read.csv("beef.txt", header = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
> beef = scan("beef.txt")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '%'
> beef=read.table("beef.txt", header = FALSE, sep = " ")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 8 elements
> beef=read.table("beef.txt", header = TRUE, sep = " ")
Error in read.table("beef.txt", header = TRUE, sep = " ") :
more columns than column names
Here's the top portion of beef.txt file and the rest is very similar.
% http://lib.stat.cmu.edu/DASL/Datafiles/agecondat.html
%
% Datafile Name: Agricultural Economics Studies
% Datafile Subjects: Agriculture , Economics , Consumer
% Story Names: Agricultural Economics Studies
% Reference: F.B. Waugh, Graphic Analysis in Agricultural Economics,
% Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957.
% Authorization: free use
% Description: Price and consumption per capita of beef and pork
% annually from 1925 to 1941 together with other variables relevant to
% an economic analysis of price and/or consumption of beef and pork
% over the period.
% Number of cases: 17
% Variable Names:
%
% PBE = Price of beef (cents/lb)
% CBE = Consumption of beef per capita (lbs)
% PPO = Price of pork (cents/lb)
% CPO = Consumption of pork per capita (lbs)
% PFO = Retail food price index (1947-1949 = 100)
% DINC = Disposable income per capita index (1947-1949 = 100)
% CFO = Food consumption per capita index (1947-1949 = 100)
% RDINC = Index of real disposable income per capita (1947-1949 = 100)
% RFP = Retail food price index adjusted by the CPI (1947-1949 = 100)
%
% The Data:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
1926 59.7 59.4 63.3 63.3 68 52.6 92.1 69.6 899
1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
1929 71 49 55 68.7 65.6 55.1 91.1 75.2 895
When I used fread the data is saved very strangely as in following, any idea how can it be formatted as expected?
> library(data.table)
> beef=fread("beef.txt", header = T, sep = " ")
> beef
YEAR V2 V3 V4
1: 1925 NA NA NA
2: 1926 NA NA NA
3: 1927 NA NA NA
4: 1928 NA NA NA
5: 1929 NA NA NA
6: 1930 NA NA NA
7: 1931 NA NA NA
8: 1932 NA NA NA
9: 1933 NA NA NA
10: 1934 NA NA NA
11: 1935 NA NA NA
12: 1936 NA NA NA
13: 1937 NA NA NA
14: 1938 NA NA NA
15: 1939 NA NA NA
16: 1940 NA NA NA
17: 1941 NA NA NA
PBE\tCBE\tPPO\tCPO\tPFO\tDINC\tCFO\tRDINC\tRFP
1: 59.7\t58.6\t60.5\t65.8\t65.8\t51.4\t90.9\t68.5\t877
2: 59.7\t59.4\t63.3\t63.3\t68\t52.6\t92.1\t69.6\t899
3: 63\t53.7\t59.9\t66.8\t65.5\t52.1\t90.9\t70.2\t883
4: 71\t48.1\t56.3\t69.9\t64.8\t52.7\t90.9\t71.9\t884
5: 71\t49\t55\t68.7\t65.6\t55.1\t91.1\t75.2\t895
6: 74.2\t48.2\t59.6\t66.1\t62.4\t48.8\t90.7\t68.3\t874
7: 72.1\t47.9\t57\t67.4\t51.4\t41.5\t90\t64\t791
8: 79\t46\t49.5\t69.7\t42.8\t31.4\t87.8\t53.9\t733
9: 73.1\t50.8\t47.3\t68.7\t41.6\t29.4\t88\t53.2\t752
10: 70.2\t55.2\t56.6\t62.2\t46.4\t33.2\t89.1\t58\t811
11: 82.2\t52.2\t73.9\t47.7\t49.7\t37\t87.3\t63.2\t847
12: 68.4\t57.3\t64.4\t54.4\t50.1\t41.8\t90.5\t70.5\t845
13: 73\t54.4\t62.2\t55\t52.1\t44.5\t90.4\t72.5\t849
14: 70.2\t53.6\t59.9\t57.4\t48.4\t40.8\t90.6\t67.8\t803
15: 67.8\t53.9\t51\t63.9\t47.1\t43.5\t93.8\t73.2\t793
16: 63.4\t54.2\t41.5\t72.4\t47.8\t46.5\t95.5\t77.6\t798
17: 56\t60\t43.9\t67.4\t52.2\t56.3\t97.5\t89.5\t830
And when I read.table as told in the comments I receive weird output (I don't read it as neatly as expected):
> beef=read.table("beef.txt", header = TRUE, sep = " ", comment.char="%")
> beef
YEAR X X.1 X.2
1 1925 NA NA NA
2 1926 NA NA NA
3 1927 NA NA NA
4 1928 NA NA NA
5 1929 NA NA NA
6 1930 NA NA NA
7 1931 NA NA NA
8 1932 NA NA NA
9 1933 NA NA NA
10 1934 NA NA NA
11 1935 NA NA NA
12 1936 NA NA NA
13 1937 NA NA NA
14 1938 NA NA NA
15 1939 NA NA NA
16 1940 NA NA NA
17 1941 NA NA NA
PBE.CBE.PPO.CPO.PFO.DINC.CFO.RDINC.RFP
1 59.7\t58.6\t60.5\t65.8\t65.8\t51.4\t90.9\t68.5\t877
2 59.7\t59.4\t63.3\t63.3\t68\t52.6\t92.1\t69.6\t899
3 63\t53.7\t59.9\t66.8\t65.5\t52.1\t90.9\t70.2\t883
4 71\t48.1\t56.3\t69.9\t64.8\t52.7\t90.9\t71.9\t884
5 71\t49\t55\t68.7\t65.6\t55.1\t91.1\t75.2\t895
6 74.2\t48.2\t59.6\t66.1\t62.4\t48.8\t90.7\t68.3\t874
7 72.1\t47.9\t57\t67.4\t51.4\t41.5\t90\t64\t791
8 79\t46\t49.5\t69.7\t42.8\t31.4\t87.8\t53.9\t733
9 73.1\t50.8\t47.3\t68.7\t41.6\t29.4\t88\t53.2\t752
10 70.2\t55.2\t56.6\t62.2\t46.4\t33.2\t89.1\t58\t811
11 82.2\t52.2\t73.9\t47.7\t49.7\t37\t87.3\t63.2\t847
12 68.4\t57.3\t64.4\t54.4\t50.1\t41.8\t90.5\t70.5\t845
13 73\t54.4\t62.2\t55\t52.1\t44.5\t90.4\t72.5\t849
14 70.2\t53.6\t59.9\t57.4\t48.4\t40.8\t90.6\t67.8\t803
15 67.8\t53.9\t51\t63.9\t47.1\t43.5\t93.8\t73.2\t793
16 63.4\t54.2\t41.5\t72.4\t47.8\t46.5\t95.5\t77.6\t798
17 56\t60\t43.9\t67.4\t52.2\t56.3\t97.5\t89.5\t830
So thanks to comments turns out the separated wasn't a space but a tab. Here's what's the correct answer:
> beef=read.table("beef.txt", header = TRUE, sep = "\t", comment.char="%")
> beef
YEAR....PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
7 1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0 64.0 791
8 1932 79 46.0 49.5 69.7 42.8 31.4 87.8 53.9 733
9 1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0 53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1 58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3 63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5 70.5 845
13 1937 73 54.4 62.2 55.0 52.1 44.5 90.4 72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6 67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8 73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5 77.6 798
17 1941 56 60.0 43.9 67.4 52.2 56.3 97.5 89.5 830
beef=read.table("beef.txt", header = TRUE, sep = " ", comment.char="%")
beef=read.table("beef.txt", header = TRUE, sep = "\t", comment.char="%") #after update
There's a recent blog post about it showing that fread is the fastest, the rest is the same. The link: http://statcompute.wordpress.com/2014/02/11/efficiency-of-importing-large-csv-files-in-r/
In your case it doesn't really matter, use the one you find most comfortable.
An example using fread is following (assuming TAB separators):
library(data.table)
a = fread("data.csv", skip=26)
a
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1: 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2: 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3: 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4: 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5: 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
The correct answer is as follows:
beef=read.table("beef.txt", header = TRUE, sep = "", comment.char="%")
Here is an alternative in base using readLines. This approach is much more complex, but returns numeric data ready for analysis. However, you must manually count the columns in the original data file and later reassign column names.
EDIT
At the bottom I added a generalized version that does not require manually counting the
columns or manually adding column names.
Note that either version works regardless of whether the data are delimited with spaces or tabs.
Here is the code for the original version:
my.data <- readLines('c:/users/mmiller21/simple R programs/beef.txt')
ncols <- 10
header.info <- ifelse(substr(my.data, 1, 1) == '%', 1, 0)
my.data2 <- my.data[header.info==0]
my.data3 <- data.frame(matrix(unlist(strsplit(my.data2[-1], "[^0-9,.]+")), ncol=ncols, byrow=TRUE), stringsAsFactors = FALSE)
my.data4 <- as.data.frame(apply(my.data3, 2, function(x) as.numeric(x)))
colnames(my.data4) <- c('YEAR', 'PBE', 'CBE', 'PPO', 'CPO', 'PFO', 'DINC', 'CFO', 'RDINC', 'RFP')
> my.data4
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
[1,] 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
[2,] 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
[3,] 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
[4,] 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
[5,] 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
Here are the contents of the original data file:
% http://lib.stat.cmu.edu/DASL/Datafiles/agecondat.html
%
% Datafile Name: Agricultural Economics Studies
% Datafile Subjects: Agriculture , Economics , Consumer
% Story Names: Agricultural Economics Studies
% Reference: F.B. Waugh, Graphic Analysis in Agricultural Economics,
% Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957.
% Authorization: free use
% Description: Price and consumption per capita of beef and pork
% annually from 1925 to 1941 together with other variables relevant to
% an economic analysis of price and/or consumption of beef and pork
% over the period.
% Number of cases: 17
% Variable Names:
%
% PBE = Price of beef (cents/lb)
% CBE = Consumption of beef per capita (lbs)
% PPO = Price of pork (cents/lb)
% CPO = Consumption of pork per capita (lbs)
% PFO = Retail food price index (1947-1949 = 100)
% DINC = Disposable income per capita index (1947-1949 = 100)
% CFO = Food consumption per capita index (1947-1949 = 100)
% RDINC = Index of real disposable income per capita (1947-1949 = 100)
% RFP = Retail food price index adjusted by the CPI (1947-1949 = 100)
%
% The Data:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
1926 59.7 59.4 63.3 63.3 68 52.6 92.1 69.6 899
1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
1929 71 49 55 68.7 65.6 55.1 91.1 75.2 895
Here is the code for the generalized version:
my.data <- readLines('c:/users/mmiller21/simple R programs/beef.txt')
header.info <- ifelse(substr(my.data, 1, 1) == '%', 1, 0)
my.data2 <- my.data[header.info==0]
ncols <- length(read.table(textConnection(my.data2[1])))
my.data3 <- data.frame(matrix(unlist(strsplit(my.data2[-1], "[^0-9,.]+")), ncol=ncols, byrow=TRUE), stringsAsFactors = FALSE)
my.data4 <- as.data.frame(apply(my.data3, 2, function(x) as.numeric(x)))
#colnames(my.data4) <- c('YEAR', 'PBE', 'CBE', 'PPO', 'CPO', 'PFO', 'DINC', 'CFO', 'RDINC', 'RFP')
#my.data4
colnames(my.data4) <- read.table(textConnection(my.data2[1]), colClasses = c('character'))
my.data4
colSums(my.data4)
sum(my.data4$PPO)

Multiple regression in R: Variable not found in data.frame

Here's my data.frame :: beef
> head(beef)
YEAR....PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
And
dput(head(beef))
structure(list(YEAR....PBE = structure(1:6, .Label = c("1925 59.7",
"1926 59.7", "1927 63", "1928 71", "1929 71", "1930 74.2",
"1931 72.1", "1932 79", "1933 73.1", "1934 70.2",
"1935 82.2", "1936 68.4", "1937 73", "1938 70.2",
"1939 67.8", "1940 63.4", "1941 56"), class = "factor"),
CBE = c(58.6, 59.4, 53.7, 48.1, 49, 48.2), PPO = c(60.5,
63.3, 59.9, 56.3, 55, 59.6), CPO = c(65.8, 63.3, 66.8, 69.9,
68.7, 66.1), PFO = c(65.8, 68, 65.5, 64.8, 65.6, 62.4), DINC = c(51.4,
52.6, 52.1, 52.7, 55.1, 48.8), CFO = c(90.9, 92.1, 90.9,
90.9, 91.1, 90.7), RDINC = c(68.5, 69.6, 70.2, 71.9, 75.2,
68.3), RFP = c(877L, 899L, 883L, 884L, 895L, 874L)), .Names = c("YEAR....PBE",
"CBE", "PPO", "CPO", "PFO", "DINC", "CFO", "RDINC", "RFP"), row.names = c(NA,
6L), class = "data.frame")
I want to create a multiple linear regression model for PBE depending on the other variables. Following the tutorial in this link I think I should do something the following code:
> lm(formula = PBE ~ CBE + PBO + CPO + PFO +
+ DINC + CFO+RDINC+RFP+YEAR, data = beef)
Error in eval(expr, envir, enclos) : object 'PBE' not found
so I decided to try the followings but all have some errors:
> lm(formula=PBE~YEAR,data=beef)
Error in eval(expr, envir, enclos) : object 'PBE' not found
> lm(formula=beef$PBE~beef$YEAR)
Error in model.frame.default(formula = beef$PBE ~ beef$YEAR, drop.unused.levels = TRUE) :
invalid type (NULL) for variable 'beef$PBE
Can you please give me some insight where the typo/error is lying?
P.S.: I read the file using beef=read.table("beef.txt", header = TRUE, sep = "\t", comment.char="%") and the file looks like the following:
% http://lib.stat.cmu.edu/DASL/Datafiles/agecondat.html
%
% Datafile Name: Agricultural Economics Studies
% Datafile Subjects: Agriculture , Economics , Consumer
% Story Names: Agricultural Economics Studies
% Reference: F.B. Waugh, Graphic Analysis in Agricultural Economics,
% Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957.
% Authorization: free use
% Description: Price and consumption per capita of beef and pork
% annually from 1925 to 1941 together with other variables relevant to
% an economic analysis of price and/or consumption of beef and pork
% over the period.
% Number of cases: 17
% Variable Names:
%
% PBE = Price of beef (cents/lb)
% CBE = Consumption of beef per capita (lbs)
% PPO = Price of pork (cents/lb)
% CPO = Consumption of pork per capita (lbs)
% PFO = Retail food price index (1947-1949 = 100)
% DINC = Disposable income per capita index (1947-1949 = 100)
% CFO = Food consumption per capita index (1947-1949 = 100)
% RDINC = Index of real disposable income per capita (1947-1949 = 100)
% RFP = Retail food price index adjusted by the CPI (1947-1949 = 100)
%
% The Data:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
1926 59.7 59.4 63.3 63.3 68 52.6 92.1 69.6 899
1927 63 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
1928 71 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
1929 71 49 55 68.7 65.6 55.1 91.1 75.2 895
1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
1931 72.1 47.9 57 67.4 51.4 41.5 90 64 791
Here's the result of View(beef) as suggested by Patrick:
You need to go back and look at the file that you loaded these data into R from. The output from head() suggests that the first variable is YEAR....PBE and that the PBE data has gotten merged with the YEAR variable, probably because of some issue with the delimiters in use in the file you read in. Go back and check the file carefully.
One way to do this from within R is to use count.fields(), which you pass the filename to check. Do read ?count.fields as you will potentially need to set the sep and quote arguments in order to match the file you read the data from. The function will tell you how many fields (variables) it finds; compare this with the known number of variables.
From your edit, it is clear that something like what I describe above has happened:
> names(beef)
[1] "YEAR....PBE" "CBE" "PPO" "CPO" "PFO"
[6] "DINC" "CFO" "RDINC" "RFP"
It seems the file is not all/fully/truly Tab-delimited. I was able to read the bit of data you included with:
beef <- read.table("file.name", header = TRUE, sep = "", comment.char = "%")
> head(beef)
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
> str(beef)
'data.frame': 7 obs. of 10 variables:
$ YEAR : int 1925 1926 1927 1928 1929 1930 1931
$ PBE : num 59.7 59.7 63 71 71 74.2 72.1
$ CBE : num 58.6 59.4 53.7 48.1 49 48.2 47.9
$ PPO : num 60.5 63.3 59.9 56.3 55 59.6 57
$ CPO : num 65.8 63.3 66.8 69.9 68.7 66.1 67.4
$ PFO : num 65.8 68 65.5 64.8 65.6 62.4 51.4
$ DINC : num 51.4 52.6 52.1 52.7 55.1 48.8 41.5
$ CFO : num 90.9 92.1 90.9 90.9 91.1 90.7 90
$ RDINC: num 68.5 69.6 70.2 71.9 75.2 68.3 64
$ RFP : int 877 899 883 884 895 874 791

Resources