I am an r novice and currently analyzing my data, so forgive me if my error is basic.
I am trying to use sm.density.compare function in the sm package to compare the abundance and diversity of parasite across host species and region.
The data I am trying to analyze is similar to the iris dataset. The iris data is working but when I try to run my data, I get the error "Error in x * w : non-numeric argument to binary operator"
Here is my code:
sm.density.compare(Data_Sheets_FINAL$Total.Endos, Data_Sheets_FINAL$Species)
The species data is broken into three groups (AS, CS, and TSE). Here is my Total.Endos data:
[1] 221 46 413 477 29 294 196 298 592 331 20 339 36 123 119 158 34 258 264 160 224 184 452
[24] 103 17 133 128 311 13 98 387 152 74 1058 13 110 66 9 17 5 22 530 146 73 44 277
[47] 75 27 68 49 115 67 104 108 256 762 93 21 1604 47 13 79 213 32 15 10 38 369 108
[70] 270 70 432 246 14 72 12 34 79 167
Any ideas?
This is the error message that you get if your Species are strings. Try
Data_Sheets_FINAL$Species = factor(Data_Sheets_FINAL$Species)
sm.density.compare(Data_Sheets_FINAL$Total.Endos, Data_Sheets_FINAL$Species)
Related
I'd like to do several manipulations with datasets that are in-built in R from the packages that I have. So, first, I made a vector with dataset's names, but when I tried to filter the datasets which have only one column, I got an error, saying that the length of the argument is 0. Here is the code:
for (i in datasets){
if (ncol(i)==1){dataset <- i datasets <- c(dataset, datasets) }
}
It treats the names of the datasets as a character vector.
Here is the head of the aforementioned vector: [1] ability.cov airmiles AirPassengers airquality anscombe attenu. It's silly, but how could I treat the entries as dataframes?
I don't fully understand your logic, but based on your code, you want to identify which dataset that has one column by using ncol(x) == 1. If that's true, then you need to deal with some issues:
the various structures of the datasets. ncol produces the number of columns on data.frame and matrix but does not on time-series. For example: ncol(anscombe) results in 8 but ncol(AirPassengers) results in NULL. If you decide to use ncol, then you need to coerce each dataset to a data.frame by using as.data.frame.
indexing the character vector of the names of the datasets. You need to call a dataset, not its character name, to be able to use as.data.frame. One way of doing this is by using eval(parse(text=the_name)).
the way to store the result. You can use c() to combine the results but the datasets will be converted to vectors, no longer in their initial structures. I recommend using list to preserve the data frame structures of the datasets.
Here is one possible solution based on those considerations:
datasets <- c("ability.cov", "airmiles", "AirPassengers", "airquality", "anscombe", "attenu")
single_col_datasets <- vector('list', 1)
for (i in seq_along(datasets)){
if (ncol(as.data.frame(eval(parse(text = datasets[i])))) == 1){
single_col_datasets[[i]] <- as.data.frame(eval(parse(text = datasets[i])))
names(single_col_datasets[[i]]) <- datasets[i]
}
not.null.element <- single_col_datasets[lengths(single_col_datasets) != 0]
new.datasets <- list(not.null.element, datasets)
}
Here is the result:
new.datasets
[[1]]
[[1]][[1]]
airmiles
1 412
2 480
3 683
4 1052
5 1385
6 1418
7 1634
8 2178
9 3362
10 5948
11 6109
12 5981
13 6753
14 8003
15 10566
16 12528
17 14760
18 16769
19 19819
20 22362
21 25340
22 25343
23 29269
24 30514
[[1]][[2]]
AirPassengers
1 112
2 118
3 132
4 129
5 121
6 135
7 148
8 148
9 136
10 119
11 104
12 118
13 115
14 126
15 141
16 135
17 125
18 149
19 170
20 170
21 158
22 133
23 114
24 140
25 145
26 150
27 178
28 163
29 172
30 178
31 199
32 199
33 184
34 162
35 146
36 166
37 171
38 180
39 193
40 181
41 183
42 218
43 230
44 242
45 209
46 191
47 172
48 194
49 196
50 196
51 236
52 235
53 229
54 243
55 264
56 272
57 237
58 211
59 180
60 201
61 204
62 188
63 235
64 227
65 234
66 264
67 302
68 293
69 259
70 229
71 203
72 229
73 242
74 233
75 267
76 269
77 270
78 315
79 364
80 347
81 312
82 274
83 237
84 278
85 284
86 277
87 317
88 313
89 318
90 374
91 413
92 405
93 355
94 306
95 271
96 306
97 315
98 301
99 356
100 348
101 355
102 422
103 465
104 467
105 404
106 347
107 305
108 336
109 340
110 318
111 362
112 348
113 363
114 435
115 491
116 505
117 404
118 359
119 310
120 337
121 360
122 342
123 406
124 396
125 420
126 472
127 548
128 559
129 463
130 407
131 362
132 405
133 417
134 391
135 419
136 461
137 472
138 535
139 622
140 606
141 508
142 461
143 390
144 432
[[2]]
[1] "ability.cov" "airmiles" "AirPassengers" "airquality" "anscombe" "attenu"
You can use the get function:
for (i in datasets){
if (ncol(get(i))==1){
dataset <- i
datasets <- c(dataset, datasets)
}
}
I have fit a linear regression model to some data in Stata and now I want to generate the Residual Autocorrelation Plot with respect to the variable id.
Below you can find the variables generated from the regression:
clear
input id response pred_response stud_res
101 72 57.55613 1.512287
102 61 51.24638 1.010817
103 49 56.94838 -0.8237054
104 48 43.1188 0.5078933
105 51 60.35182 -0.9997848
106 49 43.1188 0.6123365
107 50 43.60501 0.6678697
108 58 67.50063 -1.00277
109 50 45.17883 0.5053187
110 51 45.66593 0.5525671
111 59 62.28483 -0.3425483
112 65 52.94175 1.259024
113 57 59.49549 -0.2584414
114 53 59.00929 -0.6238151
115 74 68.10928 0.6212816
116 50 54.2797 -0.4418168
117 84 68.35238 1.671826
118 46 50.27308 -0.4435438
119 52 48.0915 0.4033695
120 64 58.04234 0.6188389
121 59 45.17972 1.444254
122 55 54.51646 0.0500989
124 46 44.33432 0.1745929
125 52 51.48948 0.0526441
126 63 64.71586 -0.1833892
127 52 51.00238 0.1038181
128 42 43.84811 -0.1929091
129 57 63.62279 -0.6922547
130 23 42.75415 -2.098808
131 65 58.88685 0.6355278
132 38 48.45526 -1.100601
133 59 54.77137 0.4510341
134 26 43.72021 -1.880954
135 53 60.46791 -0.7770496
136 50 40.68689 0.9796554
137 56 51.9748 0.4227943
138 49 65.43971 -1.751305
139 76 68.83858 0.7565064
140 68 66.53456 0.1536334
141 60 49.66532 1.077015
142 46 43.72021 0.2374953
143 57 59.85926 -0.2981544
144 45 48.45615 -0.3568231
145 46 45.42282 0.0596576
146 64 67.13597 -0.3291895
147 40 41.9024 -0.1997022
148 62 64.7104 -0.283202
149 13 45.78748 -3.629334
150 79 63.25813 1.66337
151 61 59.86015 0.1180355
152 46 42.02484 0.4124526
153 50 45.66593 0.4487194
154 48 51.61103 -0.3727813
155 65 59.37306 0.5858857
156 62 69.08168 -0.748562
157 56 54.5228 0.1524598
158 54 52.09724 0.196739
159 72 60.46156 1.209799
160 57 60.83167 -0.4032753
161 50 41.6593 0.8780965
162 65 55.97507 0.9392686
163 56 66.28511 -1.086957
201 54 49.5392 0.4779044
202 57 50.02451 0.7322617
203 48 49.18 -0.1222386
204 41 41.66019 -0.0684602
205 34 38.38376 -0.4576099
206 54 54.511 -0.0545433
207 38 40.68777 -0.2798446
208 49 41.77539 0.7603746
209 58 54.63255 0.3589811
210 14 47.24063 -3.676064
211 40 39.47226 0.0554914
212 13 39.71537 -2.931103
213 51 45.17426 0.611295
214 44 54.39491 -1.084383
216 42 48.08604 -0.6381954
217 55 46.38978 0.8958285
301 62 63.86043 -0.1954589
302 37 43.23401 -0.6509517
303 46 44.57196 0.147607
304 59 59.8538 -0.0890346
305 35 41.66019 -0.6924483
306 70 66.77221 0.3416052
307 56 58.15843 -0.2244185
308 45 46.99207 -0.2117317
309 50 47.47739 0.2635025
310 52 46.87598 0.5302449
311 52 59.84834 -0.8546749
312 83 49.78776 3.674294
313 57 54.03025 0.3084902
314 38 44.57196 -0.680949
315 40 48.81446 -0.9177504
410 48 39.59927 0.8789283
415 50 40.92999 0.9539063
605 42 36.31649 0.6024827
end
When I generate this graph, the default range for the vertical axis is set to encompass the estimated autocorrelation values. However, I want to extend this axis range over all allowable correlation values (i.e., from negative one to positive one). Unfortunately, when I do this, the axis labels do not adjust to the new range, and the labels get squashed.
Below is my code and output:
* Generate the residual autocorrelation plot
* (taken with respect to id variable)
tsset id
ac stud_res, lags(12) yscale(r(-1,1)) ///
title("Residual Autocorrelation Plot") ///
ytitle("Estimated Autocorrelation") ///
How can I get a plot with the desired extension to the vertical axis, but without having the labels squashed only onto the range of the plot values?
You have two choices and both involve adjusting the ylabel() option while removing yscale():
ac stud_res, lags(12) ylabel(-1(0.4)1) title("Residual Autocorrelation Plot") ///
ytitle("Estimated Autocorrelation")
and
ac stud_res, lags(12) ylabel(#5) title("Residual Autocorrelation Plot") ///
ytitle("Estimated Autocorrelation")
I conducted Kaplan Meier analysis in R looking at he survival of fibres in a fatigue test. I have not predefined the upper limit for the restricted mean. How does R calculate or decide the upper limit in order to calculate the restricted mean?
I am using the following code:
fit = survfit(Surv(cyclicdata[,1], cyclicdata[,2]) ~ cyclicdata[,3])
print(fit, print.rmean=TRUE,rmean="common")
From experience with this calculation over the past few years, I believe that the restricted mean is by default calculated as the mean of the longest lives from each group.
For example, I have run into this with two groups having lives as below:
Group 1 Lives:
22 23 25 26 30 32 32 34 37 38 40 43 45 48 48 54
56 59 60 62 70 72 73 73 76 77 78 78 82 86 86 92
92 92 95 98 99 102 104 106 107 112 114 114 115 119 120 123
132 134 135 151 154 157 169 180
Group 2 Lives:
5 7 30 41 44 56 59 64 67 79 86 101 110 120 120 123
125 138 150 163 163 164 167 199 201 214 235 236 237 242 245 270
272 274 282 283 284 287 296 300 300 310 314 321 322 325 340 342
345 355 371 375 376 398 414 419 422 428 442 444 449 474 511 516 549
552 560 563 581 608 618 628 637 638 675 685 702 782 782 817 885
886 946 947 951
When I run my survival fit and print the output, I get this:
* restricted mean with upper limit = 566
This is equal to:
> mean(max(c(Group1$Lives,Group2$Lives)))
[1] 565.5
or
> (951+180)/2
[1] 565.5
Consider the following two code snippets.
A:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5, nrows=190) # Specify nrows, get correct answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
# No need to remove unranked countries because we specified nrows
# No need to convert V2 from factor to numeric
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get KNA, correct answer
B:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5) # Don't specify nrows, get incorrect answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
mergedData = mergedData[which(mergedData$V2 != ""),] # Remove unranked countries
mergedData$V2 = as.numeric(mergedData$V2) # make V2 a numeric column
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get SRB, incorrect answer
I would think the two code snippets would be identical, except that in A you never add the unranked countries to your dataframe and in B you add them but then remove them. Why is the sorting different for these two code snippets?
The file downloads are from Coursera's Getting and Cleaning Data class (Quiz 3, Question 3).
Edit: To avoid security concerns, I've pasted the raw .csv files below
gdp.csv - http://pastebin.com/raw.php?i=4aRZwBRd
education.csv - http://pastebin.com/raw.php?i=0pbhDCSX
Edit2: The problem is occurring in the as.numeric step. For case B, here is mergedData$V2 before and after mergedData$V2 = as.numeric(mergedData$V2) is applied:
> mergedData$V2
[1] 161 105 60 125 32 26 133 172 12 27 68 162 25 140 128 59 76 93
[19] 138 111 69 169 149 96 7 153 113 167 117 165 11 20 36 2 99 98
[37] 121 30 182 166 81 67 102 51 4 183 33 72 48 64 38 159 13 103
[55] 85 43 155 5 185 109 6 114 86 148 175 176 110 42 178 77 160 37
[73] 108 71 139 58 16 10 46 22 47 122 40 9 116 92 3 50 87 145
[91] 120 189 178 15 146 56 136 83 168 171 70 163 84 74 94 82 62 147
[109] 141 132 164 14 188 135 129 137 151 130 118 154 127 152 34 123 144 39
[127] 126 18 23 107 55 66 44 89 49 41 187 115 24 61 45 97 54 52
[145] 8 142 19 73 119 35 174 157 100 88 186 150 63 80 21 158 173 65
[163] 124 156 31 143 91 170 184 101 79 17 190 95 106 53 78 1 75 180
[181] 29 57 177 181 90 28 112 104 134
194 Levels: .. Not available. 1 10 100 101 102 103 104 105 106 107 ... Note: Rankings include only those economies with confirmed GDP estimates. Figures in italics are for 2011 or 2010.
> mergedData$V2 = as.numeric(mergedData$V2)
> mergedData$V2
[1] 72 10 149 32 118 111 41 84 26 112 157 73 110 49 35 147 166 185
[19] 46 17 158 80 58 188 159 63 19 78 23 76 15 105 122 104 191 190
[37] 28 116 94 77 172 156 7 139 126 95 119 162 135 153 124 69 37 8
[55] 176 130 65 137 97 14 148 20 177 57 87 88 16 129 90 167 71 123
[73] 13 161 47 146 70 4 133 107 134 29 127 181 22 184 115 138 178 54
[91] 27 101 90 59 55 144 44 174 79 83 160 74 175 164 186 173 151 56
[109] 50 40 75 48 100 43 36 45 61 38 24 64 34 62 120 30 53 125
[127] 33 91 108 12 143 155 131 180 136 128 99 21 109 150 132 189 142 140
[145] 170 51 102 163 25 121 86 67 5 179 98 60 152 171 106 68 85 154
[163] 31 66 117 52 183 82 96 6 169 81 103 187 11 141 168 3 165 92
[181] 114 145 89 93 182 113 18 9 42
Can anyone explain why the numbers change when I apply as.numeric()?
The real reason for getting different results are in the second case i.e. the full dataset have some footer notes, which were also read with the read.csv resulting in most of the columns to be 'factor' class because of the 'character' elements in the footer. This could have avoided either by
skipping the last few lines using skip argument in read.csv
using stringsAsFactors=FALSE in the read.csv call along with skipping the lines.
The columns were ordered based on the "levels" of the factor.
If you have already read the files without skipping the lines, convert to the respective classes. If it is 'numeric' column, convert it to numeric by as.numeric(as.character(df$column)) or as.numeric(levels(df$column))[df$column].
Like the title says I need a for loop which will write every number from 1 to 200 that is evenly divided by 3.
Every other method posted so far generates the 1:200 vector then throws away two thirds of it. What a waste. In an attempt to be eco-conscious, this method does not waste any electrons:
seq(3,200,by=3)
You don't need a for loop, use match function instead, as in:
which(1:200 %% 3 == 0)
[1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81
[28] 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
[55] 165 168 171 174 177 180 183 186 189 192 195 198
Two other alternatives:
c(1:200)[c(F, F, T)]
c(1:200)[1:200 %% 3 == 0]