Good morning,
I need to read the following .data file : https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
For this , I tried without success :
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data", open="r" ,encoding="UTF-16LE")
data <- read.table(f, dec=",", header=F)
Thank you a lot for help!
I would try to use the coatless/ucidata package to access the data.
https://github.com/coatless/ucidata
Here you can see how the package loads in the data file and processing:
https://github.com/coatless/ucidata/blob/master/data-raw/heart_disease_build.R
If you wish to try out the package, you will need devtools installed. Here is what you can try:
# install.packages("devtools")
devtools::install_github("coatless/ucidata")
# load data
data("heart_disease_cl", package = "ucidata")
# show beginning rows of data
head(heart_disease_cl)
Output
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
1 63 Male typical angina 145 233 1 probable/definite hypertrophy 150 No 2.3 downsloping 0 fixed defect 0
2 67 Male asymptomatic 160 286 0 probable/definite hypertrophy 108 Yes 1.5 flat 3 normal 2
3 67 Male asymptomatic 120 229 0 probable/definite hypertrophy 129 Yes 2.6 flat 2 reversable defect 1
4 37 Male non-anginal pain 130 250 0 normal 187 No 3.5 downsloping 0 normal 0
5 41 Female atypical angina 130 204 0 probable/definite hypertrophy 172 No 1.4 upsloping 0 normal 0
6 56 Male atypical angina 120 236 0 normal 178 No 0.8 upsloping 0 normal 0
I had found another solution with RCurl :
library (RCurl)
download <- getURL("http://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
data <- read.csv (text = download)
head(data)
#Output :
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine
1 75 0 582 0 20 1 265000 1.9
2 55 0 7861 0 38 0 263358 1.1
3 65 0 146 0 20 0 162000 1.3
4 50 1 111 0 20 0 210000 1.9
5 65 1 160 1 20 0 327000 2.7
6 90 1 47 0 40 1 204000 2.1
serum_sodium sex smoking time DEATH_EVENT
1 130 1 0 4 1
2 136 1 0 6 1
3 129 1 1 7 1
4 137 1 0 7 1
5 116 0 0 8 1
6 132 1 1 8 1
Related
I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!
If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))
You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0
I have been trying to generate a dummy variable from the data column for interval.
Sample data
Date <- seq(as.Date("1988-01-01"), as.Date("2018-12-31"), by="1 day")
DATASET <- data.frame(rnorm(11323), Date)
I would like to create an interval: 20-04 : 20-08 for each year codes as 1. I would be grateful for the hint with code for doing this.
You could compare the day of the year. In base R that would be
DATASET$day_of_year <- as.integer(format(DATASET$Date, "%j"))
DATASET$flag <- +(with(DATASET, ifelse(as.integer(format(Date, "%Y")) %% 4 == 0 ,
day_of_year %in% 111:233, day_of_year %in% 110:232)))
For leap years 20-04 is 111th day of the year and 20-08 is 233rd day and for rest of the years they are 110 and 232 respectively. We assign 1 when the date is between those 2 values.
Maybe you can try the following code to have codes for the interval between 20-4 and 20-8 for each year
DATASET <- within(DATASET,
code <- ave(as.numeric(format(DATASET$Date,"%m%d")),
as.numeric(format(DATASET$Date,"%Y")),
FUN = function(x) ifelse(x>=420 & x <=820,1,0)))
and a small piece of result is shown as below
> DATASET
rnorm.11323. Date code
1 -0.326546058 1988-01-01 0
2 -0.561589735 1988-01-02 0
3 -0.417091199 1988-01-03 0
4 -0.482488496 1988-01-04 0
5 0.039820482 1988-01-05 0
6 -0.285270230 1988-01-06 0
7 -1.301004464 1988-01-07 0
8 1.835118221 1988-01-08 0
9 -0.207213889 1988-01-09 0
10 1.695089989 1988-01-10 0
11 -0.618905489 1988-01-11 0
12 1.689917961 1988-01-12 0
13 -0.272349252 1988-01-13 0
14 0.585059685 1988-01-14 0
15 -0.793666725 1988-01-15 0
16 -0.276084733 1988-01-16 0
17 -0.474363507 1988-01-17 0
18 1.703568414 1988-01-18 0
19 0.011776841 1988-01-19 0
20 0.029492096 1988-01-20 0
21 -1.313446231 1988-01-21 0
22 -0.127952381 1988-01-22 0
23 -0.203861769 1988-01-23 0
24 -0.365669967 1988-01-24 0
25 -0.239937083 1988-01-25 0
26 0.620562975 1988-01-26 0
27 0.652111601 1988-01-27 0
28 -0.869191381 1988-01-28 0
29 0.130085565 1988-01-29 0
30 0.059768397 1988-01-30 0
31 0.349921562 1988-01-31 0
32 -1.087277224 1988-02-01 0
33 -1.250976040 1988-02-02 0
34 -0.970337410 1988-02-03 0
35 2.063232550 1988-02-04 0
36 -0.294777997 1988-02-05 0
37 0.535559649 1988-02-06 0
38 -0.229363577 1988-02-07 0
39 -1.819158790 1988-02-08 0
40 1.020335484 1988-02-09 0
41 0.102285275 1988-02-10 0
42 1.254992570 1988-02-11 0
43 1.584044869 1988-02-12 0
44 -0.629629933 1988-02-13 0
45 -1.073561540 1988-02-14 0
46 1.273920124 1988-02-15 0
47 -0.376367657 1988-02-16 0
48 1.331066300 1988-02-17 0
49 0.694872356 1988-02-18 0
50 0.863826292 1988-02-19 0
51 -1.411795778 1988-02-20 0
52 0.388793450 1988-02-21 0
53 -0.216112938 1988-02-22 0
54 -0.196632011 1988-02-23 0
55 0.558895841 1988-02-24 0
56 0.818765192 1988-02-25 0
57 -1.250469812 1988-02-26 0
58 0.803231988 1988-02-27 0
59 0.002634810 1988-02-28 0
60 0.252328475 1988-02-29 0
61 -0.958851197 1988-03-01 0
62 -1.448732431 1988-03-02 0
63 0.647314543 1988-03-03 0
64 0.644802476 1988-03-04 0
65 -0.087973096 1988-03-05 0
66 1.088076864 1988-03-06 0
67 -0.293465532 1988-03-07 0
68 0.141825697 1988-03-08 0
69 0.413649305 1988-03-09 0
70 -1.877052966 1988-03-10 0
71 -2.200275448 1988-03-11 0
72 -0.025524427 1988-03-12 0
73 1.236501510 1988-03-13 0
74 -0.872516837 1988-03-14 0
75 -1.063727523 1988-03-15 0
76 0.264564444 1988-03-16 0
77 0.971958801 1988-03-17 0
78 0.102470655 1988-03-18 0
79 1.369131551 1988-03-19 0
80 -0.041148284 1988-03-20 0
81 -2.476135538 1988-03-21 0
82 0.836740451 1988-03-22 0
83 0.078102241 1988-03-23 0
84 -0.949778901 1988-03-24 0
85 -0.975874102 1988-03-25 0
86 2.011305586 1988-03-26 0
87 1.441333862 1988-03-27 0
88 1.404182762 1988-03-28 0
89 -0.425158054 1988-03-29 0
90 1.250722900 1988-03-30 0
91 0.060629220 1988-03-31 0
92 -1.593162931 1988-04-01 0
93 0.475640908 1988-04-02 0
94 0.102547315 1988-04-03 0
95 -2.350611181 1988-04-04 0
96 0.185065822 1988-04-05 0
97 0.463470128 1988-04-06 0
98 1.722202344 1988-04-07 0
99 -1.344383635 1988-04-08 0
100 0.858491817 1988-04-09 0
101 -0.008338174 1988-04-10 0
102 0.572599035 1988-04-11 0
103 0.138858045 1988-04-12 0
104 -1.808541857 1988-04-13 0
105 1.308927384 1988-04-14 0
106 -2.374371017 1988-04-15 0
107 1.134519340 1988-04-16 0
108 1.604437740 1988-04-17 0
109 -0.109549779 1988-04-18 0
110 -0.011355562 1988-04-19 0
111 -1.462229758 1988-04-20 1
112 1.006583367 1988-04-21 1
113 -0.124824926 1988-04-22 1
114 1.611795681 1988-04-23 1
115 0.818715370 1988-04-24 1
116 -0.440445043 1988-04-25 1
117 0.024114452 1988-04-26 1
118 -1.418044894 1988-04-27 1
119 -0.632317886 1988-04-28 1
120 0.599948691 1988-04-29 1
121 1.055118998 1988-04-30 1
122 0.301676490 1988-05-01 1
123 -0.662547532 1988-05-02 1
124 0.425191055 1988-05-03 1
125 1.715003304 1988-05-04 1
126 -0.298346044 1988-05-05 1
127 -1.043983256 1988-05-06 1
128 -1.194283503 1988-05-07 1
129 -1.517810914 1988-05-08 1
130 0.386735460 1988-05-09 1
131 0.742102056 1988-05-10 1
132 0.953762078 1988-05-11 1
133 -0.602941007 1988-05-12 1
134 1.469329252 1988-05-13 1
135 -0.233230972 1988-05-14 1
136 0.663378860 1988-05-15 1
137 -0.749108544 1988-05-16 1
138 0.591009181 1988-05-17 1
139 0.013732152 1988-05-18 1
140 -0.774612526 1988-05-19 1
141 -1.707183964 1988-05-20 1
142 -0.808360648 1988-05-21 1
143 1.420371293 1988-05-22 1
144 0.603838459 1988-05-23 1
145 0.743964804 1988-05-24 1
146 0.059498235 1988-05-25 1
147 -0.597795793 1988-05-26 1
148 0.867167938 1988-05-27 1
149 0.441291857 1988-05-28 1
150 1.348769636 1988-05-29 1
151 -1.768938126 1988-05-30 1
152 1.070400122 1988-05-31 1
153 0.321542409 1988-06-01 1
154 -0.495030342 1988-06-02 1
155 -0.740337974 1988-06-03 1
156 -1.887552572 1988-06-04 1
157 0.805602475 1988-06-05 1
158 -0.824104379 1988-06-06 1
159 0.801460489 1988-06-07 1
160 -0.912871263 1988-06-08 1
161 -0.422677222 1988-06-09 1
162 0.126785279 1988-06-10 1
163 -0.598578319 1988-06-11 1
164 -1.535492985 1988-06-12 1
165 0.018486996 1988-06-13 1
166 -1.156209268 1988-06-14 1
167 0.656276068 1988-06-15 1
168 0.045640396 1988-06-16 1
169 0.627538985 1988-06-17 1
170 2.640792582 1988-06-18 1
171 -0.383475408 1988-06-19 1
172 -2.631633446 1988-06-20 1
173 0.772980776 1988-06-21 1
174 1.930884904 1988-06-22 1
175 2.026248604 1988-06-23 1
176 -0.134588724 1988-06-24 1
177 -0.593768442 1988-06-25 1
178 -0.427553478 1988-06-26 1
179 0.303955588 1988-06-27 1
180 -0.195481230 1988-06-28 1
181 1.231190798 1988-06-29 1
182 -0.871672993 1988-06-30 1
183 -1.002028081 1988-07-01 1
184 -0.912352588 1988-07-02 1
185 -0.714319398 1988-07-03 1
186 0.053181016 1988-07-04 1
187 0.865163557 1988-07-05 1
188 0.474865269 1988-07-06 1
189 -1.105410939 1988-07-07 1
190 -0.110529764 1988-07-08 1
191 -0.805821554 1988-07-09 1
192 -1.550774659 1988-07-10 1
193 -0.508057551 1988-07-11 1
194 -0.755394814 1988-07-12 1
195 0.993023957 1988-07-13 1
196 -0.342427853 1988-07-14 1
197 -1.481690158 1988-07-15 1
198 -0.095168751 1988-07-16 1
199 1.320208464 1988-07-17 1
200 -0.340080090 1988-07-18 1
201 -1.545902324 1988-07-19 1
202 0.389589474 1988-07-20 1
203 -0.734778233 1988-07-21 1
204 0.296933278 1988-07-22 1
205 -0.024469569 1988-07-23 1
206 1.261660247 1988-07-24 1
207 -0.136786252 1988-07-25 1
208 0.908519533 1988-07-26 1
209 1.576193030 1988-07-27 1
210 0.413044482 1988-07-28 1
211 -0.601938271 1988-07-29 1
212 0.495905040 1988-07-30 1
213 0.440665366 1988-07-31 1
214 -0.804152825 1988-08-01 1
215 -1.065705237 1988-08-02 1
216 0.149246056 1988-08-03 1
217 -0.530891226 1988-08-04 1
218 -0.879233155 1988-08-05 1
219 -0.262727374 1988-08-06 1
220 -2.244552614 1988-08-07 1
221 -1.531707789 1988-08-08 1
222 1.498847169 1988-08-09 1
223 0.810096179 1988-08-10 1
224 -1.690822775 1988-08-11 1
225 0.303456055 1988-08-12 1
226 -0.874022497 1988-08-13 1
227 0.244933676 1988-08-14 1
228 1.220193574 1988-08-15 1
229 -0.456840188 1988-08-16 1
230 1.083075786 1988-08-17 1
231 -1.769152445 1988-08-18 1
232 -1.038850200 1988-08-19 1
233 0.963345582 1988-08-20 1
234 0.036574589 1988-08-21 0
235 -2.613751531 1988-08-22 0
236 1.441930677 1988-08-23 0
237 -1.927433949 1988-08-24 0
238 -0.045661284 1988-08-25 0
239 0.974935858 1988-08-26 0
240 -1.457985965 1988-08-27 0
241 0.914085417 1988-08-28 0
242 -0.004152904 1988-08-29 0
243 1.653886738 1988-08-30 0
244 0.972947047 1988-08-31 0
So I have a dataset that includes the lung capacity of certain individuals. I am trying to analyze the data distributions and relations. The only problem is that the data is somewhat incomplete. Some of the rows include "N/A" as the lung capacity. This is causing an issue because it is resulting in a mean and sd of always "N/A" for the different subsets. How would I form this into a subset so that it only includes the data that isn't N/A?
I've tried this:
fData1 = read.table("lung.txt",header=TRUE)
fData2= fData1[fData1$fev!="N/A"]
but this gives me an "undefinied columns selected error".
How can I make it so that I have a data set that excludes the rows with "N/A"?
Here is the begining of my data set:
id age fev height male smoke
1 72 1.2840 66.5 1 1
2 81 2.5530 67.0 0 0
3 90 2.3830 67.0 1 0
4 72 2.6990 71.5 1 0
5 70 2.0310 62.5 0 0
6 72 2.4100 67.5 1 0
7 75 3.5860 69.0 1 0
8 75 2.9580 67.0 1 0
9 67 1.9160 62.5 0 0
10 70 NA 66.0 0 1
One option is to apply the operations excluding the NA values:
dat <- read.table("lung.txt", header = T, na.strings = "NA")
mean(dat$fev, na.rm=T) # mean of fev col
sd(dat$fev, na.rm=T)
If you simply want to get rid of the NAs:
fData1 <- na.omit(fData1)
fData1 <- na.exclude(fData1) # same result
If you'd like to save the rows with NA's here are 2 options:
fData2 <- fData1[is.na(fData1$fev), ]
fData2 <- subset(fData1, is.na(fData1$fev))
If you just want to filter out rows with NA values, you can use complete.cases():
> df
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
10 10 70 NA 66.0 0 1
> df[complete.cases(df), ]
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
I read a text file in R, looks like below, with 1354896 rows and 5 colums.
I try read.table(), and read.delim() to upload the file, however the format of file after upload changes. It transforms everything into a single column.
OffsetY=0
GridCornerUL=258 182
GridCornerUR=8450 210
GridCornerLR=8419 8443
GridCornerLL=228 8414
Axis-invertX=0
AxisInvertY=0
swapXY=0
DatHeader=[19..65528] PA-D 102 Full:CLS=8652 RWS=8652 XIN=1 YIN=1 VE=30 2.0 11/04/03 12:49:30 50205710 M10 HG-U133_Plus_2.1sq 6
Algorithm=Percentile
AlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004;AlgVersion:6.0;FixedCellSize:TRUE;FullFeatureWidth:7;FullFeatureHeight:7;IgnoreOutliersInShiftRows:FALSE;FeatureExtraction:TRUE;PoolWidthExtenstion:2;PoolHeightExtension:2;UseSubgrids:FALSE;RandomizePixels:FALSE;ErrorBasis:StdvMean;StdMult:1.000000
[INTENSITY]
NumberCells=1354896
CellHeader=X Y MEAN STDV NPIXELS
0 0 147.0 23.5 25
1 0 10015.0 1276.7 25
2 0 160.0 24.7 25
3 0 9710.0 1159.8 25
4 0 85.0 14.0 25
5 0 171.0 21.0 25
6 0 11648.0 1678.4 25
7 0 163.0 30.7 25
8 0 12044.0 1430.1 25
9 0 169.0 25.7 25
10 0 11646.0 1925.6 25
11 0 176.0 30.7 25
After reading the format is changed as shown below.:
I want to retain the format of rows and colums
I want to remove all the content before [intensity] like (offset, GridCornerUL, so on) shown in the first file.
You could trys:
txt <- readLines("file.txt")
df <- read.csv(text = txt[-(1:grep("NumberCells=\\d+", txt))], check.names = FALSE)
write.csv(df, tf <- tempfile(fileext = ".csv"), row.names = FALSE)
read.csv(tf, check.names = FALSE) # just to verify...
# CellHeader=X Y MEAN STDV NPIXELS
# 1 0 0 147.0 23.5 25
# 2 1 0 10015.0 1276.7 25
# 3 2 0 160.0 24.7 25
# 4 3 0 9710.0 1159.8 25
# 5 4 0 85.0 14.0 25
# 6 5 0 171.0 21.0 25
# 7 6 0 11648.0 1678.4 25
# 8 7 0 163.0 30.7 25
# 9 8 0 12044.0 1430.1 25
# 10 9 0 169.0 25.7 25
# 11 10 0 11646.0 1925.6 25
# 12 11 0 176.0 30.7 25
This omits everything before and including NumberCells=1354896.
As you are using linux, another option would be to pipe the awk with read.table or fread
read.table(pipe("awk 'NR==1, /NumberCells/ {next}{print}' Hashim.txt"),
header=TRUE, check.names=FALSE)
# CellHeader=X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25
If NumberCells= always appears immediately before the header row, then you can exploit this to tell you the number of lines to skip:
dat<-readLines("file.txt")
read.table(textConnection(dat), header=TRUE, skip=grep("NumberCells", dat))
# CellHeader.X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25
Edit
Because your files have a lot of rows, you may want to limit the number of lines that readLines reads in. To do this, you need to know the maximum number of lines before your header row. For instance, if you know your header row will always come within the first 200 lines of the file, you can do:
dat<-readLines("file.txt", n=200)
read.table("file.txt", header=TRUE, skip=grep("NumberCells", dat))
New to R and want to use mlogit function.
However after putting my data into a data frame and run
x <- mlogit.data(mlogit, choice="PlacedN", shape="long", alt.var="RaceID")
I get duplicate 'row.names' are not allowed
I can upload my file if needed I've spent days trying to get this to work, so any help will be appreciated
You may want to put "RaceID" into the alt.levels argument instead of alt.var. From the mlogit.data help file:
alt.levels
the name of the alternatives: if null, for a wide data.frame, they are guessed from the variable names and the choice variable (both should be the same), for a long data.frame, they are guessed from the alt.var argument.
Give this a try.
library(mlogit)
m <- read.csv("mlogit.csv")
mlogd <- mlogit.data(m, choice="PlacedN", shape="long", alt.levels="RaceID")
head(mlogd)
# RaceID PlacedN RSP TrA JoA aDS bDS mDS aDH bDH mDH LDH MR eMR
# 1.RaceID 20119552 TRUE 3.00 13 12 0 0 0 0 0 0 0 0 131
# 2.RaceID 20119552 FALSE 4.00 23 26 91 94 94 139 153 145 153 150 150
# 3.RaceID 20119552 FALSE 0.83 15 15 99 127 99 150 153 150 153 159 159
# 4.RaceID 20119552 FALSE 18.00 21 15 0 0 0 0 0 0 0 0 131
# 5.RaceID 20119552 FALSE 16.00 16 12 92 127 92 134 135 134 135 136 136
# 6.RaceID 20119617 TRUE 2.50 12 10 0 0 0 0 0 0 0 0 152