Removing data with a non-numeric column value in R

Removing data with a non-numeric column value in R - r

So I have a dataset that includes the lung capacity of certain individuals. I am trying to analyze the data distributions and relations. The only problem is that the data is somewhat incomplete. Some of the rows include "N/A" as the lung capacity. This is causing an issue because it is resulting in a mean and sd of always "N/A" for the different subsets. How would I form this into a subset so that it only includes the data that isn't N/A?
I've tried this:
fData1 = read.table("lung.txt",header=TRUE)
fData2= fData1[fData1$fev!="N/A"]
but this gives me an "undefinied columns selected error".
How can I make it so that I have a data set that excludes the rows with "N/A"?
Here is the begining of my data set:
id age fev height male smoke
1 72 1.2840 66.5 1 1
2 81 2.5530 67.0 0 0
3 90 2.3830 67.0 1 0
4 72 2.6990 71.5 1 0
5 70 2.0310 62.5 0 0
6 72 2.4100 67.5 1 0
7 75 3.5860 69.0 1 0
8 75 2.9580 67.0 1 0
9 67 1.9160 62.5 0 0
10 70 NA 66.0 0 1

One option is to apply the operations excluding the NA values:
dat <- read.table("lung.txt", header = T, na.strings = "NA")
mean(dat$fev, na.rm=T) # mean of fev col
sd(dat$fev, na.rm=T)
If you simply want to get rid of the NAs:
fData1 <- na.omit(fData1)
fData1 <- na.exclude(fData1) # same result
If you'd like to save the rows with NA's here are 2 options:
fData2 <- fData1[is.na(fData1$fev), ]
fData2 <- subset(fData1, is.na(fData1$fev))

If you just want to filter out rows with NA values, you can use complete.cases():
> df
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0
10 10 70 NA 66.0 0 1
> df[complete.cases(df), ]
id age fev height male smoke
1 1 72 1.284 66.5 1 1
2 2 81 2.553 67.0 0 0
3 3 90 2.383 67.0 1 0
4 4 72 2.699 71.5 1 0
5 5 70 2.031 62.5 0 0
6 6 72 2.410 67.5 1 0
7 7 75 3.586 69.0 1 0
8 8 75 2.958 67.0 1 0
9 9 67 1.916 62.5 0 0

Related

Cannot read .data file under R?

Good morning,
I need to read the following .data file : https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
For this , I tried without success :
f <-file("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data", open="r" ,encoding="UTF-16LE")
data <- read.table(f, dec=",", header=F)
Thank you a lot for help!

I would try to use the coatless/ucidata package to access the data.
https://github.com/coatless/ucidata
Here you can see how the package loads in the data file and processing:
https://github.com/coatless/ucidata/blob/master/data-raw/heart_disease_build.R
If you wish to try out the package, you will need devtools installed. Here is what you can try:
# install.packages("devtools")
devtools::install_github("coatless/ucidata")
# load data
data("heart_disease_cl", package = "ucidata")
# show beginning rows of data
head(heart_disease_cl)
Output
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
1 63 Male typical angina 145 233 1 probable/definite hypertrophy 150 No 2.3 downsloping 0 fixed defect 0
2 67 Male asymptomatic 160 286 0 probable/definite hypertrophy 108 Yes 1.5 flat 3 normal 2
3 67 Male asymptomatic 120 229 0 probable/definite hypertrophy 129 Yes 2.6 flat 2 reversable defect 1
4 37 Male non-anginal pain 130 250 0 normal 187 No 3.5 downsloping 0 normal 0
5 41 Female atypical angina 130 204 0 probable/definite hypertrophy 172 No 1.4 upsloping 0 normal 0
6 56 Male atypical angina 120 236 0 normal 178 No 0.8 upsloping 0 normal 0

I had found another solution with RCurl :
library (RCurl)
download <- getURL("http://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv")
data <- read.csv (text = download)
head(data)
#Output :
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine
1 75 0 582 0 20 1 265000 1.9
2 55 0 7861 0 38 0 263358 1.1
3 65 0 146 0 20 0 162000 1.3
4 50 1 111 0 20 0 210000 1.9
5 65 1 160 1 20 0 327000 2.7
6 90 1 47 0 40 1 204000 2.1
serum_sodium sex smoking time DEATH_EVENT
1 130 1 0 4 1
2 136 1 0 6 1
3 129 1 1 7 1
4 137 1 0 7 1
5 116 0 0 8 1
6 132 1 1 8 1

Use cases with higher value on one variable for each case of another variable in R

I am doing a meta-analysis in R. For each study (variable StudyID) I have multiple effect sizes. For some studies I have the same effect size multiple times depending on the level of acquaintance (variable Familiarity) between the subjects.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
1 1 3.0 5.0 1 0.0462 4 0 44 1
2 1 5.0 2.5 1 0.1335 4 0 44 1
3 1 2.5 3.0 1 -0.1239 4 0 44 1
4 1 2.5 3.5 1 0.2062 4 0 44 1
5 1 2.5 3.0 1 -0.0370 4 0 44 1
6 1 3.0 5.0 1 -0.3850 4 0 44 1
Those are the first rows of the data set. In total there are over 50 studies. Most studies look like study 1 with the same value in "Familiarity" for all effect sizes. In some studies, there are effect sizes with multiple levels of familiarity. For example study 36 as seen below.
head(dat)
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
142 36 1.0 4.5 0 0.1233 5.00 0 311 1
143 36 3.5 3.0 0 0.0428 5.00 0 311 1
144 36 1.0 4.5 0 0.0986 5.00 0 311 1
145 36 1.0 4.5 1 -0.0520 5.00 0 311 1
146 36 1.5 2.5 1 -0.0258 5.00 0 311 1
147 36 3.5 3.0 1 0.1104 5.00 0 311 1
148 36 1.0 4.5 1 0.0282 5.00 0 311 1
149 36 1.0 4.5 2 -0.1724 5.00 0 311 1
150 36 3.5 3.0 2 0.2646 5.00 0 311 1
151 36 1.0 4.5 2 -0.1426 5.00 0 311 1
152 37 3.0 4.0 1 0.0118 5.35 0 123 0
153 37 1.0 4.5 1 -0.3205 5.35 0 123 0
154 37 2.5 3.0 1 -0.2356 5.35 0 123 0
155 37 3.0 2.0 1 0.1372 5.35 0 123 0
156 37 2.5 2.5 1 -0.1401 5.35 0 123 0
157 37 3.0 3.5 1 -0.3334 5.35 0 123 0
158 37 2.5 2.5 1 0.0317 5.35 0 123 0
159 37 1.0 3.0 1 -0.3025 5.35 0 123 0
160 37 1.0 3.5 1 -0.3248 5.35 0 123 0
Now I want for those studies that include multiple levels of familiarity, to take the rows with only one level of familiarity (two seperate versions: one with the lower, one with the higher familiarity).
I think that it can be possible with the package dplyr, but I have no real code so far.
In a second step I would like to give those rows unique studyIDs for each level of familiarity (so create out of study 36 three "different" studies).
Thank you in advance!

If you want to use dplyr, you could create an alternate ID or casenum by using group_indices:
df <- df %>%
mutate(case_num = group_indices(.dots=c("studyID", "Familiarity")))

You could do:
library(dplyr)
df %>%
group_by(studyID) %>%
mutate(nDist = n_distinct(Familiarity) > 1) %>%
ungroup() %>%
mutate(
studyID = case_when(nDist ~ paste(studyID, Familiarity, sep = "_"), TRUE ~ studyID %>% as.character),
nDist = NULL
)
Output:
# A tibble: 19 x 9
studyID A.C.Extent Visibility Familiarity p_t_cov group.size same.sex N published
<chr> <dbl> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 36_0 1 4.5 0 0.123 5 0 311 1
2 36_0 3.5 3 0 0.0428 5 0 311 1
3 36_0 1 4.5 0 0.0986 5 0 311 1
4 36_1 1 4.5 1 -0.052 5 0 311 1
5 36_1 1.5 2.5 1 -0.0258 5 0 311 1
6 36_1 3.5 3 1 0.110 5 0 311 1
7 36_1 1 4.5 1 0.0282 5 0 311 1
8 36_2 1 4.5 2 -0.172 5 0 311 1
9 36_2 3.5 3 2 0.265 5 0 311 1
10 36_2 1 4.5 2 -0.143 5 0 311 1
11 37 3 4 1 0.0118 5.35 0 123 0
12 37 1 4.5 1 -0.320 5.35 0 123 0
13 37 2.5 3 1 -0.236 5.35 0 123 0
14 37 3 2 1 0.137 5.35 0 123 0
15 37 2.5 2.5 1 -0.140 5.35 0 123 0
16 37 3 3.5 1 -0.333 5.35 0 123 0
17 37 2.5 2.5 1 0.0317 5.35 0 123 0
18 37 1 3 1 -0.302 5.35 0 123 0
19 37 1 3.5 1 -0.325 5.35 0 123 0

glmmTMB, post-hoc testing and glht

I am using glmmTMB to analyze a negative binomial generalized linear mixed model (GLMM) where the dependent variable is count data (CT), which is over-dispersed.
There are 115 samples (rows) in the relevant data frame. There are two fixed effects (F1, F2) and a random intercept (R), within which is nested a further random effect (NR). There is also an offset, consisting of the natural logarithm of the total counts in each sample (LOG_TOT).
An example of a data frame, df, is:
CT F1 F2 R NR LOG_TOT
77 0 0 1 1 12.9
167 0 0 2 6 13.7
289 0 0 3 11 13.9
253 0 0 4 16 13.9
125 0 0 5 21 13.7
109 0 0 6 26 13.6
96 1 0 1 2 13.1
169 1 0 2 7 13.7
190 1 0 3 12 13.8
258 1 0 4 17 13.9
101 1 0 5 22 13.5
94 1 0 6 27 13.5
89 1 25 1 4 13.0
166 1 25 2 9 13.6
175 1 25 3 14 13.7
221 1 25 4 19 13.8
131 1 25 5 24 13.5
118 1 25 6 29 13.6
58 1 75 1 5 12.9
123 1 75 2 10 13.4
197 1 75 3 15 13.7
208 1 75 4 20 13.8
113 1 8 1 3 13.2
125 1 8 2 8 13.7
182 1 8 3 13 13.7
224 1 8 4 18 13.9
104 1 8 5 23 13.5
116 1 8 6 28 13.7
122 2 0 1 2 13.1
115 2 0 2 7 13.6
149 2 0 3 12 13.7
270 2 0 4 17 14.1
116 2 0 5 22 13.5
94 2 0 6 27 13.7
73 2 25 1 4 12.8
61 2 25 2 9 13.0
185 2 25 3 14 13.8
159 2 25 4 19 13.7
125 2 25 5 24 13.6
75 2 25 6 29 13.5
121 2 8 1 3 13.0
143 2 8 2 8 13.8
219 2 8 3 13 13.9
191 2 8 4 18 13.7
98 2 8 5 23 13.5
115 2 8 6 28 13.6
110 3 0 1 2 12.8
123 3 0 2 7 13.6
210 3 0 3 12 13.9
354 3 0 4 17 14.4
160 3 0 5 22 13.7
101 3 0 6 27 13.6
69 3 25 1 4 12.6
112 3 25 2 9 13.5
258 3 25 3 14 13.8
174 3 25 4 19 13.5
171 3 25 5 24 13.9
117 3 25 6 29 13.7
38 3 75 1 5 12.1
222 3 75 2 10 14.1
204 3 75 3 15 13.5
235 3 75 4 20 13.7
241 3 75 5 25 13.8
141 3 75 6 30 13.9
113 3 8 1 3 12.9
90 3 8 2 8 13.5
276 3 8 3 13 14.1
199 3 8 4 18 13.8
111 3 8 5 23 13.6
109 3 8 6 28 13.7
135 4 0 1 2 13.1
144 4 0 2 7 13.6
289 4 0 3 12 14.2
395 4 0 4 17 14.6
154 4 0 5 22 13.7
148 4 0 6 27 13.8
58 4 25 1 4 12.8
136 4 25 2 9 13.8
288 4 25 3 14 14.0
113 4 25 4 19 13.5
162 4 25 5 24 13.7
172 4 25 6 29 14.1
2 4 75 1 5 12.3
246 4 75 3 15 13.7
247 4 75 4 20 13.9
114 4 8 1 3 13.1
107 4 8 2 8 13.6
209 4 8 3 13 14.0
190 4 8 4 18 13.9
127 4 8 5 23 13.5
101 4 8 6 28 13.7
167 6 0 1 2 13.4
131 6 0 2 7 13.5
369 6 0 3 12 14.5
434 6 0 4 17 14.9
172 6 0 5 22 13.8
126 6 0 6 27 13.8
90 6 25 1 4 13.1
172 6 25 2 9 13.7
330 6 25 3 14 14.2
131 6 25 4 19 13.7
151 6 25 5 24 13.9
141 6 25 6 29 14.2
7 6 75 1 5 12.2
194 6 75 2 10 14.2
280 6 75 3 15 13.7
253 6 75 4 20 13.8
45 6 75 5 25 13.4
155 6 75 6 30 13.9
208 6 8 1 3 13.5
97 6 8 2 8 13.5
325 6 8 3 13 14.3
235 6 8 4 18 14.1
112 6 8 5 23 13.6
188 6 8 6 28 14.1
The random and nested random effects are treated as factors. The fixed effect F1 has the value 0, 1, 2, 3, 4 and 6. The fixed effect F2 has the values 0, 8, 25 and 75. I am treating the fixed effects as continuous, rather than ordinal, because I would like to identify monotonic unidirectional changes in the dependent variable CT rather than up and down changes.
I previously used the lme4 package to analyze the data as a mixed model:
library(lme4)
m1 <- lmer(CT ~ F1*F2 + (1|R/NR) +
offset(LOG_TOT), data = df, verbose=FALSE)
Followed by the use of glht in the multcomp package for post-hoc analysis employing the formula approach:
library(multcomp)
glht_fixed1 <- glht(m1, linfct = c(
"F1 == 0",
"F1 + 8*F1:F2 == 0",
"F1 + 25*F1:F2 == 0",
"F1 + 75*F1:F2 == 0",
"F1 + (27)*F1:F2 == 0"))
glht_fixed2 <- glht(m1, linfct = c(
"F2 + 1*F1:F2 == 0",
"F2 + 2*F1:F2 == 0",
"F2 + 3*F1:F2 == 0",
"F2 + 4*F1:F2 == 0",
"F2 + 6*F1:F2 == 0",
"F2 + (3.2)*F1:F2 == 0"))
glht_omni <- glht(m1)
Here is the corresponding negative binomial glmmTMB model, which I now prefer:
library(glmmTMB)
m2 <- glmmTMB(CT ~ F1*F2 + (1|R/NR) +
offset(LOG_TOT), data = df, verbose=FALSE, family="nbinom2")
According to this suggestion by Ben Bolker (https://stat.ethz.ch/pipermail/r-sig-mixed-models/2017q3/025813.html), the best approach to post hoc testing with glmmTMB is to use lsmeans (?or its more recent equivalent, emmeans).
I follwed Ben's suggestion, running
source(system.file("other_methods","lsmeans_methods.R",package="glmmTMB"))
and I can then use emmeans on the glmmTMB object. For example,
as.glht(emmeans(m2,~(F1 + 27*F1:F2)))
General Linear Hypotheses
Linear Hypotheses:
Estimate
3.11304347826087, 21 == 0 -8.813
But this does not seem correct. I can also change F1 and F2 to factors and then try this:
as.glht(emmeans(m2,~(week + 27*week:conc)))
General Linear Hypotheses
Linear Hypotheses:
Estimate
0, 0 == 0 -6.721
1, 0 == 0 -6.621
2, 0 == 0 -6.342
3, 0 == 0 -6.740
4, 0 == 0 -6.474
6, 0 == 0 -6.967
0, 8 == 0 -6.694
1, 8 == 0 -6.651
2, 8 == 0 -6.227
3, 8 == 0 -6.812
4, 8 == 0 -6.371
6, 8 == 0 -6.920
0, 25 == 0 -6.653
1, 25 == 0 -6.648
2, 25 == 0 -6.282
3, 25 == 0 -6.766
4, 25 == 0 -6.338
6, 25 == 0 -6.702
0, 75 == 0 -6.470
1, 75 == 0 -6.642
2, 75 == 0 -6.091
3, 75 == 0 -6.531
4, 75 == 0 -5.762
6, 75 == 0 -6.612
But, again, I am not sure how to bend this output to my will. If some kind person could tell me how to correctly carry over the use of formulae in glht and linfct to the emmeans scenario with glmmTMB, I would be very grateful. I have read all the manuals and vignettes until I am blue in face (or it feels that way, at least), but I am still at a loss. In my defense (culpability?) I am a statistical tyro, so many apologies if I am asking a question with very obvious answers here.
The glht software and post hoc testing carries directly over to the glmmADMB package, but glmmADMB is 10x slower than glmmTMB. I need to perform multiple runs of this analysis, each with 300,000 examples of the negative binomial mixed model, so speed is essential.
Many thanks for your suggestions and help!

The second argument (specs) to emmeans is not the same as the linfct argument in glht, so you can't use it in the same way. You have to call emmeans() using it the way it was intended. The as.glht() function converts the result to a glht object, but it really is not necessary to do that as the emmeans summary yields similar results.
I think the results you were trying to get are obtainable via
emmeans(m2, ~ F2, at = list(F2 = c(0, 8, 25, 75)))
(using the original model with the predictors as quantitative variables). This will compute the adjusted means holding F1 at its average, and at each of the specified values of F2.
Please look at the documentation for emmeans(). In addition, there are lots of vignettes that provide explanations and examples -- starting with https://cran.r-project.org/web/packages/emmeans/vignettes/basics.html.

Following the advice of my excellent statistical consultant, I think the solution below provides what I had previously obtained using glht and linfct.
The slopes for F1 are calculated at the various levels of F2 by using contrast and emmeans to compute the differences in the dependendent variable between two values of F1 separated by one unit (i.e. c(0,1)). (Since the regression is linear, the two values of F1 are arbitrary, provided they are separated by one unit, eg c(3,4)). Vice versa for the slopes of F2.
Thus, slopes of F1 at F2 = 0, 8, 25, 75 and 27 (27 is average of F2):
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=0)),list(c(-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F1",])
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=8)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=25)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=75)),list(c(-1,1)))
contrast(emmeans(m1, specs="F1", at=list(F1=c(0,1), F2=27)),list(c(-1,1)))
and slopes of F2 at F1 = 1, 2, 3, 4, 6 and 3.2 (3.2 is average of F1, excluding zero value):
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=0)),list(c(-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F2",])
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=1)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=2)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=3)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=4)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=6)),list(c(-1,1)))
contrast(emmeans(m1, specs="F2", at=list(F2=c(0,1), F1=3.2)),list(c(-1,1)))
Interaction of F1 and F2 slopes at F1 = 0 and F2 = 0
contrast(emmeans(m1, specs=c("F1","F2"), at=list(F1=c(0,1),F2=c(0,1))),list(c(1,-1,-1,1)))
(above equivalent to: summary(m1)$coefficients$cond["F1:F2",])
From the resulting emmGrid objects provided from contrast(), one can pick out as desired the estimate of the slope (estimate), standard deviation of the estimated slope (SE), Z score for the difference of the estimated slope from a null hypothesized slope of zero (z.ratio, calculated by emmGrid from estimate divided by SE) and corresponding P value (p.value calculated by emmGrid as 2*pnorm(-abs(z.ratio)).
For example:
contrast(emmeans(m1, specs="F1", at=list(F2=c(0,1), F1=0)),list(c(-1,1)))
yields:
NOTE: Results may be misleading due to involvement in interactions
contrast estimate SE df z.ratio p.value
c(-1, 1) 0.001971714 0.002616634 NA 0.754 0.4511
Postscript added 1.25 yrs later:
The above gives the correct solutions, but as Russell Lenth pointed out the answers are more easily obtained using emtrends. However, I have selected this answer as being correct since there may be have some didactic value in showing how to calculate slopes using emmeans to find the resulting change in the predicted dependent variable when the independent variable changes by 1.

How can I get the recapture probabilities in R (which package to use) ?

I'm trying to find a way to estimate the recapture probabilities in my data. Here is an example directly from the package FSA in R.
library(FSA)
## First example -- capture histories summarized with capHistSum()
data(CutthroatAL)
ch1 <- capHistSum(CutthroatAL,cols2use=-1) # ignore first column of fish ID
ex1 <- mrOpen(ch1)
summary(ex1)
summary(ex1,verbose=TRUE)
confint(ex1)
confint(ex1,verbose=TRUE)
If you type summary(ex1,verbose=TRUE), you'll have this result
# Observables:
# m n R r z
# i=1 0 89 89 26 NA
# i=2 22 352 352 96 4
# i=3 94 292 292 51 6
# i=4 41 233 233 46 16
# i=5 58 259 259 100 4
# i=6 99 370 370 99 5
# i=7 91 290 290 44 13
# i=8 52 134 134 13 5
# i=9 18 140 0 NA NA
# Estimates (phi.se includes sampling and individual variability):
# M M.se N N.se phi phi.se B B.se
# i=1 NA NA NA NA 0.411 0.088 NA NA
# i=2 36.6 6.4 561.1 117.9 0.349 0.045 198.6 48.2
# i=3 127.8 13.4 394.2 44.2 0.370 0.071 526.3 119.7
# i=4 120.7 20.8 672.2 138.8 0.218 0.031 154.1 30.2
# i=5 68.3 4.1 301.0 21.8 0.437 0.041 304.7 25.4
# i=6 117.5 7.3 436.1 30.3 0.451 0.069 357.2 61.2
# i=7 175.1 24.6 553.7 84.3 0.268 0.072 106.9 36.2
# i=8 100.2 24.7 255.3 65.4 NA NA NA NA
# i=9 NA NA NA NA NA NA NA NA
Since, "Observables" is not in a list, I cannot extract automatically the numbers. Is it possible?
I have the same type of dataset, but the output won't show me a probability of recapture. I have an open population. That's why I try to use this package.
Here's a look of the typical dataset:
head(CutthroatAL)
# id y1998 y1999 y2000 y2001 y2002 y2003 y2004 y2005 y2006
# 1 1 0 0 0 0 0 0 0 0 1
# 2 2 0 0 0 0 0 0 0 0 1
# 3 3 0 0 0 0 0 0 0 0 1
# 4 4 0 0 0 0 0 0 0 0 1
# 5 5 0 0 0 0 0 0 0 0 1
# 6 6 0 0 0 0 0 0 0 0 1
I also tried the package mra and its F.cjs.estim() function. But, I don't have survival information...
I haven't find any function in RCapture that allows me to print a capture probability.
I'm trying to find the information pj on page 38 of this book Handbook of Capture-Recapture Analysis.
I haven't found as well in the RMark package.
So how can I estimate recapture probabilities in R?
Thanks,

If you just want to capture the "Observable" values in the summary, you can do it the same way the function does. If you look at the source for FSA:::summary.mrOpen, you can see that you can grab those values with
ex1$df[, c("m", "n", "R", "r", "z")]

How to retain row and column formate of a text file(.cel) in R

I read a text file in R, looks like below, with 1354896 rows and 5 colums.
I try read.table(), and read.delim() to upload the file, however the format of file after upload changes. It transforms everything into a single column.
OffsetY=0
GridCornerUL=258 182
GridCornerUR=8450 210
GridCornerLR=8419 8443
GridCornerLL=228 8414
Axis-invertX=0
AxisInvertY=0
swapXY=0
DatHeader=[19..65528] PA-D 102 Full:CLS=8652 RWS=8652 XIN=1 YIN=1 VE=30 2.0 11/04/03 12:49:30 50205710 M10 HG-U133_Plus_2.1sq 6
Algorithm=Percentile
AlgorithmParameters=Percentile:75;CellMargin:2;OutlierHigh:1.500;OutlierLow:1.004;AlgVersion:6.0;FixedCellSize:TRUE;FullFeatureWidth:7;FullFeatureHeight:7;IgnoreOutliersInShiftRows:FALSE;FeatureExtraction:TRUE;PoolWidthExtenstion:2;PoolHeightExtension:2;UseSubgrids:FALSE;RandomizePixels:FALSE;ErrorBasis:StdvMean;StdMult:1.000000
[INTENSITY]
NumberCells=1354896
CellHeader=X Y MEAN STDV NPIXELS
0 0 147.0 23.5 25
1 0 10015.0 1276.7 25
2 0 160.0 24.7 25
3 0 9710.0 1159.8 25
4 0 85.0 14.0 25
5 0 171.0 21.0 25
6 0 11648.0 1678.4 25
7 0 163.0 30.7 25
8 0 12044.0 1430.1 25
9 0 169.0 25.7 25
10 0 11646.0 1925.6 25
11 0 176.0 30.7 25
After reading the format is changed as shown below.:
I want to retain the format of rows and colums
I want to remove all the content before [intensity] like (offset, GridCornerUL, so on) shown in the first file.

You could trys:
txt <- readLines("file.txt")
df <- read.csv(text = txt[-(1:grep("NumberCells=\\d+", txt))], check.names = FALSE)
write.csv(df, tf <- tempfile(fileext = ".csv"), row.names = FALSE)
read.csv(tf, check.names = FALSE) # just to verify...
# CellHeader=X Y MEAN STDV NPIXELS
# 1 0 0 147.0 23.5 25
# 2 1 0 10015.0 1276.7 25
# 3 2 0 160.0 24.7 25
# 4 3 0 9710.0 1159.8 25
# 5 4 0 85.0 14.0 25
# 6 5 0 171.0 21.0 25
# 7 6 0 11648.0 1678.4 25
# 8 7 0 163.0 30.7 25
# 9 8 0 12044.0 1430.1 25
# 10 9 0 169.0 25.7 25
# 11 10 0 11646.0 1925.6 25
# 12 11 0 176.0 30.7 25
This omits everything before and including NumberCells=1354896.

As you are using linux, another option would be to pipe the awk with read.table or fread
read.table(pipe("awk 'NR==1, /NumberCells/ {next}{print}' Hashim.txt"),
header=TRUE, check.names=FALSE)
# CellHeader=X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25

If NumberCells= always appears immediately before the header row, then you can exploit this to tell you the number of lines to skip:
dat<-readLines("file.txt")
read.table(textConnection(dat), header=TRUE, skip=grep("NumberCells", dat))
# CellHeader.X Y MEAN STDV NPIXELS
#1 0 0 147 23.5 25
#2 1 0 10015 1276.7 25
#3 2 0 160 24.7 25
#4 3 0 9710 1159.8 25
#5 4 0 85 14.0 25
#6 5 0 171 21.0 25
#7 6 0 11648 1678.4 25
#8 7 0 163 30.7 25
#9 8 0 12044 1430.1 25
#10 9 0 169 25.7 25
#11 10 0 11646 1925.6 25
#12 11 0 176 30.7 25
Edit
Because your files have a lot of rows, you may want to limit the number of lines that readLines reads in. To do this, you need to know the maximum number of lines before your header row. For instance, if you know your header row will always come within the first 200 lines of the file, you can do:
dat<-readLines("file.txt", n=200)
read.table("file.txt", header=TRUE, skip=grep("NumberCells", dat))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing data with a non-numeric column value in R - r

Related

Cannot read .data file under R?

Use cases with higher value on one variable for each case of another variable in R

glmmTMB, post-hoc testing and glht

How can I get the recapture probabilities in R (which package to use) ?

How to retain row and column formate of a text file(.cel) in R

Categories

Resources