extracting table from text file - r

I am trying to extract tables from text files and have found several earlier posts here that address similar questions. However, none seem to work efficiently with my problem. The most helpful answer I have found is to one of my earlier questions here: R: removing header, footer and sporadic column headings when reading csv file
An example dummy text file contains:
>
>
> ###############################################################################
>
> # Display AICc Table for the models above
>
>
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
>
>
> ###############################################################################
>
> # the three lines below count the number of errors in the code above
>
> cat("ERROR COUNT:", .error.count, "\n")
ERROR COUNT: 0
> options(error = old.error.fun)
> rm(.error.count, old.error.fun, new.error.fun)
>
> ##########
>
>
I have written the following code to extract the desired table:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
my.data <- my.data[c(1:(length(my.data)-4))]
aa <- as.data.frame(my.data)
aa
write.table(my.data, 'c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', quote=F, col.names=F, row.name=F)
my.data2 <- read.table('c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', header = TRUE, row.names = c(1))
my.data2
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
I would prefer to avoid having to write and then read my.data to obtain the desired data frame. Prior to that step the current code returns a vector of strings for my.data:
[1] " model npar AICc DeltaAICc weight Deviance" "13 P1 19 94 0.00 0.78 9"
[3] "12 P2 21 94 2.64 0.20 9" "10 P3 15 94 9.44 0.02 9"
[5] "2 P4 11 94 619.26 0.00 9"
Is there some way I can convert the above vector of strings into a data frame like that in dummy.log.extraction.txt without writing and then reading my.data?
The line:
aa <- as.data.frame(my.data)
returns the following, which looks like what I want:
# my.data
# 1 model npar AICc DeltaAICc weight Deviance
# 2 13 P1 19 94 0.00 0.78 9
# 3 12 P2 21 94 2.64 0.20 9
# 4 10 P3 15 94 9.44 0.02 9
# 5 2 P4 11 94 619.26 0.00 9
However:
dim(aa)
# [1] 5 1
If I can split aa into columns then I think I will have what I want without having to write and then read my.data.
I found the post: Extracting Data from Text Files However, in the posted answer the table in question seems to have a fixed number of rows. In my case the number of rows can vary between 1 and 20. Also, I would prefer to use base R. In my case I think the number of rows between bottom and the last row of the table is a constant (here 4).
I also found the post: How to extract data from a text file using R or PowerShell? However, in my case the column widths are not fixed and I do not know how to split the strings (or rows) so there are only seven columns.
Given all of the above perhaps my question is really how to split the object aa into columns. Thank you for any advice or assistance.
EDIT:
The actual logs are produced by a supercomputer and contain up to 90,000 lines. However, the number of lines varies greatly among logs. That is why I was making use of top and bottom.

May be your real log file is totally different and more complex but with this one, you can use read.table directly, you just have to play with the right parameters.
data <- read.table("c:/users/mmiller21/simple R programs/dummy.log",
comment.char = ">",
nrows = 4,
skip = 1,
header = TRUE,
row.names = 1)
str(data)
## 'data.frame': 4 obs. of 6 variables:
## $ model : Factor w/ 4 levels "P1","P2","P3",..: 1 2 3 4
## $ npar : int 19 21 15 11
## $ AICc : int 94 94 94 94
## $ DeltaAICc: num 0 2.64 9.44 619.26
## $ weight : num 0.78 0.2 0.02 0
## $ Deviance : int 9 9 9 9
data
## model npar AICc DeltaAICc weight Deviance
## 13 P1 19 94 0.00 0.78 9
## 12 P2 21 94 2.64 0.20 9
## 10 P3 15 94 9.44 0.02 9
## 2 P4 11 94 619.26 0.00 9

read.table and its family now have an option to read text:
> df <- read.table(text = paste(my.data, collapse = "\n"))
> df
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
> summary(df)
model npar AICc DeltaAICc weight Deviance
P1:1 Min. :11.0 Min. :94 Min. : 0.00 Min. :0.000 Min. :9
P2:1 1st Qu.:14.0 1st Qu.:94 1st Qu.: 1.98 1st Qu.:0.015 1st Qu.:9
P3:1 Median :17.0 Median :94 Median : 6.04 Median :0.110 Median :9
P4:1 Mean :16.5 Mean :94 Mean :157.84 Mean :0.250 Mean :9
3rd Qu.:19.5 3rd Qu.:94 3rd Qu.:161.90 3rd Qu.:0.345 3rd Qu.:9
Max. :21.0 Max. :94 Max. :619.26 Max. :0.780 Max. :9

It looks strange that you have to read an R console. Whatever, you can use the fact that your table lines begin with a numeric and extract your inetersting line using something like ^[0-9]+. Then read.table like shown by #kohske do the rest.
readLines('c:/users/mmiller21/simple R programs/dummy.log')
idx <- which(grepl('^[0-9]+',ll))
idx <- c(min(idx)-1,idx) ## header line
read.table(text=ll[idx])
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9

Thank you to those who posted answers. Because of the size, complexity and variability of the actual log files I think I need to continue to make use of the variables top and bottom. However, I used elements of dickoa's answer to come up with the following.
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
x <- read.table(text=my.data, comment.char = ">")
x
# model npar AICc DeltaAICc weight Deviance
# 13 P1 19 94 0.00 0.78 9
# 12 P2 21 94 2.64 0.20 9
# 10 P3 15 94 9.44 0.02 9
# 2 P4 11 94 619.26 0.00 9
Here is even simpler code:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
x

Related

Is there another way to calculate within-subject Hedges'g (and error)?

I'm carrying out a meta-analysis of within-subject studies (crossover studies). I've read some papers that used the esc package (esc_mean_sd function, more precisely) to calculate Hedges'g to perform it. However, its output is doubling the "n" of each study.
Please, look that the "n" in the data is n=12 for all the three studies, while in the output there are n=24.
ID mean_exp mean_con sd_exp sd_con n
1 A 150 130 15 22 12
2 B 166 145 10 8 12
3 C 179 165 11 14 12
# What I did:
e1 <- esc_mean_sd(data[1,2],data[1,4],data[1,6],
data[1,3],data[1,5],data[1,6],
r = .9,es.type = "g")
e2 <- esc_mean_sd(data[2,2],data[2,4],data[2,6],
data[2,3],data[2,5],data[2,6],
r = .9,es.type = "g")
e3 <- esc_mean_sd(data[3,2],data[3,4],data[3,6],
data[3,3],data[3,5],data[3,6],
r = .9,es.type = "g")
data2 <- combine_esc(e1, e2, e3)
colnames(data2) <- ("study","es","weight","n","se","var","lCI","uCI","measure")
head(data2, 3)
# study es weight n se var lCI uCI measure
# 1 1.80 4.18 24 0.489 0.239 0.842 2.76 g
# 2 4.53 1.60 24 0.791 0.626 2.983 6.08 g
# 3 2.14 3.71 24 0.519 0.269 1.126 3.16 g

Wrong degrees of freedom in lsmeans and SE calculation in R

I have this sample data:
Sample Replication Days
1 1 10
1 1 14
1 1 13
1 1 14
2 1 NA
2 1 5
2 1 18
2 1 20
1 2 16
1 2 NA
1 2 18
1 2 21
2 2 15
2 2 7
2 2 12
2 2 14
I have four observations for each sample with a total of 64 samples in each of the two replications. In total, I have 512 values for both the replications. I also have some missing values designated as 'NA'. I prformed ANOVA for Mean values for each Sample for each Rep that I generated using
library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE))
curve.anova <- aov(Mean~Rep+Sample, data=df)
Result of anova is:
> summary(curve.anova)
Df Sum Sq Mean Sq F value Pr(>F)
Rep 1 6.1 6.071 2.951 0.0915 .
Sample 63 1760.5 27.945 13.585 <2e-16 ***
Residuals 54 111.1 2.057
I created a table for mean and SE values,
ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")
A few lines from file are:
Sample lsmean SE df lower.CL upper.CL
1 24.875 1.014145417 54 22.84176086 26.90823914
2 25.5 1.014145417 54 23.46676086 27.53323914
3 31.32575758 1.440722628 54 28.43728262 34.21423253
4 26.375 1.014145417 54 24.34176086 28.40823914
5 26.42424242 1.440722628 54 23.53576747 29.31271738
6 25.5 1.014145417 54 23.46676086 27.53323914
7 28.375 1.014145417 54 26.34176086 30.40823914
8 24.875 1.014145417 54 22.84176086 26.90823914
9 21.16666667 1.014145417 54 19.13342752 23.19990581
10 23.875 1.014145417 54 21.84176086 25.90823914
df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples. SE values are larger than the manually calculated values. Based on anova results, df=54 is for residuals.
I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.
I will appreciate any help. Thank you!
After reading your comments, I think your workflow as an issue. Basically, when you are applying your anova test, you are doing it on means of the different samples.
So, in your example, when you are doing :
curve.anova <- aov(Mean~Rep+Sample, data=df)
You are comparing these values:
> df
# A tibble: 4 x 3
# Groups: Sample [2]
Sample Replication Mean
<dbl> <dbl> <dbl>
1 1 1 12.8
2 1 2 18.3
3 2 1 14.3
4 2 2 12
So, basically, you are comparing two groups with two values per group.
So, when you tried to remove the Replication group, you get an error because the output of:
df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE))
is now:
# A tibble: 2 x 2
Sample Mean
<dbl> <dbl>
1 1 15.1
2 2 13
So, applying anova test on that dataset means that you are comparing two groups with one value each. So, you can't compute residuals and SE.
Instead, you should do it on the full dataset without trying to calculate the mean first:
anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)
And their output are:
> summary(anova_data)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.071 0.713 0.416
Replication 1 9.05 9.054 0.402 0.539
Residuals 11 247.80 22.528
2 observations deleted due to missingness
> summary(anova_data2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.07 0.751 0.403
Residuals 12 256.86 21.41
2 observations deleted due to missingness
Now, you can apply lsmeans:
A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))
> A_d
Sample lsmean SE df lower.CL upper.CL
1 15.3 1.8 11 11.29 19.2
2 12.9 1.8 11 8.91 16.9
Results are averaged over the levels of: Replication
Confidence level used: 0.95
> A_d2
Sample lsmean SE df lower.CL upper.CL
1 15.1 1.75 12 11.33 19.0
2 13.0 1.75 12 9.19 16.8
Confidence level used: 0.95
It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.
So, to plot it, you can:
library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:
d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))
Then,
curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)
And you get the following output:
> summary(curve.d2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 139.32 139.32 167.7 1.47e-10 ***
Residuals 18 14.96 0.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ANOVA2
Sample lsmean SE df lower.CL upper.CL
1 2.62 0.288 18 2.02 3.23
2 7.90 0.288 18 7.29 8.51
Confidence level used: 0.95
And for the plot
ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place. So, I think your output are correct. Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same. It is what it is.
I hope this answer helps you.
Data
df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
Replication = c(rep(1,8), rep(2,8)),
Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

Create and plot a table which preserves the ordering of the factor

When creating and plotting a table the names are numeric values and I would like for them to stay in numeric order.
Code :
library(plyr)
set.seed(1234)
# create a random vector of different categories
number_of_categories <- 11
probability_of_each_category <- c(0.1,0.05, 0.05,0.08, 0.01,
0.1, 0.2, 0.3, 0.01, 0.02,0.08)
number_of_samples <- 1000
x <- sample( LETTERS[1:number_of_categories],
number_of_samples,
replace=TRUE,
prob=probability_of_each_category)
# just a vector of zeros and ones
outcome <- rbinom(number_of_samples, 1, 0.4)
# I want x to be 1,2,...,11 so that it demonstrates the issue when
# creating the table
x <- mapvalues(x,
c(LETTERS[1:number_of_categories]),
seq(1:number_of_categories))
# the table shows the ordering
prop.table(table(x))
plot(table(x, outcome))
Table :
> prop.table(table(x))
x
1 10 11 2 3 4 5 6 7 8 9
0.105 0.023 0.078 0.044 0.069 0.083 0.018 0.097 0.195 0.281 0.007
Plot :
I would like the plot and the table in the order
1 3 4 5 ... 10 11
Rather than
1 10 11 2 3 4 5 6 7 8 9
You can either convert x to numeric before feeding it to table
plot(table(as.numeric(x), outcome))
Or order the table's rows by the as.numeric of the rownames
t <- table(x, outcome)
t <- t[order(as.numeric(rownames(t))),]
plot(t)
A simple to solve this problem is to format the numbers to include a leading zero during mapvalues(), using sprintf().
x <- mapvalues(x,
c(LETTERS[1:number_of_categories]),
sprintf("%02d",seq(1:number_of_categories)))
# the table shows the ordering
prop.table(table(x))
plot(table(x, outcome))
...and the output:
> prop.table(table(x))
x
01 02 03 04 05 06 07 08 09 10 11
0.104 0.067 0.038 0.073 0.019 0.112 0.191 0.291 0.011 0.019 0.075

How to apply a function from a package to a dataframe

How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12

R Programming issue intervals

I'm trying to figure out a formula to be able to divide the max and min number inside the intervals.
x <- sample(10:40,100,rep=TRUE)
factorx<- factor(cut(x, breaks=nclass.Sturges(x)))
xout<-as.data.frame(table(factorx))
xout<- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
Using the above code in the R editor program, I get the following:
xout
factorx Freq cumFreq relative
1 (9.97,13.8] 14 14 0.14
2 (13.8,17.5] 13 27 0.13
3 (17.5,21.2] 16 43 0.16
4 (21.2,25] 5 48 0.05
5 (25,28.8] 11 59 0.11
6 (28.8,32.5] 8 67 0.08
7 (32.5,36.2] 16 83 0.16
8 (36.2,40] 17 100 0.17
What I want to know is if there is a way to calculate the interval. For example it would be:
(13.8 + 9.97)/2
It's called the class midpoint in statistics I believe.
Here's a one-liner that is probably close to what you want:
> sapply(strsplit(levels(xout$factorx), ","), function(x) sum(as.numeric(gsub("[[:space:]]", "", chartr(old = "(]", new = " ", x))))/2)
[1] 11.885 15.650 19.350 23.100 26.900 30.650 34.350 38.100
#One possible solution is to split by (,] (xout is your dataframe)
x1<-strsplit(as.character(xout$factorx),",|\\(|]")
x2<-do.call(rbind,x1)
xout$lower=as.numeric(x2[,2])
xout$higher=as.numeric(x2[,3])
xout$ave<-rowMeans(xout[,c("lower","higher")])
> head(xout,3)
factorx Freq cumFreq relative higher lower aver
1 (9.97,13.7] 15 15 0.15 13.7 9.97 11.835
2 (13.7,17.5] 14 29 0.14 17.5 13.70 15.600
3 (17.5,21.2] 12 41 0.12 21.2 17.50 19.350

Resources