Extracting columns from text file - r

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command).
I use the following command to load the text file:
data3 <-read.table (file.choose(), header = FALSE,sep = ",")
I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result).
Each COLn will contain the relevant characters of the tree in this example.
How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?
Here is the text file content:
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
An example of the COL1 content will be:
MSTV
|
|
|
|
|
|
|
|
MSTV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COL2 content will be:
MLTV
MLTV
|
|
|
|
|
|
>
DP
|
|
|
|
|
|
|
|
|
|
|
|
DP
|
|
|
|
|
|

Try this:
cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt),
header = FALSE,
widths = rep.int(4, max(nchar(cleaned.txt)/4)),
strip.white= TRUE
)
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.
Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.
Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here
widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
Next, I'm creating the data in the way you would like it structured.
for (i in colnames(cleaned.df)) {
assign(i, subset(cleaned.df, select=i))
assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}
rm(i)
rm(cleaned.df)
rm(cleaned.txt)
What this does is it creates a loop for each column header in your data frame.
From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.
Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.
Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using
get(i)[get(i)!=""]
(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).
If we just use get(i), there will be a leading whitespace in the output.

Related

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

Data Cleanup with R: Getting rid of extra fullstops

I am cleaning the data using R.
Below is my data format
Input
1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |
expected output:
1) 100 | 101.25 | 102.25 |NA | NA |201.5
2) 200.05|200.26 | 205 |NA | NA |3000
3) 300.98|300.26 |2001.26 |NA |0.2 |5.65
there are extra full stops at in the table, which I am trying to get cleaned, but to retain decimal numbers in its format
I tried replace all in R, which clears all the full stops, and decimal numbers are distorted.
If the trailing full stop is really the only manifestation of the problem, then you may try just removing it with sub:
x <- c("101.25", "200.56.", "300.26")
x <- sub("\\.$", "", x)
You can use look-ahead to replace dot(.) which are not before space or | as:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'
y <- gsub("([.]+)(?=[[:blank:]|])","",x,perl = TRUE)
cat(y)
# 1) 100 | 101.25 | 102.25 | | | 201.5 |
# 2) 200.05 | 200.56 | 205 | | | 3000 |
# 3) 300.98 | 300.26 | 2001.56| | 0.2| 5.65 |
Regex explanation:
([.]+) - Group any number of . before look-ahead
(?=[[:blank:]|]) - Look-ahead before :blank: or |
Data:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'

Converting comma separated list to dataframe

If I have a list similar to x <- c("Name,Age,Gender", "Rob,21,M", "Matt,30,M"), how can I convert to a dataframe where Name, Age, and Gender become the column headers.
Currently my approach is,
dataframe <- data.frame(matrix(unlist(x), nrow=3, byrow=T))
which gives me
matrix.unlist.user_data...nrow...num_rows..byrow...T.
1 Name,Age,Gender
2 Rob,21,M
3 Matt,30,M
and doesn't help me at all.
How can I get something which resembles the following from the list mentioned above?
+---------------------------------------------+
| name | age | gender |
| | | |
+---------------------------------------------+
| | | |
| | | |
| ... | ... | ... |
| | | |
| | | ++
+---------------------------------------------+
| | | |
| ... | ... | ... |
| | | |
| | | |
+---------------------------------------------+
We paste the strings into a single string with \n and use either read.csv or read.table from base R
read.table(text=paste(x, collapse='\n'), header = TRUE, stringsAsFactors = FALSE, sep=',')
Alternatively,
data.table::fread(paste(x, collapse = "\n"))
Name Age Gender
1: Rob 21 M
2: Matt 30 M

How do I present data from a text file to CrossTable in R?

I am having trouble importing data to R so that the CrossTable package will do a simple chi squared test. Thank you for any tips on how to import the data in the correct way: the test is fine when I enter data manually but not when I import into a table - see below. /OT
> library(gmodels)
> library(MASS)
> #When I enter the data manually there's no problem running a simple chi-squared:
> CA<-c(42,100,10,5)
> noCA<-c(20,0,140,40)
> regionalca<-cbind(CA,noCA)
> regionalca
CA noCA
[1,] 42 20
[2,] 100 0
[3,] 10 140
[4,] 5 40
> CrossTable(regionalca, fisher=FALSE, chisq=TRUE, expected=TRUE, , sresid=TRUE, format="SPSS")
Cell Contents
|-------------------------|
| Count |
| Expected Values |
| Chi-square contribution |
| Row Percent |
| Column Percent |
| Total Percent |
| Std Residual |
|-------------------------|
Total Observations in Table: 357
|
| CA | noCA | Row Total |
-------------|-----------|-----------|-----------|
[1,] | 42 | 20 | 62 |
| 27.266 | 34.734 | |
| 7.962 | 6.250 | |
| 67.742% | 32.258% | 17.367% |
| 26.752% | 10.000% | |
| 11.765% | 5.602% | |
| 2.822 | -2.500 | |
-------------|-----------|-----------|-----------|
[2,] | 100 | 0 | 100 |
| 43.978 | 56.022 | |
| 71.366 | 56.022 | |
| 100.000% | 0.000% | 28.011% |
| 63.694% | 0.000% | |
| 28.011% | 0.000% | |
| 8.448 | -7.485 | |
-------------|-----------|-----------|-----------|
[3,] | 10 | 140 | 150 |
| 65.966 | 84.034 | |
| 47.482 | 37.274 | |
| 6.667% | 93.333% | 42.017% |
| 6.369% | 70.000% | |
| 2.801% | 39.216% | |
| -6.891 | 6.105 | |
-------------|-----------|-----------|-----------|
[4,] | 5 | 40 | 45 |
| 19.790 | 25.210 | |
| 11.053 | 8.677 | |
| 11.111% | 88.889% | 12.605% |
| 3.185% | 20.000% | |
| 1.401% | 11.204% | |
| -3.325 | 2.946 | |
-------------|-----------|-----------|-----------|
Column Total | 157 | 200 | 357 |
| 43.978% | 56.022% | |
-------------|-----------|-----------|-----------|
Statistics for All Table Factors
Pearson's Chi-squared test
------------------------------------------------------------
Chi^2 = 246.0862 d.f. = 3 p = 4.595069e-53
Minimum expected frequency: 19.78992
> #But when I try to import the data from a .txt file, it becomes unacceptable:
> regionalca<-read.table(file="låtsas ca.txt", header=TRUE)
> regionalca
CA noCA
1 43 20
2 100 1
3 10 140
4 5 40
> CrossTable(regionalca, fisher=FALSE, chisq=TRUE, expected=TRUE, , sresid=TRUE, format="SPSS")
Error in margin.table(x, margin) : 'x' is not an array
> #I would really like to run the test on this table:
> regionalca<-read.table(file="låtsas ca.txt", header=TRUE)
> regionalca
region CA noCA
1 south 43 20
2 southwest 100 0
3 mid 10 140
4 north 5 40
> #Which ob
> CrossTable(regionalca, fisher=FALSE, chisq=TRUE, expected=TRUE, , sresid=TRUE, format="SPSS")
Error in if (any(x < 0) || any(is.na(x))) stop("all entries of x must be nonnegative and finite") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(left, right) : ‘<’ not meaningful for factors
>
The error is very explicit :
if (any(x < 0) || any(is.na(x)))
stop("all entries of x must be nonnegative and finite")
You have not eligible inputs for CrossTable ( gmodels package). I can reproduce it using your data and introduction a non negative value:
CA <- c(-1,100,10,5) ## -1 the first value
So you need to remove all this values before or setting them by another value. For example :
regionalca <- regionalca[rowSums(!regionalca < 0) == ncol(regionalca) &
rowSums(!is.na(regionalca))==ncol(regionalca),]
The probelm is that the read.table create a data.frame, yet what you need is a matrix. Note the sing c.bind() defaulting in a matrix class output. It is also specified in the error you have printed:
Error in margin.table(x, margin) : 'x' is not an array, while array equals matrix, in that case.
That is in order to fix it you need to change your code as follows:
regionalca<-as.matrix(read.table(file="låtsas ca.txt", header=TRUE))

Hmisc Table Creation

Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)

Resources