R: read.table seperator every XX characters - r

Unable to find the solution to my problem.
I am trying to read a text file using r. File contains a single row and separated by number of characters.
000341656.0000000000000000004.6000000000000000009.0000000000000000050.9566787004000000052.0000000000000000072.8621215573000000007.0000000000000000050.0361010830000000047.2490974729000000054.5560183531000000006.0000000000000000049.9711191336000000047.0397111913000000043.1488475260000000023.0000000000000000046.6281588448000000040.1516245487000000038.4653540241000000002.0000000000000000046.2129963899000000041.9963898917000000037.3850068798000000030.0000000000000000046.0144404332000000040.0324909747000000027.0930952140000000003.0000000000000000043.3971119134000000032.4801444043000000010.4757238771
First value is of 20 digit floating.9 digit followed by 10 decimal digit.
The file contains between 22 and 30 values, each 20 digits long (decimal digit set to '.')
What i am unable to figure out how to get rid of this extra 0.
Any lead of help is highly appreciable.

You can read data in a fixed-width format with read.fwf:
> read.fwf("./d.txt", widths=rep(20,30))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 341656 4.6 9 50.95668 52 72.86212 7 50.0361 47.2491 54.55602 6 49.97112
2 341656 4.6 9 50.95668 52 72.86212 7 50.0361 47.2491 54.55602 6 49.97112
V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
1 47.03971 43.14885 23 46.62816 40.15162 38.46535 2 46.213 41.99639 37.38501
2 47.03971 43.14885 23 46.62816 40.15162 38.46535 2 46.213 41.99639 37.38501
V23 V24 V25 V26 V27 V28 V29 V30
1 30 46.01444 40.03249 27.0931 3 43.39711 32.48014 10.47572
2 30 46.01444 40.03249 27.0931 3 43.39711 32.48014 10.47572
You need to know how many fields and how big they are. You didnt say how many lines in the file but I copied your line in twice (hence the duplication).

library(stringi)
stri_match_all_regex(
"000341656.0000000000000000004.6000000000000000009.0000000000000000050.9566787004000000052.0000000000000000072.8621215573000000007.0000000000000000050.0361010830000000047.2490974729000000054.5560183531000000006.0000000000000000049.9711191336000000047.0397111913000000043.1488475260000000023.0000000000000000046.6281588448000000040.1516245487000000038.4653540241000000002.0000000000000000046.2129963899000000041.9963898917000000037.3850068798000000030.0000000000000000046.0144404332000000040.0324909747000000027.0930952140000000003.0000000000000000043.3971119134000000032.4801444043000000010.4757238771",
".{20}"
) %>%
unlist() %>%
as.numeric()
## [1] 341656.00000 4.60000 9.00000 50.95668 52.00000
## [6] 72.86212 7.00000 50.03610 47.24910 54.55602
## [11] 6.00000 49.97112 47.03971 43.14885 23.00000
## [16] 46.62816 40.15162 38.46535 2.00000 46.21300
## [21] 41.99639 37.38501 30.00000 46.01444 40.03249
## [26] 27.09310 3.00000 43.39711 32.48014 10.47572
Also:
as.numeric(readChar("~/Data/20.txt", rep(20, file.size("~/Data/20.txt")/20)))

Related

R - aggregated data.table columns differently

I am given a large data-table that needs to be aggregated according to the first column:
The problem is the following:
For several columns, one just has to form the sum for each category (given in column 1)
For other columns, one has to calculate the mean
There is a 1-1 correspondence between the entries in the first and second columns. Such that the entries of the second column should be kept.
The following is a possible example of such a data-table. Let's assume that columns 3-9 need to be summed up and columns 10-12 need to be averaged.
library(data.table)
set.seed(1)
a<-matrix(c("cat1","text1","cat2","text2","cat3","text3"),nrow=3,byrow=TRUE)
M<-do.call(rbind, replicate(1000, a, simplify=FALSE)) # where m is your matrix
M<-cbind(M,matrix(sample(c(1:100),3000*10,replace=TRUE ),ncol=10))
M <- as.data.table(M)
The result should be a table of the form
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1: cat1 text1 27 81 78 95 27 22 12 76 18 76
2: cat2 text2 38 48 70 100 11 97 8 53 56 33
3: cat3 text3 58 18 66 24 14 73 18 27 92 70
but with entries the corresponding sums respective averages.
M[, names(M)[-c(1,2)] := lapply(.SD, as.numeric),
.SDcols = names(M)[-c(1,2)]][,
c(lapply(.SD[, ((3:9)-2), with=FALSE], sum),
lapply(.SD[, ((10:12)-2), with=FALSE], mean)),
by = eval(names(M)[c(1,2)])]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> 1: cat1 text1 51978 49854 48476 49451 49620 49870 50248 50.193 51.516 49.694
#> 2: cat2 text2 50607 50097 50572 50507 48960 51419 48905 49.700 49.631 48.863
#> 3: cat3 text3 51033 50060 49742 50345 51532 51299 50957 50.192 50.227 50.689

Skipping columns in R read.table()

I wanna skip the first three columns. Couldn't quite understand the posts about colClasses because I'm new to R.
YDL025C YDL025C 1 -0.1725 -0.5375 -0.4970 -0.3818 -0.5270 -0.4260 -0.6929 -0.4020 -0.3263 -0.3373 -0.3532 -0.2771 -0.2732 -0.3307 -0.4660 -0.4314 -0.3135
YKL032C YKL032C 1 -0.2364 0.0794 0.1678 0.2389 0.3847 0.2625 0.1889 0.2681 0.0363 -0.1992 -0.0521 -0.0307 0.0584 0.2817 0.2239 -0.0253 0.0751
If you have to use read.table and you want to filter on the way in, you can use col.classes as follows. You have 20 columns. Say the first 2 are character, rest are numeric and you want to drop 4,5,6. You construct a vector of length 20 detailing that information. The NULL will not pull in those columns.
x<- read.table(file="datat.txt",
colClasses = c(rep("character", 2),
rep("numeric", 1),
rep("NULL", 3),
rep("numeric", 14)),
header = FALSE)
x
V1 V2 V3 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 YDL025C YDL025C 1 -0.3818 -0.5270 -0.4260 -0.6929 -0.4020 -0.3263 -0.3373 -0.3532 -0.2771 -0.2732 -0.3307 -0.4660 -0.4314 -0.3135
2 YKL032C YKL032C 1 0.2389 0.3847 0.2625 0.1889 0.2681 0.0363 -0.1992 -0.0521 -0.0307 0.0584 0.2817 0.2239 -0.0253 0.0751
As commented above, easier to remove the columns after reading in. For example:
mydf <- read.table("mydf.txt")
Then,
mydf[, 4:ncol(mydf)]
will remove the first 3 columns.

Column Mean by Factors

I would like to create a table of column means by Strain factors
I have the following data:
Age Strain 103 3 163 39
V2 28 101CD -3.4224173012 -0.3360570164 -9.2417448649 -3.6094766494
V3 28 101CD -3.6487198656 -0.7948262475 -4.6350611123 -1.9232938265
V4 28 101CD -7.0936427264 -0.1981243536 -9.2063428591 -3.367139071
V5 28 101CD -5.9245254437 -0.1161875584 -7.3830396092 -4.7980771085
V6 30 101HFD -9.4618204696 -5.0355557149 -3.9915005349 -0.9271933496
V7 30 101HFD -8.805867863 -2.667103793 -2.2489197384 -1.5169130813
V8 30 101HFD -10.9841335945 -2.9617657815 -3.3460597574 -1.121806194
V9 30 101HFD -10.4612747952 -4.3759351258 -4.4322637085 -0.772499965
V10 30 101HFD -9.2871507889 -1.2664335711 -4.3142098012 -1.3791233817
V11 30 101HFD -10.9443983294 -2.4651954898 -4.7759052834 -1.0954401254
V12 29 103CD -2.7492530803 -2.0659306194 -2.5698186069 -1.4978280502
V13 29 103CD -6.4401905692 -2.1098420514 -3.4349220483 -0.8836564768
V14 29 103CD -6.479929929 -2.4792621691 -3.368774934 -0.7756932376
V15 29 103CD -3.6586850957 -1.9145944032 -3.0911223702 -1.2730896376
V16 29 103CD -7.1377230731 -1.413139617 -2.9203340711 -1.3152010161
V17 29 103HFD -9.4624093184 -1.3265834556 -4.1871313168 -1.0108235293
V18 29 103HFD -7.336764023 -0.8712499419 -4.204313727 -1.4450582002
V19 29 103HFD -7.036723106 -0.7546877382 -6.0432957599 -1.4161366956
V20 29 103HFD -9.4449207581 -0.9226067311 -4.6305567775 -1.320094489
V21 29 103HFD -9.6383454033 -1.9620356763 -3.0214290407 -0.8602682738
And, I want to end up with this:
Age Strain 103 3 163 39
V1 28 101CD -3.4224173012 -0.3360570164 -9.2417448649 -3.6094766494
V2 30 101HFD -9.4618204696 -5.0355557149 -3.9915005349 -0.9271933496
V3 29 103CD -2.7492530803 -2.0659306194 -2.5698186069 -1.4978280502
V4 29 103HFD -9.4624093184 -1.3265834556 -4.1871313168 -1.0108235293
Where [1,] is the mean of all columns for all samples with Strain=101CD, [2:3] is the mean of all columns for samples with Strain=101HFD, etc.
I have attempted to use:
> ave <- aggregate(data, as.list(factor(data$Age)), mean)
Error in aggregate.data.frame(data, as.list(factor(data$Age)), mean) : arguments must have same length
and
> ave <- sapply(split(data, data$Strain), mean)
101CD 101HFD 103CD 103HFD 32CD 40CD 40HFD 43CD 43HFD 44CD 44HFD
NA NA NA NA NA NA NA NA NA NA NA
...
97HFD 98CD 98HFD 99CD 99HFD
NA NA NA NA NA
There were 50 or more warnings (use warnings() to see the first 50)
and
> ave <- daply(data, data$Strain, mean)
Error in parse(text = x) : <text>:1:4: unexpected symbol
1: 101CD
I feel like there should be a fairly straightforward way to accomplish this, but I have been unable to find a solution.
You can use dplyr. Here we group_by Strain, then use summarise_each to summarise each column, with the function mean with na.rm set to TRUE:
library(dplyr)
data %>% group_by(Strain) %>%
summarise_each(funs(mean(., na.rm=TRUE)))
Source: local data frame [4 x 6]
Strain Age X103 X3 X163 X39
(fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 101CD 28 -5.022326 -0.3612988 -7.616547 -3.424497
2 101HFD 30 -9.990774 -3.1286649 -3.851476 -1.135496
3 103CD 29 -5.293156 -1.9965538 -3.076994 -1.149094
4 103HFD 29 -8.583833 -1.1674327 -4.417345 -1.210476
Exploit the fact that a data.frame is a special kind of list.
aggregate(data, data[, "Age", drop = FALSE], mean)
drop = FALSE is required so that the result of the selection remains a data.frame. data[, "Age"] is equivalent to data[, "Age", drop = TRUE] and will return a vector.

R: format csv file as.data.frame

I have a csv file i'm importing but some columns are not getting the correct formatting. It's very strange and I can't figure it out. The entire top row is formatting the columns as characters, instead of numeric. Believe it is getting the formatting from V1/time1 column?
> dde = read.csv("dde.csv",header=F, sep=",",stringsAsFactors=FALSE)
> dde <- na.omit(dde)
> dde <- as.data.frame(dde)
> time1 = as.POSIXct(strptime(paste(dde$V1, sep=" "),format="%m/%d/%Y %I:%M:%S %p"))
> head(dde)
V1 V2 V3 V4 V5 V6 V7 V8
1 9/7/2014 9:20:00 PM 105.061 136.099 169.961 98.391 96.515 112.802 87.277
2 9/7/2014 9:26:00 PM 105.068 136.074 169.954 98.399 96.521 112.790 87.276
3 9/7/2014 9:31:00 PM 105.078 136.107 170.031 98.414 96.528 112.813 87.287
4 9/7/2014 9:35:00 PM 105.068 136.102 170.001 98.424 96.516 112.789 87.289
5 9/7/2014 9:41:00 PM 105.074 136.109 169.994 98.422 96.519 112.821 87.300
6 9/7/2014 9:45:00 PM 105.091 136.114 170.028 98.420 96.539 112.829 87.302
V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1.29531 0.80054 1.38283 1.40974 1.20601 1.55867 1.61761 0.93644 1.08825
2 1.29503 0.80041 1.38256 1.40949 1.20607 1.55817 1.61749 0.93643 1.08828
3 1.29514 0.80026 1.38256 1.40963 1.20607 1.55828 1.61796 0.93650 1.08832
4 1.29520 0.80038 1.38250 1.40957 1.20594 1.55819 1.61791 0.93666 1.08835
5 1.29517 0.80042 1.38259 1.40965 1.20590 1.55843 1.61777 0.93658 1.08840
6 1.29519 0.80046 1.38275 1.40969 1.20588 1.55860 1.61780 0.93648 1.08834
V18 V19 V20 V21 V22 V23 V24 V25 V26
1 0.93103 0.83073 1.72682 1.77693 1.50608 1.94649 1.01918 0.87190 1.12698
2 0.93106 0.83075 1.72689 1.77693 1.50593 1.94627 1.01912 0.87187 1.12676
3 0.93109 0.83069 1.72704 1.77693 1.50638 1.94661 1.01929 0.87202 1.12684
4 0.93110 0.83082 1.72687 1.77693 1.50645 1.94631 1.01941 0.87213 1.12694
5 0.93101 0.83080 1.72681 1.77693 1.50613 1.94643 1.01934 0.87199 1.12701
6 0.93097 0.83070 1.72706 1.77693 1.50613 1.94680 1.01927 0.87190 1.12696
V27 V28 V29 V30
1 0.85511 0.90400 0.77324 1268.81
2 0.85520 0.90390 0.77332 1268.81
3 0.85517 0.90405 0.77328 1268.81
4 0.85515 0.90415 0.77333 1268.81
5 0.85508 0.90423 0.77344 1268.81
6 0.85513 0.90412 0.77334 1268.81
> V22 = xts(dde$V22, order.by=time1)
> V22 <-to.minutes(V22[,1],240,'minutes')
> V22 <- align.time(xts(V22),5 * 60)
>
> V2 = xts(dde$V2, order.by=time1)
> V2 <-to.minutes(V2[,1],240,'minutes')
Error in to.period(x, "minutes", k = k, name = name, ...) :
unsupported type
> V2 <- align.time(xts(V2),5 * 60)
>
> class(dde$V22)
[1] "numeric"
> class(dde$V2)
[1] "character"
> typeof(dde$V22)
[1] "double"
> typeof(dde$V2)
[1] "character"
When I do the things you said you did in the comments, it works for me
dde <- read.csv("dde.csv", header=FALSE, stringsAsFactors=FALSE)
dde <- na.omit(dde)
colnames(dde)[1] <- "time"
colnames(dde)[2] <- "test1"
time1 = as.POSIXct(strptime(paste(dde$time, sep=" "),format="%m/%d/%Y %I:%M:%S %p"))
test1 = xts(dde[, c("test1")], order.by=time1)
test1 <- to.minutes(test1[,1],240,'minutes')
test1 <- align.time(xts(test1),5 * 60)
test1
# minutes.Open minutes.High minutes.Low minutes.Close
#2014-09-07 21:30:00 105.061 105.068 105.061 105.068

random forest error NA not permitted in predictors

Hi I am using the following r script to build a random forest:
# load the necessary libraries
library(randomForest)
testPP<-numeric()
# load the dataset
QdataTrain <- read.csv('train.csv',header = FALSE)
QdataTest <- read.csv('test.csv',header = FALSE)
QdataTrainX <- subset(QdataTrain,select=-V1)
QdataTrainY<-as.factor(QdataTrain$V1)
QdataTestX <- subset(QdataTest,select=-V1)
QdataTestY<-as.factor(QdataTest$V1)
mdl <- randomForest(QdataTrainX, QdataTrainY)
where I am getting the following error:
Error in randomForest.default(QdataTrainX, QdataTrainY) :
NA not permitted in predictors
however i see no occurence of NA in my data.
for reference here is my data:
https://docs.google.com/file/d/0B0iDswLYaZ0zUFFsT01BYlRZU0E/edit
does anyone know why this error is being thrown? I'll keep looking in the mean time.
Thanks in advance for any help!
The given data does contain some missing values (7 in particular):
sapply(QdataTrainX, function(x) sum(is.na(x)))
## V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29
## 0 0 0 0 0 0 1 1 1 1 1 1 1
Therefore columns V23 to V29 have one missing value each
which(is.na(QdataTrainX$V23))
## 318
Gives the row number for that.

Resources