Import fixed width data in R - r

I have a problem importing a file in R. The file correctly organized must contain 5 million records and 22 columns. I cannot separate the data base properly. I tried it with this code:
content <- scan("filepath",'character',sep='~') # Read the file as a single line
# To split content in lines:
lines <- regmatches(content,gregexpr(".{211}",content)) #Each line must have 211 characters with 5 million rows in total
x <- tempfile()
library(erer)
write.list(lines,x)
data <- read.fw(x, widths = c(12,9,9,3,4,8,1,1,3,3,3,1,12,14,13,30,8,9,12,6,6,27))
unlink(x)
Each record has numbers and letters. I don't know what I can correct to separate in columns properly.
All rows looks like this:
1000100060040000000000808040512000000188801072010010010000000000000 CABANILLAS GONZALES MARIA MANUEL CABANILLAS MARIA GONZALES 00000000000000000000000
I want to separate it according to the widths specified in the function
It includes some spaces that I cannot include in the final view

Related

r recognize and importing Multiple Tables from a Single excel file

I tried to read all posts like this but I did not succeed.
I need to extract tables of different layouts from a single sheet in excel, for each sheet of the file.
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
I would use readxl. The code below reads just one sheet, but it is easy enough to adapt to read multiple or different sheets.
First we just want to read the sheet. Obviously you should change the path to reflect where you saved your file:
library(readxl)
sheet = read_excel("~/Downloads/try.xlsx", col_names = LETTERS[1:12])
If you didn't know you had 12 columns, then using read_excel without specifying the column names would give you enough information to find that out. The different tables in the sheet are separated by one or two blank rows. You can find the blank rows by testing each row to see if all of the cells in that row are NA using the apply function.
blanks = which(apply(sheet, 1, function(row)all(is.na(row))))
blanks
> blanks
[1] 7 8 17 26 35 41 50 59 65 74 80 86 95 98
So you could extract the first table by taking rows 1--6 (7 - 1), the second table by taking rows 9--16 and so on.

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

R: lack of value in second column of first row causing read.table to recognise a 2D file as 1D

I have a series of data frames in my R environment the I have read in as follows:
x <- list.files(pattern="nuc_occupancy_region");
for(i in seq_along(x)){
print(x[i])
assign(paste(x[i]), read.table(x[i], sep='\t', header=T, fill=T))
}
ESC=ls()[grep(ls(), pattern='ESC_nuc')]
MEF=ls()[grep(ls(), pattern='MEF_nuc')]
The list of files MEF often have missing data:
eg.
from command line
head MEF_nuc_occupancy_regionCybb9049012-9053217chrX.txt
9049012 26
9049013
9049014 29
9049015
9049016 26
etc.
The above file is not a problem as the missing values will be read as NA's and I can deal with that later.
However, in others the second value of the first row is missing....
117755994
117755995
117755996
117755997 6
117755998 6
117755999 6
so despite the fact that each file has 2 columns, the lack of a second value in the first row of some of them causes them to be recognised as a file with a single column:
read.table(example.txt, sep='\t', header=T, fill=T)
117755994
117755995
117755996
117755997
6
117755998
6
117755999
6
Is there some way to avoid this as I need all the data frames to be in 2D?
Thanks
I had to just sort it out with python as 'readlines()' is unbiased to column number:
import os
list=os.listdir('.')
counter=0
files=[]
for i in list:
file=open(i, 'r')
print file
lines=file.readlines()
file.close()
corrected=open(i+'formatted', 'a')
print lines[0:9]
for line in lines:
line=line.rstrip('\n')
line=line.split()
if len(line)<2:
line.append(0)
corrected.write("{}\t{}\n".format(line[0], line[1]))
else:
corrected.write("{}\t{}\n".format(line[0], line[1]))
corrected.close()

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Collecting data in one row from different csv files by the name

It's hard to explain what exactly I want to achieve with my script but let me try.
I have 20 different csv files, so I loaded them into R:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
then with your help I combined them into one and removed all of the duplicates:
data_rd <- subset(transform(all_data, X = sub("\\..*", "", X)),
!duplicated(X))
I have now 1 master table which includes all of the names (Accession):
Accession
AT1G19570
AT5G38480
AT1G07370
AT4G23670
AT5G10450
AT4G09000
AT1G22300
AT1G16080
AT1G78300
AT2G29570
Now I would like to find this accession in other csv files and put the data of this accession in the same raw. There are like 20 csv files and for each csv there are like 20 columns so in same cases it might give me a 400 columns. It doesn't matter how long it takes. It has to be done. Is it even possible to do with R ?
Example:
First csv Second csv Third csv
Accession Size Lenght Weight Size Lenght Weight Size Lenght Weight
AT1G19570 12 23 43 22 77 666 656 565 33
AT5G38480
AT1G07370 33 22 33 34 22
AT4G23670
AT5G10450
AT4G09000 12 45 32
AT1G22300
AT1G16080
AT1G78300 44 22 222
AT2G29570
It looks like a hard task to do. Propably it has to be done by the loop. Any ideas ?
This is a merge loop. Here is rough R code that will inefficiently grow with each merge.
Begin as before:
tbls = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
tbl=list_of_data[[1]]
for(i in 2:length(list_of_data))
{
tbl=merge(tbl, list of_data[[i]], by="Accession", all=T)
}
The matching column names (not used as a key) will be renamed with a suffix (.x,.y, and so on), the all=T argument will ensure that whenever a new Accession key is merged a new row will be made and the missing cells will be filled with NA.

Resources