Reading a fixed length text file in R - r

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R.
The File looks some what like this :
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
It is not clear here , I have attached it , please find.
It reads easily in SAS using infile input.
SAS Code:
infile "filename.txt" lrecl=32767 ;
input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;
run;
Now to do the same in R , I have been trying various things:
1)Atfirst read.fwf ,
Code:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
But it returns just NA values for most of the fields and only some values that too in wrong positions.
Head(dat1) gives output:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Output:
2) Now I use the Sascii package to call the SAS code in R.
Code:
sas_imp <- "input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
It too gives the same useless output as above.
3) Now I use the Laf package and the laf_open_fwf command like :
library(LaF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
Then I convert it like :
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
But it gives output:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4) Lastly read.table.ffdf like
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
But result is same.
The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].
What am I doing wrong , am I putting the widths wrong ?
Or I am getting the ideas wrong altogether?
I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.

Update
Please see here for what I use nowadays for this problem:
Faster way to read fixed-width files
for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅
Here's the FW -> .csv converter I created in Python to destroy these awful files:
It also includes the checkLength function that can help get at what #RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).
Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)
import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS
def checkLength(ffile):
"""
Used to check that all lines in file have the same length (and so don't cause any issues below)
"""
with open(ffile,'r') as ff:
firstrow=1
troubles=0
for rows in ff:
if firstrow:
length=len(rows)
firstrow=0
elif len(rows) != length:
print rows
print len(rows)
troubles=1
return troubles
def fixed2csv(infile,outfile,dictfile):
"""
This function takes a file name for a fixed-width dataset as input and
converts it to .csv format according to slices and column names specified in dictfile
Parameters
==========
infile: string of input file name from which fixed-width data is to be read
e.g. 'fixed_width.dat'
outfile: string of output file name to which comma-separated data is to be saved
e.g. 'comma_separated.csv'
dictfile: .csv-formatted dictionary file name from which to read the following:
* widths: field widths
* column names: names of columns to be written to the output .csv
* types: object types (character, integer, etc)
column order must be: col_names,slices,types
"""
with open(dictfile,'r') as dictf:
fieldnames = ("col_names","widths","types") #types used in R later
ddict = csv.DictReader(dictf,fieldnames)
slices=[]
colNames=[]
wwidths=[]
for rows in ddict:
wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
colNames.append(rows['col_names'])
offset = 0
for w in wwidths:
slices.append(slice(offset,offset+w))
offset+=w
with open(infile,'r') as fixedf:
with open(outfile,'w') as csvf:
csvfile=csv.writer(csvf)
csvfile.writerow(colNames)
for rows in fixedf:
csvfile.writerow([rows[s] for s in slices])
Good luck, and curses be on whoever it is that is proliferating these FW format data files.

The file you uploaded is not a fixed width file:
I am not a SAS user, but from looking at the SAS code in your post, the column widths in the code do not match up with those in the file.
Moreover, some lines are entirely blank.
It appears that there are many carriage return / new lines which do not belong there - in particular they seem to be used in places as a delimiter. There should be one CRLF at the end of each line, and that's it.
Since you say that SAS opens it, I suggest you use save to a CSV format in SAS and then open it in R. Alternatively you could remove the superfluous CRLF using a good text editor/processor, leaving a single CRLF at the end of each line. Since it appears that each "real" line begins with "DP" you could try to do a replace of -CRLF-DP with (say) -tab- then delete all -CRLF-s then replace all -tab-s with -CRLF- (this relies on their being no -tab-s in the file already)

Related

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

How to treat missing values in multiple format as NA while loading on R?

I have a data set of this form it has missing values in form of NULL, blank cells and !VALUE , I have to load this data set on R and want to treat all kind of missing values as NA on R. Can na.string argument on read.csv or read.table can take multiple values ??
my set - enter image description here
Yes,na.strings can take multiple values in read.csv/read.table .
Ex:
dsa <- read.table("C://Users/....../dsa.txt",sep=",",header=T)
ID Name Date
1 1 AA 2011
2 2 BB 2012
3 3 CC 2013
dsa <- read.table("C://Users/....../dsa.txt",sep=",",header=T,na.strings = c("AA","2"))
ID Name Date
1 1 <NA> 2011
2 NA BB 2012
3 3 CC 2013

R-text connection

I need to create a table with the following data (I don't want to use a csv to import it).
toread<-"DomicLabProv cluster
BUENOS AIRES 1
CHUBUT 1
FORMOSA 1
LA PAMPA 1
SAN JUAN 1
CAPITAL FEDERAL 1
MISIONES 1
SAN LUIS 1
SANTA FE 1
ENTRE RIOS 2
JUJUY 2
LA RIOJA 2
SANTIAGO DEL ESTERO 2
CHACO 2
CORDOBA 2
CORRIENTES 2
SALTA 2
TIERRA DEL FUEGO 2
TUCUMAN 2
CATAMARCA 3
MENDOZA 3
NEUQUEN 3
RIO NEGRO 3
SANTA CRUZ 3
"
As you can see first and second fields are separated by tabs.
When I try:
read.table(textConnection(toread), header = TRUE)
I get the following error mesage:
Error in scan. Line 2 hasn't got 2 elements.
I think this is related with the fact that names in DomicLabProv have spaces, for example "Buenos Aires". Is there a way to overcome this issue? I mean those are spaces done by space bar and the ones that are between fields are done by the tab key.
Thanks.
No need for a textConnection; pass the string to the function read.table via the text option instead:
read.delim(text = toread)
(read.delim is the same as read.table but uses tabs as delimiters, and defaults to having a header.)
This works if your text is indeed delimited by the tab character '\t'. If that isn’t the case, a bit more work is required, as you need to manually split the columns, while taking care not to split fields like “LA PAMPA”, which also contains a space. This is finicky and best avoided by having the data in the right format to start with. In your particular case, we can use the information that the second column is numeric (but we first need to remove the header, since that doesn’t conform):
header = strsplit(sub('\n.*', '', toread), ' +')[[1]]
no_header = sub('^.*?\n', '', toread)
no_header = gsub(' +(?=\\d)', '\t', no_header, perl = TRUE)
data = read.delim(text = no_header, header = FALSE)
colnames(data) = header
With a little regex, we can convert it to CSV format:
read.csv(text = sub('\\s+(\\S+$)', ',\\1', readLines(textConnection(toread))))
# DomicLabProv cluster
# 1 BUENOS AIRES 1
# 2 CHUBUT 1
# 3 FORMOSA 1
# 4 LA PAMPA 1
# 5 SAN JUAN 1
# 6 CAPITAL FEDERAL 1
# 7 MISIONES 1
# 8 SAN LUIS 1
# 9 SANTA FE 1
# 10 ENTRE RIOS 2
# 11 JUJUY 2
# 12 LA RIOJA 2
# 13 SANTIAGO DEL ESTERO 2
# 14 CHACO 2
# 15 CORDOBA 2
# 16 CORRIENTES 2
# 17 SALTA 2
# 18 TIERRA DEL FUEGO 2
# 19 TUCUMAN 2
# 20 CATAMARCA 3
# 21 MENDOZA 3
# 22 NEUQUEN 3
# 23 RIO NEGRO 3
# 24 SANTA CRUZ 3
sub looks for whitespace characters \\s+ followed by a group that it captures (...), which consists of anything but whitespace characters \\S+ followed by the end of the line $. It replaces them with a comma , followed by the captured group \\1.

readr::read_csv(), empty strings as NA not working

I was trying to load a CSV file (readr::read_csv()) in which some entries are blank. I set the na="" in read_csv() but it still loads them as blank entries.
d1 <- read_csv("sample.csv",na="") # want to load empty string as NA
Where Sample.csv file can be like following:-
Name,Age,Weight,City
Sam,13,30,
John,35,58,CA
Doe,20,50,IL
Ann,18,45,
d1 should show me as following(using read_csv())
Name Age Weight City
1 Sam 13 30 NA
2 John 35 58 CA
3 Doe 20 50 IL
4 Ann 18 45 NA
First and fourth row of City should have NA (as shown above). But in actual its showing blank there.
Based on the comments and verifying myself, the solution was to upgrade to readr_0.2.2.
Thanks to fg nu, akrun and Richard Scriven

extracting values of a column into a string and replacing values in a data frame column

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.
I want to standardize the market names in both data frames so I can perform merge operations later.
I thought of tackling the problem in two steps:
1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:
Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .
Table2 Markets > "SPOKANE" , "BOSTON" . . .
I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.
2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.
I know I could use a looping function like below but I still need a lookup table I think.
replacefunc <- function (data, oldvalue, newvalue) {
newdata <- data
for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i]
newdata
}
Table 1: This table is 90 rows x 2 columns and has 90 unique market names.
Market Leads Investment Leads1 Leads2 Leads3
1 Albany.Schenectady.Troy NA NA NA NA NA
2 Albuquerque.Santa.Fe NA NA NA NA NA
3 Atlanta NA NA NA NA NA
4 Austin NA NA NA NA NA
5 Baltimore NA NA NA NA NA
Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.
> df
Spot.ID Date Hour Time Local.Date Broadcast.Week Local.Hour Local.Time Market
2 13072765 6/30/14 0 12:40 AM 2014-06-29 1 21 9:40 PM SPOKANE
261 13072946 6/30/14 5 5:49 AM 2014-06-30 1 5 5:49 AM BOSTON
356 13081398 6/30/14 10 10:52 AM 2014-06-30 1 7 7:52 AM SPOKANE
389 13082306 6/30/14 11 11:25 AM 2014-06-30 1 8 8:25 AM SPOKANE
438 13082121 6/30/14 8 8:58 AM 2014-06-30 1 8 8:58 AM BOSTON
469 13081040 6/30/14 9 9:17 AM 2014-06-30 1 9 9:17 AM ALBANY
482 13080104 6/30/14 12 12:25 PM 2014-06-30 1 9 9:25 AM SPOKANE
501 13082120 6/30/14 9 9:36 AM 2014-06-30 1 9 9:36 AM BOSTON
617 13080490 6/30/14 13 1:23 PM 2014-06-30 1 10 10:23 AM SPOKANE
Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.
First, list the markets, use the following command to list the unique names in df1, repeat for df2.
mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)
Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use
df2[df2$market=="Atlanta","market"] = "Atlanta, GA"
The format is
df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"
If you only have 90 names to correct, I would write out 90 change lines like the one above.
After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

Resources