readr::read_csv(), empty strings as NA not working - r

I was trying to load a CSV file (readr::read_csv()) in which some entries are blank. I set the na="" in read_csv() but it still loads them as blank entries.
d1 <- read_csv("sample.csv",na="") # want to load empty string as NA
Where Sample.csv file can be like following:-
Name,Age,Weight,City
Sam,13,30,
John,35,58,CA
Doe,20,50,IL
Ann,18,45,
d1 should show me as following(using read_csv())
Name Age Weight City
1 Sam 13 30 NA
2 John 35 58 CA
3 Doe 20 50 IL
4 Ann 18 45 NA
First and fourth row of City should have NA (as shown above). But in actual its showing blank there.

Based on the comments and verifying myself, the solution was to upgrade to readr_0.2.2.
Thanks to fg nu, akrun and Richard Scriven

Related

Remove specific value in R or Linux

Hi I have a file (tab sep) in terminal that has several columns as below. You can see last column has a comma in between followed by one or more characters.
1 100 Japan Na pa,cd
2 120 India Ca pa,ces
5 110 Japan Ap pa,cres
1 540 China Sn pa,cd
1 111 Nepal Le pa,b
I want to keep last column values before the comma so the file can look like
2 120 India Ca pa
5 110 Japan Ap pa
1 540 China Sn pa
1 111 Nepal Le pa
I have looked for sed but I cannot find a way to exclude them
Regards
In R you can read the file with a tab-separator and remove the values after comma.
result <- transform(read.table('file1.txt', sep = '\t'), V5 = sub(',.*', '', V5))
V5 is used assuming it is the 5th column that you want to change the value.
We can use
df1 <- read.tsv('file1.txt', sep="\t")
df1$V5 <- sub("^([^,]+),.*", "\\1", df1$V5)

Format function output as data frame

I use the following to sum several measures of TABR per Julian date in 5 separate years:
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
Which produces output like:
2015 2016 2017 2018 2019
33 NA NA NA 2 NA
....
80 NA 1 NA 21 NA
81 NA 47 NA 25 NA
82 NA 12 1 9 NA
But I want to convert these results into a dataframe with 6 columns Julian + 2015-2019.
I tried:
TABR_Day<-as.data.frame(TABR_YearDay)
But that seems to not produce a fully realized df: there is no column for Julian and if I want to call an individual variable, I have to use quotes around it like:
hist(TABR_Day$"2017")
Can you help me transition the function output to a dataframe with 6 viable columns?
The column is present in the rownames of the result. Column names usually don't start with a number, we can prepend the column name with the word 'Year' to make it 'Year_2015' etc and construct the final dataframe.
TABR_YearDay<-with(wsmr, tapply(TABR, list(Julian, Year),sum))
colnames(TABR_YearDay) <- paste0('Year_', colnames(TABR_YearDay))
TABR_Day <- data.frame(Julian = rownames(TABR_YearDay), TABR_YearDay)

Remove rows which cells do not match the column class

So, similarly to removing NA values, I need to remove rows which cell values do not match the column class. For example
So for this example, I want to be able to remove rows of Andy, Aaron and Dorothy. For Andy's Gender it is 12 but it should only be "Male" or "Female". AS for Aaron, Status is NA so i would like to remove that too.
And lastly for Dorothy, her age is "abc" instead of a numeric.
Name Age Gender Status
Tom 12 Male Married
Dom 41 Male Single
Kelvin 23 Male Married
Tim 12 Male Single
Andy 42 12 Single
Aaron 12 Male NA
Dorothy abc Female Married
Nathan 34 Male Single
sorry for the formatting im new to stackoverflow
For each column, there should be a class assigned to them. However, for this case there wasn't. Solution provided by Adam Quek was helpful!
For columns that are class() numeric, e.g. dat$Age <- as.numeric(as.character(dat$Age))
For columns that are class() factor, e.g. dat$Gender <- factor(dat$Gender, levels=c("Male", "Female"))
These codes above changes abnormalities to NA values and lastly na.exclude(dat) should do the work.

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

Reading a fixed length text file in R

I am currently working in SAS , but I have used R for a very long time. I have some fixed width text files to read. Those read easily in SAS but I am literally experiencing hell to do the same in R.
The File looks some what like this :
DP JAMES SILVA REY
2014
6
0
1723713652
2
0
DP ALEJANDRA NARVAEZ
2014
6
0
1723713456
6
0
DP NANYER PICHARDO
2014
6
0
1723713991
1
0
DP GABRIELA ANASI CASTILLO
2014
6
0
1723713240
3
0
It is not clear here , I have attached it , please find.
It reads easily in SAS using infile input.
SAS Code:
infile "filename.txt" lrecl=32767 ;
input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;
run;
Now to do the same in R , I have been trying various things:
1)Atfirst read.fwf ,
Code:
dat1=read.fwf("D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header = FALSE, sep = "\t",fill = TRUE,
skip = 0, col.names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "), fileEncoding = "ASCII")
But it returns just NA values for most of the fields and only some values that too in wrong positions.
Head(dat1) gives output:
park_cd Title first_name middle_name
1 DP JAMES SILVA
2
3 <NA>
4 <NA> <NA> <NA>
5 <NA> <NA>
6 2014 <NA> <NA>
last_name suffix
1 REY
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 <NA> <NA>
ADDRESS_1.
1
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
ADDRESS_2 ADDRESS_3 CITY
1 NA NA
2 <NA> NA NA
3 <NA> NA NA
4 <NA> NA NA
5 <NA> NA NA
6 <NA> NA NA
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA
TOTAL_PURCHASE_AMT.
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Output:
2) Now I use the Sascii package to call the SAS code in R.
Code:
sas_imp <- "input
#001 park_cd $5.
#006 Title $15.
#021 first_name $25.
#046 middle_name $25.
#071 last_name $25.
#096 suffix $15.
#111 ADDRESS_1 $60.
#171 ADDRESS_2 $60.
#231 ADDRESS_3 $60.
#261 CITY $30.
#291 STATE_PROVINCE $2.
#293 ZIP $9.
#302 Ticket_Year $11.
#314 product_id $12.
#327 UNIT_PRICE $13.
#340 PURCHASE_DT $26.
#366 PURCHASE_QTY $12.
#378 TOTAL_PURCHASE_AMT $14. ;"
sas_imp.tf <- tempfile()
writeLines (sas_imp , con = sas_imp.tf )
parse.SAScii( sas_imp.tf )
read.SAScii( "filename.txt" , sas_imp.tf )
It too gives the same useless output as above.
3) Now I use the Laf package and the laf_open_fwf command like :
library(LaF)
data <- laf_open_fwf(filename="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
column_types=rep("character",18),
column_names=c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
column_widths=c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14))
Then I convert it like :
library(ffbase)
my.data <- laf_to_ffdf(data)
head(as.data.frame(my.data))
But it gives output:
park_cd Title first_name middle_name last_name
1 DP JAMES SILVA REY
2 \r\n \r\n
3 JANDR A NARVAEZ
4 \r\n \r \n \r\n \r\n 20
5 PICHARDO
6 \r\n \r\n \r\n \r\n 2014\r\n 6\r\n
suffix
1
2 \r\n \r\n
3
4 14\r\n
5
6 0\r\n
ADDRESS_1.
1
2 2014\r\n 6\r\n 0\r\n 172
3
4 6\r\n 0\r\n 1723713456\r\n 6\r\n
5
6 1723713991\r\n 1\r\n 0\r\nDP
ADDRESS_2 ADDRESS_3 CITY
1 \r *\003
2 3713652\r\n 2\r\n 0\r\nDP A L *\003
3 \r\n *\003
4 0\r\nDP NANYER *\003
5 \r\n *\003
6 GABRIELA ANASI *\003
STATE_PROVINCE X.ZIP Ticket_Year product_id UNIT_PRICE PURCHASE_DT PURCHASE_QTY
1 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
2 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
3 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
4 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
5 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
6 ÐÆ *\003 "ADDR ,"\001 *\003 \n <N
TOTAL_PURCHASE_AMT.
1 \001
2 \001
3 \001
4 \001
5 \001
6 \001
4) Lastly read.table.ffdf like
library(ff)
library(stringr)
my.data1 <- read.table.ffdf(file="D:/Cedar_response/Cedar_Fair_DP_2014_haunt_results.txt",
FUN="read.fwf",
widths = c(5,15,25,25,25,15,60,60,60,30,2,9,11,12,13,26,12,14),
header=F, VERBOSE=TRUE,
col.names = c("park_cd","Title","first_name","middle_name","last_name","suffix",
"ADDRESS_1 ","ADDRESS_2","ADDRESS_3","CITY","STATE_PROVINCE",
" ZIP","Ticket_Year","product_id","UNIT_PRICE","PURCHASE_DT",
"PURCHASE_QTY","TOTAL_PURCHASE_AMT "),
fileEncoding = "UTF-8",
transFUN=function(x){
z <- sapply(x, function(y) {
y <- str_trim(y)
y[y==""] <- NA
factor(y)})
as.data.frame(z)
} )
But result is same.
The last solution I found in this page [http://r.789695.n4.nabble.com/read-table-ffdf-and-fixed-width-files-td4673220.html][1].
What am I doing wrong , am I putting the widths wrong ?
Or I am getting the ideas wrong altogether?
I have worked with many things in R and just cant believe that a so easy thing in SAS is so tough in R. I must be missing some easy things. If you have any idea on these type please help me folks .Thanks in advance.
Update
Please see here for what I use nowadays for this problem:
Faster way to read fixed-width files
for posterity, original answer retained below as a how-to guide for bootstrapping solutions while desperate 😅
Here's the FW -> .csv converter I created in Python to destroy these awful files:
It also includes the checkLength function that can help get at what #RobertLong suggested, which is that your underlying file might be faulty. If that's the case, you may be in trouble if it's pervasive & unpredictable (i.e. there are no consistent mistake patterns in your file that you can ctrl+H to fix).
Please note dictfile must be formatted correctly (I wrote this for myself, not necessarily to be as robust as possible)
import os
import csv
#Set correct directory
os.chdir('/home/michael/...') #match format of your OS
def checkLength(ffile):
"""
Used to check that all lines in file have the same length (and so don't cause any issues below)
"""
with open(ffile,'r') as ff:
firstrow=1
troubles=0
for rows in ff:
if firstrow:
length=len(rows)
firstrow=0
elif len(rows) != length:
print rows
print len(rows)
troubles=1
return troubles
def fixed2csv(infile,outfile,dictfile):
"""
This function takes a file name for a fixed-width dataset as input and
converts it to .csv format according to slices and column names specified in dictfile
Parameters
==========
infile: string of input file name from which fixed-width data is to be read
e.g. 'fixed_width.dat'
outfile: string of output file name to which comma-separated data is to be saved
e.g. 'comma_separated.csv'
dictfile: .csv-formatted dictionary file name from which to read the following:
* widths: field widths
* column names: names of columns to be written to the output .csv
* types: object types (character, integer, etc)
column order must be: col_names,slices,types
"""
with open(dictfile,'r') as dictf:
fieldnames = ("col_names","widths","types") #types used in R later
ddict = csv.DictReader(dictf,fieldnames)
slices=[]
colNames=[]
wwidths=[]
for rows in ddict:
wwidths.append(int(rows['widths'])) #Python 0-based, must subtract 1
colNames.append(rows['col_names'])
offset = 0
for w in wwidths:
slices.append(slice(offset,offset+w))
offset+=w
with open(infile,'r') as fixedf:
with open(outfile,'w') as csvf:
csvfile=csv.writer(csvf)
csvfile.writerow(colNames)
for rows in fixedf:
csvfile.writerow([rows[s] for s in slices])
Good luck, and curses be on whoever it is that is proliferating these FW format data files.
The file you uploaded is not a fixed width file:
I am not a SAS user, but from looking at the SAS code in your post, the column widths in the code do not match up with those in the file.
Moreover, some lines are entirely blank.
It appears that there are many carriage return / new lines which do not belong there - in particular they seem to be used in places as a delimiter. There should be one CRLF at the end of each line, and that's it.
Since you say that SAS opens it, I suggest you use save to a CSV format in SAS and then open it in R. Alternatively you could remove the superfluous CRLF using a good text editor/processor, leaving a single CRLF at the end of each line. Since it appears that each "real" line begins with "DP" you could try to do a replace of -CRLF-DP with (say) -tab- then delete all -CRLF-s then replace all -tab-s with -CRLF- (this relies on their being no -tab-s in the file already)

Resources