sink a data frame to .txt file - r

I have a 4-column data frame named as mytable with hundreds of rows.
It looks like
id name count rate
234 uert e#3 erwafrw23 weq 34 2
324 awrt%rw-fref-sfr-32 eq 78 4
329 jiowerfhguy qwhrb 90 8
123 234huib|f|wer fwfqwasgre 54 3
so as it shows, the name has spaces and special characters. so I can't use write.table to save the data.frame.
I tried
sink('myfile.txt')
print(mytable,right=F)
sink()
But I met a problem that sometimes the name is so long that the four column can't show together in the same page, i.e. the third or fourth column may run to the next page.
Is there any method can adjust the width of table sinked to .txt file? Or besides sink(), any other code can be used to save a data frame to .txt file? Thanks.

seems like write.table() should be OK. just specify a seperator, like ",", or something else not appearing in your name column:
my.df <- data.frame(ID=c(234,324,329,123),
name = c("uert e#3 erwafrw23 weq"," awrt%rw-fref-sfr-32 eq","jiowerfhguy qwhrb","234huib|f|wer fwfqwasgre"),
count = c(34,78,90,54), rate = c(2,4,8,3))
write.table(my.df, file = "my.df.txt", sep = ",", col.names = colnames(my.df))
# read it back in
my.df2 <- read.table(file = "my.df.txt",sep = ",", header = TRUE, stringsAsFactors = FALSE)
all(my.df == my.df2)
TRUE

You seem confused about the difference between a file and the console output. There is no limitation to the width of lines with write.table, at least not ones you will approach in normal use. You can control the console screen width with options(width=72) and use capture.output(print(mytable)) so the ouput meets whatever unstated width requirements you might have.

Related

Decimal read in does not change

I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on.
When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".") nothing changes. It doesn't matter what I define, dec = "." or dec = "," it will always keep these numbers above.
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ; as a separator, read_delim is more appropriate.

Struggling to use read_tsv() in place of read.csv()

ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file
FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(
Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}

loading seer data into R

I am trying to load SEER data from ASCII files. There is only a .sas load file which I am trying to convert into an R load command.
the .sas load file looks like this:
filename seer9 './yr1973_2015.seer9/*.TXT';
data in;
infile seer9 lrecl=362;
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
# 19 MAR_STAT $char1. /* Marital status at diagnosis */
# 20 RACE1V $char2. /* Race/ethnicity */
# 23 NHIADE $char1. /* NHIA Derived Hisp Origin */
# 24 SEX $char1. /* Sex */
I have the following code to try an replicate a similar loading process:
data <- read.table("OTHER.TXT",
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"),
sep = c(1, 9, 19, 20, 23, 24))
If I use the separegument I get the following error:
Error in read.table("OTHER.TXT", col.names = c("pubcsnum", "reg", "mar_stat",
:invalid 'sep' argument
If I dont use the sep argument I get the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
line 1 did not have 133 elements
Does anyone have experience loading seer data? Does anyone have a suggestion why this isn't working?
*of note when I use the fill = TRUE argument, the second error line 1 did not have 133 elements doesn't occur anymore, BUT the data is not correct when I evaluate the first few observations. I further confirmed by evaluating a known variable sex :
> summary(data$sex)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 2.000e+00 3.020e+03 7.852e+18 9.884e+13 2.055e+20
where the values are 1/2 and the summary is nonsensical
So the other comments and answers point out most of this but here's a more complete answer for your exact problem. I have heard of many people struggling with these ASCII files (including many related but, not very simple packages) and I wanted to answer these for anyone else searching.
Fixed width files
These SEER "ASCII" files are actually fixed width text files (ASCII is an encoding standard not a file format). This means that there is no delimiter character (e.g. "," or "\t") that separates the fields (in a .csv or .tsv).
Instead, each field is defined by a start and end position in the line (sometimes a start position and the field width/length). This is what we see the in .sas file that you summarize:
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
...
What does this mean?
the first Patient ID field starts at position 1 and has a length of 8 (from $char8, similar to precision in SQL schemas etc.) which means it ends at position 8.
the second field, SEER registry ID, starts at position 9 (1 + 8 from the previous field) and has a length of 10 (again from $char10) which means it ends at position 18.
etc.
Where the # number consistently increases so the fields don't overlap.
Reading fixed width files
I find the readr::read_fwf() function nice and simple, mostly because it has a couple of helper functions, namely fwf_positions() that tell it how to define each field by start and end (or widths, with fwf_widths()).
So, to read just these two fields from the file we can do:
read_fwf(<file>, fwf_positions(start=c(1, 9), end=c(8, 18), col_names=c("patient_id", "registry_id")))
Where col_names is only there to rename the columns.
Helper script.
I have struggled with these before so I actually wrote some code that reads that .sas file and extracts the start positions, widths, column names and descriptions.
Here is the entire thing, just replace the file name:
## Script to read the SEER file dictionary and use it to read SEER ASCII data files.
library(tidyverse)
library(stringr)
#### Reading the file dictionary ----
## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas
sas.raw <- read_lines("https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas")
sas.df <- tibble(raw = sas.raw) %>%
## remove first few rows by insisting an # that defines the start index of that field
filter(str_detect(raw, "#")) %>%
## extract out the start, width and column name+description fields
mutate(start = str_replace(str_extract(raw, "# [[:digit:]]{1,3}"), "# ", ""),
width = str_replace(str_extract(raw, "\\$char[[:digit:]]{1,2}"), "\\$char", ""),
col_name = str_extract(raw, "[[:upper:]]+[[:upper:][:digit:][:punct:]]+"),
col_desc = str_trim(str_replace(str_replace(str_extract(raw, "\\/\\*.+\\*\\/"), "\\/\\*", ""), "\\*\\/", "" )) ) %>%
## coerce to integers
mutate_at(vars(start, width), funs(as.integer)) %>%
## calculate the end position
mutate(end = start + width - 1)
column_mapping <- sas.df %>%
select(col_name, col_desc)
#### read the file with the start+end positions----
## CHANGE THIS LINE
file_path = "data/test_COLRECT.txt"
## read the file with the fixed width positions
data.df <- read_fwf(file_path,
fwf_positions(sas.df$start, sas.df$end, sas.df$col_name))
## result is a tibble
Hope that helps!
Fixed width files such as those described by that .sas file are read with read.fwf function in the foreign package. I'm afraid that nicely formatted webpage hosted by Princeton is simply wrong about how to use read.table for this purpose. There are really no separators, just positions. In the case in point you could have used (assuming you have a directory named "yr1973_2015.seer9" in your working directory):
library(utils) #not really needed, just correcting my faulty memory
inputdf <- read.fwf( "yr1973_2015.seer9/OTHER.TXT",
widths= c(1, 9, 19, 20, 23, 24),
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"))
You would lose most of the information since the lrecl value tells us there are 362 characters per line, but this would be a good test case and you could then switch to the SAScii functions.... and thanks to #AnthonyDamico:
packageDescription("SAScii")
#---------------
Package: SAScii
Type: Package
Title: Import ASCII files directly into R using only a SAS
input script
Version: 1.0
Date: 2012-08-18
Authors#R: person( "Anthony Joseph" , "Damico" , role = c(
"aut" , "cre" ) , email = "ajdamico#gmail.com" )
Description: Using any importation code designed for SAS
users to read ASCII files into sas7bdat files, the
SAScii package parses through the INPUT block of a
(.sas) syntax file to design the parameters needed
for a read.fwf function call. This allows the user
to specify the location of the ASCII (often a .dat)
file and the location of the .sas syntax file, and
then load the data frame directly into R in just one
step.
License: GPL (>= 2)
URL: https://github.com/ajdamico/SAScii
Depends: R (>= 2.14)
LazyLoad: Yes
Packaged: 2012-08-17 08:35:18 UTC; AnthonyD
Author: Anthony Joseph Damico [aut, cre]
Maintainer: Anthony Joseph Damico <ajdamico#gmail.com>
Repository: CRAN
Date/Publication: 2012-08-17 10:55:15
Built: R 3.4.0; ; 2017-04-20 18:55:31 UTC; unix
-- File: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SAScii/Meta/package.rds
I wasn't abolutely sure that the trailing info would be effectively ignored on those long lines but checked with this slight mod of athe first example on the ?read.fwf page:
> ff <- tempfile()
> cat(file = ff, "12345689", "98765489", sep = "\n")
> read.fwf(ff, widths = c(1,2,3))
V1 V2 V3
1 1 23 456
2 9 87 654
>unlink(ff)
I checked my memory that using Anthony's name as a search term could be helpful and find that his website has been updated. Check out:
http://asdfree.com/surveillance-epidemiology-and-end-results-seer.html

How do I import XLSX files into R Dataframe While Maintaining Carriage Returns and Line Breaks?

I want to ingest all files in the working directory and scan all rows for line breaks or carriage returns. Instead of eliminating them, I'd like to divert them into a new output file for manual review. Here's what I have so far:
library(plyr)
library(dplyr)
library(readxl)
filenames <- list.files(pattern = "Sara Lee.*\\.xlsx$", ignore.case = TRUE)
read_excel_filename <- function(filename){
ret <- read_excel(filename, col_names = TRUE, skip = 5, trim_ws = FALSE)
ret
}
import.list <- ldply(filenames, read_excel_filename)
returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
(import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
(import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
(import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
(import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
(import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
(import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
(import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
(import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
(import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
(import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
(import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
(import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
(import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
(import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
(import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
now <- Sys.time()
carriage_return_file_name <- paste(format(now,"%Y%m%d"),"ROWS with Carriage Returns or New Lines.csv",sep="_")
write.csv(returnornewline, carriage_return_file_name, row.names = FALSE)
Here's some sample data:
Customer Segment Address
BuyFood 123 Main St.\r
BigKetchup 679 Smith Dr.\r
DownUnderMeat 410 Crocodile Way
BuyFood 123 Main St.
I thought the trim_ws = FALSE condition would work, but it hasn't.
Apologies for the column spam, I've yet to figure out an easier way to scan all the columns without listing them. Any help on that issue is appreciated as well.
EDIT: Added some sample data. I don't know how to show a carriage return in the address other than the regex of it. It doesn't look like that in the real sample data, that's just for our use here. Please let me know if that's not clear. The desired output would take the first 2 rows of data where there's a carriage return and output it to the csv file listed at the end of the code block.
EDIT 2: I used the code provided in the suggestion in place of the original long list of columns as follows. However, this doesn't give me a new variable that contains a dataframe of rows with new lines or carriage returns. When I look at my global environment in R Studio I see another variable under Data called "returnornewline" but it shows as a large list, unlike the import.list variable which shows a dataframe. This shouldn't be the case because I've only added a carriage return in the first row of the first spreadsheet of the data, so that list should not be so large.:
returnornewline <- lapply(import.list, function(x) lapply(x, function(s) grep("\r", s)))
# returnornewline <- import.list[((import.list$"CUSTOMER SEGMENT")=="[\r\n]"|(import.list$"SECTOR NAME")=="[\r\n]"|
# (import.list$"LOCATION NAME")=="[\r\n]"|(import.list$"LOCATION ID")=="[\r\n]"|
# (import.list$"ADDRESS")=="[\r\n]"|(import.list$"CITY")=="[\r\n]"|
# (import.list$"STATE")=="[\r\n]"|(import.list$"ZIP CODE")=="[\r\n]"|
# (import.list$"DISTRIBUTOR NAME")=="[\r\n]"|(import.list$"REDISTRIBUTOR NAME")=="[\r\n]"|
# (import.list$"TRANS DATE")=="[\r\n]"|(import.list$"DIST. INVOICE")=="[\r\n]"|
# (import.list$"ITEM MIN")=="[\r\n]"|(import.list$"ITEM LABEL")=="[\r\n]"|
# (import.list$"ITEM DESC")=="[\r\n]"|(import.list$"PACK SIZE")=="[\r\n]"|
# (import.list$"REBATEABLE UOM")=="[\r\n]"|(import.list$"QUANTITY")=="[\r\n]"|
# (import.list$"SALES VOLUME")=="[\r\n]"|(import.list$"X__1")=="[\r\n]"|
# (import.list$"X__2")=="[\r\n]"|(import.list$"X__3")=="[\r\n]"|
# (import.list$"VA PER")=="[\r\n]"|(import.list$"VA PER CODE")=="[\r\n]"|
# (import.list$"TOTAL REBATE")=="[\r\n]"|(import.list$"TOTAL ADMIN FEE")=="[\r\n]"|
# (import.list$"TOTAL INVOICED")=="[\r\n]"|(import.list$"STD VA PER")=="[\r\n]"|
# (import.list$"STD VA PER CODE")=="[\r\n]"|(import.list$"EXC TYPE CODE")=="[\r\n]"|
# (import.list$"EXC EXC VA PER")=="[\r\n]"|(import.list$"EXC VA PER CODE")=="[\r\n]"), ]
EDIT 3: I need to be able to take all rows in the newly created data frame "import.list" and scan them for any instances of carriage returns or new lines within all the rows. The example above is rudimentary, but the concept stands. In the example, I'd expect for the script to read the first two rows and say "hey, these rows have carriage returns, add this to the variable assigned to this line of code and at the end of the script output this data to a csv." The remaining two rows in the sample data above have no need to be output because they have no carriage returns in their data.

fread - multiple separators in a string

I'm trying to read a table using fread.
The txt file has text which look like:
"No","Comment","Type"
"0","he said:"wonderful|"","A"
"1","Pr/ "d/s". "a", n) ","B"
R codes I'm using is: dataset0 <- fread("data/test.txt", stringsAsFactors = F) with the development version of data.table R package.
Expect to see a dataset with three columns; however:
Error in fread(input = "data/stackoverflow.txt", stringsAsFactors = FALSE) :
Line 3 starting <<"1","Pr/ ">> has more than the expected 3 fields.
Separator 3 occurs at position 26 which is character 6 of the last field: << n) ","B">>.
Consider setting 'comment.char=' if there is a trailing comment to be ignored.
How to solve it?
The development version of data.table handles files like this where the embedded quotes have not been escaped. See point 10 on the wiki page.
I just tested it on your input and it works.
$ more unescaped.txt
"No","Comment","Type"
"0","he said:"wonderful."","A"
"1","The problem is: reading table, and also "a problem, yes." keep going on.","A"
> DT = fread("unescaped.txt")
> DT
No Comment Type
1: 0 he said:"wonderful." A
2: 1 The problem is: reading table, and also "a problem, yes." keep going on. A
> ncol(DT)
[1] 3
Use readLines to read line by line, then replace delimiter and read.table:
# read with no sep
x <- readLines("test.txt")
# introduce new sep - "|"
x <- gsub("\",\"", "\"|\"", x)
# read with new sep
read.table(text = x, sep = "|", header = TRUE)
# No Comment Type
# 1 0 he said:"wonderful." A
# 2 1 The problem is: reading table, and also "a problem, yes." keep going on. A

Resources