loading seer data into R - r

I am trying to load SEER data from ASCII files. There is only a .sas load file which I am trying to convert into an R load command.
the .sas load file looks like this:
filename seer9 './yr1973_2015.seer9/*.TXT';
data in;
infile seer9 lrecl=362;
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
# 19 MAR_STAT $char1. /* Marital status at diagnosis */
# 20 RACE1V $char2. /* Race/ethnicity */
# 23 NHIADE $char1. /* NHIA Derived Hisp Origin */
# 24 SEX $char1. /* Sex */
I have the following code to try an replicate a similar loading process:
data <- read.table("OTHER.TXT",
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"),
sep = c(1, 9, 19, 20, 23, 24))
If I use the separegument I get the following error:
Error in read.table("OTHER.TXT", col.names = c("pubcsnum", "reg", "mar_stat",
:invalid 'sep' argument
If I dont use the sep argument I get the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
line 1 did not have 133 elements
Does anyone have experience loading seer data? Does anyone have a suggestion why this isn't working?
*of note when I use the fill = TRUE argument, the second error line 1 did not have 133 elements doesn't occur anymore, BUT the data is not correct when I evaluate the first few observations. I further confirmed by evaluating a known variable sex :
> summary(data$sex)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 2.000e+00 3.020e+03 7.852e+18 9.884e+13 2.055e+20
where the values are 1/2 and the summary is nonsensical

So the other comments and answers point out most of this but here's a more complete answer for your exact problem. I have heard of many people struggling with these ASCII files (including many related but, not very simple packages) and I wanted to answer these for anyone else searching.
Fixed width files
These SEER "ASCII" files are actually fixed width text files (ASCII is an encoding standard not a file format). This means that there is no delimiter character (e.g. "," or "\t") that separates the fields (in a .csv or .tsv).
Instead, each field is defined by a start and end position in the line (sometimes a start position and the field width/length). This is what we see the in .sas file that you summarize:
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
...
What does this mean?
the first Patient ID field starts at position 1 and has a length of 8 (from $char8, similar to precision in SQL schemas etc.) which means it ends at position 8.
the second field, SEER registry ID, starts at position 9 (1 + 8 from the previous field) and has a length of 10 (again from $char10) which means it ends at position 18.
etc.
Where the # number consistently increases so the fields don't overlap.
Reading fixed width files
I find the readr::read_fwf() function nice and simple, mostly because it has a couple of helper functions, namely fwf_positions() that tell it how to define each field by start and end (or widths, with fwf_widths()).
So, to read just these two fields from the file we can do:
read_fwf(<file>, fwf_positions(start=c(1, 9), end=c(8, 18), col_names=c("patient_id", "registry_id")))
Where col_names is only there to rename the columns.
Helper script.
I have struggled with these before so I actually wrote some code that reads that .sas file and extracts the start positions, widths, column names and descriptions.
Here is the entire thing, just replace the file name:
## Script to read the SEER file dictionary and use it to read SEER ASCII data files.
library(tidyverse)
library(stringr)
#### Reading the file dictionary ----
## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas
sas.raw <- read_lines("https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas")
sas.df <- tibble(raw = sas.raw) %>%
## remove first few rows by insisting an # that defines the start index of that field
filter(str_detect(raw, "#")) %>%
## extract out the start, width and column name+description fields
mutate(start = str_replace(str_extract(raw, "# [[:digit:]]{1,3}"), "# ", ""),
width = str_replace(str_extract(raw, "\\$char[[:digit:]]{1,2}"), "\\$char", ""),
col_name = str_extract(raw, "[[:upper:]]+[[:upper:][:digit:][:punct:]]+"),
col_desc = str_trim(str_replace(str_replace(str_extract(raw, "\\/\\*.+\\*\\/"), "\\/\\*", ""), "\\*\\/", "" )) ) %>%
## coerce to integers
mutate_at(vars(start, width), funs(as.integer)) %>%
## calculate the end position
mutate(end = start + width - 1)
column_mapping <- sas.df %>%
select(col_name, col_desc)
#### read the file with the start+end positions----
## CHANGE THIS LINE
file_path = "data/test_COLRECT.txt"
## read the file with the fixed width positions
data.df <- read_fwf(file_path,
fwf_positions(sas.df$start, sas.df$end, sas.df$col_name))
## result is a tibble
Hope that helps!

Fixed width files such as those described by that .sas file are read with read.fwf function in the foreign package. I'm afraid that nicely formatted webpage hosted by Princeton is simply wrong about how to use read.table for this purpose. There are really no separators, just positions. In the case in point you could have used (assuming you have a directory named "yr1973_2015.seer9" in your working directory):
library(utils) #not really needed, just correcting my faulty memory
inputdf <- read.fwf( "yr1973_2015.seer9/OTHER.TXT",
widths= c(1, 9, 19, 20, 23, 24),
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"))
You would lose most of the information since the lrecl value tells us there are 362 characters per line, but this would be a good test case and you could then switch to the SAScii functions.... and thanks to #AnthonyDamico:
packageDescription("SAScii")
#---------------
Package: SAScii
Type: Package
Title: Import ASCII files directly into R using only a SAS
input script
Version: 1.0
Date: 2012-08-18
Authors#R: person( "Anthony Joseph" , "Damico" , role = c(
"aut" , "cre" ) , email = "ajdamico#gmail.com" )
Description: Using any importation code designed for SAS
users to read ASCII files into sas7bdat files, the
SAScii package parses through the INPUT block of a
(.sas) syntax file to design the parameters needed
for a read.fwf function call. This allows the user
to specify the location of the ASCII (often a .dat)
file and the location of the .sas syntax file, and
then load the data frame directly into R in just one
step.
License: GPL (>= 2)
URL: https://github.com/ajdamico/SAScii
Depends: R (>= 2.14)
LazyLoad: Yes
Packaged: 2012-08-17 08:35:18 UTC; AnthonyD
Author: Anthony Joseph Damico [aut, cre]
Maintainer: Anthony Joseph Damico <ajdamico#gmail.com>
Repository: CRAN
Date/Publication: 2012-08-17 10:55:15
Built: R 3.4.0; ; 2017-04-20 18:55:31 UTC; unix
-- File: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SAScii/Meta/package.rds
I wasn't abolutely sure that the trailing info would be effectively ignored on those long lines but checked with this slight mod of athe first example on the ?read.fwf page:
> ff <- tempfile()
> cat(file = ff, "12345689", "98765489", sep = "\n")
> read.fwf(ff, widths = c(1,2,3))
V1 V2 V3
1 1 23 456
2 9 87 654
>unlink(ff)
I checked my memory that using Anthony's name as a search term could be helpful and find that his website has been updated. Check out:
http://asdfree.com/surveillance-epidemiology-and-end-results-seer.html

Related

Decimal read in does not change

I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on.
When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".") nothing changes. It doesn't matter what I define, dec = "." or dec = "," it will always keep these numbers above.
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ; as a separator, read_delim is more appropriate.

How to read a txt file that contains different tables in it

I have to collect data in R, that has been given to me in a xls format, but when I open it with Excel it says that the extension and the format don't match, the file suggests I should save it as a .txt file.
The file I have to use typically contains 3 sections, with different tables in them, which have different sizes and column names. The sections are announced by a title between square brackets. This is a simplified version of my file.
I am only interested in the third section, called '[DATA]'. So far I have manually saved it as an xlsx file and worked my way to use the data I was interested in using read_excel. After reading the whole sheet in R I collected the row where the title '[DATA]' was (it can vary from file to file, I can't select a row number as in readLines), then I could select the table underneath after taking the column names (T, Time, Tension etc.) as my new dataframe's column names. I'd like to be able to do something similar starting from a txt file, because I have a lot of files to work with and they are formatted exactly the same way.
I've tried several functions to read the file as a .txt, like
1A = data.table::fread(file, header = F, fill=F, sep = '\t')
2) A = read.delim(file)
3)A = data.frame(readLines(file))
4) A = read.table(file)
It saves the first table from SETUP and stops early, with this error message "Stopped early on line 25. Expected 24 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<Number of Duts: 24>>" . If I type fill = TRUE I get the same result as 3.
It makes a big column of all the cells, line after line and cell by cell. It becomes difficult to rearrange the data in a table from there.
It makes a big column again but each line of the file is a cell in the dataframe, and the content of the cell is a string of all the numbers, separated by \t . Example for line 8: experiment1\group1\t0\t7200\t0.001\t"
I get this error message : Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 4 did not have 2 elements
I don't know which of these functions is the best lead for this task.
NB : The numbers dispayed in the error messages might be different of what I would get with the example, but I don't even get the error messages with the example (when i made it, Excel required me to put an apostrophy in the cell so the 'minus' sign wouldn't be seen as a formula, so I did. I then saved the file in txt and xls, and even added the xls extension to the txt file to create an incoherence of extension like in my original file. It works in any case.)
Thanks for your help !
You said text file and show a spreadsheet, so I'll demonstrate on a multi-table CSV file:
csvtext <- '[SETUP]
ExpName:
GroupName:
,,
Experiment,Group,Voltage
1,1,1
2,2,2
3,3,3
,,
[RESULT]
Group,Dev,V3
1,1,1
3,3,3
4,4,4
,,
[Data]
"mpg","cyl","disp"
21,6,160
21,6,160
22.8,4,108
'
Read it in as text:
# you may use something like
# rawtext <- readLines("path/to/file.csv")
rawtext <- readLines(textConnection(csvtext))
str(rawtext)
# chr [1:21] "[SETUP]" "ExpName:" "GroupName:" ",," "Experiment,Group,Voltage" "1,1,1" "2,2,2" "3,3,3" ",," "[RESULT]" ...
We can now split the data based on the "empty" lines, then drop these empty lines:
spltext <- split(rawtext, cumsum(!grepl("[^,\\s]", rawtext)))
spltext <- lapply(spltext, function(z) if (grepl("[^,\\s]", z[1])) z else z[-1])
str(spltext)
# List of 5
# $ 0: chr [1:3] "[SETUP]" "ExpName:" "GroupName:"
# $ 1: chr [1:4] "Experiment,Group,Voltage" "1,1,1" "2,2,2" "3,3,3"
# $ 2: chr [1:5] "[RESULT]" "Group,Dev,V3" "1,1,1" "3,3,3" ...
# $ 3: chr [1:5] "[Data]" "\"mpg\",\"cyl\",\"disp\"" "21,6,160" "21,6,160" ...
# $ 4: chr(0)
(Note that the $ 0 indicates that the name is "0" not 0, so we'll need to use string-numbers for indexing later.)
From here, since you only want the [Data] section, then
read.csv(text = spltext[["3"]][-1])
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
I made it work on any of my files (txt) doing this :
rawtext <- readLines(file)
#separation of the sections with an empty line between them
spltext <- split(rawtext, cumsum(!grepl("[^,\t]", rawtext)))
#removing the cells coded by a former empty line \t
spltext <- lapply(spltext, function(z) if (grepl("[^,\t]", z[1])) z else z[-1])
#The column indexed by 3 is the one that contains the DATA table
data=read.delim(text = base[["3"]][-1], header= T, check.names=F) #check.names= F doesn't affect the titles

Struggling to use read_tsv() in place of read.csv()

ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file
FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(
Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}

fread - multiple separators in a string

I'm trying to read a table using fread.
The txt file has text which look like:
"No","Comment","Type"
"0","he said:"wonderful|"","A"
"1","Pr/ "d/s". "a", n) ","B"
R codes I'm using is: dataset0 <- fread("data/test.txt", stringsAsFactors = F) with the development version of data.table R package.
Expect to see a dataset with three columns; however:
Error in fread(input = "data/stackoverflow.txt", stringsAsFactors = FALSE) :
Line 3 starting <<"1","Pr/ ">> has more than the expected 3 fields.
Separator 3 occurs at position 26 which is character 6 of the last field: << n) ","B">>.
Consider setting 'comment.char=' if there is a trailing comment to be ignored.
How to solve it?
The development version of data.table handles files like this where the embedded quotes have not been escaped. See point 10 on the wiki page.
I just tested it on your input and it works.
$ more unescaped.txt
"No","Comment","Type"
"0","he said:"wonderful."","A"
"1","The problem is: reading table, and also "a problem, yes." keep going on.","A"
> DT = fread("unescaped.txt")
> DT
No Comment Type
1: 0 he said:"wonderful." A
2: 1 The problem is: reading table, and also "a problem, yes." keep going on. A
> ncol(DT)
[1] 3
Use readLines to read line by line, then replace delimiter and read.table:
# read with no sep
x <- readLines("test.txt")
# introduce new sep - "|"
x <- gsub("\",\"", "\"|\"", x)
# read with new sep
read.table(text = x, sep = "|", header = TRUE)
# No Comment Type
# 1 0 he said:"wonderful." A
# 2 1 The problem is: reading table, and also "a problem, yes." keep going on. A

sink a data frame to .txt file

I have a 4-column data frame named as mytable with hundreds of rows.
It looks like
id name count rate
234 uert e#3 erwafrw23 weq 34 2
324 awrt%rw-fref-sfr-32 eq 78 4
329 jiowerfhguy qwhrb 90 8
123 234huib|f|wer fwfqwasgre 54 3
so as it shows, the name has spaces and special characters. so I can't use write.table to save the data.frame.
I tried
sink('myfile.txt')
print(mytable,right=F)
sink()
But I met a problem that sometimes the name is so long that the four column can't show together in the same page, i.e. the third or fourth column may run to the next page.
Is there any method can adjust the width of table sinked to .txt file? Or besides sink(), any other code can be used to save a data frame to .txt file? Thanks.
seems like write.table() should be OK. just specify a seperator, like ",", or something else not appearing in your name column:
my.df <- data.frame(ID=c(234,324,329,123),
name = c("uert e#3 erwafrw23 weq"," awrt%rw-fref-sfr-32 eq","jiowerfhguy qwhrb","234huib|f|wer fwfqwasgre"),
count = c(34,78,90,54), rate = c(2,4,8,3))
write.table(my.df, file = "my.df.txt", sep = ",", col.names = colnames(my.df))
# read it back in
my.df2 <- read.table(file = "my.df.txt",sep = ",", header = TRUE, stringsAsFactors = FALSE)
all(my.df == my.df2)
TRUE
You seem confused about the difference between a file and the console output. There is no limitation to the width of lines with write.table, at least not ones you will approach in normal use. You can control the console screen width with options(width=72) and use capture.output(print(mytable)) so the ouput meets whatever unstated width requirements you might have.

Resources