Struggling to use read_tsv() in place of read.csv() - r

ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file

FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(

Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}

Related

Decimal read in does not change

I try to read in a .csv file with, example, such a column:
These values are meant like they are representing thousands of hours, not two or three hours and so on.
When I try to change the reading in options through
read.csv(file, sep = ";, dec = ".") nothing changes. It doesn't matter what I define, dec = "." or dec = "," it will always keep these numbers above.
You can use the following code:
library(readr)
df <- read_csv('data.csv', locale = locale(grouping_mark = "."))
df
Output:
# A tibble: 4 × 1
`X-ray`
<dbl>
1 2771
2 3783
3 1267
4 7798
As you can see, the values are now thousands.
An elegant way (in my opinion) is to create a new class, which you then use in the reading process.
This way, you stay flexible when your data is (really) messed up and the decimal/thousand separator is not equal over all (numeric) columns.
# Define a new class of numbers
setClass("newNumbers")
# Define substitution of dots to nothing
setAs("character", "newNumbers", function(from) as.numeric(gsub("\\.", "", from)))
# Now read
str(data.table::fread( "test \n 1.235 \n 1.265", colClasses = "newNumbers"))
# Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable:
# $ test: num 1235 1265
Solution proposed by Quinten will work; however, it's worth adding that function which is designed to process numbers with a grouping mark is col_number.
with(asNamespace("readr"),
read_delim(
I("X-ray hours\n---\n2.771\n3.778\n3,21\n"),
delim = ";",
col_names = c("x_ray_hours"),
col_types = cols(x_ray_hours = col_number()),
na = c("---"),
skip = 1
))
There is no need to define specific locale to handle this specific case only. Also locale setting will apply to the whole data and intention in this case to handle only that specific column. From docs:
?readr::parse_number
This drops any non-numeric characters before or after the first number.
Also if the columns use ; as a separator, read_delim is more appropriate.

How to merge files in a directory with r?

Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.
Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!
no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)

Creating a loop to download and write met data to csv

I'm quite a novice at using R but I'm trying to self-teach and learn as I go along. I'm trying to create a loop to download and save multiple met data files individually as csv files using the worldmet package.
I have two variables, the met site code and the years of interest. I have included code to create a list of the years in question:
Startyear <- "2018"
Endyear <- "2020"
Yearlist <- seq(as.numeric(Startyear), as.numeric(Endyear))
and I have a .csv file with all the site codes listed which are required, and have read this into R. See below a simplified version of the dataframe, however in total there are 204 rows. This dataframe is called 'siteinfo'.
code station ctry
037760-99999 GATWICK UK
037690-99999 CHARLWOOD UK
038760-99999 SHOREHAM UK
038820-99999 HERSTMONCEUX WEST END UK
037810-99999 BIGGIN HILL UK
An example of the code to import one years worth of metdata for one site is as follows
importNOAA(code="037760-99999",year=2019,hourly=TRUE,precip=FALSE,PWC=FALSE,parallel=FALSE,quiet=FALSE)
I understand that I likely need a nested loop to change both variables, but I am unsure if I am going about this correctly. I also understand that I need to have quotation marks around the code value for it to be read correctly, however I was wondering if there's a quick way to include this as part of the code rather than editing all 204 values in the csv?
Would I also need a separate loop following downloading the files, or can this be included into one piece of code?
The current code I have, and I am sure there is a lot wrong with this so I appreciate any guidance, is as follows
for(i in 1:siteinfo$code) {
for(j in 1:Yearlist){
importNOAA(code=i,year=j,hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
This currently isn't working, so if you could help me piece this together, and if possible provide any explanation of where I have gone wrong or how I can improve my coding, I would be extremely grateful!
You can avoid loops altogether (better for large data sets and files) with some functions in dplyr and purrr. I get an error for invalid parameters when I try to run your importNOAA code, so I am using a simpler call to that function.
met_data <- siteinfo %>%
full_join(data.frame(year = Yearlist), by = character(0)) %>%
group_by(code, year) %>%
mutate(dat = list(data.frame(code, year))) %>%
mutate(met = purrr::map(dat, function(df) {
importNOAA(code = df$code, year = df$year, hourly=TRUE, quiet=FALSE)
}) ) %>%
select(-dat)
This code returns a tbl.df where the last column is a list of data.frames, each containing the data for a year-code combination. You can use met_data %>% summarize(met) to expand the data into one big data.frame to save to a csv, or if you want to write them all to indidividual csvs, use lapply:
lapply(1:nrow(met_data), function(x) {
write.csv(met_data$met[x],
file = paste(met_data$station[x], "_", met_data$year[x], ".csv", sep = ""))})
you can't use for loop like for(i in 1:siteinfo$code){}...
just short example
for(i in 1:mtcars$mpg){
print(i)
}
output:
numerical expression has 32 elements: only the first used[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
[1] 21
So use just index like this
for(i in 1:nrow(siteinfo$code){
for(j in 1:nrow(Yearlist){
importNOAA(code=siteinfo$code[i],year=Yearlist[j],hourly = TRUE, precip= FALSE, PWC= FALSE, parallel = TRUE, quiet = FALSE)
}}
maybe that's works

loading seer data into R

I am trying to load SEER data from ASCII files. There is only a .sas load file which I am trying to convert into an R load command.
the .sas load file looks like this:
filename seer9 './yr1973_2015.seer9/*.TXT';
data in;
infile seer9 lrecl=362;
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
# 19 MAR_STAT $char1. /* Marital status at diagnosis */
# 20 RACE1V $char2. /* Race/ethnicity */
# 23 NHIADE $char1. /* NHIA Derived Hisp Origin */
# 24 SEX $char1. /* Sex */
I have the following code to try an replicate a similar loading process:
data <- read.table("OTHER.TXT",
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"),
sep = c(1, 9, 19, 20, 23, 24))
If I use the separegument I get the following error:
Error in read.table("OTHER.TXT", col.names = c("pubcsnum", "reg", "mar_stat",
:invalid 'sep' argument
If I dont use the sep argument I get the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
:
line 1 did not have 133 elements
Does anyone have experience loading seer data? Does anyone have a suggestion why this isn't working?
*of note when I use the fill = TRUE argument, the second error line 1 did not have 133 elements doesn't occur anymore, BUT the data is not correct when I evaluate the first few observations. I further confirmed by evaluating a known variable sex :
> summary(data$sex)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 2.000e+00 3.020e+03 7.852e+18 9.884e+13 2.055e+20
where the values are 1/2 and the summary is nonsensical
So the other comments and answers point out most of this but here's a more complete answer for your exact problem. I have heard of many people struggling with these ASCII files (including many related but, not very simple packages) and I wanted to answer these for anyone else searching.
Fixed width files
These SEER "ASCII" files are actually fixed width text files (ASCII is an encoding standard not a file format). This means that there is no delimiter character (e.g. "," or "\t") that separates the fields (in a .csv or .tsv).
Instead, each field is defined by a start and end position in the line (sometimes a start position and the field width/length). This is what we see the in .sas file that you summarize:
input
# 1 PUBCSNUM $char8. /* Patient ID */
# 9 REG $char10. /* SEER registry */
...
What does this mean?
the first Patient ID field starts at position 1 and has a length of 8 (from $char8, similar to precision in SQL schemas etc.) which means it ends at position 8.
the second field, SEER registry ID, starts at position 9 (1 + 8 from the previous field) and has a length of 10 (again from $char10) which means it ends at position 18.
etc.
Where the # number consistently increases so the fields don't overlap.
Reading fixed width files
I find the readr::read_fwf() function nice and simple, mostly because it has a couple of helper functions, namely fwf_positions() that tell it how to define each field by start and end (or widths, with fwf_widths()).
So, to read just these two fields from the file we can do:
read_fwf(<file>, fwf_positions(start=c(1, 9), end=c(8, 18), col_names=c("patient_id", "registry_id")))
Where col_names is only there to rename the columns.
Helper script.
I have struggled with these before so I actually wrote some code that reads that .sas file and extracts the start positions, widths, column names and descriptions.
Here is the entire thing, just replace the file name:
## Script to read the SEER file dictionary and use it to read SEER ASCII data files.
library(tidyverse)
library(stringr)
#### Reading the file dictionary ----
## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas
sas.raw <- read_lines("https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas")
sas.df <- tibble(raw = sas.raw) %>%
## remove first few rows by insisting an # that defines the start index of that field
filter(str_detect(raw, "#")) %>%
## extract out the start, width and column name+description fields
mutate(start = str_replace(str_extract(raw, "# [[:digit:]]{1,3}"), "# ", ""),
width = str_replace(str_extract(raw, "\\$char[[:digit:]]{1,2}"), "\\$char", ""),
col_name = str_extract(raw, "[[:upper:]]+[[:upper:][:digit:][:punct:]]+"),
col_desc = str_trim(str_replace(str_replace(str_extract(raw, "\\/\\*.+\\*\\/"), "\\/\\*", ""), "\\*\\/", "" )) ) %>%
## coerce to integers
mutate_at(vars(start, width), funs(as.integer)) %>%
## calculate the end position
mutate(end = start + width - 1)
column_mapping <- sas.df %>%
select(col_name, col_desc)
#### read the file with the start+end positions----
## CHANGE THIS LINE
file_path = "data/test_COLRECT.txt"
## read the file with the fixed width positions
data.df <- read_fwf(file_path,
fwf_positions(sas.df$start, sas.df$end, sas.df$col_name))
## result is a tibble
Hope that helps!
Fixed width files such as those described by that .sas file are read with read.fwf function in the foreign package. I'm afraid that nicely formatted webpage hosted by Princeton is simply wrong about how to use read.table for this purpose. There are really no separators, just positions. In the case in point you could have used (assuming you have a directory named "yr1973_2015.seer9" in your working directory):
library(utils) #not really needed, just correcting my faulty memory
inputdf <- read.fwf( "yr1973_2015.seer9/OTHER.TXT",
widths= c(1, 9, 19, 20, 23, 24),
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"))
You would lose most of the information since the lrecl value tells us there are 362 characters per line, but this would be a good test case and you could then switch to the SAScii functions.... and thanks to #AnthonyDamico:
packageDescription("SAScii")
#---------------
Package: SAScii
Type: Package
Title: Import ASCII files directly into R using only a SAS
input script
Version: 1.0
Date: 2012-08-18
Authors#R: person( "Anthony Joseph" , "Damico" , role = c(
"aut" , "cre" ) , email = "ajdamico#gmail.com" )
Description: Using any importation code designed for SAS
users to read ASCII files into sas7bdat files, the
SAScii package parses through the INPUT block of a
(.sas) syntax file to design the parameters needed
for a read.fwf function call. This allows the user
to specify the location of the ASCII (often a .dat)
file and the location of the .sas syntax file, and
then load the data frame directly into R in just one
step.
License: GPL (>= 2)
URL: https://github.com/ajdamico/SAScii
Depends: R (>= 2.14)
LazyLoad: Yes
Packaged: 2012-08-17 08:35:18 UTC; AnthonyD
Author: Anthony Joseph Damico [aut, cre]
Maintainer: Anthony Joseph Damico <ajdamico#gmail.com>
Repository: CRAN
Date/Publication: 2012-08-17 10:55:15
Built: R 3.4.0; ; 2017-04-20 18:55:31 UTC; unix
-- File: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SAScii/Meta/package.rds
I wasn't abolutely sure that the trailing info would be effectively ignored on those long lines but checked with this slight mod of athe first example on the ?read.fwf page:
> ff <- tempfile()
> cat(file = ff, "12345689", "98765489", sep = "\n")
> read.fwf(ff, widths = c(1,2,3))
V1 V2 V3
1 1 23 456
2 9 87 654
>unlink(ff)
I checked my memory that using Anthony's name as a search term could be helpful and find that his website has been updated. Check out:
http://asdfree.com/surveillance-epidemiology-and-end-results-seer.html

sink a data frame to .txt file

I have a 4-column data frame named as mytable with hundreds of rows.
It looks like
id name count rate
234 uert e#3 erwafrw23 weq 34 2
324 awrt%rw-fref-sfr-32 eq 78 4
329 jiowerfhguy qwhrb 90 8
123 234huib|f|wer fwfqwasgre 54 3
so as it shows, the name has spaces and special characters. so I can't use write.table to save the data.frame.
I tried
sink('myfile.txt')
print(mytable,right=F)
sink()
But I met a problem that sometimes the name is so long that the four column can't show together in the same page, i.e. the third or fourth column may run to the next page.
Is there any method can adjust the width of table sinked to .txt file? Or besides sink(), any other code can be used to save a data frame to .txt file? Thanks.
seems like write.table() should be OK. just specify a seperator, like ",", or something else not appearing in your name column:
my.df <- data.frame(ID=c(234,324,329,123),
name = c("uert e#3 erwafrw23 weq"," awrt%rw-fref-sfr-32 eq","jiowerfhguy qwhrb","234huib|f|wer fwfqwasgre"),
count = c(34,78,90,54), rate = c(2,4,8,3))
write.table(my.df, file = "my.df.txt", sep = ",", col.names = colnames(my.df))
# read it back in
my.df2 <- read.table(file = "my.df.txt",sep = ",", header = TRUE, stringsAsFactors = FALSE)
all(my.df == my.df2)
TRUE
You seem confused about the difference between a file and the console output. There is no limitation to the width of lines with write.table, at least not ones you will approach in normal use. You can control the console screen width with options(width=72) and use capture.output(print(mytable)) so the ouput meets whatever unstated width requirements you might have.

Resources