Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?
It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C
Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!
Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values
ANSWERED: Thank you so much Bob, ffs the issue was not specifying comment='#'. Why this works, when 'skip' should've skipped the offending lines remains a mystery. Also see Gray's comment re: Excel's 'Text to Columns' feature for a non-R solution.
Hey folks,
this has been a demon on my back for ages.
The data I work with is always a collection of tab delimited .txt files, so my analysis always begin with gathering the file paths to each and feeding those into read.csv() and binding to a df.
dat <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
) %>%
map_df( ~read.csv( ., sep='\t', skip=16) ) # actual data begins at line 16
This does exactly what I want, but I've been transitioning to tidyverse over the last few years.
I don't mind using utils::read.csv(), where my datasets are usually small the speed benefit of readr wouldn't be felt. But, for consistency's sake I'd rather use readr.
When I do the same, but sub readr::read_tsv(), i.e.,
dat <-
.... same call to list.files()
%>%
map_df( ~read_tsv( ., skip=16 ))
I always get an empty (0x0) table. But it seems to be 'reading' the data, because I get a warning print out of 'Parsed with column specification: cols()' for every column in my data.
Clearly I'm misunderstanding here, but I don't know what about it I don't understand, which has made my search for answers challenging & fruitless.
So... what am I doing wrong here?
Thanks in advance!
edit: a example snippet of (one of) my data files was requested, hope this formats well!
# KLIBS INFO
# > KLibs Commit: 11a7f8331ba14052bba91009694f06ae9e1cdd3d
#
# EXPERIMENT SETTINGS
# > Trials Per Block: 72
# > Blocks Per Experiment: 8
#
# SYSTEM INFO
# > Operating System: macOS 10.13.4
# > Python Version: 2.7.15
#
# DISPLAY INFO
# > Screen Size: 21.5" diagonal
# > Resolution: 1920x1080 # 60Hz
# > View Distance: 57 cm
PID search_type stimulus_type present_absent response rt error
3 time COLOUR present absent 5457.863881 TRUE
3 time COLOUR absent absent 5357.009108 FALSE
3 time COLOUR present present 2870.76412 FALSE
3 time COLOUR absent absent 5391.404728 FALSE
3 time COLOUR present present 2686.6131 FALSE
3 time COLOUR absent absent 5306.652878 FALSE
edit: Using Jukob's suggestion
files <- list.files(
path = 'data',
pattern = '*.txt',
full.names = TRUE,
recursive = TRUE
)
for (i in 1:length(files)) {
print(read_tsv(files[i], skip=16))
}
prints:
Parsed with column specification:
cols()
# A tibble: 0 x 0
... for each file
If I print files, I do get the correct list of file paths. If I remove skip=16 I get:
Parsed with column specification:
cols(
`# KLIBS INFO` = col_character()
)
Warning: 617 parsing failures.
row col expected actual file
15 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
16 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
17 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
18 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
19 -- 1 columns 21 columns 'data/raw/2019/colour/p1.2019-02-28.txt'
... ... ......... .......... ........................................
See problems(...) for more details.
... for each file
FWIW I was able to solve the problem using your snippet by doing something along the following line:
# Didn't work for me since when I copy and paste your snippet,
# the tabs become spaces, but I think in your original file
# the tabs are preserved so this should work for you
read_tsv("dat.tsv", comment = "#")
# This works for my case
read_table2("dat.tsv", comment = "#")
Didn't even need to specify skip argument!
But also, no idea why using skip and not comment will fail... :(
Could your try following code? The value of i may give you some idea for which file there is a problem.
files <- list.files(path = "path", full.names = T, pattern = ".csv")
for (i in 1:length(files)){
print(read_tsv(files[i], skip = 16))
}
I imported a dataset that unfortunately did not have any separators defined, nor in columns or in rows. I have tried to look for an option to define a specific row separator but could not find one that could be applicable to this situation.
df1 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
df2 <- data.frame("V1" = "{lat:45.493,lng:-76.4886,alt:22400,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:328,spd:null,postime:2019-01-15 16:10:39},
{lat:45.5049,lng:-76.5285,alt:23425,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50},
{lat:45.5049,lng:-76.5285,alt:24000,call:COFPQ,icao:C056P,registration:X-VLMP,sqk:6232,trak:321,spd:null,postime:2019-01-15 16:11:50}")
newdf <- rbind(df1,df2)
This is a model of the data that I am currently struggling with. Ideally, the row separators in this case would have to be defined as "},{" and the column separators as ",". I tried subsetting this pattern to a tab and defining a different separator but this either returned an error (tried with separate_rows from TidyR) or simply did nothing.
Hope you guys can help
This looks like incomplete (incorrect) JSON, so I suggest you bring it up-to-spec and then parse it with known tools. Some problems, easily mitigated:
sqk should have a comma-separator, perhaps a copy/paste issue. This might be generalized as any "number-letter" progression depending on your process. (Edit: your update seems to have resolved this issue, so I'll remove it. If you still need it, I recommend you go with a very literal gsub("([^,])sqk:", "\\1,sql:", s).)
Labels (e.g., lat, alt, sql) should all be double-quoted.
Non-numeric data needs to be quoted, specifically the dates.
Exception to 3: null should remain unquoted.
There are multiple "dicts" that need to be within a "list", i.e., from {...},{...} to [{...},{...}].
Side note with your data: I read them in with stringsAsFactors=FALSE, since we don't need factors.
fixjson <- function(s) {
gsub(",+", ",",
paste(
gsub('"sqk":([^,]+)', '"sqk":"\\1"',
gsub("\\s*\\b([A-Za-z]+)\\s*(?=:)", '"\\1"', # note 2
gsub('(?<=:)"(-?[0-9.]+|null)"', "\\1", # notes 3, 4
gsub("(?<=:)([^,]+)\\b", "\"\\1\"", # quote all data
s, perl = TRUE), perl = TRUE), perl = TRUE)),
collapse = "," )
)
}
fixjson(df1$V1)
# [1] "{\"lat\":45.493,\"lng\":-76.4886,\"alt\":22400,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":328,\"spd\":null,\"postime\":\"2019-01-15 16:10:39\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":23425,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"},\n {\"lat\":45.5049,\"lng\":-76.5285,\"alt\":24000,\"call\":\"COFPQ\",\"icao\":\"C056P\",\"registration\":\"X-VLMP\",\"sqk\":\"6232\",\"trak\":321,\"spd\":null,\"postime\":\"2019-01-15 16:11:50\"}"
From here, we use a well-defined json parser (from either jsonlite or RJSONIO, both use similar APIs):
jsonlite::fromJSON(paste("[", fixjson(df1$V1), "]", sep=""))
# lat lng alt call icao registration sqk trak spd postime
# 1 45.4930 -76.4886 22400 COFPQ C056P X-VLMP 6232 328 NA 2019-01-15 16:10:39
# 2 45.5049 -76.5285 23425 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 3 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
# 4 45.5049 -76.5285 24000 COFPQ C056P X-VLMP 6232 321 NA 2019-01-15 16:11:50
From here, rbind as needed. (Note that the null literal was translated into R's NA, which is "as it should be" in my opinion.)
Follow-on suggestion: you can use as.POSIXct directly on your postime column; I hope you are certain all of your data are in the same timezone since the field contains no hint.
Lastly, you mentioned something about non-ASCII characters gumming up the works. My recent edit included a little added robustness for spaces introduced from the use of iconv (e.g., the use of \\s*), so the following might suffice for you:
jsonlite::fromJSON( paste("[", fixjson(iconv(df2$V1, "latin1", "ASCII", sub="")), "]") )
(Use of iconv suggested by https://stackoverflow.com/a/9935242/3358272)
I'm trying to read a table using fread.
The txt file has text which look like:
"No","Comment","Type"
"0","he said:"wonderful|"","A"
"1","Pr/ "d/s". "a", n) ","B"
R codes I'm using is: dataset0 <- fread("data/test.txt", stringsAsFactors = F) with the development version of data.table R package.
Expect to see a dataset with three columns; however:
Error in fread(input = "data/stackoverflow.txt", stringsAsFactors = FALSE) :
Line 3 starting <<"1","Pr/ ">> has more than the expected 3 fields.
Separator 3 occurs at position 26 which is character 6 of the last field: << n) ","B">>.
Consider setting 'comment.char=' if there is a trailing comment to be ignored.
How to solve it?
The development version of data.table handles files like this where the embedded quotes have not been escaped. See point 10 on the wiki page.
I just tested it on your input and it works.
$ more unescaped.txt
"No","Comment","Type"
"0","he said:"wonderful."","A"
"1","The problem is: reading table, and also "a problem, yes." keep going on.","A"
> DT = fread("unescaped.txt")
> DT
No Comment Type
1: 0 he said:"wonderful." A
2: 1 The problem is: reading table, and also "a problem, yes." keep going on. A
> ncol(DT)
[1] 3
Use readLines to read line by line, then replace delimiter and read.table:
# read with no sep
x <- readLines("test.txt")
# introduce new sep - "|"
x <- gsub("\",\"", "\"|\"", x)
# read with new sep
read.table(text = x, sep = "|", header = TRUE)
# No Comment Type
# 1 0 he said:"wonderful." A
# 2 1 The problem is: reading table, and also "a problem, yes." keep going on. A
I have a 4-column data frame named as mytable with hundreds of rows.
It looks like
id name count rate
234 uert e#3 erwafrw23 weq 34 2
324 awrt%rw-fref-sfr-32 eq 78 4
329 jiowerfhguy qwhrb 90 8
123 234huib|f|wer fwfqwasgre 54 3
so as it shows, the name has spaces and special characters. so I can't use write.table to save the data.frame.
I tried
sink('myfile.txt')
print(mytable,right=F)
sink()
But I met a problem that sometimes the name is so long that the four column can't show together in the same page, i.e. the third or fourth column may run to the next page.
Is there any method can adjust the width of table sinked to .txt file? Or besides sink(), any other code can be used to save a data frame to .txt file? Thanks.
seems like write.table() should be OK. just specify a seperator, like ",", or something else not appearing in your name column:
my.df <- data.frame(ID=c(234,324,329,123),
name = c("uert e#3 erwafrw23 weq"," awrt%rw-fref-sfr-32 eq","jiowerfhguy qwhrb","234huib|f|wer fwfqwasgre"),
count = c(34,78,90,54), rate = c(2,4,8,3))
write.table(my.df, file = "my.df.txt", sep = ",", col.names = colnames(my.df))
# read it back in
my.df2 <- read.table(file = "my.df.txt",sep = ",", header = TRUE, stringsAsFactors = FALSE)
all(my.df == my.df2)
TRUE
You seem confused about the difference between a file and the console output. There is no limitation to the width of lines with write.table, at least not ones you will approach in normal use. You can control the console screen width with options(width=72) and use capture.output(print(mytable)) so the ouput meets whatever unstated width requirements you might have.