fread() with commas outside square brackets as separators - r

I'm trying to use fread() to get some data from a website. The data is conveniently set up with comma separators, but I get the error:
1: In fread("https://website.com/") :
Stopped early on line 56. Expected 5 fields but found 6. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<0,1,1,x[[0], [1]],0>>
This is because the entries before line 56 had a blank on column 4, so something like <<1,1,1,0>>, whereas line 56 has something including a comma on column 4, so it splits it into two columns. Now, I want that the whole x[[y], [z]] to be in one cell, so I want that my data is separated by a comma, but not when commas are inside square brackets.
Edit: The real website is private, so it makes no sense to link it here, but it simply contains data in a csv format. Something like:
field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
............
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0
............
The problem arises with the fact that x[[0], [1]] is supposed to be all in one cell, but because of the comma delimiter it is split across two cells.
Is there any way to do this with fread()? Or with any other function that serves a similar purpose?
Thank you in advance and sorry if the question is somewhat basic, I'm just getting started with R.

Instead of reading your CSV file directly from that private website of yours with fread, you can download the CSV first and then:
Read the lines of the CSV (without any special parsing), that will be equivalent to my csv_lines <- read_lines(my_weird_csv_text);
Then, split those read lines according to the regex "(?!\\])(\\,)(?!\\s\\[)" as opposed to using the single comma "," (this ensures that commas within those expressions with "[[" and "]]" are not use as split characters);
Finally, from the first row of the resulting matrix (split_lines) define the column names of a new dataframe/tibble that has been coerced from split_lines.
I hope it's clear.
Basically, we had to circumvent straightforward reading functions such as fread or other equivalent by reading line by line and then doing split based on a regex that handles your special cases.
library(readr)
library(data.table)
library(stringr)
library(tibble)
my_weird_csv_text <-
"field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0"
csv_lines <- read_lines(my_weird_csv_text)
split_lines <- stringr::str_split(csv_lines, "(?!\\])(\\,)(?!\\s\\[)", simplify = TRUE)
as_tibble(split_lines[-1, ]) %>%
`colnames<-`(split_lines[1, ]) -> tbl
tbl
#> # A tibble: 8 x 5
#> field1 field2 field3 field4 field5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 0 "" 1
#> 2 0 0 0 "" 1
#> 3 1 1 0 "" 1
#> 4 1 1 0 "" 1
#> 5 0 1 1 x[[0], [1]] 0
#> 6 0 1 0 x[[0], [1]] 1
#> 7 1 0 1 "" 1
#> 8 0 0 1 x[[1], [0]] 0

A suggestion:
From the documentation:
'fread' is for regular delimited files; i.e., where every row has the same number of
columns.
If the number of columns varies or is irregular because of errors in file generation, an alternative like readLines would enable you to process the file line by line--perhaps, using regular expressions like gsub, etc.

Related

How to calculate the number of elements in a string for each observation

Here is a representation of my dataset
mydata<-data.frame(ID=1:3, str=c("ANN_ABL_ABL","ABL", "SLE_ANN"))
I want to calculate the number of elements in the string of each observation, in order to have a dataset like below.
ID str number_of_elements
1 1 ANN_ABL_ABL 3
2 2 ABL 1
3 3 SLE_ANN 2
A base R option
transform(
mydata,
number_of_elements = nchar(gsub("[^_]","",str))+1
)
gives
ID str number_of_elements
1 1 ANN_ABL_ABL 3
2 2 ABL 1
3 3 SLE_ANN 2
A possible solution, using stringr::str_count:
library(tidyverse)
mydata<-data.frame(ID=1:3, str=c("ANN_ABL_ABL","ABL", "SLE_ANN"))
mydata %>%
mutate(n = str_count(str, "_") + 1)
#> ID str n
#> 1 1 ANN_ABL_ABL 3
#> 2 2 ABL 1
#> 3 3 SLE_ANN 2
The scan function is set up to pull apart lines of text. It's default first parameter is a file name but a text parameter was added a couple of years ago. You can cook up an identical function whose first parameter is text and I also chose to make the default for the expected type of input to be "character".
scant <- function(txt, ...){scan(text=txt, what="", quiet=TRUE, ...) }
I went through those gymnastics to allow the scan* function to work within an lapply call:
lengths( lapply(mydata$str, scant, sep="_") )
I could have used an anonymous, throwaway function to do this in one line, but I decided instead to put this helper function in my .Rprofile setup. For many years I had a somewhat similar read.txt function that used a textConnection to supply character data to the read.table function. It became unnecessary when the text parameter was added to scan.

Replacing empty cells in a column with values from another column in R

I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.

Breaking up a checkbox column containing multiple answers into separate columns in R

I have multiple checkboxes in a data set that need to be divided. One of these checkbox questions asks about what states someone practices in (thus there are 50 checkbox options) The data is exported in the format below:
ID q143
1 1,4,6
But I need it in this format (a true/false format for each individual check box (unchecked/checked)
ID q143_1 q143_2 q143_3 q143_4 q143_5 q143_6
100 1 0 0 1 0 1
Since this is such a large number of columns that need to be made.. any ideas on how to separate this easily?
I was thinking if, then statements but I think that would take a while.
Thanks in advance!
We could use strsplit
v1 <- as.numeric(strsplit(df1$q143, ",")[[1]])
v2 <- +(Reduce(`:`, range(v1)) %in% v1)
cbind(df1[1], setNames(as.data.frame.list(v2), paste0(names(df1)[2], "_", seq_along(v2))))
# ID q143_1 q143_2 q143_3 q143_4 q143_5 q143_6
#1 1 1 0 0 1 0 1

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Lists of term-frequency pairs into a matrix in R

I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.

Resources