I'm trying to use fread() to get some data from a website. The data is conveniently set up with comma separators, but I get the error:
1: In fread("https://website.com/") :
Stopped early on line 56. Expected 5 fields but found 6. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<0,1,1,x[[0], [1]],0>>
This is because the entries before line 56 had a blank on column 4, so something like <<1,1,1,0>>, whereas line 56 has something including a comma on column 4, so it splits it into two columns. Now, I want that the whole x[[y], [z]] to be in one cell, so I want that my data is separated by a comma, but not when commas are inside square brackets.
Edit: The real website is private, so it makes no sense to link it here, but it simply contains data in a csv format. Something like:
field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
............
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0
............
The problem arises with the fact that x[[0], [1]] is supposed to be all in one cell, but because of the comma delimiter it is split across two cells.
Is there any way to do this with fread()? Or with any other function that serves a similar purpose?
Thank you in advance and sorry if the question is somewhat basic, I'm just getting started with R.
Instead of reading your CSV file directly from that private website of yours with fread, you can download the CSV first and then:
Read the lines of the CSV (without any special parsing), that will be equivalent to my csv_lines <- read_lines(my_weird_csv_text);
Then, split those read lines according to the regex "(?!\\])(\\,)(?!\\s\\[)" as opposed to using the single comma "," (this ensures that commas within those expressions with "[[" and "]]" are not use as split characters);
Finally, from the first row of the resulting matrix (split_lines) define the column names of a new dataframe/tibble that has been coerced from split_lines.
I hope it's clear.
Basically, we had to circumvent straightforward reading functions such as fread or other equivalent by reading line by line and then doing split based on a regex that handles your special cases.
library(readr)
library(data.table)
library(stringr)
library(tibble)
my_weird_csv_text <-
"field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0"
csv_lines <- read_lines(my_weird_csv_text)
split_lines <- stringr::str_split(csv_lines, "(?!\\])(\\,)(?!\\s\\[)", simplify = TRUE)
as_tibble(split_lines[-1, ]) %>%
`colnames<-`(split_lines[1, ]) -> tbl
tbl
#> # A tibble: 8 x 5
#> field1 field2 field3 field4 field5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 0 "" 1
#> 2 0 0 0 "" 1
#> 3 1 1 0 "" 1
#> 4 1 1 0 "" 1
#> 5 0 1 1 x[[0], [1]] 0
#> 6 0 1 0 x[[0], [1]] 1
#> 7 1 0 1 "" 1
#> 8 0 0 1 x[[1], [0]] 0
A suggestion:
From the documentation:
'fread' is for regular delimited files; i.e., where every row has the same number of
columns.
If the number of columns varies or is irregular because of errors in file generation, an alternative like readLines would enable you to process the file line by line--perhaps, using regular expressions like gsub, etc.
I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.
Does there exist a method to concatenate two dfm matrices containing different numbers of columns and rows at the same time? It can be done with some additional coding, so I am not interested in an adhoc code but in the general and elegant solution if there exists any.
An example:
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
gives an error.
The 'tm' package can concatenate its dfm matrices out of box; it is too slow for my purposes.
Also recall that 'dfm' from 'quanteda' is a S4 class.
Should work "out of the box", if you are using the latest version:
packageVersion("quanteda")
## [1] ‘0.9.6.9’
dfm1 <- dfm(c(doc1 = "This is one sample text sample."), verbose = FALSE)
dfm2 <- dfm(c(doc2 = "Surprise! This is one sample text sample."), verbose = FALSE)
rbind(dfm1, dfm2)
## Document-feature matrix of: 2 documents, 6 features.
## 2 x 6 sparse Matrix of class "dfmSparse"
## is one sample surprise text this
## doc1 1 1 2 0 1 1
## doc2 1 1 2 1 1 1
See also ?selectFeatures where features is a dfm object (there are examples in the help file).
Added:
Note that this will correctly align the two texts in a common feature set, unlike the normal rbind methods for matrices, whose columns must match. For the same reasons, rbind() does not actually work in the tm package for DocumentTermMatrix objects with different terms:
require(tm)
dtm1 <- DocumentTermMatrix(Corpus(VectorSource(c(doc1 = "This is one sample text sample."))))
dtm2 <- DocumentTermMatrix(Corpus(VectorSource(c(doc2 = "Surprise! This is one sample text sample."))))
rbind(dtm1, dtm2)
## Error in f(init, x[[i]]) : Numbers of columns of matrices must match.
This almost gets it, but seems to duplicate the repeated feature:
as.matrix(rbind(c(dtm1, dtm2)))
## Terms
## Docs one sample sample. text this surprise!
## 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1
I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.
The first dataset (df1) looks like this :
word1 word2 pattern
air 10 (^|\\s)air(\\s.*)?\\s10($|\\s)
airport 20 (^|\\s)airport(\\s.*)?\\s20($|\\s)
car 30 (^|\\s)car(\\s.*)?\\s30($|\\s)
The other dataset (df2) from which I want to match this looks like
sl_no query
1 air 10
2 airport 20
3 airport 20
3 airport 20
3 car 30
The final output I want should look like
word1 word2 total_occ
air 10 1
airport 20 3
car 30 1
I am able to do this by using apply in R
process <-
function(x)
{
length(grep(x[["pattern"]], df2$query))
}
df1$total_occ=apply(df1,1,process)
but find it time taking since my dataset is pretty big.
I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying
lapply(df,process)
Error in x[, "pattern"] : incorrect number of dimensions
Please let me know what changes should I make to run lapply correctly.
Why not just lapply() over the pattern?
Here I've just pulled out your pattern but this could just as easily be df$pattern
pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
"(^|\\s)airport(\\s.*)?\\s20($|\\s)",
"(^|\\s)car(\\s.*)?\\s30($|\\s)")
Using your data for df2
txt <- "sl_no query
1 'air 10'
2 'airport 20'
3 'airport 20'
3 'airport 20'
3 'car 30'"
df2 <- read.table(text = txt, header = TRUE)
Just iterate on pattern directly
> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1
[[2]]
[1] 2 3 4
[[3]]
[1] 5
If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to #Frank for pointing out the new function lengths().)). Eg
lengths(lapply(pattern, grep, x = df2$query))
which gives
> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1
You can add this to the original data via
dfnew <- cbind(df1[, 1:2],
Count = lengths(lapply(pattern, grep, x = df2$query)))
I have a data frame A in the following format
user item
10000000 1 # each user is a 8 digits integer, item is up to 5 digits integer
10000000 2
10000000 3
10000001 1
10000001 4
..............
What I want is a list B, with users' names as the name of list elements, list element is a vector of items corresponding to this user.
e.g
B = list(c(1,2,3),c(1,4),...)
I also need to paste names to B. To apply association rule learning, items need to be convert to characters
Originally I used tapply(A$user,A$item, c), this makes it not compatible with association rule package. See my post:
data format error in association rule learning R
But #sgibb's solution seems also generates an array, not a list.
library("arules")
temp <- as(C, "transactions") # C is output using #sgibb's solution
throws error: Error in as(C, "transactions") :
no method or default for coercing “array” to “transactions”
Have a look at tapply:
df <- read.table(textConnection("
user item
10000000 1
10000000 2
10000000 3
10000001 1
10000001 4"), header=TRUE)
B <- tapply(df$item, df$user, FUN=as.character)
B
# $`10000000`
# [1] "1" "2" "3"
#
# $`10000001`
# [1] "1" "4"
EDIT: I do not know the arules package, but here the solution proposed by #alexis_laz:
library("arules")
as(split(df$item, df$user), "transactions")
# transactions in sparse format with
# 2 transactions (rows) and
# 4 items (columns)