I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.
The first dataset (df1) looks like this :
word1 word2 pattern
air 10 (^|\\s)air(\\s.*)?\\s10($|\\s)
airport 20 (^|\\s)airport(\\s.*)?\\s20($|\\s)
car 30 (^|\\s)car(\\s.*)?\\s30($|\\s)
The other dataset (df2) from which I want to match this looks like
sl_no query
1 air 10
2 airport 20
3 airport 20
3 airport 20
3 car 30
The final output I want should look like
word1 word2 total_occ
air 10 1
airport 20 3
car 30 1
I am able to do this by using apply in R
process <-
function(x)
{
length(grep(x[["pattern"]], df2$query))
}
df1$total_occ=apply(df1,1,process)
but find it time taking since my dataset is pretty big.
I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying
lapply(df,process)
Error in x[, "pattern"] : incorrect number of dimensions
Please let me know what changes should I make to run lapply correctly.
Why not just lapply() over the pattern?
Here I've just pulled out your pattern but this could just as easily be df$pattern
pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
"(^|\\s)airport(\\s.*)?\\s20($|\\s)",
"(^|\\s)car(\\s.*)?\\s30($|\\s)")
Using your data for df2
txt <- "sl_no query
1 'air 10'
2 'airport 20'
3 'airport 20'
3 'airport 20'
3 'car 30'"
df2 <- read.table(text = txt, header = TRUE)
Just iterate on pattern directly
> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1
[[2]]
[1] 2 3 4
[[3]]
[1] 5
If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to #Frank for pointing out the new function lengths().)). Eg
lengths(lapply(pattern, grep, x = df2$query))
which gives
> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1
You can add this to the original data via
dfnew <- cbind(df1[, 1:2],
Count = lengths(lapply(pattern, grep, x = df2$query)))
Related
I work with a sheet of data that lists a variety of scientific publications. Rows are publications,
columns are a variety of metrics describing each publication (author name and position, Pubmed IDs, Date etc...)
I want to filter for publications for each author and extract parts of them. The caveat is the format:
all author names (5-80 per cell) are lumped together in one cell for each row.
I managed to solve this with the use of str_which, saving the coordinates for each author and later extract. This works only for manual use. When I try to automate this process using a loop to draw on a list of authors I fail to save the output.
I am at a bit of a loss on how to store the results without overwriting previous ones.
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
list of names that I want the coordinates for
namesOfInterest <-
list(c("Doe J", "Theobald U"))
the manual extraction, requiring me to type the exact name and output object
Doe <- str_which(sampleDat$var1, "Doe J")
Theobald <- str_which(sampleDat$var1, "Theobald U")
one of many attempts that does not replicate the manual version.
results <- c()
for (i in namesOfInterest) {
results[i] <- str_which(sampleDat$var1, i)
}
The for loop is set up incorrectly (it needs to be something like for(i in 1:n){do something}). Also, even if you fix that, you'll get an error related to the fact that str_which returns a vector of varying length, indicating the position of each of the matches it makes (and it can make multiple matches). Thus, indexing a vector in a loop won't work here because whenever a author has multiple matches, more than one entry will be saved to a single element, throwing an error.
Solve this by working with lists, because lists can hold vectors of arbitrary length. Index the list with double bracket notation: [[.
library(stringr)
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
# no need for list here. a simple vector will do
namesOfInterest <- c("Doe J", "Theobald U")
# initalize list
results <- vector("list", length = length(namesOfInterest))
# loop over list, saving output of `str_which` in each list element.
# seq_along(x) is similar to 1:length(x)
for (i in seq_along(namesOfInterest)) {
results[[i]] <- str_which(sampleDat$var1, namesOfInterest[i])
}
which returns:
> results
[[1]]
[1] 1
[[2]]
[1] 2 3
The way to understand the output above is that the ith element of the list, results[[i]] contains the output of str_which(sampleDat$var1, namesOfInterest[i]), where namesOfInterest[i] is always exactly one author. However, the length of results[[i]] can be longer than one:
> sapply(results, length)
[1] 1 2
indicating that a single author can be mentioned multiple times. In the example above, sapply counts the length of each vector along the list results, showing that namesOfInterest[1] has one paper, and namesOfInterest[2] has 2. `
Here is another approach for you. If you want to know which scholar is in which publication, you can do the following as well. First, assign unique IDs to publications. Then, split authors and create a long-format data frame. Define groups by authors and aggregate publication ID (pub_id) as string (character). If you need to extract some authors, you can use this data frame (foo) and subset rows.
library(tidyverse)
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = toString(pub_id)) -> foo
var1 pub_id
<chr> <chr>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2, 3
filter(foo, var1 %in% c("Doe J", "Theobald U"))
var1 pub_id
<chr> <chr>
1 Doe J 1
2 Theobald U 2, 3
If you want to have index as numeric, you can twist the idea above and do the following. You can subset rows with targeted names with filter().
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = list(pub_id)) %>%
unnest(pub_id)
var1 pub_id
<chr> <int>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2
9 Theobald U 3
How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
I need to search through a text string for keywords and then assign a category in an R dataframe. This creates a problem where I have keywords from more than one category. I would like to easily extract rows where more than one category is represented so that I can manually evaluate them and assign the correct category.
To do this, I have tried to add a count column to show how many categories are represented in each string.
Using a combination of the two solutions linked below, I have managed to get part of the way, but I am still not getting the correct output
Partial animal string matching in R
Count occurrences of specific words from a dataframe row in R
I have created an example below. I would like the following rules to be applied:
if string has cat or lion wcount gets 1 - only 1 group represented (feline)
if string has dog or wolf wcount gets 1 - only 1 group represented (canine)
if string has (cat or lion) AND (dog or wolf) wcount get 2 - two groups represented (feline and canine)
I can then easily pull out rows where wcount > 1
id <- c(1:5)
text <- c('saw a cat',
'found a dog',
'saw a cat by a dog',
'There was a lion',
'Huge wolf'
)
dataset <- data.frame(id,text)
SearchGrp<-list(c("(cat|lion)", "feline"),
c("(dog|wolf)","canine"))
output_vector<- character (nrow(dataset))
for (i in seq_along(SearchGrp)){
output_vector[grepl(x=dataset$text, pattern = SearchGrp[[i]][1],ignore.case = TRUE)]<-SearchGrp[[i]][2]}
dataset$type<-output_vector
keyword_temp <- unlist(lapply(SearchGrp, function(x) new<-{x[1]}))
keyword<-paste(keyword_temp[1],"|",keyword_temp[2])
library(stringr)
getCount <- function(data,keyword)
{
wcount <- str_count(dataset$text, keyword)
return(data.frame(data,wcount))
}
getCount(dataset,keyword)
Here is a base R method to get the count across types.
dataset$wcnt <- rowSums(sapply(c("dog|wolf", "cat|lion"),
function(x) grepl(x, dataset$text)))
Here, sapply runs through the regular expressions of each type and feeds it to grepl. This returns a matrix, where the columns are logical vectors indicating if a particular type (eg, "dog|wolf") was found. rowSums sums the logicals along the rows to get the type variety count.
This returns
dataset
id text wcnt
1 1 saw a cat 1
2 2 found a dog 1
3 3 saw a cat by a dog 2
4 4 There was a lion 1
5 5 Huge wolf 1
If you want the intermediary step, returning logical vectors as variables in your data.frame, you would probably want to set your values up in a named vector and then do cbind with the result.
# construct named vector
myTypes <- c("canine"="dog|wolf", "feline"="cat|lion")
# cbind sapply results of logicals to original data.frame
dataset <- cbind(dataset, sapply(myTypes, function(x) grepl(x, dataset$text)))
This returns
dataset
id text canine feline
1 1 saw a cat FALSE TRUE
2 2 found a dog TRUE FALSE
3 3 saw a cat by a dog TRUE TRUE
4 4 There was a lion FALSE TRUE
5 5 Huge wolf TRUE FALSE
I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.
I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.