New to R ... struggling to produce results on 10,000 lines; Data model actually has about 1M lines. Is there a better option than a Loop? Read about vectorization and attempted tapply with no success.
Data set has a column of free form text and a category associated to the text. I need to parse the text into distinct words to then perform statistics on the frequency of words being able to predict the category with a certain degree of accuracy. I read in the data via read.table and create a data.frame called data.
Function attempts to parse Text, and count occurrences of each word:
data <- data.frame(category = c("cat1","cat2","cat3", "cat4"),
text = c("The quick brown fox",
"Jumps over the fence",
"The quick car hit a fence",
"Jumps brown"))
parsefunc <- function(data){
finalframe <- data.frame()
for (i in 1:nrow(data)){
description <- strsplit(as.character(data[i,2]), " ")[[1]]
category <- rep(data[i,1], length(description))
worddataframe <- data.frame(description, category)
finalframe <- rbind(finalframe, worddataframe)
}
m1<- ddply(finalframe, c("description","category"), nrow)
m2<- ddply(m1, 'description', transform, totalcount = sum(nrow), percenttotal = nrow/sum(nrow))
m3 <- m2[(m2$totalcount>10) & (m2$percenttotal>0.8), ]
m3
}
This will get your finalframe and do something close to your m1,2, and 3 part. You'll have to edit it to do exactly what you want. I used a longer data set of 40k rows to make sure it performs alright:
# long data set
data <- data.frame(Category = rep(paste0('cat',1:4),10000),
Text = rep(c('The quick brown fox','Jumps over the fence','The quick car hit a fence','Jumps brown cars'),10000),stringsAsFactors = F)
# split into words
wordbag <- strsplit(data$Text,split = ' ')
# find appropriate category for each word
categoryvar <- rep(data$Category,lapply(wordbag,length))
# stick them in a data frame and aggregate
newdf <- data.frame(category = categoryvar,word = tolower(unlist(wordbag)))
agg <- aggregate(list(wordcount = rep(1,nrow(newdf))),list(category = newdf$category,word =newdf$word),sum)
# find total count in entire data set and put in data set
wordagg <- aggregate(list(totalwordcount = rep(1,nrow(newdf))),list(word =newdf$word),sum)
agg <- merge(x = agg,y = wordagg,by = 'word')
# find percentages and do whatever else you need
agg$percentageofword <- agg$wordcount/agg$totalwordcount
Related
I am working with a corpus with speeches spanning several years (aggregated to person-year level). I want to remove words that occur less than 4 times in a year (not remove it for the whole corpus, but only for the year in which it does not meet the threshold).
I have tried the following:
DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)
and
DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)
However, none seem to work. Mgsub just substitutes the whole speech with "" for 1998, whilst the other options give error messages. The reason that removeWords does not work is that my words_remove_1998 vector is too large. I have tried to split the word vector and loop over the words (see code below), but R does not appear to like this (running forever).
group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)
for (i in 1:length(d)) {
DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}
Any suggestions for how to solve this?
Thank you for your help!
Reproducible example:
text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")
DT <- cbind(name, session, text)
words_remove_1998 <- c("like", "ice", "cream")
newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)
My real word vector that I want removed contains 30k elements.
I ended up not using any wrappings, as none of them could handle the size of the data. Insted I did it the old-fashioned and simple way; separate the text into several rows, count the occurences of each word per session (year) and person, then remove the rows corresponding to less than a threshold (same limit as I used to identify the vector with words I wanted to remove). Lastly, I aggregate the data back to it's initial level (person-year).
This only words because I am removing words according to a threshold. If I had a list of words to remove that I could not remove in this way, I would have been in more trouble.
DT_separate <- separate_rows(DT, text)
df <- DT_separate %>%
dplyr::group_by(session, text) %>%
dplyr::mutate(count = dplyr::n())
df <- df[df$count >5, ]
df <- aggregate(
text ~ x, #where x is a person-year id
data=df,
FUN=paste, collapse=' '
)
names(df)[names(df) == 'text'] <- 'text2'
DT <- left_join(DT, df, by="x")
DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]
I currently have the following problem. I work with Web-of-Science scientific publication and citation data, which has the following structure: A variable "SR" is a string with the name of a publication, "CR" a variable with a string containing all cited references in the article, separated by a ";".
My task now is to create an edgelist between all publications with the corresponding citations, where every publication and citation combination is in a single row. I do it currently with the following code:
# Some minimal data for example
pub <- c("pub1", "pub2", "pub3")
cit <- c("cit1;cit2;cit3;cit4","cit1;cit4;cit5","cit5;cit1")
M <- cbind(pub,cit)
colnames(M) <- c("SR","CR")
# Create an edgelist
cit_el <- data.frame() #
for (i in seq(1, nrow(M), 1)) { # i=3
cit <- data.frame(strsplit(as.character(M[i,"CR"]), ";", fixed=T), stringsAsFactors=F)
colnames(cit)[1] <- c("SR")
cit$SR_source <- M[i,"SR"]
cit <- unique(cit)
cit_el <- rbind(cit_el, cit)
}
However, for large datasets of some 10k+ of publications (which tend to have 50+ citations), the script runs 15min+. I know that loops are usually an inefficient way of coding in R, yet didn't find an alternative that produces what I want.
Anyone knows some trick to make this faster?
This is my attempt. I haven't compared the speeds of different approaches yet.
First is the artificial data with 10k pubs, 100k possible citations, max is 80 citations per pub.
library(data.table)
library(stringr)
pubCount = 10000
citCount = 100000
maxCitPerPub = 80
pubList <- paste0("pub", seq(pubCount))
citList <- paste0("cit", seq(citCount))
cit <- sapply(sample(seq(maxCitPerPub), pubCount, replace = TRUE),
function(x) str_c(sample(citList, x), collapse = ";"))
data <- data.table(pub = pubList,
cit = cit)
For processing, I use stringr::str_split_fixed to split the citations into columns and use data.table::melt to collapse the columns.
temp <- data.table(pub = pubList, str_split_fixed(data$cit, ";", maxCitPerPub))
result <- melt(temp, id.vars = "pub")[, variable:= NULL][value!='']
Not sure if this is any quicker but if I'm understanding correctly this should give the desired result
rbindlist(lapply(1:nrow(M), function(i){
data.frame(SR_source = M[i, 'SR'], SR = strsplit(M[i, 'CR'], ';'))
}))
My first attempt at building a bipartite graph of co-authors' PubMed publications (226 records). The following is a sample of the input file (just one CSV line):
11810598;Chêne G, Angelini E, Cotte L, Lang JM, Morlat P, Rancinan C, May T, Journot V, Raffi F, Jarrousse B, Grappin M, Lepeu G, Molina JM;2002;Mar;Role of long-term nucleoside-analogue therapy in lipodystrophy and metabolic disorders in human immunodeficiency virus-infected patients.
> InputFile = 'JMMolina_PubMed.csv'
# Read the CSV input file into the initial JMMpubs data frame
> setwd('~/Dropbox/R')
> JMMpubs <- read.csv(file=InputFile , header =
> FALSE , sep = ";" , strip.white = TRUE)
> names(JMMpubs) <- c("ID","AuthList", "Year", "Month", "Title")
# build a new data frame IdAuth with one Id line for each coauthor
# therefor the first article which has 13 co-authors will generate 13 lines with the same Id
> Authors <- strsplit(as.character(JMMpubs$AuthList), split = ", ")
> IdAuth <- data.frame(Id = rep(JMMpubs$ID, sapply(Authors,length)), Author = unlist(Authors))
# Now I would like to export this data to Gephi
# The nodes of the graph should be the UNIQUE names in Authors
> UniqueAuthors <- unique(unlist(Authors))
The edges of the graph should be each row of IdAuth. I would like to associate the year of the publications JMMpubs$Year to each edge (to paint recent edges red and older ones in paler hues).
I have a similar issue. My solution has been the following.
For this to work you will as far as I can tell need to reshuffle your data a bit.
If I understand correctly you need the authors who's connected to an ID.
The original answer is on this post https://stackoverflow.com/a/16177624/8080865 by user1317221_G
I would set up DF as:
df3<-data.frame(Author = c("fawf", "ewew", "wewe", "wrewe", "zare")
ID= "11", "11", "11"... etc)´
###TNET solution WoRKS
#create an identifier df for each author
dfnames <- data.frame(i = as.numeric(df3$Id),
value = df$author)
library(tnet)
tdf <- as.tnet( cbind(df3[,1],df3[,2]), type="binary two-mode tnet")
relations <- projecting_tm(tdf, method = "sum")
# match original names
relations[["i"]] <- dfnames[match(relations[['i']], dfnames[['']] ) , 'value']
relations[["j"]] <- dfnames[match(relations[['j']], dfnames[['i']] ) , 'value']
# clean up names
names(relations) <- c("source" , "target", "weight")
I hope this helps you figure out your answer?
I am working on some work of prediction citation counts for articles. The problem I have is that I need information about journals from ISI Web of Knowledge. They're gathering these information (journal impact factor, eigenfactor,...) year by year, but there is no way to download all one-year-journal-informations at once. There's just option to "mark all" which marks always first 500 journals in the list (this list then can be downloaded). I am programming this project in R. So my question is, how to retrieve this information at once or in efficient and tidy way? Thank you for any idea.
I used RSelenium to scrape WOS to get citation data and make a plot similar to this one by Kieran Healy (but mine was for archaeology journals, so my code is tailored to that):
Here's my code (from a slightly bigger project on github):
# setup broswer and selenium
library(devtools)
install_github("ropensci/rselenium")
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()
# go to http://apps.webofknowledge.com/
# refine search by journal... perhaps arch?eolog* in 'topic'
# then: 'Research Areas' -> archaeology -> refine
# then: 'Document types' -> article -> refine
# then: 'Source title' -> choose your favourite journals -> refine
# must have <10k results to enable citation data
# click 'create citation report' tab at the top
# do the first page manually to set the 'save file' and 'do this automatically',
# then let loop do the work after that
# before running the loop, get URL of first page that we already saved,
# and paste in next line, the URL will be different for each run
remDr$navigate("http://apps.webofknowledge.com/CitationReport.do?product=UA&search_mode=CitationReport&SID=4CvyYFKm3SC44hNsA2w&page=1&cr_pqid=7&viewType=summary")
Here's the loop to automate collecting data from the next several hundred pages of WOS results...
# Loop to get citation data for each page of results, each iteration will save a txt file, I used selectorgadget to check the css ids, they might be different for you.
for(i in 1:1000){
# click on 'save to text file'
result <- try(
webElem <- remDr$findElement(using = 'id', value = "select2-chosen-1")
); if(class(result) == "try-error") next;
webElem$clickElement()
# click on 'send' on pop-up window
result <- try(
webElem <- remDr$findElement(using = "css", "span.quickoutput-action")
); if(class(result) == "try-error") next;
webElem$clickElement()
# refresh the page to get rid of the pop-up
remDr$refresh()
# advance to the next page of results
result <- try(
webElem <- remDr$findElement(using = 'xpath', value = "(//form[#id='summary_navigation']/table/tbody/tr/td[3]/a/i)[2]")
); if(class(result) == "try-error") next;
webElem$clickElement()
print(i)
}
# there are many duplicates, but the code below will remove them
# copy the folder to your hard drive, and edit the setwd line below
# to match the location of your folder containing the hundreds of text files.
Read all text files into R...
# move them manually into a folder of their own
setwd("/home/two/Downloads/WoS")
# get text file names
my_files <- list.files(pattern = ".txt")
# make list object to store all text files in R
my_list <- vector(mode = "list", length = length(my_files))
# loop over file names and read each file into the list
my_list <- lapply(seq(my_files), function(i) read.csv(my_files[i],
skip = 4,
header = TRUE,
comment.char = " "))
# check to see it worked
my_list[1:5]
Combine list of dataframes from the scrape into one big dataframe
# use data.table for speed
install_github("rdatatable/data.table")
library(data.table)
my_df <- rbindlist(my_list)
setkey(my_df)
# filter only a few columns to simplify
my_cols <- c('Title', 'Publication.Year', 'Total.Citations', 'Source.Title')
my_df <- my_df[,my_cols, with=FALSE]
# remove duplicates
my_df <- unique(my_df)
# what journals do we have?
unique(my_df$Source.Title)
Make abbreviations for journal names, make article titles all upper case ready for plotting...
# get names
long_titles <- as.character(unique(my_df$Source.Title))
# get abbreviations automatically, perhaps not the obvious ones, but it's fast
short_titles <- unname(sapply(long_titles, function(i){
theletters = strsplit(i,'')[[1]]
wh = c(1,which(theletters == ' ') + 1)
theletters[wh]
paste(theletters[wh],collapse='')
}))
# manually disambiguate the journals that now only have 'A' as the short name
short_titles[short_titles == "A"] <- c("AMTRY", "ANTQ", "ARCH")
# remove 'NA' so it's not confused with an actual journal
short_titles[short_titles == "NA"] <- ""
# add abbreviations to big table
journals <- data.table(Source.Title = long_titles,
short_title = short_titles)
setkey(journals) # need a key to merge
my_df <- merge(my_df, journals, by = 'Source.Title')
# make article titles all upper case, easier to read
my_df$Title <- toupper(my_df$Title)
## create new column that is 'decade'
# first make a lookup table to get a decade for each individual year
year1 <- 1900:2050
my_seq <- seq(year1[1], year1[length(year1)], by = 10)
indx <- findInterval(year1, my_seq)
ind <- seq(1, length(my_seq), by = 1)
labl1 <- paste(my_seq[ind], my_seq[ind + 1], sep = "-")[-42]
dat1 <- data.table(data.frame(Publication.Year = year1,
decade = labl1[indx],
stringsAsFactors = FALSE))
setkey(dat1, 'Publication.Year')
# merge the decade column onto my_df
my_df <- merge(my_df, dat1, by = 'Publication.Year')
Find the most cited paper by decade of publication...
df_top <- my_df[ave(-my_df$Total.Citations, my_df$decade, FUN = rank) <= 10, ]
# inspecting this df_top table is quite interesting.
Draw the plot in a similar style to Kieran's, this code comes from Jonathan Goodwin who also reproduced the plot for his field (1, 2)
######## plotting code from from Jonathan Goodwin ##########
######## http://jgoodwin.net/ ########
# format of data: Title, Total.Citations, decade, Source.Title
# THE WRITERS AUDIENCE IS ALWAYS A FICTION,205,1974-1979,PMLA
library(ggplot2)
ws <- df_top
ws <- ws[order(ws$decade,-ws$Total.Citations),]
ws$Title <- factor(ws$Title, levels = unique(ws$Title)) #to preserve order in plot, maybe there's another way to do this
g <- ggplot(ws, aes(x = Total.Citations,
y = Title,
label = short_title,
group = decade,
colour = short_title))
g <- g + geom_text(size = 4) +
facet_grid (decade ~.,
drop=TRUE,
scales="free_y") +
theme_bw(base_family="Helvetica") +
theme(axis.text.y=element_text(size=8)) +
xlab("Number of Web of Science Citations") + ylab("") +
labs(title="Archaeology's Ten Most-Cited Articles Per Decade (1970-)", size=7) +
scale_colour_discrete(name="Journals")
g #adjust sizing, etc.
Another version of the plot, but with no code: http://charlesbreton.ca/?page_id=179
I have multiple data files in which im interested in cleaning up then obtaining means from to run repeated measures ANOVA on.
Here's example data, in real data theres 4500 rows and another line called Actresponse which sometimes contains a 9 which I trim around : https://docs.google.com/file/d/0B20HmmYd0lsFVGhTQ0EzRFFmYXc/edit?pli=1
I have just discovered plyr and how awesome it is for manipulating data, but the way I'm using it right now looks rather stupid to me. I have 4 different things I"m interested in that I want to read into a data frame. I've read them in to 4 separate data frames to start, I'm wondering if there is a way I can combine this and read all the means into one data frame (a row for each reqresponse of each file) with less lines of code. Basically, can I achieve what I've done here without rewriting a lot of the same code 4 times?
PMScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse),summarise,Score=mean(Score))
})
PMRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
OtherScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],. (Condition,Reqresponse),summarise,Score=mean(Score))
})
OtherRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
I think this deals with what you're trying to do. Basically, I think you need to read all the data in once, then deal with that data.frame. There are several questions dealing with how to read it all in, here is how I would do it so I maintain a record of which file each row in the data.frame comes from, which can also be used for grouping:
filenames <- list.files(".", pattern="^[2-3].txt")
import <- mdply(filenames, read.table, header = T, quote = "\"")
import$file <- filenames[import$X1]
Now import is a big dataframe with all your files in it (I'm assuming your pattern recognition etc for reading in files is correct). You can then do summaries based on whatever criteria you like.
I'm not sure what you're trying to achieve in line 3 of your code above, but for the ddply below that, you just need to do:
ddply(import[import$Reqresponse==9,],.(Condition,Reqresponse,file),summarise,Score=mean(Score))
There's so much going on in the rest of your code that it's hard to make out exactly what you want.
I think the important thing is that to make this efficient, and easier to follow, you need to read your data in once, then work on that dataset - making subsets if necessary, doing summary stats or whatever else it is.
As an example of how you can work with this, here's an attempt to deal with your problem of dealing with trials (rows?) that have reqresponse == 9 and the following two. There are probably ways of doing this more efficiently, but this is slightly based on how you were doing it to show you briefly how to work with the larger dataframe. Now modified to remove the first two trials of each file:
import.clean <- ddply(import, .(file), function(x) {
index <- which(x$reqresponse == 9)
if(length(index) > 0) {
index <- unique(c(index, index + 1, index + 2, 1, 2))
}
else index <- c(1,2)
x <- x[-index,]
return(x)
})