fetching large number of followers and followees in R - r
I am trying to get followers of XYZ and fetch them in descending order of their follower count (say we consider 400 of the top followers with respect of follower count). The code is--
library(twitteR)
user <- getUser("XYZ")
followers <- user$getFollowers()
b <- twListToDF(followers)
f_count <- as.data.frame(b$followersCount)
u_id <- as.data.frame(b$id)
u_sname <- as.data.frame(b$screenName)
u_name <- as.data.frame(b$name)
final_df <- cbind(u_id,u_name,u_sname,f_count)
sort_fc <- final_df[order(-f_count),]
colnames(sort_fc) <- c('id','name','s_name','fol_count')
The first problem I'm facing is that it doesn't give me the list of all followers. I surmise it's because of the 15 min API limit that twitter has. Anyway in which I can sort this out?
The second thing that I'm trying to do is get the followees(friends) of these followers and sort them in descending order of their follower count again. Let's say we take the top 100 followees with respect to the follower count.
The follow-up code which contains all these followees of followers in a data frame is (the column names of data frame are: odd numbered columns representing users and even numbered columns representing the count of followees' followers):
alpha <- as.factor(sort_fc[1:400,]$s_name)
user_followees <- rep(list(list()),10)
fof <- rep(list(list()),10)
gof <- rep(list(list()),10)
m <- data.frame(matrix(NA, nrow=100, ncol=800))
colnames(m) <- sprintf("%d",1:80)
for(i in 1:400)
{
user <- getUser(alpha[i])
Sys.sleep(61)
user_followees[[i]] <- user$getFriends(n=100)
fof[[i]] <- twListToDF(user_followees[[i]])$screenName
gof[[i]] <- twListToDF(user_followees[[i]])$followersCount
j <- 2*i-1
k <- 2*i
m[,j] <- fof[[i]]
m[,k] <- gof[[i]]
c <- as.vector(m[,j])
d <- as.vector(m[,k])
n <- cbind(c,d)
sort <- n[order(-d),]
m[,j] <- sort[,1]
m[,k] <- sort[,2]
}
The error I'm getting over here is:
[1] "Unauthorized"
Error in twInterfaceObj$doAPICall(cmd, params, method, ...) :
Error: Unauthorized
I'm using Sys.sleep(61) so that I don't make more than 1 search per minute (since twitter API limit is 15 for 15 mins, so I guess this works fine).
The session info:
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] twitteR_1.1.6 rjson_0.2.12 ROAuth_0.9.3 digest_0.6.3 RCurl_1.95-4.1
[6] bitops_1.0-5
I am a novice in R and require this manipulation to work on the interest-graph building. So, I would be glad if anyone can help me with this.
Thanks a lot for any help in advance.
Related
R: SpatialPointsDataFrame code no longer working. Error in !res[[1]] : invalid argument type
I have been following this workflow to convert coordinates from Eastings / Northings to Latitude / Longitude in R. Up until today it has been working fine. Here is a reproducible example: require(rgdal) # create test coordinates x <- 259269 y <- 074728 # create test dataframe dat <- data.frame(x, y) class(dat) # "data.frame" ### shortcuts ukgrid <- "+init=epsg:27700" latlong <- "+init=epsg:4326" ### Create coordinates object coords <- cbind(Easting = as.numeric(as.character(x)), Northing = as.numeric(as.character(y))) class(coords) # matrix dat_SP <- SpatialPointsDataFrame(coords, data = dat, proj4string = CRS("+init=epsg:27700")) # Error in !res[[1]] : invalid argument type # Following steps ---- # Convert dat_SP_LL <- spTransform(dat_SP, CRS(latlong) # replace Lat, Long dat_SP_LL#data$Long <- coordinates(dat_SP_LL)[, 1] dat_SP_LL#data$Lat <- coordinates(dat_SP_LL)[, 2] I think this may be related to the proj4string argument, but have been unable to resolve it. Any help is appreciated. My session info: > sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3 [5] readr_1.3.1 tibble_3.0.1 tidyverse_1.3.0 ggspatial_1.1.2 [9] tmap_3.0 rnrfa_2.0.3 gdalUtils_2.0.3.2 zoon_0.6.5 [13] biomod2_3.4.6 sdm_1.0-89 SDMTools_1.1-221.1 SSDM_0.2.8 [17] odbc_1.2.2 DBI_1.1.0 rgeos_0.5-3 rgdal_1.5-8 [21] tidyr_1.1.0 ggplot2_3.3.1 knitr_1.25 raster_3.0-7 [25] sp_1.3-1
I encountered the same problem. The bug is related to sp::CRS. I solved it with a re-intallation of the 'sp' package.
I am having the same problem. It comes from the CRS function. If you look at the function it comes from line 34 (I marked with asterisks in code below), but I don't know how to fix it. It might be a bug. I mainly wrote this "answer" to draw more attention and I don't have the ability to comment. Edit: You can set the doCheckCRSArgs = F in the CRS function, but still get a error when I try to spTransform. source crs creation failed: generic error of unknown origin function (projargs = NA_character_, doCheckCRSArgs = TRUE) { if (!is.na(projargs) && !nzchar(projargs)) projargs <- NA_character_ stopifnot(is.logical(doCheckCRSArgs)) stopifnot(length(doCheckCRSArgs) == 1L) stopifnot(is.character(projargs)) if (!is.na(projargs)) { if (length(grep("^[ ]*\\+", projargs)) == 0) stop(paste("PROJ4 argument-value pairs must begin with +:", projargs)) } if (!is.na(projargs)) { if (length(grep("latlon", projargs)) != 0) stop("northings must follow eastings: ", projargs) if (length(grep("lonlat", projargs)) != 0) { projargs <- sub("lon", "long", projargs) warning("'lonlat' changed to 'longlat': ", projargs) } } if (is.na(projargs)) uprojargs <- projargs else uprojargs <- paste(unique(unlist(strsplit(projargs, " "))), collapse = " ") if (length(grep("= ", uprojargs)) != 0) stop(paste("No spaces permitted in PROJ4 argument-value pairs:", uprojargs)) if (length(grep(" [:alnum:]", uprojargs)) != 0) stop(paste("PROJ4 argument-value pairs must begin with +:", uprojargs)) if (doCheckCRSArgs) { if (!is.na(uprojargs) && requireNamespace("rgdal", quietly = TRUE)) { res <- rgdal::checkCRSArgs(uprojargs) ********if (!res[[1]])********* stop(res[[2]]) uprojargs <- res[[2]] } } res <- new("CRS", projargs = uprojargs) res }
parLapply and Part of Speech tagging
I am trying to use the parLapply along with the openNLP R package to do part of speech tagging of a corpus of ~600k documents. However, while I was able to successfully part of speech tag a different set of ~90k documents, I get a strange error after ~25 mins of running the same code over the ~600k documents: Error in checkForRemoteErrors(val) : 10 nodes produced errors; first error: no word token annotations found The documents are simply digital newspaper articles, where I run the tagger over the body field (after cleaning). This field is nothing but raw text which I save into a list of strings. Here's my code: # I set the Java heap size (memory) allocation - I experimented with different sizes options(java.parameters = "- Xmx3GB") # Convert the corpus into a list of strings myCorpus <- lapply(contentCleaned, function(x){x <- as.String(x)}) # tag Corpus Function tagCorpus <- function(x, ...){ s <- as.String(x) # This is a repeat and may not be required WTA <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, WTA, a2) a3 <- annotate(s, PTA, a2) word_subset <- a3[a3$type == "word"] POStags <- unlist(lapply(word_subset$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[word_subset], POStags), collapse = " ") list(text = s, POStagged = POStagged, POStags = POStags, words = s[word_subset]) } # I have 12 cores in my box cl <- makeCluster(mc <- getOption("cl.cores", detectCores()-2)) # I tried both exporting the word token annotator and not clusterEvalQ(cl, { library(openNLP); library(NLP); PTA <- Maxent_POS_Tag_Annotator(); WTA <- Maxent_Word_Token_Annotator() }) # Each cluster node has the following description: [[1]] An annotator inheriting from classes Simple_Word_Token_Annotator Annotator with description Computes word token annotations using the Apache OpenNLP Maxent tokenizer employing the default model for language 'en'. clusterEvalQ(cl, sessionInfo()) # ClusterEvalQ outputs for each worker: [[1]] R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] NLP_0.1-11 openNLP_0.2-6 loaded via a namespace (and not attached): [1] openNLPdata_1.5.3-4 compiler_3.4.4 parallel_3.4.4 rJava_0.9-10 packageDescription('openNLP') # Version: 0.2-6 packageDescription('parallel') # Version: 3.4.4 startTime <- Sys.time() print(startTime) corpus.tagged <- parLapply(cl, myCorpus, tagCorpus) endTime <- Sys.time() print(endTime) endTime - startTime Kindly note that I have consulted many web forums & the one which stood out is: parallel parLapply setup However, this doesn't seem to address my issue. Furthermore, I am confused why the setup works with the ~90k articles but not the ~600k articles (I have a total of 12 cores and 64GB memory). Any advice is much appreciated.
I have managed to get this to work by directly using the qdap package (https://github.com/trinker/qdap) by Tyler Rinker. It took ~20 hours to run. Here's how the function pos from the qdap package does this in a one liner: corpus.tagged <- qdap::pos(myCorpus, parallel =TRUE, cores =detectCores()-2)
Splitting a dataframe by group and printing group-specific rows to individual HTML files using pander and rapport
Say I have a tall dataframe with many rows per group, like so: df <- data.frame(group = factor(rep(c("a","b","c"), each = 5)), v1 = sample(1:100, 15, replace = TRUE), v2 = sample(1:100, 15, replace = TRUE), v3 = sample(1:100, 15, replace = TRUE)) What I want to do is split df into length(levels(df$group)) separate dataframes, e.g., df_a <- df[df$group=="a",]; df_b <- df[df$group == "b",] ; ... And then print each dataframe in a separate HTML/PDF/DOCX file (probably using Rmarkdown and knitr). I want to do this because I have a large dataframe and want to create a personalized report for each group a, b, c, etc. Thanks. Update (11/18/14) Following #daroczig 's advice in this thread and another thread, I attempted to make my own template that would simply print a nicely formatted table of all columns and rows per group to substitute into the "correlations" template call in the original sapply() function. I want to make my own template rather than just printing the nice table (e.g., the answer #Thomas graciously provided) because I'd like to build additional customization into the template once the simple printing works. Anyway, I've certainly butchered it: <!--head meta: title: Sample Report author: Nicapyke description: This is a demo packages: ~ inputs: - name: eachgroup class: character standalone: TRUE required: TRUE head--> ### Records received up to present for Group <%= eachgroup %> <%= pandoc.table(df[df$group == eachgroup, ]) %> Then, after saving that as groupreport.rapport in my working directory, I wrote the following R code, modeled after #daroczig's response: allgroups <- unique(df$group) library(rapport) for (eachstate in allstates) { rapport.docx("FILEPATHHERE", eachgroup = eachgroup) } I received the error: Error in openFileInOS(f.out) : File not found! I'm not sure what happened. I see from the pander documentation that this means it's looking for a system file, but that doesn't mean much to me. Anyway, this error doesn't get at the root of the problem, which is 1) what should go in the input section of the custom template YAML header, and 2) which R code should go in the rapport template vs. in the R script. I realize I may be making a number of errors that reveal my lack of experience with rapport and pander. Thanks for your patience! N.B.: > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] knitr_1.8 dplyr_0.3.0.2 rapport_0.51 yaml_2.1.13 pander_0.5.1 plyr_1.8.1 lattice_0.20-29 loaded via a namespace (and not attached): [1] assertthat_0.1 DBI_0.3.1 digest_0.6.4 evaluate_0.5.5 formatR_1.0 grid_3.1.2 [7] lazyeval_0.1.9 magrittr_1.0.1 parallel_3.1.2 Rcpp_0.11.3 reshape_0.8.5 stringr_0.6.2 [13] tools_3.1.2
A slightly off-topic, but still R/markdown one-liner for separate reports with report templates: > library(rapport) > sapply(levels(df$group), function(g) rapport.html('correlations', data = df[df$group == g, ], vars = c('v1', 'v2', 'v3'))) Exported to */tmp/RtmpYyRLjf/rapport-correlations-1-0.[md|html]* under 0.683 seconds. Exported to */tmp/RtmpYyRLjf/rapport-correlations-2-0.[md|html]* under 0.888 seconds. Exported to */tmp/RtmpYyRLjf/rapport-correlations-3-0.[md|html]* under 1.063 seconds. The rapport package can run (predefined or custom) report templates on any (sub)dataset in markdown, then export it to HTML/docx/PDF/other formats. For a quick demo, I've uploaded the resulting documents: rapport-correlations-1-0.html rapport-correlations-2-0.html rapport-correlations-3-0.html
You can do this with by (or split) and xtable (from the xtable package). Here I create xtable objects of each subset, and then loop over them to print them to file: library('xtable') s <- by(df, df$group, xtable) for(i in seq_along(s)) print(s[[i]], file = paste0('df',names(s)[i],'.tex')) If you use the stargazer package, you can get a nice summary of the dataframe instead of the dataframe itself in just one line: library('stargazer') by(df, df$group, stargazer, out = paste0('df',unique(df$group),'.tex')) You should be able to easily include each of these files in, e.g., a PDF report. You could also use HTML markup using either xtable or stargazer.
data.table binary search fails if j refers to a numeric (extra) key column?
Consider a data.table dt = data.table(id = rep(c('a','b'), each=2), val = rep(c(1,2,3), times=c(1,2,1))) # > dt # id val # 1: a 1 # 2: a 2 # 3: b 2 # 4: b 3 that we want to subset by id. If we key by that column alone, no problem. setkey(dt, id) dt[J('a'), val] # id val # 1: a 1 # 2: a 2 dt[J('a'), range(val)] # id V1 # 1: a 1 # 2: a 2 But if dt happens to be keyed also by the numeric column val, then that extra key column no longer seems to work in j. setkey(dt, id, val) dt[J('a'), val] # id val # 1: a 1 dt[J('a'), range(val)] # id V1 # 1: a 1 # 2: a 1 ## I would have expected same results here as when key(dt) == "id" only Some values seem to be missing now... unless we resort to vector scan (which can be slow, and returns vectors here) dt[id == 'a', val] # [1] 1 2 dt[id == 'a', range(val)] # [1] 1 2 or unless we explicitly set by (which throws a warning). dt[J('a'), range(val), by = id] # id V1 # 1: a 1 # 2: a 2 # Warning message: # In `[.data.table`(dt, J("a"), range(val), by = id) : # by is not necessary in this query; it equals all the join columns # in the same order. j is already evaluated by group of x that each # row of i matches to (by-without-by, see ?data.table). Setting by # will be slower because a subset of x is taken and then grouped # again. Consider removing by, or changing it. What's going on please? > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] data.table_1.9.2 loaded via a namespace (and not attached): [1] plyr_1.8.1 Rcpp_0.11.0 reshape2_1.2.2 stringr_0.6.2 [5] tools_3.0.1
Update: Added test in commit 1309 to catch regressions on this issue at any later stages. Closes #734. From NEWS: Added tests (1351.1 and 1351.2) to catch any future regressions on particular case of binary search based subset reported here on SO. Thanks to Scott for the post. The regression was contained to v1.9.2 AFAICT. Closes #734. Scott, thanks for the report (and the follow-up comment). It seems to have occurred in 1.9.2 alone. I tested it on the current development version v1.9.3 and things seem to work as intended. Please check the README file for installation instructions. I've added a issue #734 to remind us to add a test to cover this usage so that we don't miss it again during any changes in the future.
Clustering with CLARA in R: Mysterious dataset causes problems. Why?
I have discovered a dataset that causes R's CLARA algorithm to fail (i.e. enter an apparently infinite loop). Why does it fail? What I find strange is that the dataset is extremely boring (basically just a normal variable with a dummy-coded categorical variable), but also because changing nearly ANYTHING about the dataset or the call to cluster::clara avoids the problem. Anyone have insight here? I apologize for not including the random seed and program I used to generate this dataset; unfortunately, the code that generated this is a rather large set of programs drawing from parallel random number generators (generated using the rlecuyer package). I haven't been able to replicate the generation process on a standard PC, so here is the raw dataset below. The bug does appear to replicate across operating systems and versions of R. An example sessionInfo() is shown below. One possible clue is that the algorithm seemed slightly more likely to fail when there was no true group structure in the continuous random variable (first column). Even so, I had to go through thousands of iterations before finding a dataset that failed. # self contained example library(cluster) strangeData <- c(-0.866285627488327,1,0,0 ,-0.510951849076121,0,0,1 ,-1.17167514108559,0,1,0 ,-0.376762528272389,1,0,0 ,0.78146409125879,0,1,0 ,1.61782054548358,0,0,1 ,1.13870520606875,1,0,0 ,-1.13486644861899,0,0,1 ,-0.206927567036607,0,0,0 ,-0.207869481212349,0,0,1 ,-0.744628008025243,0,1,0 ,0.731433838031448,0,0,1 ,1.62303204478421,0,1,0 ,0.336307634488861,1,0,0 ,1.81605021862297,1,0,0 ,-0.750411588420933,1,0,0 ,-0.0238025825151884,0,1,0 ,-1.15317047678752,0,0,1 ,-1.16535479706466,0,1,0 ,-0.902957290471365,0,1,0 ,0.757365905566296,0,0,0 ,0.327956357831982,1,0,0 ,-1.21209036793528,1,0,0 ,-0.827813655803123,0,0,1 ,1.29462065083151,1,0,0 ,-1.18875004333554,0,0,1 ,0.30436829237833,1,0,0 ,0.686761122942272,0,0,1 ,0.348318524543705,0,0,0 ,0.00772114748248065,0,1,0 ,-0.0654531320746511,0,1,0 ,-0.871825468631758,1,0,0 ,-0.24260740899128,1,0,0 ,1.6316283625827,0,0,0 ,-0.157380038359798,1,0,0 ,-1.13594228151195,0,0,1 ,0.397846678544776,0,0,0 ,-1.79295787039272,0,1,0 ,0.659894781274625,0,0,0 ,-0.170073590508157,0,0,0 ,-0.813943074670405,0,0,0 ,-0.382343008302335,0,1,0 ,1.00965255945214,1,0,0 ,-0.325407591983216,0,0,0 ,-0.192348595055317,0,1,0 ,-0.776963720822561,0,0,1 ,0.412818265336597,0,0,0 ,-0.372215722029941,0,1,0 ,-0.626596060936548,1,0,0 ,1.77998358890641,0,0,1 ,1.28471804407001,1,0,0 ,2.30521501151449,0,0,1 ,-0.0434857353479769,0,1,0 ,0.129337095832911,0,0,0 ,0.490060486314237,0,0,0 ,0.0219239584969476,0,1,0 ,2.16246213170809,1,0,0 ,-1.6463259663705,0,1,0 ,-1.25945578842163,0,1,0 ,-0.0508902150466008,1,0,0 ,-0.761232947984076,0,0,1 ,-1.0844017226841,1,0,0 ,0.768163697492781,1,0,0 ,0.0512967198361093,0,0,0 ,0.496505407614941,0,1,0 ,-0.875095658442669,0,0,0 ,0.356165420230564,0,1,0 ,-1.48736915282316,0,0,0 ,0.605835916959207,1,0,0 ,0.150248856364222,0,0,1 ,0.436072890825432,0,0,0 ,-1.2477307042077,0,1,0 ,-0.925825618397317,0,0,1 ,-0.865719152703743,0,0,0 ,1.49904943876406,0,0,0 ,-2.000710979317,0,0,0 ,-0.592352714624327,0,0,0 ,0.393289445874399,0,1,0 ,-0.394123590756249,1,0,0 ,-0.352673107317825,0,0,1 ,0.695357507195699,0,0,1 ,-0.0040347449129765,0,1,0 ,-0.127791150003591,0,0,1 ,0.474647838412314,1,0,0 ,0.621985643440749,0,1,0 ,-0.0963944386034655,0,0,0 ,-0.0769711593827088,0,0,1 ,1.52062637216907,0,1,0 ,-0.0171131048538696,0,0,1 ,-0.614781568866571,0,0,1 ,-0.428078937440796,0,0,1 ,1.46996872384325,0,0,1 ,1.33611500009041,0,1,0 ,0.0849546480618341,1,0,0 ,1.16189138672992,0,1,0 ,-0.0892214257894575,0,0,1 ,0.639365524696794,0,0,1 ,0.601135082783273,1,0,0 ,0.507344807255368,1,0,0 ,-0.933432615481704,0,1,0 ,1.0605270726388,1,0,0 ,-0.846884297959019,0,1,0 ,0.069757909351883,0,0,0 ,0.971079516592758,0,0,1 ,2.13895542286282,0,1,0 ,-1.07521573310271,0,0,0 ,-0.578762770935955,0,0,1 ,0.138245534658141,0,1,0 ,0.472499258515774,0,1,0 ,-0.955048693548485,0,1,0 ,-0.541270774805467,0,0,0 ,-0.965725893217509,0,1,0 ,1.23937418822633,0,0,0 ,0.572605101793754,0,0,1 ,1.25059599135706,0,0,0 ,0.224476582002008,0,1,0 ,-0.0466083676818374,1,0,0 ,1.58857119911245,0,0,1 ,-1.85121285531852,1,0,0 ,0.506446004536966,0,0,0 ,0.00529547247104709,0,1,0 ,-0.67761610257275,0,1,0 ,-0.050520398113479,1,0,0 ,1.09977836234059,0,0,0 ,-0.729074108182476,0,0,0 ,-0.0290602916500079,0,0,0 ,0.610785941871284,1,0,0 ,0.224899327520079,1,0,0 ,1.01797352389614,0,0,0 ,0.258960174013477,1,0,0 ,-0.0173109255243736,1,0,0 ,1.15675575673647,0,0,0 ,-1.40075651100777,0,1,0 ,1.29973838021484,0,0,0 ,-2.19964416776033,0,1,0 ,1.43949364450352,0,0,1 ,1.07342747391928,0,0,1 ,-1.20048803661996,0,0,0 ,-0.346455976764648,0,0,0 ,-0.480774942257163,0,1,0 ,-0.743609323884268,0,1,0 ,1.54622738113784,0,1,0 ,-0.403898403935946,0,0,0 ,-0.833433145942438,1,0,0 ,-0.219164470718774,1,0,0 ,0.12696814119287,1,0,0 ,-0.322315905835392,0,0,0 ,-0.350597318362615,1,0,0 ,-0.657947808709103,0,0,1 ,-1.3599769112632,0,1,0 ,-0.101630823838928,0,0,0 ,-1.34660046284197,1,0,0 ,-2.42751597395552,0,1,0 ,-0.890211205480196,1,0,0 ,-1.24558895213474,0,0,1 ,0.18700333641652,1,0,0 ,2.5082251629743,1,0,0 ,1.33292648093108,1,0,0 ,-0.38816796929655,0,0,1 ,-0.340299350153559,0,0,1 ,-0.360408688017778,0,0,1 ,0.184527132500589,0,0,1 ,-0.945284123204619,0,0,0 ,-0.447915791428278,0,0,0 ,0.960837246768647,0,1,0 ,0.261390181284487,0,0,1 ,1.99391928572872,1,0,0 ,1.63328530659267,0,1,0 ,0.841678645192176,0,1,0 ,0.807348533253569,0,1,0 ,-1.69473655701435,0,0,0 ,-2.95460714558617,1,0,0 ,-1.64969453370585,0,0,1 ,-0.484414672869815,0,1,0 ,-0.526272870810604,0,0,0 ,0.051830062343814,0,0,1 ,1.26507516617792,1,0,0 ,-1.41617967114385,0,0,0 ,0.236754759425436,0,0,0 ,0.904212403266786,0,0,1 ,2.42448681816761,0,0,1 ,-1.57101731154911,1,0,0 ,-0.471525015273919,0,0,1 ,-0.777691881154585,0,0,0 ,-0.33500864971305,1,0,0 ,-0.804758127002811,1,0,0 ,1.31836909690498,0,0,1 ,0.0609487604963864,0,0,0 ,-0.443936034513707,0,0,0 ,0.740834846236723,0,0,1 ,-2.14041576208016,0,0,0 ,-0.650625084741614,0,0,1 ,-0.314809179050786,0,0,0 ,0.623191053756259,0,0,0 ,-0.861527811786575,0,0,0 ,-0.495712634432739,0,0,0 ,-1.15138427720019,0,0,0 ,-0.0368657006311513,0,0,1 ,0.808099625439217,0,1,0 ,2.04358789173449,0,1,0 ,-0.361230312246911,0,1,0 ,-1.43757174215673,0,0,1 ,-0.0126368333853388,0,0,1 ,-0.55062321407905,0,0,1 ,-0.598669467196556,0,0,0 ,0.553538760296604,0,1,0 ,0.0331404550587805,0,1,0 ,-1.01641207011743,0,0,0 ,-0.969763233966749,1,0,0 ,-0.115985687817581,0,0,0 ,-1.44923317467671,0,1,0 ,0.58359088336307,0,0,1 ,1.02177931523912,0,0,0 ,-0.772903258762401,0,1,0 ,-0.833203951806085,1,0,0 ,-0.121756851474623,0,0,1 ,0.333735580183243,0,0,0 ,0.841447311750965,0,0,1 ,-0.202542681685737,1,0,0 ,-0.363835977102042,0,0,1 ,-0.763208653073085,0,0,0 ,-0.404233770949925,0,0,0 ,-0.626967850505697,0,0,1 ,1.51902583641424,0,0,1 ,0.152334670353581,1,0,0 ,-0.37793809368425,0,1,0 ,0.958745025661511,1,0,0 ,-1.44235110382376,0,1,0 ,0.0234173973335753,0,0,1 ,0.381965794662393,0,0,0 ,-0.987421186811441,1,0,0 ,0.680391323949574,1,0,0 ,-0.200195019135693,1,0,0 ,-1.11687467522201,0,0,1 ,0.484415933230296,0,0,0 ,0.66465930418491,1,0,0 ,0.135520835347015,0,0,0 ,-0.0135948380436641,0,0,0 ,-0.174610953673981,1,0,0 ,-0.385258324842704,0,0,1 ,0.0736764406605341,0,1,0 ,0.433497723474607,0,0,1 ,-1.57962309060262,0,0,1 ,0.630090656840841,1,0,0 ,0.973500666410068,0,1,0 ,-0.509883991863271,0,1,0 ,0.776678864294335,1,0,0 ,1.06439033722933,0,0,1 ,0.631572825803999,0,0,1 ,0.736226086691134,0,0,0 ,-1.23321773011943,1,0,0 ,-0.388575379622945,0,1,0 ,1.0632151634506,0,0,1 ,1.05814098386583,0,0,0 ,0.408184424450004,0,0,1 ,0.531436889048738,1,0,0 ,2.20381966762526,0,0,1 ,2.11422588577572,0,0,1 ,-0.531704962557588,0,0,0 ,1.34561800389927,0,0,1 ,0.273769623933743,1,0,0 ,-0.372910934670834,0,0,1 ,-0.470566520010902,1,0,0 ,-0.75477217389578,0,1,0 ,-0.501842228377673,0,0,1 ,-1.25532930322808,0,0,0 ,-0.286477761094775,1,0,0 ,-0.823694457831787,0,0,1 ,0.797314566796799,0,0,1 ,-0.0600523761224243,0,1,0 ,0.657605378335186,0,0,0 ,-0.725821254759635,0,1,0 ,-1.1218762657447,0,1,0 ,1.02390098776472,1,0,0 ,-0.125900354616813,0,1,0 ,-0.110600677983877,0,0,1 ,0.362848077657443,0,1,0 ,1.75245676080733,0,0,1 ,-1.53945786644643,0,0,0 ,-1.69041842719508,1,0,0 ,2.21366606351434,0,1,0 ,1.59672297659057,0,0,1 ,1.36855862766991,0,0,0 ,-0.59109080681349,1,0,0 ,0.344628944020705,0,0,0 ,-0.547730633367544,1,0,0 ,3.28229260418702,0,1,0 ,0.186377905391717,0,0,1 ,-0.85647024545773,0,0,0 ,-2.10613283819929,1,0,0 ,1.0659329233981,1,0,0 ,0.197594622321815,0,0,0 ,-1.5165240972921,0,0,1 ,-1.10653359569001,0,0,0 ,-0.702450947236347,0,1,0 ,0.561612881169714,0,1,0 ,0.0618497778342936,0,1,0 ,-1.61352989112514,0,0,1 ,-0.380008609813976,0,1,0 ,0.485668785864403,1,0,0 ,1.44073309607887,0,0,0 ,-0.631502388683978,0,1,0 ,0.924636979461146,0,1,0 ,-0.385053889933482,1,0,0 ,1.7335479754834,1,0,0 ,0.804525293681982,0,0,1 ,-0.991585017557505,0,1,0 ,1.35969572742296,0,1,0 ,0.76841232389423,0,1,0 ,-0.133117031661697,1,0,0 ,-1.57093944605932,1,0,0 ,0.0463315954106568,1,0,0 ,0.0400112040904778,0,1,0 ,0.542965613556832,0,0,1 ,-0.951132065936325,0,1,0 ,1.0236860490532,1,0,0 ,-1.65166967479243,1,0,0 ,-0.31690124841045,1,0,0 ,-0.236230990447918,1,0,0 ,2.15914099741492,1,0,0 ,1.31763979031855,0,0,1 ,-0.607279994041944,0,0,1 ,1.13453737158436,0,0,1 ,0.6701910306058,0,1,0 ,-0.564864937395887,1,0,0 ,0.926205103954728,1,0,0 ,-0.980200111995638,0,1,0 ,0.437252455381278,0,1,0 ,-1.13416837126914,0,1,0 ,0.780011292450117,0,0,0 ,0.558179248192722,0,0,1 ,-0.788560632585296,0,0,1 ,1.42188103521719,1,0,0 ,-1.25403745185199,0,1,0 ,-0.961543192416867,0,0,1 ,0.272005872845942,0,0,0 ,-0.3754160451173,1,0,0 ,-1.40457905909558,0,0,0 ,0.0564752608649083,0,1,0 ,0.0717952713169958,1,0,0 ,1.27361011457374,0,1,0 ,-1.13881319792865,0,0,1 ,-1.43642545765708,0,0,0 ,1.56419289350468,0,0,0 ,-0.539863901494306,1,0,0 ,-0.649284486134111,0,0,1 ,-0.816485163999317,0,0,0 ,-0.586168990002293,0,0,0 ,1.33236676326001,0,0,0 ,0.897539764461409,1,0,0) Here is the analysis part strangeData <- matrix(strangeData,byrow=T,ncol=4) # Remove any column (and probably any row) and it works # Decrease samples below 42 and it works # Set sampsize to 199 or 201 and it works # Change distance to euclidean and it works # set pamLike to F and it works resClaraManDum <- clara( x=strangeData ,k=2 ,metric='manhattan' ,samples=42 ,sampsize=200 ,pamLike=T ) And relevant information about my system: sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] cluster_1.14.4 loaded via a namespace (and not attached): [1] tools_3.0.2