r: very large matrix.csr to matrix: Integer Overflow - r

The following code ( specifically as.matrix ) fails only when opening very large libsvm files. It works fine on smaller files
rawmforCluster=read.matrix.csr(filePath)
sparseforCluster=rawmforCluster$x
str(sparseforCluster)
sparseMatrixforCluster=as.matrix(sparseforCluster)
The structure of sparseforCluster is
Formal class 'matrix.csr' [package "SparseM"] with 4 slots
..# ra : num [1:4860285] 1 1 2 1 1 1 1 1 1 1 ...
..# ja : int [1:4860285] 77 668 716 1086 1202 1306 1527 2184 2545 2729 ...
..# ia : int [1:659095] 1 18 25 26 31 36 52 59 67 72 ...
..# dimension: int [1:2] 659094 3778
The error I get is
Error in double(nrow * ncol) : vector size cannot be NA In addition:
Warning message: In nrow * ncol : NAs produced by integer overflow
Question
How do I coerce the data into a matrix or (second best) a data.table?
(or should I be seeking another solution?)
Update
I have found that the standard solution is to reduce the size of the matrix by removing sparse (low frequency) terms. This is not an option in my case as some low frequency terms may be highly relevant to some subsets.
I have also read about the bigmemory package. However, this does not seem to work with matrix.csr

Related

if statement with three outcomes

I'd like to make a new column in which the value depends on other columns.
There are three possible outcomes
Distance < Min_disp = 0
Distance < Max_disp = Distance
Distance > Max_disp = Max_disp
I have tried using an if-statement, with multiple outcomes, but receive a warning.
Warning messages:
1: In if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) { :
the condition has length > 1 and only the first element will be used
2: In if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) { :
the condition has length > 1 and only the first element will be used
And indeed it only prints "Max_disp".
This is the code I've used
if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) {
0
} else if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) {
Noord_2015_moved$Distance
} else {
Noord_2015_moved$Max_disp
}
I have also tried running it in three separate steps, but then I run into the problem that I don't know how to tell R to only apply part of the df$column, because now I get the error
number of items to replace is not a multiple of replacement length
Noord_2015_moved <- mutate(Noord_2015_moved, Actual_disp = ifelse(Distance < Min_disp, 0, NA))
Noord_2015_moved$Actual_disp[Noord_2015_moved$Distance < Noord_2015_moved$Max_disp] <- Noord_2015_moved$Distance
Noord_2015_moved$Actual_disp[is.na(Noord_2015$Actual_disp)] <- Noord_2015_moved$Max_disp
And this is my data
'data.frame': 301 obs. of 15 variables:
$ Transmitter: Factor w/ 18 levels "A69-1601-22313",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Date : Date, format: "2015-03-03" "2015-03-08" "2015-03-11" "2015-05-18" ...
$ Date_time : Factor w/ 279544 levels "1-03-15 0:00",..: 198302 258702 18684 85140 190788 182641 208718 26315 198759 205744 ...
$ Receiver : Factor w/ 17 levels "uitzetpunt 1-noord",..: 8 5 8 5 6 7 6 8 5 8 ...
$ Station : Factor w/ 17 levels "10","11","12",..: 15 12 15 12 13 14 13 15 12 15 ...
$ Traject : Factor w/ 53 levels "","10-10","10-9",..: 53 50 41 50 40 44 45 53 50 41 ...
$ Interval : num 83.4 12.7 42.6 25.2 217.4 ...
$ Distance : num 1540 6480 6480 6480 4690 4220 4220 1540 6480 6480 ...
$ Min_speed : num 0.02 0.51 0.15 0.26 0.02 0.73 0.52 0.01 0.02 0.02 ...
$ Min_speed2 : num 0.00556 0.14167 0.04167 0.07222 0.00556 ...
$ Length : int 47 47 47 47 47 47 47 45 45 45 ...
$ Activity : chr "Low" "Low" "Low" "Low" ...
$ Moved : chr "Yes" "Yes" "Yes" "Yes" ...
$ Min_disp : num 160 4080 1200 2080 160 5840 4160 80 160 160 ...
$ Max_disp : num 240 6120 1800 3120 240 8760 6240 120 240 240 ...
if() isn't vectorized. It work on a single condition, not a whole vector. That's what the warning "the condition has length > 1 and only the first element will be used" is telling you. You could use if() for this purpose, but you would need to put it in a for loop to check each row one-at-a-time. Doable, but not efficient.
ifelseis a vectorized version of if, and is good for a problem like this. For something like this, you would probably nest 2 ifelses:
Noord_2015_moved$Actual_disp = ifelse(
Noord_2015_moved$Distance < Noord_2015_moved$Min_disp, 0,
ifelse(Noord_2015_moved$Distance < Noord_2015_moved$Max_disp, Noord_2015_moved$Distance,
Noord_2015_moved$Max_disp
))
I see you have a single mutate. If you're using dplyr, you can use mutate which adds a column to the data frame and means you don't need to type out the data frame's name to reference existing columns. This code is equivalent to my above code:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = ifelse(Distance < Min_disp, 0,
ifelse(Distance < Max_disp, Distance, Max_disp)
)
)
In addition to using to ifelse multiple times, you can use dplyr::case_when, which handles multiple outcomes in the cleanest possible way:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = case_when(
Distance < Min_disp ~ 0,
Distance < Max_disp ~ Distance,
Distance > Max_disp ~ Max_disp,
TRUE ~ NA_real_
)
)
Here is a short reference.

r: Reading libsvm files with library (e1071)

I have generated a libsvm file in scala using the org.apache.spark.mllib.util.MLUtils package.
The file format is as follows:
49.0 109:2.0 272:1.0 485:1.0 586:1.0 741:1.0 767:1.0
49.0 109:2.0 224:1.0 317:1.0 334:1.0 450:1.0 473:1.0 592:1.0 625:1.0 647:1.0
681:1.0 794:1.0
17.0 26:1.0 109:1.0 143:1.0 198:2.0 413:1.0 476:1.0 582:1.0 586:1.0 611:1.0
629:1.0 737:1.0
12.0 255:1.0 394:1.0
etc etc
I read the file into r using e1071 package as follows:
m= read.matrix.csr(filename)
The structure of the resultant matrix.csr is as follows:
$ x:Formal class 'matrix.csr' [package "SparseM"] with 4 slots
.. ..# ra : num [1:31033] 2 1 1 1 1 1 2 1 1 1 ...
.. ..# ja : int [1:31033] 109 272 485 586 741 767 109 224 317 334 ...
.. ..# ia : int [1:2996] 1 7 18 29 31 41 49 65 79 83 ...
.. ..# dimension: int [1:2] 2995 796
$ y: Factor w/ 51 levels "0.0","1.0","10.0",..: 45 45 10 5 42 25 23 41 23 25 ...
When I convert to a dense matrix with as.matrix(m) it produces one column and two rows, each with an uninterpretable (by me) object in it.
When I simply try to save the matrix.csr back to file (without doing any intermediate processing), I get the following error:
Error in abs(x) : non-numeric argument to mathematical function
I am guessing that the libsvm format is incompatible but I'm really not sure.
Any help would be much appreciated.
OK, the short of it:
m= read.matrix.csr(filename)$x
because read.matrix.csr is a list with two elements; the matrix and a vector.
In other words, the target/label/class is separated out from the features matrix.
NOTE for fellow r neophytes: In Cran documents, it seems that the "Value" subheading refers to the return values of the function
Value
If the data file includes no y variable, read.matrix.csr returns
an object of class matrix.csr,
else a list with components:
x object of class matrix.csr
y vector of numeric values or factor levels, depending on fac

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

Subsetting two corresponding variables if both of them are true

I'm trying to create a vector with two columns that contain the following strings given that the data in BOTH columns are true. I tried, unsuccessfully with:
CrimesAndLocation <- table(c(Crimes_Data$Primary.Type=="ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY",Crimes_Data$Location.Description=="RESIDENCE")))
I'm trying to get an output where:
Primary.Type, is one of the 8 specific felonies listed above. Thus, it should not show all 32 possible felonies, just out of the 8 listed above
Location.Description, is RESIDENCE
This is the goal of what I'm trying to do:
COLUMN 1 COLUMN 2
"ARSON" "RESIDENCE"
"KIDNAPPING" "RESIDENCE"
"BATTERY" "RESIDENCE"
"HOMICIDE" "RESIDENCE"
"ASSAULT" "RESIDENCE"
...
UPDATE: > str(Crimes_Data) :
'data.frame': 293036 obs. of 22 variables:
$ ID : int 10248194 10251162 10248198 10248242 10248228 10248223 10248192 10248157 10249529 10252453 ...
$ Case.Number : Factor w/ 293015 levels "F218264","HA168845",..: 292354 292350 292363 292359 292368 292366 292351 292348 292364 292816 ...
$ Date : Factor w/ 124573 levels "01/01/2015 01:00:00 AM",..: 94544 94542 94539 94536 94535 94535 94535 94535 94529 94528 ...
$ Block : Factor w/ 27983 levels "0000X E 100TH PL",..: 13541 7650 22635 1317 13262 9623 12854 8232 24201 14279 ...
$ IUCR : Factor w/ 334 levels "0110","0130",..: 49 139 321 33 251 82 38 282 97 38 ...
$ Primary.Type : Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 24 3 18 31 3 13 17 3 ...
$ Description : Factor w/ 313 levels "$500 AND UNDER",..: 111 281 119 35 131 1 260 193 274 260 ...
$ Location.Description: Factor w/ 121 levels "","ABANDONED BUILDING",..: 95 19 110 48 97 110 106 110 110 99 ...
$ Arrest : Factor w/ 2 levels "false","true": 1 1 2 1 2 2 1 2 2 1 ...
$ Domestic : Factor w/ 2 levels "false","true": 2 1 1 1 1 1 1 1 1 1 ...
$ Beat : int 835 333 733 634 1121 1432 1024 735 414 2535 ...
$ District : int 8 3 7 6 11 14 10 7 4 25 ...
$ Ward : int 18 5 6 21 27 1 22 17 7 26 ...
$ Community.Area : int 70 43 68 49 23 22 30 67 46 23 ...
$ FBI.Code : Factor w/ 26 levels "01A","01B","02",..: 11 17 26 6 21 8 11 25 9 11 ...
$ X.Coordinate : int 1154209 1190610 1172166 1176493 1153156 1159961 1154332 1163770 1193570 NA ...
$ Y.Coordinate : int 1852321 1856955 1858813 1841948 1904451 1915955 1887190 1857568 1852889 NA ...
$ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ Updated.On : Factor w/ 442 levels "01/01/2015 12:39:07 PM",..: 288 288 288 288 288 288 288 288 288 288 ...
$ Latitude : num 41.8 41.8 41.8 41.7 41.9 ...
$ Longitude : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
$ Location : Factor w/ 173646 levels "","(41.644604096, -87.610728247)",..: 31318 40835 45858 15601 116871 140063 84837 42961 32176 1 ...
This is a good job for the dplyr package. The filter function will filter a data frame according to any number of logical expressions that you feed it. The following should work for you:
library(dplyr)
filter(
Crimes_Data,
Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE", "HUMAN TRAFFICKING",
"KIDNAPPING", "ROBBERY"),
Location.Description == "RESIDENCE"
)
If you'd rather not use dplyr, you can do it the old fashioned way with base R, like this:
type.bool <- Crimes_Data$Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE",
"HUMAN TRAFFICKING", "KIDNAPPING",
"ROBBERY")
location.bool <- Crimes_Data$Location.Description == "RESIDENCE"
Crimes_Data[type.bool & location.bool, ]
Instead of an integer vector of indices, the [ subsetting operator can take a boolean vector. In that case, it will only return the rows of the data frame for which the corresponding elements of the boolean vector are TRUE.
Thanks for the str() aka "structure" output update, it makes it clearer to be able to help you.
To obtain a list of observations where
these eight felonies : "ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY"
occurred at RESIDENCE
Try breaking up the task into slightly smaller parts:
Step 1:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" | Primary.Type == "ASSAULT" | Primary.Type == "BATTERY" | Primary.Type == "BURGLARY" | Primary.Type == "HOMICIDE" | Primary.Type == "HUMAN TRAFFICKING" | Primary.Type == "KIDNAPPING" | Primary.Type == "ROBBERY")
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
Result:
ViolentCrimesResidence holds two columns with Column 1 being a list of Primary.Type and column 2 is Location.Description, where Column 1 only has values from the eight felonies of interest and column2 only "RESIDENCE"
Explanation
Step 1:
From R website's examples about subset and OR condition:
PineTreeGrade3Data<-subset(StudentData, SchoolName=="Pine Tree Elementary" | Grade==3)
Whereas we have:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" |
we use the subset() function
Crimes_Data is the existing data frame as input
next are the conditions. Which simply take the pattern of VectorName == "Some string", in this casePrimary.Type == "ARSON"`
But we want observations for the other types too, so use the "or" condition to include them
in R, "or" is written with | symbol. So we use this repeatedly to include each of the other felonies of interest
the equal sign = is synonymous with <- and assigns, saves this subset result, into to a new data frame we call ViolentCrimes.
note I prefer using = because it is less keystrokes to type than <-, either is correct
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
the input is ViolentCrimes data frame we made previously which contains only the eight violent crimes , the eight felonies "ARSON", "ASSAULT"...
now we are interested in, out of all these violent crimes, which ones occurred at home, so use condition Location.Description == "RESIDENCE"
but a further option of subset() we didn't use before, is the select = ... option
we do a select = c(Variable1, Variable2) to choose just the Primary.Type and Location.Description vectors
note that if you actually don't want to limit to the columns aka Variables, you simply omit this , select ... option
thus it saves this new subset into ViolentCrimesResidence
So, now in R when you:
ViolentCrimesResidence
You will see a two-column output you wanted of the eight felonies of interest, that happened in RESIDENCE.

R String split by spaces

I have a dataframe containing 2 columns, of which one of them is a string that contain spaces. I have used strsplit to split this string into character tokens based on spaces. I defined a function for that (split), which I want to apply on the entire data frame:
split <- function (str) strsplit(str, "\\s+")[[1]]
data.frame(raw_rt_data, apply(raw_rt_data$stimulusitem1,2, split) )
Here is more info about my dataframe :
str(raw_rt_data)
'data.frame': 5372 obs. of 2 variables:
$ stimulusitem1: Factor w/ 4313 levels "ABILITY TAX",..: 2483 3645 1339 2455 2769 3033 3998 2712 1313 250 ...
$ latency : int 4051 1266 2145 2959 1086 2956 3814 4924 4771 2654 ...
> head(raw_rt_data)
stimulusitem1 latency
1 MORNING BUBBLE 4051
2 SYSTEM MEN 1266
3 FRIEND PAIN 2145
4 MOMMY TINYURL 2959
5 PEACE INFORMATION 1086
6 PUBLIC SCRITS 2956
The problem is that executing the above code yields an error:
Error in apply(raw_rt_data$stimulusitem1, 2, split) :
dim(X) must have a positive length
What am I doing wrong? The desired result should be 2 new added columns: one containing the first token and the other one containing the 2nd token.
Any help appreciated

Resources