How are vector operations performed on 20newsgroups_vectorized data set? - vector

When I fetch 20newsgroups_vectorized data by
newsgroups = fetch_20newsgroups_vectorized(subset='all')
labels = newsgroups.target_names
target = newsgroups.target
target = pd.DataFrame([labels[i] for i in target], columns=['label'])
data = newsgroups.data
data is the <class 'scipy.sparse.csr.csr_matrix'> with the shape
(18846, 130107)
How can I subset the data by target names (for example, extract only 'rec.sport.baseball') and use vector operations on those sparse row vectors (for example, calculate the mean vector or the distances)?

Unfortunately, subsetting the data by target names option is not available in fetch_20newsgroups_vectorized but it is available in
fetch_20newsgroups, just that you have to vectorize the data yourself.
Here is how you can do it.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
newsgroups_train = fetch_20newsgroups(subset='all',
categories=['rec.sport.baseball'])
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
# (994, 13986)
Read more here

Related

Is there a way to write a function in R that writes a dataframe like in this script?

I wrote a script in R which reads a csv file to a dataframe, then manipulates that dataframe to create a new one. The current script is as follows:
#read in the csv file for a node
node00D = read.csv("00D.csv")
#create a new data frame with the specific measurement we want to decrease file size
node00D_smaller = node00D[node00D$sensor=="BMP180",]
#remove the columns that we don't need
keeps = c("node_id","timestamp","parameter","value")
node00D = node00D_smaller[keeps]
#fix row names
rownames(node00D) = 1:nrow(node00D)
#convert the timestamp column from a factor to a date and then truncate to the hour
node00D$timestamp = as.POSIXlt(node00D$timestamp)
node00D$timestamp = trunc(node00D$timestamp,"hour")
rm(keeps,node00D_smaller)
#get average temperature for each hour
library(plyr)
node00D$timestamp = as.character(node00D$timestamp)
node00D_means = ddply(node00D, .(timestamp), summarize,
mean=round(mean(value),2))
node00D$timestamp = as.POSIXlt(node00D$timestamp)
node00D_means$timestamp = as.POSIXlt(node00D_means$timestamp)
write.csv(node00D_means,"00D_Edit.csv")
#load lat long data
latlong = read.csv("Node.Lat.Lon.csv")
node00D_means$Node = "00D"
node00D_means = merge(node00D_means,latlong,by="Node")
I have to do this for up to 100 nodes, and so I tried writing a function with argument 'node' which would perform this. In this example, I would input getNodeData(00D). However, when I do this there are issues actually creating the data frames. The function runs but does not create any new objects. Is there a way to turn this script into a function so that I can more easily perform it 100 times?
You could do something like
fun1 <- function(node.num){
### (1). Load the data
dat <- read.csv(paste0(node.num, ".csv"))
dat_smaller <- dat[dat$sensor=="BMP180",]
### (2). Here proceed with your code and substitute node00D(_smaller) by dat(_smaller) ###
# ------------------------------------------------------------------- #
### (3). Then define dat_mean and save the .csv
library(plyr)
dat$timestamp=as.character(dat$timestamp)
dat_means=ddply(dat,.(timestamp),summarize,mean=round(mean(value),2))
dat$timestamp=as.POSIXlt(dat$timestamp)
dat_means$timestamp=as.POSIXlt(dat_means$timestamp)
write.csv(dat_means,paste0(node.num, "_Edit.csv"))
### (4). And similarly for lat long
latlong=read.csv("Node.Lat.Lon.csv")
dat_means$Node=node.num
dat_means=merge(dat_means,latlong,by="Node")
}
Now this function is not returning anything, it is saving the .csv files though. However, if you want it to return something, e.g. dat_means, then you can add the line return(dat_means) before the function ends.
Appendix
Now to perform the above operation dynamically, you can for instance using a loop:
### (1.) First, create an object containing all your nodes, e.g.
nodes.vector <- c("00D", ...)
### (2.) Run a loop, or use one of the apply functions
for(k in seq_along(nodes.vector)){
fun1(nodes.vector[k])
}
# Or
sapply(nodes.vector, fun1)
Now I don't know your data, but if the nodes are contained in latlong$Node, then you can set this to be your dat.vector.

Igraph Write Communities

We are using igraph and R to detect communities in a network. The detection using cluster_walktrap is working great:
e <- cluster_walktrap(g)
com <-membership(e)
print(com)
write.csv2(com, file ="community.csv", sep=",")
The result is printed fine using print with the number and the community number that it belongs to but we have a problem in writing the result in the csv file and I have an error : cannot coerce class ""membership"" to a data.frame
How can I write the result of membership in a file ?
Thanks
Convert the membership object to numeric. write.csv and write.csv2 expect a data frame or matrix. The command tries to coerce the object into a data frame, which the class membership resists. Since membership really is just a vector, you can convert it a numeric. Either:
write.csv2(as.numeric(com), file ="community.csv")
Or:
com <- as.numeric(com)
write.csv2(com, file ="community.csv")
Oh, and you don't need the sep = "," argument for write.csv.
If you want to create table of vertex names/numbers and groups:
com <- cbind(V(g),e$membership) #V(g) gets the number of vertices
com <- cbind(V(g)$name,e$membership) #To get names if your vertices are labeled
I don't know if you guys resolved the problem but I did the following using R:
```
# applying the community method
com = spinglass.community(graph_builted,
weights = graph_builted$weights,
implementation = "orig",
update.rule = "config")
# creating a data frame to store the results
type = c(0)
labels = c(0)
groups = c(0)
res2 = data.frame(type, labels, groups)
labels = com$names # here you get the vertices names
groups = com$membership # here you get the communities indices
# here you save the information
res = data.frame(type = "spinGlass1", labels, groups)
res2 = rbind(res2, res)
# then you save the .csv file
write.csv(res2, "spinglass-communities.csv")
```
That resolves the problem for me.
Best regards.

How to group repeating sequences of numbers using R

The simplest description of what I am trying to do is that I have a column in a data.frame like 1,2,3,..., n, 1,2,3,...n,.... and I want group the first 1...n as 1 the second 1...n as 2 and so on.
The full context is; I am using the R spcosa package to do equal area stratification composite sampling on parcels of land. I start with a shape file from a GIS that contains a number of polygons (land parcels). The end result I want is a GIS file with each of the strata and sample locations in a GIS file format with each stratum and sample location labeled by land parcel, stratum and sample id. So far I can do all this except one bit which is identifying the stratum that the samples belongs too and including it in the sample label. The sample label needs to look like "parcel#-strata#-composite# (where # is the number). In practice I don't need this actual label but as separate attributes in GIS file.
The basic work flow is a follows
For each individual polygon using spcosa::stratify I divide it into a number of equal area strata like
strata.CSEA <- stratify(poly[i,], nStrata = n, nTry = 1, equalArea = TRUE, nGridCells = x)
Note spcosa::stratify generates a CompactStratificationEqualArea object. I cocerce this to a SpatialPixelData then use rasterToPolygon to be able to output it as a GIS file.
I then generate the sample locations as follows:
samples.SPRC <- spsample(strata.CSEA, n = n, type = "composite")
spcosa::spsample creates a SamplingPatternRandomComposite object. I coerce this to a SpatialPointsDataFrame
samples.SPDF <- as(samples.SPRC, "SpatialPointsDataFrame")
and add two columns to the #data slot
samples.SPDF#data$Strata <- "this is the bit I can't do yet"
samples.SPDF#data$CEA <- poly[i,]$name
I can then write samples.SPDF as a GIS file ( ie writeOGE) with all the wanted attributes.
As above the part I can't sort out is how the sample ids relate to the strata ids. The sample points are a vector like 1,2,3...n, 1,2,3...n,.... How do I extract which sample goes with which strata? As actual strata number are arbitrary, I can just group ( as per my simple question above) but ideally I would like to use the numbering of the actual strata so everything lines up.
To give any contributors access to a hands on example I copy below the code from the spcosa documentation slightly modified to generate the correct objects.
# Note: the example below requires the 'rgdal'-package You may consider the 'maptools'-package as an alternative
if (require(rgdal)) {
# read a vector representation of the `Farmsum' field
shpFarmsum <- readOGR(
dsn = system.file("maps", package = "spcosa"),
layer = "farmsum"
)
# stratify `Farmsum' into 50 strata
# NB: increase argument 'nTry' to get better results
set.seed(314)
myStratification <- stratify(shpFarmsum, nStrata = 50, nTry = 1, equalArea = TRUE)
# sample two sampling units per stratum
mySamplingPattern <- spsample(myStratification, n = 2 type = "composite")
# plot the resulting sampling pattern on
# top of the stratification
plot(myStratification, mySamplingPattern)
}
Maybe order() function can help you
n <- 10
dat <- data.frame(col1 = rep(1:n, 2), col2 = rnorm(2*n))
head(dat)
dat[order(dat$col1), ]
I did not get where the "ID" (1,2,3...n) is to be found; so let's assume you have your SpatialPolygonsDataFrame called shpFarmsum with a attribute data column "ID". You can access this column via shpFarmsum$ID. Therefore, if you want to create individual subsets for each ID this is one way to go:
for (i in unique(shpFarmsum$ID)) {
tempSubset shpFarmsum[shpFarmsum$ID == i,]
writeOGR(tempSubset, ".", paste0("subset_", i), driver = "ESRI Shapefile")
}
I added the line writeOGR(... so all subsets are written to your working direktory. However, you can change this line or add further analysis into the for-loop.
How it works
unique(shpFarmsum$ID) extracts all occuring IDs (compareable to your 1,2,3...n).
In each repetition of the for loop, another value of this IDs will be used to create a subset of the whole SpatialPolygonsDataFrame, which you can use for further analysis.

Vectorise an imported variable in R

I have imported a CSV file to R but now I would like to extract a variable into a vector and analyse it separately. Could you please tell me how I could do that?
I know that the summary() function gives a rough idea but I would like to learn more.
I apologise if this is a trivial question but I have watched a number of tutorial videos and have not seen that anywhere.
Read data into data frame using read.csv. Get names of data frame. They should be the names of the CSV columns unless you've done something wrong. Use dollar-notation to get vectors by name. Try reading some tutorials instead of watching videos, then you can try stuff out.
d = read.csv("foo.csv")
names(d)
v = d$whatever # for example
hist(v) # for example
This is totally trivial stuff.
I assume you have use the read.csv() or the read.table() function to import your data in R. (You can have help directly in R with ? e.g. ?read.csv
So normally, you have a data.frame. And if you check the documentation the data.frame is described as a "[...]tightly coupled collections of variables which share many of the properties of matrices and of lists[...]"
So basically you can already handle your data as vector.
A quick research on SO gave back this two posts among others:
Converting a dataframe to a vector (by rows) and
Extract Column from data.frame as a Vector
And I am sure they are more relevant ones. Try some good tutorials on R (videos are not so formative in this case).
There is a ton of good ones on the Internet, e.g:
* http://www.introductoryr.co.uk/R_Resources_for_Beginners.html (which lists some)
or
* http://tryr.codeschool.com/
Anyways, one way to deal with your csv would be:
#import the data to R as a data.frame
mydata = read.csv(file="SomeFile.csv", header = TRUE, sep = ",",
quote = "\"",dec = ".", fill = TRUE, comment.char = "")
#extract a column to a vector
firstColumn = mydata$col1 # extract the column named "col1" of mydata to a vector
#This previous line is equivalent to:
firstColumn = mydata[,"col1"]
#extract a row to a vector
firstline = mydata[1,] #extract the first row of mydata to a vector
Edit: In some cases[1], you might need to coerce the data in a vector by applying functions such as as.numeric or as.character:
firstline=as.numeric(mydata[1,])#extract the first row of mydata to a vector
#Note: the entire row *has to be* numeric or compatible with that class
[1] e.g. it happened to me when I wanted to extract a row of a data.frame inside a nested function

Rpy2 - List of List of Dataframes

I'm trying to figure out how to use python to do file parsing from XML files into a data structure to pass into R.
What I need to create in R is a List of Lists of Dataframes:
Nodes = data.frame()
Edges = data.frame()
NetworkCompListA = list()
NetworkCompListA[['Nodes']] = Nodes
NetworkCompListA[['Edges']] = Edges
Networks = list()
Networks[['NetA']] = NetworkCompListA
Networks[['NetB']] = NetworkCompListB
I know how to create a dataframe from the examples in the Rpy2 documentation.
import rpy2.rlike.container as rlc
od = rlc.OrdDict([('value', robjects.IntVector((1,2,3))),
('letter', robjects.StrVector(('x', 'y', 'z')))])
df = robjects.DataFrame(od)
How do I insert 'df' into a List and then insert that list into another list in python and then write that out to an rdata file to load into another instance of R?
Thanks!
The class ListVector requires an object that implements iteritems() (such as a dict, or a OrderedDict). Note that in R data.frames are just lists with a the (loose) constrain that all elements should be vectors of the same length (or a matrix with the matching number of rows can be accepted), and with row names and column names (list's names being the column names).
from rpy2.robjects.vectors import ListVector, DataFrame
# rpy2's OrdDict was added because there was no ordered dict
# in Python's stdlib. It should be gone by rpy2-2.5
from collections import OrderedDict
od = OrderedDict((('a', 1), ('b', 2)))
df = DataFrame(od)
od_l = OrderedDict((('df', df),))
df_in_list = ListVector(od_l)
df_in_list_in_list = ListVector(OrderedDict((('df_in_list', df_in_list),))

Resources