Clustering strings in R (is it possible?) - r

I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data.
Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like the clustering to group these differently than say "R3no4".
Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.
the dataset is from the kdd 2010 cup
I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.
my data frame looks like this:
str(data1)
'data.frame': 809694 obs. of 19 variables:
$ Row : int 1 2 3 4 5 6 7 8 9 10 ...
$ Anon.Student.Id : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Problem.Hierarchy : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
$ Problem.Name : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
$ Problem.View : int 1 1 1 1 2 2 3 3 4 4 ...
$ Step.Name : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...
I'm most interested in the Step.Name feature, since it contains the greatest number of unique factor values.
and some example values for step name:
[97170] (1+7)/4 = x
[97171] (1-sqrt(1^2-4*2*-6))/4 = x
[97172] (1-sqrt(1^2-(-48)))/4 = x
[97173] (1-sqrt(1-(-48)))/4 = x
[97174] (1-sqrt(49))/4 = x
[97175] (1-7)/4 = x
[97176] x^2+15x+44 = 0
[97177] a-factor-node
[97178] b-factor-node
[97179] c-factor-node
[97180] num1-factor-node
[97181] num2-factor-node
[97182] den1-factor-node
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x
[97185] (-15+sqrt((-15)^2-176))/2 = x
[97186] (-15+sqrt(225-176))/2 = x
[97187] (-15+sqrt(49))/2 = x
[97188] (-15+7)/2 = x
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x
[97190] (-15-sqrt((-15)^2-176))/2 = x
[97191] (-15-sqrt(225-176))/2 = x
[97192] (-15-sqrt(49))/2 = x
[97193] (-15-7)/2 = x
[97194] 2x^2+x = 0
[97195] a-factor-node
[97196] b-factor-node
[97197] c-factor-node
[97198] num1-factor-node
[97199] num2-factor-node
[97200] den1-factor-node
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x
[97203] (-1+sqrt((-1)^2-0))/4 = x
[97204] (-1+sqrt((-1)^2))/4 = x
[97205] (-1+1)/4 = x
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x
[97207] (-1-sqrt((-1)^2-0))/4 = x
[97208] (-1-sqrt((-1)^2))/4 = x
[97209] (-1-1)/4 = x
[97210] x^2-6x = 0
[97211] a-factor-node
[97212] b-factor-node

Clustering is just scoring each instance in a data array according to some metric, sorting the data array according to this calculated score, then slicing into some number of segments, assigning a label each one.
In other words, you can cluster any data for which you can formulate some meaningful function to calculate similarity of each data point w/r/t the others; this is usually referred to as a similarity metric.
There are a lot of these, but only a small subset of them are useful to evaluate strings. Of these, perhaps the most commonly used is Levenshtein Distance (aka Edit Distance).
This metric is expressed as an integer, and it increments one unit (+1) for each 'edit'--inserting, deleting, or changing a letter--required to transform one word into another. Summing those individual edits (one for each letter) gives you the Levenshtein Distance.
The R Package vwr includes an implementation:
> library(vwr)
> levenshtein.distance('cat', 'hat')
hat
1
> levenshtein.distance('cat', 'catwalk')
catwalk
4
> levenshtein.distance('catwalk', 'sidewalk')
sidewalk
4
> # using a data set supplied with the vmr library
> EW = english.words
> ew1 = sample(EW, 20) # random select 20 words from EW
> # the second argument is a vector of words, returns a vector of distances
> dx = levenshtein.distance('cat', ew1)
> dx
furriers graves crooned cursively gabled caparisons drainpipes
8 5 6 8 5 8 9
patricians medially beholder chirpiness fluttered bobolink lamentably
8 7 8 9 8 8 8
depredations alights unearthed thimbles supersede dissembler
10 6 7 8 9 10
While Levenshtein Distance can be used to cluster your data, whether it should be used for your data is a question i'll leave to you (i.e., the primary use case for L/D is clearly pure text data).
(Perhaps the next-most-common similarity metric that operates on strings is Hamming Distance. Hamming Distance (unlike Levenshtein) requires that the two strings be of equal length, hence it won't work for your data.)

Perhaps:
> grepl("[[:alpha:]]", c("-18 + 5 = -13", "5 - 18 = -13","R3no4") )
[1] FALSE FALSE TRUE

Related

How to count number of instances above a value within a given range in R?

I have a rather large dataset looking at SNPs across an entire genome. I am trying to generate a heatmap that scales based on how many SNPs have a BF (bayes factor) value over 50 within a sliding window of x base pairs across the genome. For example, there might be 5 SNPs of interest within the first 1,000,000 base pairs, and then 3 in the next 1,000,000, and so on until I reach the end of the genome, which would be used to generate a single row heatmap. Currently, my data are set out like so:
SNP BF BP
0001_107388 11.62814713 107388
0001_193069 2.333472447 193069
0001_278038 51.34452334 278038
0001_328786 5.321968927 328786
0001_523879 50.03245434 523879
0001_804477 -0.51777189 804477
0001_990357 6.235452787 990357
0001_1033297 3.08206707 1033297
0001_1167609 -2.427835577 1167609
0001_1222410 52.96447989 1222410
0001_1490205 10.98099565 1490205
0001_1689133 3.75363951 1689133
0001_1746080 3.519987207 1746080
0001_1746450 -2.86666016 1746450
0001_1777011 0.166999413 1777011
0001_2114817 3.266942137 2114817
0001_2232084 50.43561123 2232084
0001_2332903 -0.15022324 2332903
0001_2347062 -1.209000033 2347062
0001_2426273 1.230915683 2426273
where SNP = the SNP ID, BF = the bayes factor, and BP = the position on the genome (I've fudged a couple of > 50 values in there for the data to be suitable for this example).
The issue is that I don't have a SNP for each genome position, otherwise I could simply split the windows of interest based on line count and then count however many lines in the BF column are over 50. Is there any way I can I count the number of SNPs of interest within different windows of the genome positions? Preferably in R, but no issues with using other languages like Python or Bash if it gets the job done.
Thanks!
library(slider); library(dplyr)
my_data %>%
mutate(count = slide_index(BF, BP, ~sum(.x > 50), .before = 999999))
This counts how many BF > 50 in the window of the last 1M in BP.
SNP BF BP count
1 0001_107388 11.6281471 107388 0
2 0001_193069 2.3334724 193069 0
3 0001_278038 51.3445233 278038 1
4 0001_328786 5.3219689 328786 1
5 0001_523879 50.0324543 523879 2
6 0001_804477 -0.5177719 804477 2
7 0001_990357 6.2354528 990357 2
8 0001_1033297 3.0820671 1033297 2
9 0001_1167609 -2.4278356 1167609 2
10 0001_1222410 52.9644799 1222410 3
11 0001_1490205 10.9809957 1490205 2
12 0001_1689133 3.7536395 1689133 1
13 0001_1746080 3.5199872 1746080 1
14 0001_1746450 -2.8666602 1746450 1
15 0001_1777011 0.1669994 1777011 1
16 0001_2114817 3.2669421 2114817 1
17 0001_2232084 50.4356112 2232084 1
18 0001_2332903 -0.1502232 2332903 1
19 0001_2347062 -1.2090000 2347062 1
20 0001_2426273 1.2309157 2426273 1

How to apply a function to multiple columns to create multiple new columns in R?

I've this list of sequences aqi_range and a dataframe df:
aqi_range = list(0:50,51:100,101:250)
df
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
1 85.6 3 264 75.7 3 240
2 105. 6 243 76.4 3 191
3 95.8 19 287 48.4 8 134
4 85.5 50 166 64.8 32 103
5 55.9 24 117 46.7 19 77
6 37.5 6 116 31.3 3 87
7 26 5 69 15.5 3 49
8 82.3 34 169 49.6 25 120
9 170 68 272 133 67 201
10 254 189 323 226 173 269
Now I've created these two pretty simple functions that i want to apply to this dataframe to calculate the AQI=Air Quality Index for each pollutant.
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
max_val = max(i)
return (max_val - min_val)
}}}
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
return(min_val)
}
}}
Basically the first function "min_max_diff" takes the value of column df$PM10_mean / df$PM2.5_mean and check for it in the list "aqi_range" and then returns a certain value (difference of min and max value of the sequence in which it's available). Similarly the second function "c_low" just returns the minimum value of the sequence.
I want to apply this kind of manipulation (formula defined below) to PM10_mean column to create new columns PM10_AQI:
df$PM10_AQI = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean - df$PM10_min) + c_low(df$PM10_mean,aqi_range)
I hope it explains it properly.
If your problem is just how to compute the given transformation to several columns in a data frame, you could write a for loop, construct the name of each variable involved in the transformation using string transformation functions (in this case sub() is useful), and refer to the columns in the data frame using the [ notation (as opposed to the $ notation --since the [ notation accepts strings to specify columns).
Following I show an example of such code with a small sample data with 3 observations:
(note that I modified the definition of the AQI range values (now I just define the breaks where the range changes --assuming they are all integers), and your functions min_max_diff() and c_low() which are collapsed into one single function returning the min and max values of the AQI range where the values are found --again this assumes that the AQI values are integer values)
# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)
# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
PM10_min=c(3, 6, 19),
PM10_max=c(264, 243, 287),
PM2.5_mean=c(75.7, 76.4, 48.4),
PM2.5_min=c(3, 3, 8),
PM2.5_max=c(240, 191, 134))
# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
aqi_range_groups = findInterval(values, aqi_range_breaks)
return( list(min=aqi_range_breaks[aqi_range_groups],
max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}
# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
vmin = sub("_mean$", "_min", vmean)
vmax = sub("_mean$", "_max", vmean)
vaqi = sub("_mean$", "_AQI", vmean)
aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) /
(df[,vmax] - df[,vmin]) / (df[,vmean] - df[,vmin]) +
aqi_range_min_max$min
}
Note how the findInterval() function has been used to find the range where an array of values fall. That was the key to make your transformation work for a data frame column.
The expected output of this process is:
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max PM10_AQI PM2.5_AQI
1 85.6 3 264 75.7 3 240 51.00227 51.002843893
2 105.0 6 243 76.4 3 191 101.00635 51.003550930
3 95.8 19 287 48.4 8 134 51.00238 0.009822411
Please check the formula that computes AQI because you had a syntax error in it (look for / *, which I have replaced with / in the formula in my code).
Note that the use of $ in the regular expression used in sub() to match the string "_mean" is used to replace the "_mean" string only when it occurs at the end of the variable name.

Visualizing network of sentences in Textrank

I'm using the Textrank method explained here to get the summary of the text. Is there a way to plot the output of the textrank_sentences like a network of all the textrank_ids connected to each other?
library(textrank)
data(joboffer)
library(udpipe)
tagger <- udpipe_load_model(tagger$file_model)
joboffer <- udpipe_annotate(tagger, job_rawtxt)
joboffer <- as.data.frame(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id","paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]
tr <- textrank_sentences(data = sentences, terminology = terminology)
This question is rather old, but is a good question and deserves an answer.
Yes! textrank returns all the information that you need. Just look
at the output of str(tr). Part of it says:
$ sentences_dist:Classes ‘data.table’ and 'data.frame': 666 obs. of 3 variables:
..$ textrank_id_1: int [1:666] 1 1 1 1 1 1 1 1 1 1 ...
..$ textrank_id_2: int [1:666] 2 3 4 5 6 7 8 9 10 11 ...
..$ weight : num [1:666] 0.1429 0.4167 0 0.0625 0 ...
This gives which sentences are connected in the form of a lower triangular matrix. Two sentences are connected if the weight of their connection is greater than zero. To visualize the graph, use the non-zero weights as an edgelist and build the graph.
Links = which(tr$sentences_dist$weight > 0)
EdgeList = cbind(tr$sentences_dist$textrank_id_1[Links],
tr$sentences_dist$textrank_id_2[Links])
library(igraph)
SGraph1 = graph_from_edgelist(EdgeList, directed=FALSE)
set.seed(42)
plot(SGraph1)
We see that 11 of the nodes (sentences) are not connected to any other node.
For example, sentences 15 and 36
tr$sentences$sentence[c(36,15)]
[1] "Contact:"
[2] "Integration of the models into the existing architecture."
But other other nodes do connect up, for example node 1 is connected to node 2.
tr$sentences$sentence[c(1,2)]
[1] "Statistical expert / data scientist / analytical developer"
[2] "BNOSAC (Belgium Network of Open Source Analytical Consultants),
is a Belgium consultancy company specialized in data analysis and
statistical consultancy using open source tools."
because those sentences share the (important) words "statistical", "data", and "analytical".
The singleton nodes take up a lot of space in the graph making the other nodes rather crowded. So I will also show the graph with those removed.
which(degree(SGraph1) == 0)
[1] 4 7 15 20 21 23 25 26 29 30 36
SGraph2 = delete.vertices(SGraph1, which(degree(SGraph1) == 0))
set.seed(42)
plot(SGraph2)
That shows the relations between sentences somewhat better, but I expect that you can find a nicer layout for the graph that better shows the relations. However, that is not the thrust of the question and I leave it to you to make the graph pretty.

Error in `colnames<-`(`*tmp*`, value = c("x", "y", "x_1", "x_2", "y_1", : length of 'dimnames' [2] not equal to array extent for Panel Data

I encounter the above error while attempting to run the Granger causality test for panel data using the pgrangertest function from the plm package. I read several questions by users facing a similar issue and tried the suggestions given there, however, none of them could solve my problem.
Essentially, I have a panel data which looks something lime this:
>head(granger_data)
panel_id time_id close_close_ret log_volume
25-2 25 2 0.004307257 4.753590
25-3 25 3 -0.001912046 8.249836
25-4 25 4 0.011417821 8.628377
25-5 25 5 0.018744691 9.134754
25-6 25 6 -0.024913157 8.920122
25-7 25 7 -0.008604260 8.724370
str(granger_data)
'data.frame': 105209 obs. of 4 variables:
$ panel_id : Factor w/ 938 levels "25","26","27",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time_id : Factor w/ 323 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ close_close_ret: num NA 0.00431 -0.00191 0.01142 0.01874 ...
$ log_volume : num 4.88 4.75 8.25 8.63 9.13 ...
Now, I want to run the granger causality test for panel data using the pgrangertest function from the plm package and while doing so, I encounter the following problem:
> vol_ret <- pgrangertest(log_volume ~ close_close_ret,data = granger_data)
Error in `colnames<-`(`*tmp*`, value = c("x", "y", "x_1", "y_1")) :
length of 'dimnames' [2] not equal to array extent
I even read the source code of the function and tried to understand where the error came, from, but I couldn't figure it out.
The panel granger test requires to have times series of length 5+3*order per individual otherwise the second order moments of the individual Wald statistics do not exist. pgrangertest in package plm has a check for that since version 1.7-0 of the package. From its NEWS file:
pgrangertest: better detection of infeasibility of test due to lacking data.
It gives an informative error message in case you supply too short a time series for an individual, like in the case you encountered, e.g.:
Error in pgrangertest(inv ~ value, data = pG, order = 1) :
Condition for test = "Ztilde" not met for all individuals: length of
time series must be larger than 5+3*order (>5+3*1=8)

create a new column conditional on distance traveled between points in R

I am trying to create a new column conditional on another column, a bit like a moving average or moving window but based on distance between points. Take for example row 2 with a CO2 of 399.935. I would like to have the mean of all the points within 100 m (traveled) of that point. In my example (looking at column CumDist), rows 1, 3, 4, 5 would be selected to calculate the mean. The column CumDist (*100,000 to have the units in meters) consists of cumulative distance traveled. I have 5000 points and obviously the width (or the number of rows) of the moving window will vary.
I tested over() from the sp package, but it's problematic if the same road is taken more than once. I looked on the web for other solutions and I did not find anything that could help me.
dput(DF)
structure(list(CO2 = c(399.9350305, 399.9350305, 399.9350305,
400.0320031, 400.0320031, 400.0320031, 399.7718229, 399.7718229,
399.7718229, 399.3855075, 399.3855075, 399.3855075, 399.4708139,
399.4708139, 399.4708139, 400.0362474, 400.0362474, 400.0362474,
399.7556753, 399.7556753), lon = c(-103.7093538, -103.709352,
-103.7093492, -103.7093467, -103.7093455, -103.7093465, -103.7093482,
-103.7093596, -103.7094074, -103.7094625, -103.7094966, -103.709593,
-103.709649, -103.7096717, -103.7097349, -103.7097795, -103.709827,
-103.7099007, -103.709924, -103.7099887), lat = c(49.46972027,
49.46972153, 49.46971675, 49.46971533, 49.46971307, 49.4697124,
49.46970636, 49.46968214, 49.46960921, 49.46955984, 49.46953621,
49.46945809, 49.46938994, 49.46935281, 49.46924309, 49.46918635,
49.46914762, 49.46912566, 49.46912407, 49.46913321),distDiff = c(0.000342016147509882,
0.000191466419697602, 0.000569046320857002, 0.000240367540492089,
0.000265977754839834, 0.000103953049523505, 0.000682968856240796,
0.0028176007969857, 0.00882013898948418, 0.00678966015562509,
0.00360774024245839, 0.011149423290729, 0.00859796340323456,
0.00444526066124642, 0.0130344010874029, 0.00709037369666853,
0.00551435348701512, 0.00587377717110946, 0.00169806309901329,
0.00479849401022625), CumDist = c(0.000342016147509882, 0.000533482567207484,
0.00110252888806449, 0.00134289642855657, 0.00160887418339641,
0.00171282723291991, 0.00239579608916071, 0.00521339688614641,
0.0140335358756306, 0.0208231960312557, 0.0244309362737141, 0.0355803595644431,
0.0441783229676777, 0.0486235836289241, 0.0616579847163269, 0.0687483584129955,
0.0742627119000106, 0.08013648907112, 0.0818345521701333, 0.0866330461803596
)), .Names = c("X12CO2_dry", "coords.x1", "coords.x2", "V1",
"CumDist"), row.names = 2:21, class = "data.frame")
thanks, Martin
Man you beat me to it with a cleaner solution mra68.
Here's mine using a few loops.
####################
for (j in 1:nrow(DF)){#Loop through all rows of your dataset
CO2list<-NULL ##Need to make a variable before storing to it in the loop
for(i in 1:nrow(DF)){##Loop through all distances in the table
if ((abs(DF$CumDist[i]-DF$CumDist[j]))<=0.001) {
##Check to see if difference in CumDist<=100/100000 for all entries
#CumDist[j] is point with the 100 meter window around it
CO2list<-c(CO2list,DF$X12CO2_dry[i])
##Store your CO2 entries that are within the 100 meter window to a vector
}
}
DF$CO2AVG[j]<-mean(CO2list)
#Get the mean of your list and store it to column named CO2AVG
}
The window that belongs to the i-th row starts at n[i] and ends at m[i]-1. Hence the sum of the CO2-values in the i-th window is CumCO2[m[i]]-CumCO2[n[i]]. (Notice that the indices in CumCO2 are shifted by 1, because of the leading 0.) Dividing this CO2-sum by the window size m[i]-n[i] gives the values meanCO2 for the new column:
n <- sapply( df$CumDist,
function(x){
which.max( df$CumDist >= x-0.001 )
}
)
m <- sapply( df$CumDist,
function(x){
which.max( c(df$CumDist,Inf) > x+0.001 )
}
)
CumCO2 <- c( 0, cumsum(df$X12CO2) )
meanCO2 <- ( CumCO2[m] - CumCO2[n] ) / (m-n)
.
> n
[1] 1 1 1 2 3 3 5 8 9 10 11 12 13 14 15 16 17 18 19 20
> m
[1] 4 5 7 7 8 8 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> meanCO2
[1] 399.9350 399.9593 399.9835 399.9932 399.9606 399.9606 399.9453 399.7718 399.7718 399.3855 399.3855 399.3855 399.4708 399.4708 399.4708 400.0362
[17] 400.0362 400.0362 399.7557 399.7557
>

Resources