Matrix to Matrix Votes conversion - r

I have output of classification algorithm in the form of probabilities. they look like:
> head(testProb)
B U
1 0.98 0.02
2 0.80 0.20
3 0.14 0.86
4 0.91 0.09
5 0.25 0.75
6 0.86 0.14
When I checked the class, it was:
> class(testProb)
[1] "matrix" "votes"
If I take a subset of it, for example, I want 1st column:
a <- testProb[,1]
> head(a)
1 2 3 4 5 6
0.98 0.80 0.14 0.91 0.25 0.86
> class(a)
[1] "numeric"
I have another such classification output matrix but it has not the class of "matrix" "votes". How can I convert it into "matrix" "votes"? So that when I take subset then I get the values in the form when I took subset before.
> head(prediction)
B U
[1,] 0.9413505 0.05864955
[2,] 0.8758474 0.12415256
[3,] 0.2271516 0.77284845
[4,] 0.9227356 0.07726441
[5,] 0.1838987 0.81610128
[6,] 0.9253403 0.07465969
> class (prediction)
[1] "matrix"
> a <- prediction[,1]
> head(a)
[1] 0.9413505 0.8758474 0.2271516 0.9227356 0.1838987 0.9253403
> class(a)
[1] "numeric"
In this case as well I want a as I get in the first case. Your help will be appreciated.

Related

How to repeat a Function and store values in R using the function sim.msm

I would like to simulate 10000 result for the function below and store the values.It is a function available on the package msm (R-software).
sim.msm(qmatrix,15)
Result:
$states
[1] 1 2 3 2 3 2 2
$times
[1] 0.000000 1.538988 2.240587 9.695302 11.002184 14.998754 15.000000
$qmatrix
[,1] [,2] [,3]
[1,] -0.11 0.10 0.01
[2,] 0.05 -0.15 0.10
[3,] 0.02 0.07 -0.09
This is only one simulation . I need 10000 like this.
Grateful if someone could help me
Replicate allows to repeat N times the same command. Here N = 10 :
replicate(10, sim.msm(qmatrix,15), simplify = FALSE)

Confusion on base reshape example

Can you please explain this behavior :
And why are wide and wide2 not identical, and why does reshape works on wide but not on wide2 ?
wide <- reshape(Indometh, v.names = "conc", idvar = "Subject",
timevar = "time", direction = "wide")
wide
# Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 conc.4 conc.5 conc.6 conc.8
# 1 1 1.50 0.94 0.78 0.48 0.37 0.19 0.12 0.11 0.08 0.07 0.05
# 12 2 2.03 1.63 0.71 0.70 0.64 0.36 0.32 0.20 0.25 0.12 0.08
# 23 3 2.72 1.49 1.16 0.80 0.80 0.39 0.22 0.12 0.11 0.08 0.08
# 34 4 1.85 1.39 1.02 0.89 0.59 0.40 0.16 0.11 0.10 0.07 0.07
# 45 5 2.05 1.04 0.81 0.39 0.30 0.23 0.13 0.11 0.08 0.10 0.06
# 56 6 2.31 1.44 1.03 0.84 0.64 0.42 0.24 0.17 0.13 0.10 0.09
reshape(wide) # ok
wide2 <- wide[,1:ncol(wide)]
reshape(wide2) # Error in match.arg(direction, c("wide", "long")) : argument "direction" is missing, with no default
Some diagnosis:
identical(wide,wide2) # FALSE
dplyr::all_equal(wide,wide2) # TRUE
all.equal(wide,wide2)
# [1] "Attributes: < Names: 1 string mismatch >" "Attributes: < Length mismatch: comparison on first 2 components >"
# [3] "Attributes: < Component 2: Modes: list, numeric >" "Attributes: < Component 2: names for target but not for current >"
# [5] "Attributes: < Component 2: Length mismatch: comparison on first 5 components >" "Attributes: < Component 2: Component 1: Modes: character, numeric >"
# [7] "Attributes: < Component 2: Component 1: target is character, current is numeric >" "Attributes: < Component 2: Component 2: Modes: character, numeric >"
# [9] "Attributes: < Component 2: Component 2: target is character, current is numeric >" "Attributes: < Component 2: Component 3: Modes: character, numeric >"
# [11] "Attributes: < Component 2: Component 3: target is character, current is numeric >" "Attributes: < Component 2: Component 4: Numeric: lengths (11, 1) differ >"
# [13] "Attributes: < Component 2: Component 5: Modes: character, numeric >" "Attributes: < Component 2: Component 5: Lengths: 11, 1 >"
# [15] "Attributes: < Component 2: Component 5: Attributes: < Modes: list, NULL > >" "Attributes: < Component 2: Component 5: Attributes: < Lengths: 1, 0 > >"
# [17] "Attributes: < Component 2: Component 5: Attributes: < names for target but not for current > >" "Attributes: < Component 2: Component 5: Attributes: < current is not list-like > >"
# [19] "Attributes: < Component 2: Component 5: target is matrix, current is numeric >"
Because the subset operation on wide data.frame removes the custom attributes added by reshape and used by reshape itself to automagically perform the opposite reshaping.
In fact as you can notice the attributes list of wide contains reshapeWide storing all the necessary information to revert the reshape :
> names(attributes(wide))
[1] "row.names" "names" "class" "reshapeWide"
> attributes(wide)$reshapeWide
$v.names
[1] "conc"
$timevar
[1] "time"
$idvar
[1] "Subject"
$times
[1] 0.25 0.50 0.75 1.00 1.25 2.00 3.00 4.00 5.00 6.00 8.00
$varying
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
while wide2 does not :
> names(attributes(wide2))
[1] "names" "class" "row.names"

How to apply findAssoc against each row of data.frame

I created a data.frame that holds my words and its frequencies. Now I would like to do a findAssocs against every row of my frame but I cannot get my code to work. Any help is appreciated.
Here is an example of my data.frame term.df
term.df <- data.frame(word = names(v),freq=v)
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
I created a TermDocumentMatrix tdm and the following code works as expected.
findAssocs(tdm, 'frozen', 0.20)
I would like to append the output of findAssocs as a new column
Here's the code I tried:
library(dplyr)
library(tm)
library(pbapply)
#I would like to append all findings in a new column
res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
EDIT:
as for the output. The single statement above gets me something like this.
$yogurt
greek ellenos fat chobani dannon fage yoplait nonfat wallaby
0.62 0.36 0.25 0.24 0.24 0.24 0.24 0.22 0.20
I was hoping it would be possible to add a single column to my original table (ASSOC) and put the results as comma separated name:value tuples but I'm really open to ideas.
I think a structure that would be simplest to handle would be a nested list:
lapply(seq_len(nrow(text.df)), function(i) {
list(word=text.df$word[i],
freq=text.df$freq[i],
assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
#
# [[1]]$freq
# [1] 3
#
# [[1]]$assoc
# 15.8 opec clearly late trying who winter analysts
# 0.87 0.87 0.80 0.80 0.80 0.80 0.80 0.79
# said meeting above emergency market fixed that prices
# 0.78 0.77 0.76 0.75 0.75 0.73 0.73 0.72
# agreement buyers
# 0.71 0.70
#
#
# [[2]]
# [[2]]$word
# [1] "opec"
#
# [[2]]$freq
# [1] 2
#
# [[2]]$assoc
# meeting emergency oil 15.8 analysts buyers above
# 0.88 0.87 0.87 0.85 0.85 0.83 0.82
# said ability they prices. agreement but clearly
# 0.82 0.80 0.80 0.79 0.76 0.74 0.74
# december. however, late production sell trying who
# 0.74 0.74 0.74 0.74 0.74 0.74 0.74
# winter quota that through bpd market
# 0.74 0.73 0.73 0.73 0.70 0.70
#
#
# [[3]]
# [[3]]$word
# [1] "xyz"
#
# [[3]]$freq
# [1] 1
#
# [[3]]$assoc
# numeric(0)
In my experience this will be easier to handle than a nested string because you can still access the word associations for each row of your original text.df object by accessing the corresponding element in the outputted list.
If you really want to keep a data frame structure, then you could pretty easily convert the findAssocs output to a string representation, for instance using toJSON:
library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
# assoc
# 1 { "15.8": 0.87,"opec": 0.87,"clearly": 0.8,"late": 0.8,"trying": 0.8,"who": 0.8,"winter": 0.8,"analysts": 0.79,"said": 0.78,"meeting": 0.77,"above": 0.76,"emergency": 0.75,"market": 0.75,"fixed": 0.73,"that": 0.73,"prices": 0.72,"agreement": 0.71,"buyers": 0.7 }
# 2 { "meeting": 0.88,"emergency": 0.87,"oil": 0.87,"15.8": 0.85,"analysts": 0.85,"buyers": 0.83,"above": 0.82,"said": 0.82,"ability": 0.8,"they": 0.8,"prices.": 0.79,"agreement": 0.76,"but": 0.74,"clearly": 0.74,"december.": 0.74,"however,": 0.74,"late": 0.74,"production": 0.74,"sell": 0.74,"trying": 0.74,"who": 0.74,"winter": 0.74,"quota": 0.73,"that": 0.73,"through": 0.73,"bpd": 0.7,"market": 0.7 }
# 3 [ ]
Data:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1

function of corr.test() and cor() to get pearson correlation and P value in R

I just tried to get a matrix to describe the P value and a matrix of correlation.
For example, I used the code below to create data
library(psych)
xx <- matrix(rnorm(16), 4, 4)
xx
[,1] [,2] [,3] [,4]
[1,] 1.2349830 -0.23417979 -1.0380279 0.2119736
[2,] 0.9540995 0.05405983 0.4438048 1.8375497
[3,] 0.1583041 -1.29936451 -0.6030342 -0.4052208
[4,] 0.4524374 1.03351913 1.3253830 -0.4829464
when I tried to run
corr.test(xx)
But I got the error as below:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
But I thought I never set the row or column name before, and the row or column name could be NULL;I checked the row and column name with the code below:
> any(duplicated(colnames(xx)))
[1] FALSE
> any(duplicated(rownames(xx)))
[1] FALSE
However, I used the other correlation function cor() to get the matrix of correlation of xx:
co<-cor(xx)
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.24090246 -0.28707770 0.58664566
[2,] 0.2409025 1.00000000 0.79523833 0.06658293
[3,] -0.2870777 0.79523833 1.00000000 0.04730974
[4,] 0.5866457 0.06658293 0.04730974 1.00000000
Now, it works fine, and I thought maybe I can use the corr.p() to get the p value quickly. But actually, the I got the same error again!!
corr.p(co,16)
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
I am confused that I never set the rownames and colnames and have checked the duplicate 'row.names' as well, but why I still got the error, did I missed any important point?
library(psych)
set.seed(123)
xx <- matrix(rnorm(16), 4, 4)
xx <- as.data.frame(xx)
out <- corr.test(xx)
print(out, short = FALSE)
# Call:corr.test(x = xx)
# Correlation matrix
# V1 V2 V3 V4
# V1 1.00 -0.02 0.96 -0.49
# V2 -0.02 1.00 -0.27 -0.75
# V3 0.96 -0.27 1.00 -0.22
# V4 -0.49 -0.75 -0.22 1.00
# Sample Size
# [1] 4
# Probability values (Entries above the diagonal are adjusted for multiple tests.)
# V1 V2 V3 V4
# V1 0.00 1.00 0.25 1
# V2 0.98 0.00 1.00 1
# V3 0.04 0.73 0.00 1
# V4 0.51 0.25 0.78 0
# To see confidence intervals of the correlations, print with the short=FALSE option
# Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
# lower r upper p
# V1-V2 -0.96 -0.02 0.96 0.98
# V1-V3 -0.04 0.96 1.00 0.04
# V1-V4 -0.99 -0.49 0.89 0.51
# V2-V3 -0.98 -0.27 0.93 0.73
# V2-V4 -0.99 -0.75 0.75 0.25
# V3-V4 -0.97 -0.22 0.94 0.78

How to get list of lists in R with my data

here is the function which is working fine, but giving the output in an output which I don't want.
frequencies <- {}
for (k in (1:4))
{
interval <- (t(max_period_set[k]))
intervals <- round(quantile(interval,c(0,0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95,1.0)))
frequency <- {}
for (i in (2:length(intervals)))
{
count = 0;
for (r in (1:length(interval)))
{
if (r == length(interval))
{
if (interval[r] >= intervals[i-1] && interval[r] <= intervals[i])
{
count = count + 1
}
}
else
{
if (interval[r] >= intervals[i-1] && interval[r] < intervals[i])
{
count = count + 1
}
}
}
frequency <- c(frequency,count)
}
frequencies[[length(frequencies)+1]] <- frequency
}
The output is as follows:
> frequencies
[[1]]
[1] 2 6 5 4 5 4 6 5 5 5 3
[[2]]
[1] 1 7 5 4 5 4 5 6 5 5 3
[[3]]
[1] 3 5 5 4 5 4 6 5 5 5 3
[[4]]
[1] 3 5 5 4 4 6 5 5 5 5 3
I would like to have it in a format as follows:
[[],[],[],[]] which is a list of list whose first element I can access like frequencies[1] to get the first list, etc...
If it is not possible, how can I access the first list values in my current format? frequencies[1] does not give me the first list values back.
Thanks for your help!
Guys an another question:
now I can access the data but r is representing the last line in different format:
[[1]]
[1] 1.00 0.96 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[2]]
[1] 1.00 0.98 0.84 0.74 0.66 0.56 0.48 0.38 0.26 0.16 0.06 0.00
[[3]]
[1] 1.00 0.94 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[4]]
[1] 1.000000e+00 9.400000e-01 8.400000e-01 7.400000e-01 6.600000e-01 5.800000e-01 4.600000e-01 3.600000e-01 2.600000e-01 1.600000e-01 6.000000e-02 1.110223e-16
Why is it happening with the accuracy? the first three lines are as it should be but the last line is odd, the numbers were not infractional numbers, so it can be represented as a number with its accuracy of 2 after comma digits.
frequencies is a list, so you need
frequencies[[1]]
to access the first element. If list were named, you could also index by element name.
Lists are the most general data structure, and the only one that can
be nested: lists within lists within ...
be ragged: does not require rectangular dimensions
so you should try to overcome initial aversion to the fact that it is different. These are very powerful data structures, and are use a lot behind the scenes.
Edit: Also, a number of base functions as well as add-on packages can post-process lists. It starts with something basic like do.call(), goes to lapply and ends all the over at the plyr packages. Keep reading -- there are many ways to skin the same cat, some better than others.
While I completely agree with Dirk on the usefulness of lists, you can, if all of your lists are the same length, convert them to a dataframe using as.data.frame() and then you can index them by column i frequencies[,i] or by row j frequencies[j,]

Resources