Confusion on base reshape example - r

Can you please explain this behavior :
And why are wide and wide2 not identical, and why does reshape works on wide but not on wide2 ?
wide <- reshape(Indometh, v.names = "conc", idvar = "Subject",
timevar = "time", direction = "wide")
wide
# Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 conc.4 conc.5 conc.6 conc.8
# 1 1 1.50 0.94 0.78 0.48 0.37 0.19 0.12 0.11 0.08 0.07 0.05
# 12 2 2.03 1.63 0.71 0.70 0.64 0.36 0.32 0.20 0.25 0.12 0.08
# 23 3 2.72 1.49 1.16 0.80 0.80 0.39 0.22 0.12 0.11 0.08 0.08
# 34 4 1.85 1.39 1.02 0.89 0.59 0.40 0.16 0.11 0.10 0.07 0.07
# 45 5 2.05 1.04 0.81 0.39 0.30 0.23 0.13 0.11 0.08 0.10 0.06
# 56 6 2.31 1.44 1.03 0.84 0.64 0.42 0.24 0.17 0.13 0.10 0.09
reshape(wide) # ok
wide2 <- wide[,1:ncol(wide)]
reshape(wide2) # Error in match.arg(direction, c("wide", "long")) : argument "direction" is missing, with no default
Some diagnosis:
identical(wide,wide2) # FALSE
dplyr::all_equal(wide,wide2) # TRUE
all.equal(wide,wide2)
# [1] "Attributes: < Names: 1 string mismatch >" "Attributes: < Length mismatch: comparison on first 2 components >"
# [3] "Attributes: < Component 2: Modes: list, numeric >" "Attributes: < Component 2: names for target but not for current >"
# [5] "Attributes: < Component 2: Length mismatch: comparison on first 5 components >" "Attributes: < Component 2: Component 1: Modes: character, numeric >"
# [7] "Attributes: < Component 2: Component 1: target is character, current is numeric >" "Attributes: < Component 2: Component 2: Modes: character, numeric >"
# [9] "Attributes: < Component 2: Component 2: target is character, current is numeric >" "Attributes: < Component 2: Component 3: Modes: character, numeric >"
# [11] "Attributes: < Component 2: Component 3: target is character, current is numeric >" "Attributes: < Component 2: Component 4: Numeric: lengths (11, 1) differ >"
# [13] "Attributes: < Component 2: Component 5: Modes: character, numeric >" "Attributes: < Component 2: Component 5: Lengths: 11, 1 >"
# [15] "Attributes: < Component 2: Component 5: Attributes: < Modes: list, NULL > >" "Attributes: < Component 2: Component 5: Attributes: < Lengths: 1, 0 > >"
# [17] "Attributes: < Component 2: Component 5: Attributes: < names for target but not for current > >" "Attributes: < Component 2: Component 5: Attributes: < current is not list-like > >"
# [19] "Attributes: < Component 2: Component 5: target is matrix, current is numeric >"

Because the subset operation on wide data.frame removes the custom attributes added by reshape and used by reshape itself to automagically perform the opposite reshaping.
In fact as you can notice the attributes list of wide contains reshapeWide storing all the necessary information to revert the reshape :
> names(attributes(wide))
[1] "row.names" "names" "class" "reshapeWide"
> attributes(wide)$reshapeWide
$v.names
[1] "conc"
$timevar
[1] "time"
$idvar
[1] "Subject"
$times
[1] 0.25 0.50 0.75 1.00 1.25 2.00 3.00 4.00 5.00 6.00 8.00
$varying
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] "conc.0.25" "conc.0.5" "conc.0.75" "conc.1" "conc.1.25" "conc.2" "conc.3" "conc.4" "conc.5" "conc.6" "conc.8"
while wide2 does not :
> names(attributes(wide2))
[1] "names" "class" "row.names"

Related

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

Matrix to Matrix Votes conversion

I have output of classification algorithm in the form of probabilities. they look like:
> head(testProb)
B U
1 0.98 0.02
2 0.80 0.20
3 0.14 0.86
4 0.91 0.09
5 0.25 0.75
6 0.86 0.14
When I checked the class, it was:
> class(testProb)
[1] "matrix" "votes"
If I take a subset of it, for example, I want 1st column:
a <- testProb[,1]
> head(a)
1 2 3 4 5 6
0.98 0.80 0.14 0.91 0.25 0.86
> class(a)
[1] "numeric"
I have another such classification output matrix but it has not the class of "matrix" "votes". How can I convert it into "matrix" "votes"? So that when I take subset then I get the values in the form when I took subset before.
> head(prediction)
B U
[1,] 0.9413505 0.05864955
[2,] 0.8758474 0.12415256
[3,] 0.2271516 0.77284845
[4,] 0.9227356 0.07726441
[5,] 0.1838987 0.81610128
[6,] 0.9253403 0.07465969
> class (prediction)
[1] "matrix"
> a <- prediction[,1]
> head(a)
[1] 0.9413505 0.8758474 0.2271516 0.9227356 0.1838987 0.9253403
> class(a)
[1] "numeric"
In this case as well I want a as I get in the first case. Your help will be appreciated.

How to apply findAssoc against each row of data.frame

I created a data.frame that holds my words and its frequencies. Now I would like to do a findAssocs against every row of my frame but I cannot get my code to work. Any help is appreciated.
Here is an example of my data.frame term.df
term.df <- data.frame(word = names(v),freq=v)
word freq
ounce 8917
pack 6724
count 4992
organic 3696
frozen 2534
free 1728
I created a TermDocumentMatrix tdm and the following code works as expected.
findAssocs(tdm, 'frozen', 0.20)
I would like to append the output of findAssocs as a new column
Here's the code I tried:
library(dplyr)
library(tm)
library(pbapply)
#I would like to append all findings in a new column
res <- merge(do.call(rbind.data.frame, pblapply(term.df, findAssocs(tdm, term.df$word , 0.18))),
term.df[, c("word")], by.x="list.q", by.y="word", all.x=TRUE)
EDIT:
as for the output. The single statement above gets me something like this.
$yogurt
greek ellenos fat chobani dannon fage yoplait nonfat wallaby
0.62 0.36 0.25 0.24 0.24 0.24 0.24 0.22 0.20
I was hoping it would be possible to add a single column to my original table (ASSOC) and put the results as comma separated name:value tuples but I'm really open to ideas.
I think a structure that would be simplest to handle would be a nested list:
lapply(seq_len(nrow(text.df)), function(i) {
list(word=text.df$word[i],
freq=text.df$freq[i],
assoc=findAssocs(tdm, as.character(text.df$word[i]), 0.7)[[1]])
})
# [[1]]
# [[1]]$word
# [1] "oil"
#
# [[1]]$freq
# [1] 3
#
# [[1]]$assoc
# 15.8 opec clearly late trying who winter analysts
# 0.87 0.87 0.80 0.80 0.80 0.80 0.80 0.79
# said meeting above emergency market fixed that prices
# 0.78 0.77 0.76 0.75 0.75 0.73 0.73 0.72
# agreement buyers
# 0.71 0.70
#
#
# [[2]]
# [[2]]$word
# [1] "opec"
#
# [[2]]$freq
# [1] 2
#
# [[2]]$assoc
# meeting emergency oil 15.8 analysts buyers above
# 0.88 0.87 0.87 0.85 0.85 0.83 0.82
# said ability they prices. agreement but clearly
# 0.82 0.80 0.80 0.79 0.76 0.74 0.74
# december. however, late production sell trying who
# 0.74 0.74 0.74 0.74 0.74 0.74 0.74
# winter quota that through bpd market
# 0.74 0.73 0.73 0.73 0.70 0.70
#
#
# [[3]]
# [[3]]$word
# [1] "xyz"
#
# [[3]]$freq
# [1] 1
#
# [[3]]$assoc
# numeric(0)
In my experience this will be easier to handle than a nested string because you can still access the word associations for each row of your original text.df object by accessing the corresponding element in the outputted list.
If you really want to keep a data frame structure, then you could pretty easily convert the findAssocs output to a string representation, for instance using toJSON:
library(RJSONIO)
text.df$assoc <- sapply(text.df$word, function(x) toJSON(findAssocs(tdm, x, 0.7)[[1]], collapse=""))
text.df
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1
# assoc
# 1 { "15.8": 0.87,"opec": 0.87,"clearly": 0.8,"late": 0.8,"trying": 0.8,"who": 0.8,"winter": 0.8,"analysts": 0.79,"said": 0.78,"meeting": 0.77,"above": 0.76,"emergency": 0.75,"market": 0.75,"fixed": 0.73,"that": 0.73,"prices": 0.72,"agreement": 0.71,"buyers": 0.7 }
# 2 { "meeting": 0.88,"emergency": 0.87,"oil": 0.87,"15.8": 0.85,"analysts": 0.85,"buyers": 0.83,"above": 0.82,"said": 0.82,"ability": 0.8,"they": 0.8,"prices.": 0.79,"agreement": 0.76,"but": 0.74,"clearly": 0.74,"december.": 0.74,"however,": 0.74,"late": 0.74,"production": 0.74,"sell": 0.74,"trying": 0.74,"who": 0.74,"winter": 0.74,"quota": 0.73,"that": 0.73,"through": 0.73,"bpd": 0.7,"market": 0.7 }
# 3 [ ]
Data:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
(text.df <- data.frame(word=c("oil", "opec", "xyz"), freq=c(3, 2, 1), stringsAsFactors=FALSE))
# word freq
# 1 oil 3
# 2 opec 2
# 3 xyz 1

function of corr.test() and cor() to get pearson correlation and P value in R

I just tried to get a matrix to describe the P value and a matrix of correlation.
For example, I used the code below to create data
library(psych)
xx <- matrix(rnorm(16), 4, 4)
xx
[,1] [,2] [,3] [,4]
[1,] 1.2349830 -0.23417979 -1.0380279 0.2119736
[2,] 0.9540995 0.05405983 0.4438048 1.8375497
[3,] 0.1583041 -1.29936451 -0.6030342 -0.4052208
[4,] 0.4524374 1.03351913 1.3253830 -0.4829464
when I tried to run
corr.test(xx)
But I got the error as below:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
But I thought I never set the row or column name before, and the row or column name could be NULL;I checked the row and column name with the code below:
> any(duplicated(colnames(xx)))
[1] FALSE
> any(duplicated(rownames(xx)))
[1] FALSE
However, I used the other correlation function cor() to get the matrix of correlation of xx:
co<-cor(xx)
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.24090246 -0.28707770 0.58664566
[2,] 0.2409025 1.00000000 0.79523833 0.06658293
[3,] -0.2870777 0.79523833 1.00000000 0.04730974
[4,] 0.5866457 0.06658293 0.04730974 1.00000000
Now, it works fine, and I thought maybe I can use the corr.p() to get the p value quickly. But actually, the I got the same error again!!
corr.p(co,16)
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
I am confused that I never set the rownames and colnames and have checked the duplicate 'row.names' as well, but why I still got the error, did I missed any important point?
library(psych)
set.seed(123)
xx <- matrix(rnorm(16), 4, 4)
xx <- as.data.frame(xx)
out <- corr.test(xx)
print(out, short = FALSE)
# Call:corr.test(x = xx)
# Correlation matrix
# V1 V2 V3 V4
# V1 1.00 -0.02 0.96 -0.49
# V2 -0.02 1.00 -0.27 -0.75
# V3 0.96 -0.27 1.00 -0.22
# V4 -0.49 -0.75 -0.22 1.00
# Sample Size
# [1] 4
# Probability values (Entries above the diagonal are adjusted for multiple tests.)
# V1 V2 V3 V4
# V1 0.00 1.00 0.25 1
# V2 0.98 0.00 1.00 1
# V3 0.04 0.73 0.00 1
# V4 0.51 0.25 0.78 0
# To see confidence intervals of the correlations, print with the short=FALSE option
# Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
# lower r upper p
# V1-V2 -0.96 -0.02 0.96 0.98
# V1-V3 -0.04 0.96 1.00 0.04
# V1-V4 -0.99 -0.49 0.89 0.51
# V2-V3 -0.98 -0.27 0.93 0.73
# V2-V4 -0.99 -0.75 0.75 0.25
# V3-V4 -0.97 -0.22 0.94 0.78

How to get list of lists in R with my data

here is the function which is working fine, but giving the output in an output which I don't want.
frequencies <- {}
for (k in (1:4))
{
interval <- (t(max_period_set[k]))
intervals <- round(quantile(interval,c(0,0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95,1.0)))
frequency <- {}
for (i in (2:length(intervals)))
{
count = 0;
for (r in (1:length(interval)))
{
if (r == length(interval))
{
if (interval[r] >= intervals[i-1] && interval[r] <= intervals[i])
{
count = count + 1
}
}
else
{
if (interval[r] >= intervals[i-1] && interval[r] < intervals[i])
{
count = count + 1
}
}
}
frequency <- c(frequency,count)
}
frequencies[[length(frequencies)+1]] <- frequency
}
The output is as follows:
> frequencies
[[1]]
[1] 2 6 5 4 5 4 6 5 5 5 3
[[2]]
[1] 1 7 5 4 5 4 5 6 5 5 3
[[3]]
[1] 3 5 5 4 5 4 6 5 5 5 3
[[4]]
[1] 3 5 5 4 4 6 5 5 5 5 3
I would like to have it in a format as follows:
[[],[],[],[]] which is a list of list whose first element I can access like frequencies[1] to get the first list, etc...
If it is not possible, how can I access the first list values in my current format? frequencies[1] does not give me the first list values back.
Thanks for your help!
Guys an another question:
now I can access the data but r is representing the last line in different format:
[[1]]
[1] 1.00 0.96 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[2]]
[1] 1.00 0.98 0.84 0.74 0.66 0.56 0.48 0.38 0.26 0.16 0.06 0.00
[[3]]
[1] 1.00 0.94 0.84 0.74 0.66 0.56 0.48 0.36 0.26 0.16 0.06 0.00
[[4]]
[1] 1.000000e+00 9.400000e-01 8.400000e-01 7.400000e-01 6.600000e-01 5.800000e-01 4.600000e-01 3.600000e-01 2.600000e-01 1.600000e-01 6.000000e-02 1.110223e-16
Why is it happening with the accuracy? the first three lines are as it should be but the last line is odd, the numbers were not infractional numbers, so it can be represented as a number with its accuracy of 2 after comma digits.
frequencies is a list, so you need
frequencies[[1]]
to access the first element. If list were named, you could also index by element name.
Lists are the most general data structure, and the only one that can
be nested: lists within lists within ...
be ragged: does not require rectangular dimensions
so you should try to overcome initial aversion to the fact that it is different. These are very powerful data structures, and are use a lot behind the scenes.
Edit: Also, a number of base functions as well as add-on packages can post-process lists. It starts with something basic like do.call(), goes to lapply and ends all the over at the plyr packages. Keep reading -- there are many ways to skin the same cat, some better than others.
While I completely agree with Dirk on the usefulness of lists, you can, if all of your lists are the same length, convert them to a dataframe using as.data.frame() and then you can index them by column i frequencies[,i] or by row j frequencies[j,]

Resources