Singular value decomposition (SVD) to tfidf dataframe using mapReduce - jupyter-notebook

Here is my dataframe, i used np.linalg.svd but it doesnt use MapReduce so i want a similar function to np.linalg.svd that uses mapreduce
#there is 5 documents.
print(tfidf)
a portion of the output:
[{'poorly': 0.03095072908527116, 'respect': 0.0, 'got': 0.0, 'interpretation': 0.0, 'pretty': 0.0, 'regular': 0.03095072908527116, 'issues': 0.03095072908527116, 'glad': 0.0, 'lunar': 0.06190145817054232, 'one': 0.0, 'complex': 0.0, 'rockets': 0.0, 'might': 0.0, 'possible': 0.03095072908527116, 'ritual': 0.0, 'luck': 0.0, 'quite': 0.03095072908527116, 'crash': 0.03095072908527116, 'play': 0.0, 'least': 0.0, 'contest': 0.0, 'fighters': 0.0, 'corps': 0.0, 'result': 0.0, 'low': 0.017620975612964523, 'would': 0.0, 'flying': 0.0, 'missions': 0.03095072908527116...}]
then applying np.linalg.svd
X = pd.DataFrame(tfidf).T
output:
0 1 2 3 4
poorly 0.030951 0.000000 0.000000 0.000000 0.000000
respect 0.000000 0.000000 0.025547 0.000000 0.000000
got 0.000000 0.050905 0.000000 0.000000 0.029558
interpretation 0.000000 0.000000 0.000000 0.000000 0.051917
pretty 0.000000 0.000000 0.000000 0.000000 0.051917
... ... ... ... ... ...
g 0.000000 0.000000 0.051093 0.000000 0.000000
due 0.030951 0.000000 0.000000 0.000000 0.000000
development 0.000000 0.000000 0.051093 0.000000 0.000000
...
now my goal is to do the same process but with a function that uses mapreduce

Related

How to get all dots in boxplot using ggplot?

I would like to display a boxplot with a dot for each of my data.
Here is my a downsampling of my data:
value value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15 value16 value17 value18 value19 value20 value21 value22 value23 value24 value25 value26 value27 value28 value29 value30 value31 value32 value33 value34 value35 value36 value37 value38 value39 value40 value41 value42 value43 value44 value45 value46 value47 value48 value49 value50 value51 value52 value53 value54 value55 value56 value57 value58 value59 value60 value61 value62 value63 value64 value65 value66 value67 value68 value69 value70 value71 value72 value73 value74 value75 value76 value77 value78 value79 value80 value81 value82 value83 value84 value85 value86 value87 value88 value89 value90 value91 value92 value93
1 DLBCL 1994.95631 2621.3410 753.2132 0.000000 11197.10111 0.000000 176.337991 2000.983371 862.402989 8491.35251 0.000000 0.000000 0.000000 0.000000 0.000000 1293.604484 431.201495 11022.058175 6899.22391 1557.191604 0.00000 0.0000000 491.33939 0.00000 935.4880 473.089640 117093.3704 267.06673 0.000000 1201.315893 546.473181 817.685797 5550.213652 5864.340327 0.000000 756.0793 1186.963254 0.000000 0.000000 182.35834 0.000000 0.000000 2.221214e+04 546.4731813 0.000000 22467.36115 25197.16560 4527.61569 47851.49797 0.0000000 809.029514 1780.444881 466.4264055 2854.851275 2178.702289 0.000000 1155.2188880 0.000000 0.000000 0.000000 0.0000000 325.947587 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000 0.000000 5219.72808 0.000000 1092.946363 1914.235537 0.00000 41395.343 5012.19294 0.0000 0.00000 0.000000 0.00000 211214.036 771.94114 5792.9344 155407.942 586.647915 904.81625 5221.03431 26527.2485 118750.28 103149.05
2 HL 2685.55082 3282.5779 4598.1600 4183.367213 1465.89302 0.000000 66.245848 0.000000 161.991801 61.34601 161.991801 0.000000 485.975403 404.979503 80.995901 80.995901 161.991801 6164.020846 4211.78683 17549.958130 2601.72383 1143.4715367 1292.08891 2101.51526 8785.9960 157.980575 25628.0113 2257.43413 426.060627 3572.830049 410.593080 11519.416962 23630.893343 47042.419019 2594.830952 5964.8488 3901.738003 0.000000 0.000000 376.79150 0.000000 833.100691 1.251683e+05 3797.9859885 4500.351000 231.24480 901.51959 8990.54496 21686.09505 0.0000000 50.655417 0.000000 5081.5230881 766.069601 8594.091339 4754.510950 578.6497823 0.000000 0.000000 540.128957 5906.6921396 1897.982677 0.000000 0.000000 0.00000 517.142472 0.000000 90.021493 0.000000 0.000000 395.929041 51.1553056 0.000000 5501.47987 569.641498 1180.455105 1258.479657 0.00000 31700.549 8406.06103 650.9810 198.52612 1888.006678 183.67574 130532.228 108.74974 3400.4110 58514.733 4600.624542 1019.75167 0.00000 20734.9505 163994.61 181005.92
3 HL 3937.68099 5174.0505 14309.5447 17201.448539 6027.55676 0.000000 1566.266081 246.848582 9575.025066 966.94533 5745.015039 5106.680035 5745.015039 8298.355057 5745.015039 8936.690061 3830.010026 2595.831304 0.00000 3842.016327 932.01765 0.0000000 0.00000 0.00000 12463.7614 2256.666225 105760.7753 165061.07726 2014.690206 296.397390 808.979015 0.000000 684.694530 0.000000 1120.551505 47009.4381 0.000000 0.000000 0.000000 809.86996 0.000000 6565.731474 1.992851e+03 2831.4265541 0.000000 911.22915 0.00000 0.00000 0.00000 0.0000000 0.000000 0.000000 345.2403404 1811.236269 0.000000 1561.277973 0.0000000 0.000000 736.098023 3192.598806 0.0000000 0.000000 0.000000 0.000000 0.00000 9897.983156 0.000000 3015.232206 0.000000 1210.472305 3120.347631 2015.7947507 0.000000 89720.16482 0.000000 0.000000 0.000000 984.42025 23569.292 794.98586 570.0480 0.00000 0.000000 482.52095 42461.843 571.37679 3573.1872 25446.846 1519.791401 0.00000 0.00000 57004.8004 153509.90 112514.3
and here is my code :
data2=read.table("/../data.txt",sep="\t",header=TRUE )
data2 %>%
ggplot( aes(x=name, y=value, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10, value11, value12, value13, value14, value15, value16, value17, value18, value19, value20, value21, value22, value23, value24, value25, value26, value27, value28, value29, value30, value31, value32, value33, value34, value35, value36, value37, value38, value39, value40, value41, value42, value43, value44, value45, value46, value47, value48, value49, value50, value51, value52, value53, value54, value55, value56, value57, value58, value59, value60, value61, value62, value63, value64, value65, value66, value67, value68, value69, value70, value71, value72, value73, value74, value75, value76, value77, value78, value79, value80, value81, value82, value83, value84, value85, value86, value87, value88, value89, value90, value91, value92, value93, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Distribution of ... ") +
xlab("")
I got a plot but not all of my data appeared. I suspect only the first column (value) is taken into account.
What did I miss? Does anyone know a trick to get all my dots?
Thanks a lot!
You can try reshaping data to long:
library(ggplot2)
library(dplyr)
library(tidyr)
#Code
data2 %>%
rename(key=value) %>%
pivot_longer(-key) %>%
ggplot(aes(x=key,y=value,fill=name))+
geom_boxplot() +
#scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.4, alpha=0.9) +
#theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Distribution of total EBV gene expression for each PTCL subtype ") +
xlab("")
Output:

How can I calculate the progressive mean in a vector BUT restarting when a condition is met?

Given the following vector,
x<-c(0,0,0,5.0,5.1,5.0,6.5,6.7,6.0,0,0,0,0,3.8,4.0,3.6)
I would like to have a vector with the cumulative mean, like
cumsum(x)/seq_along(x)
but restarting the computation each time that the difference between two subsequent values is grater than 1.3 or less than -1.3. My aim is to obtain a vector like
d<-c(0,0,0,5,5.05,5.03,6.5,6.6,6.37,0,0,0,0,3.8,3.9,3.8)
You can use cumsum(abs(diff(x)) > 1.3)) to define groups, which are used in aggregate to restart cumsum(x)/seq_along(x) each time when the difference is grater than 1.3 or less than -1.3.
unlist(aggregate(x, list(c(0, cumsum(abs(diff(x)) > 1.3))),
function(x) cumsum(x)/seq_along(x))[,2])
# [1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
# [9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000
Maybe you can try ave + findInterval like below
ave(x,findInterval(seq_along(x),which(abs(diff(x))>1.3)+1),FUN = function(v) cumsum(v)/seq_along(v))
which gives
[1] 0.000000 0.000000 0.000000 5.000000 5.050000 5.033333 6.500000 6.600000
[9] 6.400000 0.000000 0.000000 0.000000 0.000000 3.800000 3.900000 3.800000

Rstudio can not find "setup" function

its my code :
library(useful)
nbt.data=read.table("a.txt",sep=",",header=TRUE,row.names=1)
nbt.data=log(nbt.data+1)
corner(nbt.data)
Hi_2338_1 Hi_2338_2 Hi_2338_3 Hi_2338_4 Hi_2338_5
A1BG 2.310553 0.00000000 0.000000 1.0116009 0
A1BG-AS1 0.000000 0.00000000 1.497388 0.3074847 0
A1CF 0.000000 0.04879016 0.000000 0.0000000 0
A2LD1 0.000000 0.00000000 0.000000 0.2546422 0
A2M 0.000000 0.00000000 0.000000 0.0000000 0
dim(nbt.data)
[1] 23730 301
nbt=new("seurat",raw.data=nbt.data)
nbt=setup(nbt,project="NBT",min.cells = 3,names.field = 2,names.delim = "_",min.genes = 1000,is.expr=1,)
Error in setup(nbt, project = "NBT", min.cells = 3, names.field = 2, names.delim = "_", :
could not find function "setup"

Interpretate garch(1,1)-results

How should I read the results I got from my Garch-model?
Does this mean that none of my external regressors had any impact?
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(1,1)
Mean Model : ARFIMA(1,0,0)
Distribution : ghyp
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.007363 0.005930 1.241772 0.214321
ar1 0.088732 0.031644 2.804059 0.005046
omega 0.000002 0.000004 0.534986 0.592660
alpha1 0.049419 0.015292 3.231603 0.001231
beta1 0.863324 0.006810 126.771068 0.000000
vxreg1 0.000000 0.002269 0.000004 0.999996
vxreg2 0.000000 0.000000 1.724852 0.084554
vxreg3 0.000000 0.000000 0.595820 0.551296
vxreg4 0.000000 0.000317 0.000032 0.999975
vxreg5 0.000000 0.000157 0.000064 0.999949
vxreg6 0.000000 0.000000 0.027500 0.978061
vxreg7 0.000000 0.000000 0.000000 1.000000
vxreg8 0.000000 0.000067 0.000149 0.999881
skew 0.206258 0.233379 0.883791 0.376809
shape 1.882771 0.128208 14.685333 0.000000
ghlambda -0.464969 0.508095 -0.915121 0.360128

Error when using ''ward'' method with pvclust R package

I am having some troubles regarding a cluster analysis that I am trying to do with the pvclust package.
Specifically, I have a data matrix composed by species (rows) and sampling stations (columns). I want to perform a CA in order to group my sampling stations according to my species abundance (which I have previously log(x+1) transformed).
Once having prepared adequately my matrix,I've tried to run a CA according to the pvclust package, using Ward's clustering method and Bray-Curtis as distance index. However, every time I get the following error message:
''Error in hclust(distance, method = method.hclust) :
invalid clustering method''
I then tried to perform the same analysis using another cluster method, and I had no problem. I also tried to perform the same analysis using the hclust function from the vegan package, and I had no problem at all, too. The analysis run without any problems.
To better understand my problem, I'll display part of my matrix and the
script that I used to perfrom the analysis:
P1 P2 P3 P4 P5 P6
1 10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.333333
3 0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.000000
4 0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.264000
5 0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.000000
6 0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.000000
7 0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.000000
8 1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.000000
9 0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.000000
10 2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.000000
15 1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.000000
16 17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.666667
19 4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.000000
20 0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.000000
21 0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.000000
24 1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.000000
25 0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.000000
26 6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.000000
Where P1-P6 are my sampling stations, and the leftmost row numbers are my different species. I'll denote this example matrix just as ''platforms''.
Afterwards, I've used the following code lines:
dist <- function(x, ...){
vegdist(x, ...)
}
result<-pvclust(platforms,method.dist = "bray",method.hclust = "ward")
It is noteworthy that I run the three first codelines, since the bray-curtis index isn't originally available in the pvclust package. Thus, running these codelines allowed me to specify the bray-curtis index in the pvclust function
Does anyone know why it doesn't work with the pvclust package?
Any help will be much appreciated.
Kind regards,
Marie
There are two related issues:
When calling method.hclust you need to pass hclust compatible methods. In theory pvclust checks for ward and converts to ward.D, but you probably want to pass the (correct) names of either ward.D or ward.D2.
You cannot over-write dist in that fashion. However, you can pass a custom function to pvclust.
For instance, this should work:
library(vegan)
library(pvclust)
sample.data <- "P1 P2 P3 P4 P5 P6
10.8750000 3.2888889 2.0769231 1.4166667 3.2395833 5.3333330
0.3645833 0.3027778 0.3212038 0.7671958 0.4993676 0.0000000
0.0000000 0.0000000 2.3500000 0.0000000 0.0000000 0.2640000
0.0000000 0.7333333 0.2692308 0.0000000 0.2343750 0.0000000
0.0000000 0.9277778 0.0000000 0.2936508 0.7291667 0.0000000
0.4166667 6.3500000 1.0925463 0.5476190 0.1885169 0.0000000
1.6250000 0.0000000 0.0000000 0.0000000 5.2187500 0.0000000
0.0000000 0.8111111 0.0000000 0.0000000 0.0000000 0.0000000
2.6770833 0.6666667 2.3304890 4.5906085 2.9652778 0.0000000
1.8020833 0.9666667 1.4807137 3.3878968 0.1666667 0.0000000
17.8750000 4.9555556 1.4615385 6.5000000 7.8593750 7.6666670
4.5312500 1.0555556 3.5766941 6.7248677 2.3196181 0.0000000
0.0000000 0.6777778 0.5384615 0.0000000 0.0000000 0.0000000
0.0000000 0.9777778 0.0000000 0.2500000 0.0000000 0.0000000
1.2500000 3.0583333 0.1923077 0.0000000 4.9583333 0.0000000
0.0000000 0.0000000 2.5699634 0.0000000 0.0000000 0.0000000
6.6666667 2.2333333 24.8730020 55.9980159 17.6239583 0.0000000"
platforms <- read.table(text = sample.data, header = TRUE)
result <- pvclust(platforms,
method.dist = function(x){
vegdist(x, "bray")
},
method.hclust = "ward.D")

Resources