Read a tsv file and change the first column [duplicate] - r

This question already has answers here:
Only read selected columns
(5 answers)
Closed 5 years ago.
I have the following file and I am interested only on the 1st and last columns (14):
sp0000001-mRNA-1 f0651baa110098a342ff92218202e4d0 1016 Pfam PF00226 DnaJ domain 76 137 7.5E-18 T 02-05-2017 IPR001623 DnaJ domain
sp0000001-mRNA-1 f0651baa110098a342ff92218202e4d0 1016 Pfam PF05266 Protein of unknown function (DUF724) 832 1015 3.8E-41 T 02-05-2017 IPR007930 Protein of unknown function DUF724
sp0000001-mRNA-1 f0651baa110098a342ff92218202e4d0 1016 Pfam PF11926 Domain of unknown function (DUF3444) 419 607 2.6E-56 T 02-05-2017 IPR024593 Domain of unknown function DUF3444
sp0000005-mRNA-1 8db7c080b2bc76bf090fec8662fcae20 243 Pfam PF01472 PUA domain 155 232 1.3E-19 T 02-05-2017 IPR002478 PUA domain GO:0003723
sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 198 261 1.4E-15 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-1 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 7 91 1.1E-25 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-2 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 198 261 1.4E-15 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-2 edf5c2bb6341fe44b3da447099a5b2df 282 Pfam PF03083 Sugar efflux transporter for intercellular exchange 7 91 1.1E-25 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-3 51ff56e496d48682f7af1b2478190834 235 Pfam PF03083 Sugar efflux transporter for intercellular exchange 130 214 9.6E-24 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000006-mRNA-3 51ff56e496d48682f7af1b2478190834 235 Pfam PF03083 Sugar efflux transporter for intercellular exchange 7 91 7.5E-26 T 02-05-2017 IPR004316 SWEET sugar transporter GO:0016021
sp0000007-mRNA-1 ed1eda6e176feb124dbef8934b633df0 553 Pfam PF03106 WRKY DNA -binding domain 281 338 2.6E-26 T 02-05-2017 IPR003657 WRKY domain GO:0003700|GO:0006355|GO:0043565
as result I try to get the following file:
sp0000001,n/a
sp0000005,GO:0003723
sp0000006,GO:0016021
sp0000007,GO:0003700
sp0000007,GO:0006355
sp0000007,GO:0043565
I tried to read the input file in the following way
> interproscan <- read.csv(file="ed.tsv", sep = "\t")[1,14]
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
What would be the best way to solve the problem?

Seems like duplicate row names. I tried to save your tsv file but it's not saved as tab-separated file to me.
Anyway try this. NULL the row names:
> interproscan <- read.csv(file="ed.tsv", sep = "\t", row.names=NULL)[c(1,14)]

Related

Automize portfolios volatilities computation in R

Thanks for reading my post. I have a series of portfolios created from the combination of several stocks. I should compute the volatility of those portfolios using the historical daily performances of each stock. Since I have all the combinations in one data frame (called final_output), and all stocks return in another data frame (called perf, where the columns are stocks and rows days) I don't know which will be the most efficient way to automize the process. Below you can find an extract:
> Final_output
ISIN_1 ISIN_2 ISIN_3 ISIN_4
2 CH0595726594 CH1111679010 XS1994697115 CH0587331973
3 CH0595726594 CH1111679010 XS1994697115 XS2027888150
4 CH0595726594 CH1111679010 XS1994697115 XS2043119358
5 CH0595726594 CH1111679010 XS1994697115 XS2011503617
6 CH0595726594 CH1111679010 XS1994697115 CH1107638921
7 CH0595726594 CH1111679010 XS1994697115 XS2058783270
8 CH0595726594 CH1111679010 XS1994697115 JE00BGBBPB95
> perf
CH0595726594 CH1111679010 XS1994697115 CH0587331973
626 0.0055616769 -0.0023656130 1.363791e-03 1.215922e-03
627 0.0086094443 0.0060037334 0.000000e+00 2.519220e-03
628 0.0053802380 0.0009027081 0.000000e+00 7.508635e-04
629 -0.0025213543 -0.0022046297 4.864050e-05 1.800720e-04
630 0.0192416817 0.0093401627 -6.079767e-03 3.800836e-03
631 -0.0101224820 0.0051741294 6.116956e-03 -1.345184e-03
632 -0.0013293793 -0.0100475153 -4.494163e-03 -1.746106e-03
633 0.0036350604 0.0012999350 3.801130e-03 -5.997121e-05
634 0.0030097434 -0.0011484496 -1.187614e-03 -2.069131e-03
635 0.0002034381 0.0030493901 -1.851762e-03 -3.806280e-04
636 -0.0035594427 0.0167455769 -2.148123e-04 -4.709560e-04
637 0.0007654623 -0.0051958237 -3.711191e-04 1.604010e-04
638 0.0107592678 -0.0016260163 4.298764e-04 3.397951e-03
639 0.0050953486 -0.0007403020 2.011738e-03 8.790770e-04
640 0.0008532851 -0.0071121648 -9.746114e-04 5.389598e-04
641 -0.0068204614 0.0133810874 -9.755622e-05 -1.346674e-03
642 0.0091395678 0.0102591793 1.717157e-03 -1.977785e-03
643 0.0027520640 -0.0157912638 1.256440e-03 -1.301119e-04
644 -0.0048902196 0.0039494471 -1.624514e-03 -3.373340e-03
645 -0.0116838833 0.0062450826 6.625549e-04 1.205255e-03
646 0.0004566442 -0.0018570102 -3.456636e-03 4.474138e-03
647 0.0041586368 0.0085679315 4.435933e-03 1.957455e-03
648 0.0007575758 0.0002912621 0.000000e+00 2.053306e-03
649 0.0046429473 -0.0138309230 -4.435798e-03 1.541798e-03
650 0.0049731250 -0.0488164953 4.181975e-03 -9.733133e-04
651 0.0008497451 -0.0033110870 2.724477e-04 -7.555498e-04
652 0.0004494831 0.0049831300 -8.657588e-04 -1.790813e-04
653 -0.0058905751 0.0020143588 8.178287e-04 -1.213991e-03
654 0.0000000000 0.0167525773 4.864050e-05 9.365068e-04
655 0.0010043186 0.0048162231 0.000000e+00 -2.110146e-03
656 -0.0024079462 -0.0100403633 -2.431907e-03 -9.176600e-04
657 -0.0095544604 -0.0193670047 0.000000e+00 -8.935435e-03
658 0.0008123477 0.0114339172 2.437835e-03 5.530483e-03
659 0.0022828734 -0.0015415446 -3.239300e-03 2.765060e-03
660 0.0049096523 -0.0001029283 3.199079e-02 2.327835e-03
661 -0.0027702226 -0.0357198003 9.456712e-04 3.189602e-04
662 -0.0008081216 -0.0139311449 -2.891020e-02 -1.295363e-03
663 -0.0033867462 0.0068745264 -2.529552e-03 -1.496588e-04
664 -0.0015216068 -0.0558572120 -3.023653e-03 -7.992975e-03
665 0.0052829422 0.0181072771 4.304652e-03 -3.319519e-03
666 0.0084386054 0.0448545861 -8.182748e-04 4.279284e-03
667 -0.0076664829 -0.0059415480 -2.047362e-04 6.059936e-03
668 -0.0062108665 -0.0039847073 7.313506e-04 5.993467e-04
669 -0.0053350948 0.0068119154 -1.042631e-02 -2.056524e-03
670 -0.0263588067 0.0245395479 -2.188962e-02 -6.732491e-03
671 -0.0021511018 0.0220649895 1.412435e-02 1.702085e-03
672 0.0205058100 -0.0007179119 3.057527e-03 -1.002423e-02
673 0.0096862280 -0.0194488633 1.207407e-03 -1.553899e-03
674 0.0007143951 -0.0068557672 6.227450e-03 1.790274e-03
675 -0.0021926470 -0.0051114507 -6.267498e-03 -1.035691e-03
676 0.0076655765 -0.0139300847 6.583825e-03 3.059472e-03
677 -0.0032457653 0.0180480206 -4.635495e-03 1.064002e-03
678 0.0036633764 0.0060676410 -2.762676e-04 5.364970e-04
679 -0.0008111122 -0.0013635410 -1.065898e-03 1.214059e-03
680 0.0050228311 0.0055141267 3.003507e-03 1.121643e-03
681 -0.0007067495 0.0147281558 -2.699002e-03 -1.514035e-04
682 -0.0024248548 0.0002573473 -2.113685e-03 -1.423409e-03
683 -0.0002025624 0.0138417207 -4.374895e-03 1.415328e-04
684 -0.0141822418 -0.0169517332 -3.578920e-03 -1.799234e-03
685 -0.0005651749 -0.0259693324 -5.926428e-03 -3.635333e-03
686 0.0004112688 0.0133043570 -1.545642e-03 1.981828e-03
687 -0.0150565262 -0.0107757493 -1.717916e-02 -1.328749e-02
688 0.0039129754 -0.0441013167 -8.376631e-03 -5.653841e-04
689 0.0019748467 0.0115063340 -2.835394e-02 7.868428e-03
690 0.0072614108 0.0358764014 3.586897e-02 7.960077e-03
691 -0.0003604531 0.0106119001 1.024769e-04 -2.733651e-04
What I should do is look for each portfolio (each row of final_output is a portfolio, i.e. 4 stocks portfolio) in perf and compute the volatility (standard deviation) of that portfolio using the stocks historical daily performances of the last three months. (Of course, here I have pasted only 4 stocks performances for simplicity.) Once done for the first, I should do the same for all the other rows (portfolios).
Below is the formula I used for computing the volatility:
#formula for computing the volatility
sqrt(t(weights) %*% covariance_matrix %*% weights)
#where covariance_matrix is
cov(portfolio_component_monthly_returns)
#All the portfolios are equiponderated
weights = [ 0.25 0.25 0.25 0.25 ]
What I'm trying to do since yesterday is to automize the process for all the rows, indeed I have more than 10'000 rows. I'm an RStudio naif, so even trying and surfing on the new I have no results and no ideas of how to automize it. Would someone have a clue how to do it?
Hope to have been clearer as possible, in case do not hesitate to ask me.
Many thanks

select best indices from the result of ensemble using mRMR

I am using the R package mRMRe for feature selection and trying to get the indices of most common feature from the results of ensemble:
ensemble <- mRMR.ensemble(data = dd, target_indices = target_idx,solution_count = 5, feature_count = 30)
features_indices = as.data.frame(solutions(ensemble))
This give me the below data:
MR_1 MR_2 MR_3 MR_4 MR_5
2793 2794 2796 2795 2918
1406 1406 1406 1406 1406
2798 2800 2798 2798 2907
2907 2907 2907 2907 2800
2709 2709 2709 2709 2709
1350 2781 1582 1350 1582
2781 1350 2781 2781 636
2712 2712 2712 2712 2781
636 636 636 636 2779
2067 2067 2067 2067 2712
2328 2328 2357 2357 2067
2357 783 2328 2328 2328
772 2357 772 772 772
I want to use some sort of voting logic to select the most frequent index for each row across all columns.
For example in the above image :
1. For the first row there is no match - so select the first one.
2. There are some rows where min occurrence is 2 - so select that one.
3. In case of tie - check if any occurs thrice, if yes select that one, or else from the tied indices select the first occurring one.
May be I am making it too complex, but basically I want to select best indices from all the indices for each row from the dataframe.
Can someone please help me on this?
Here's a simple solution using apply:
apply(df, 1, function(x) { names(which.max(table(x))) })
which gives:
[1] "2793" "1406" "2798" "2907" "2709" "1350" "2781" "2712" "636" "2067" "2328" "2328" "772"
For each row, the function table counts occurrences of each unique element, then we return the name of the element with the maximum number of occurrences (if there is a tie, the first one is selected).

Natural Neighbor Interpolation in R

I need to conduct Natural Neighbor Interpolation (NNI) via R in order to smooth my numeric data. For example, say I have very spurious data, my goal is to use NNI to model the data neatly.
I have several hundred rows of data (one observation for each postcode), alongside latitudes and longitudes. I've made up some data below:
Postcode lat lon Value
200 -35.277272 149.117136 7
221 -35.201372 149.095065 38
800 -12.801028 130.955789 27
801 -12.801028 130.955789 3
804 -12.432181 130.84331 29
810 -12.378451 130.877014 20
811 -12.376597 130.850489 3
812 -12.400091 130.913672 42
814 -12.382572 130.853877 32
820 -12.410444 130.856124 39
821 -12.426641 130.882367 39
822 -12.799278 131.131697 49
828 -12.474896 130.907378 38
829 -14.460879 132.280002 34
830 -12.487233 130.972637 8
831 -12.480066 130.984006 49
832 -12.492269 130.990891 29
835 -12.48138 131.029173 33
836 -12.525546 131.103025 40
837 -12.460094 130.842663 39
838 -12.709507 130.995407 28
840 -12.717562 130.351316 22
841 -12.801028 130.955789 8
845 -13.038663 131.072091 19
846 -13.226806 131.098416 50
847 -13.824123 131.835799 11
850 -14.464497 132.262021 2
851 -14.464497 132.262021 23
852 -14.92267 133.064654 36
854 -16.81839 137.14707 17
860 -19.648306 134.186642 3
861 -18.94406 134.318373 8
862 -20.231104 137.762232 28
870 -12.436101 130.84059 24
871 -12.436101 130.84059 16
Is there any kind of package that will do this? I should mention, that the only predictors I am using in this model are latitude and longitude. If there isn't a package than can do this, how can I implement it manually. I've searched extensively and I can't figure out how to implement this in R. I have seen one or two other SO posts, but they haven't assisted me in figuring this out.
Please let me know if there's anything I must add to the question. Thanks.
I suggest the following:
Reproject the data to the corresponding UTM Zone.
Use R WhiteboxTools package to process the data using natural neighbour interpolation.

Decoding Unknown Data Type

I have received some encoded data from Arduino via PySerial. I have access to an application which decodes the data, but I need to know what it is doing and I do not have access to the source code.
Data file contents:
%N|nkNsnrNlnzNqnEOknJOlM
%VA#_##hpZzbdIvzegvxefvkeavdeXvXeXvPeMvReDvlM
%PaA#gH#lnMO#QaLN#mbzM#cbmM#^beM#Pb_M#Fb]M#xaUM#balM
%Ma##HI#FzJP#auPO#~uPO#{uPO#}uMO#vuN#wuyN#uuqN#xulM
%knOOinSOXnMOAnFOcmxNYmlNBm_NslSNqlHNclnM^N
%PezuReouLeluDeju~diuFe`uBeXuAeUu~dJuxdAu^N
%MM#NaJM#`MM#t`VM#h`aM#f`fM#Y`jM#O`mM#G`uM#{_BN#u_^N
%rN#tuhN#nu[N#kuRN#huEN#au{M#[uqM#Nu^M#CuFM#ttuL#at^N
%XlPMMlvLMlWLPlBLVllKMlWKDlCKKlrJNl[J`lHJPO
%pd|trdrttdjtudbtmd_tkd[tkdWtmdOtldGtvdHtPO
Output from application:
86 31 -48 97 -51 33 -1109 -3121
-984 -358 551 -1108 584 -378 -1111 -3117
-1758 -631 973 -1967 1034 -671 -1128 -3123
-1670 -601 908 -1875 976 -642 -1151 -3130
-1672 -602 890 -1885 976 -645 -1181 -3144
-1685 -607 877 -1890 976 -643 -1191 -3156
-1692 -616 869 -1904 973 -650 -1214 -3169
-1704 -616 863 -1914 959 -649 -1229 -3181
-1712 -627 861 -1928 953 -651 -1231 -3192
-1710 -636 853 -1950 945 -648 -1245 -3218
-1712 -646 845 -1970 946 -652 -1256 -3248
-1710 -657 842 -1985 936 -658 -1267 -3274
-1716 -660 845 -1996 923 -661 -1267 -3305
-1724 -662 854 -2008 914 -664 -1264 -3326
-1730 -663 865 -2010 901 -671 -1258 -3348
-1722 -672 870 -2023 891 -677 -1267 -3369
-1726 -680 874 -2033 881 -690 -1276 -3389
-1727 -683 877 -2041 862 -701 -1269 -3406
-1730 -694 885 -2053 838 -716 -1266 -3429
-1736 -703 898 -2059 821 -735 -1248 -3448
I have tried several encodings like ASCII, UTF-8, and UUEncoding but none have given me any tangible results.
Does anyone have an idea as to what this could be?
Thanks in advance,
Cheers

R row summing to a new column [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 7 years ago.
I'm new to R and this is probablly a very simple question.
I just haven't been able to make rsum/apply work
My task is to add all the different expense categories in my dataframe and create a new variable with this value like this:
(not the original)
Food Dress Car
235 564 532
452 632 719
... ... ...
and then
Food Dress Car Total
235 564 532 1331
452 632 719 1803
... ... ... ...
I have tried:
rowsum, apply and aggregate and can't get it right
You can use addmargins after converting to matrix
addmargins(as.matrix(df1),2)
# Food Dress Car Sum
#[1,] 235 564 532 1331
#[2,] 452 632 719 1803
Or use rowSums
df1$Total <- rowSums(df1)
Or with Reduce
df1$Total <- Reduce(`+`, df1)
With apply functions:
cbind(dat, Total = apply(dat, 1, sum))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803
or with just a:
cbind(dat, Total = rowSums(dat))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803

Resources