Merge columnwise from file_list - r

I have 96 files in file_list
file_list <- list.files(pattern = "*.mirna")
They all have the same columns, but the number of rows varies. Example file:
> head(test1)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 TGGAGTGTGATAATGGTGTTT seq_100003_x4 4 hsa-miR-122-5p 15 35 11TC 0 0 g GCTGTGGA TTTGTGTC miRNA
2 TGTAAACATCCCCGACCGGAAGCT seq_100045_x4 4 hsa-miR-30d-5p 6 29 17CT 0 0 CT TTGTTGTA GAAGCTGT miRNA
3 CTAGACTGAAGCTCCTTGAAAA seq_100048_x4 4 hsa-miR-151a-3p 47 65 0 I-AAA 0 gg CCTACTAG GAGGACAG miRNA
4 AGGCGGAGACTTGGGCAATTGC seq_100059_x4 4 hsa-miR-25-5p 14 35 0 0 0 C TGAGAGGC ATTGCTGG miRNA
5 AAACCGTTACCATTACTGAAT seq_100067_x4 4 hsa-miR-451a 17 35 0 I-AT 0 gtt AAGGAAAC AGTTTAGT miRNA
6 TGAGGTAGTAGCTTGTGCTGTT seq_10007_x24 24 hsa-let-7i-5p 6 27 12CT 0 0 0 TGGCTGAG TGTTGGTC miRNA
precursor ambiguity
1 hsa-mir-122 1
2 hsa-mir-30d 1
3 hsa-mir-151a 1
4 hsa-mir-25 1
5 hsa-mir-451a 1
6 hsa-let-7i 1
second file
> head(test2)
seq name freq mir start end mism add t5 t3 s5 s3 DB
1 ATTGCACTTGTCCTGGCCTGT seq_1000013_x1 1 hsa-miR-92a-3p 49 69 14TC 0 t 0 AAAGTATT CTGTGGAA miRNA
2 AAACCGTTACTATTACTGAGA seq_1000094_x1 1 hsa-miR-451a 17 36 11TC I-A 0 tt AAGGAAAC AGTTTAGT miRNA
3 TGAGGTAGCAGATTGTATAGTC seq_1000169_x1 1 hsa-let-7f-5p 8 28 9CT I-C 0 t GGGATGAG AGTTTTAG miRNA
4 TGGGTCTTTGCGGGCGAGAT seq_100019_x12 12 hsa-miR-193a-5p 21 40 0 0 0 ga GGGCTGGG ATGAGGGT miRNA
5 TGAGGTAGTAGATTGTATAGTG seq_100035_x12 12 hsa-let-7f-5p 8 28 0 I-G 0 t GGGATGAG AGTTTTAG miRNA
6 TGAAGTAGTAGGTTGTGTGGTAT seq_1000437_x1 1 hsa-let-7b-5p 6 26 4AG I-AT 0 t GGGGTGAG GGTTTCAG miRNA
precursor ambiguity
1 hsa-mir-92a-2 1
2 hsa-mir-451a 1
3 hsa-let-7f-2 1
4 hsa-mir-193a 1
5 hsa-let-7f-2 1
6 hsa-let-7b 1
I would like to create a unique ID consisting of the columns mir and seq:
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT
Then I would like to merge all the 96 files based in this ID and take the column freq form each file.
ID freq_file1 freq_file2 ...
hsa-miR-122-5p_TGGAGTGTGATAATGGTGTTT 4 12
If an ID is not pressent in a specific file the freq should be NA

We can use Reduce with merge on a list of data.frames.
lst <- lapply(mget(ls(pattern="test\\d+")),
function(x) subset(transform(x, ID=paste(precursor,
seq)), select=c("ID", "freq")))
Reduce(function(...) merge(..., by = "ID"), lst)
NOTE: In the above, I assumed that the "test1", "test2" objects are already created in the global environment by reading the files in 'file_list'. If not, we can directly read the files into a list instead of creating additional data.frame objects i.e.
library(data.table)
lst <- lapply(file_list, function(x)
fread(x, select=c("precursor", "seq", "freq"))[,
list(ID=paste(precursor, seq), freq=freq)])
Reduce(function(x,y) x[y, on = "ID"], lst)
Or instead of fread (from data.table) use read.csv/read.table and use merge as before on 'lst'

Related

how to extract observation values for each cluster of Kmeans

I have data that come from two distribution functions (mixture data). I fit the k-means to the data with $2$ centers. I then get the clusters. My point here is, instead of the number of each cluster, I would like to divide my data into two groups. That is, the first group contains the data that comes from the first cluster and the same for the second group (my data is two dimensions and a matrix).
Here is my try:
kme <- kmeans(Sim, 2)
kme$cluster
which gives this:
kme$cluster
[1] 1 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 2 1 2 2 2 1 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 2 1 2 1 2 2 2
[56] 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 1 2 2 1 2
I know that means the first row (observations in the first row) of my matrix comes from the first cluster and the second and third rows are from the second cluster. Instead of this, I want two groups, one with the observations (the values not the number of the cluster) of the first cluster, and the other come from the second cluster.
For example,
[,1] [,2] [,3]
[1,] 0.8026952 0.8049413 1
[2,] 0.4333745 0.5063472 2
[3,] 0.3587946 0.4091627 2
[4,] 0.9067146 0.9211618 1
[5,] 0.6663730 0.6644439 1
[6,] 0.9752217 0.8299001 1
Hence, I want it like this:
Group_1
[,1] [,2]
[1,] 0.8026952 0.8049413
[2,] 0.9067146 0.9211618
[3,] 0.6663730 0.6644439
[4,] 0.9752217 0.8299001
Group_2
[2,] 0.4333745 0.5063472
[3,] 0.3587946 0.4091627
## my data
structure(c(0.8026952064848, 0.433374540465373, 0.35879457564118,
0.906714606331661, 0.666372966486961, 0.975221659988165, 0.146514602801487,
0.185211665343342, 0.266845172200967, 0.9316249943804, 0.458760005421937,
0.260092565789819, 0.546946153900359, 0.320214906940237, 0.998543527442962,
0.264783770404576, 0.940526409307495, 0.218771387590095, 0.00109510733232848,
0.909367726704406, 0.195467973826453, 0.853418850837688, 0.257240866776556,
0.18492349224921, 0.0350681275368262, 0.743108308431699, 0.120800079312176,
0.536067422405767, 0.387076289858669, 0.859893148997799, 0.962759922724217,
0.0288314732712864, 0.878663770621642, 0.98208610656754, 0.98423704248853,
0.0850008164197942, 0.415692074922845, 0.725441533140838, 0.514739896170795,
0.564903213409707, 0.65493689605431, 0.551635805051774, 0.20452569425106,
0.0509099354967475, 0.646801606381046, 0.656341063790023, 0.706781879998744,
0.244539211907925, 0.43318469475677, 0.848426640266553, 0.26359805940462,
0.730860544172275, 0.405211122473702, 0.401496034115553, 0.432796132021846,
0.654138915939257, 0.00803712895140052, 0.991968845921972, 0.0311756118742527,
0.0648601313587278, 0.733741108178729, 0.0431173096876591, 0.619796682847664,
0.804308546474203, 0.0934691624715924, 0.520366458455101, 0.833598382357762,
0.373484763782471, 0.261487311183624, 0.822368689114228, 0.88254910800606,
0.261728620579622, 0.109025254459585, 0.661885950024542, 0.231851563323289,
0.46855820226483, 0.909970719134435, 0.799321972066537, 0.646252158097923,
0.233985049184412, 0.309839888018159, 0.129971102112904, 0.0901338488329202,
0.460395671925082, 0.274646409088746, 0.675003502921675, 0.00289221783168614,
0.336108531044562, 0.371105678845197, 0.607435576152056, 0.156731446506456,
0.246894558891654, 0.418194083335386, 0.000669385509081014, 0.929943428778418,
0.972200238145888, 0.503282874496368, 0.126382717164233, 0.683936105109751,
0.21720214970307, 0.804941252722838, 0.506347232734472, 0.409162739287115,
0.921161751145135, 0.664443932378791, 0.829900114789874, 0.0660539097664178,
0.296326436845226, 0.120007439729838, 0.768823563807157, 0.449026418114183,
0.268668511775742, 0.733763495587273, 0.365402223476625, 0.97980160509396,
0.335119241818387, 0.929315469866307, 0.253016166717649, 0.00521095494948787,
0.870041067705, 0.215020805969677, 0.858896143709886, 0.167998804405928,
0.204213777320881, 0.050652931423494, 0.731499125526297, 0.166061290725948,
0.520575411719918, 0.370579454420263, 0.655607928337889, 0.978414469097905,
0.00268175014874324, 0.937587480238656, 0.992468047261219, 0.856301580636229,
0.106064732119751, 0.530228247677302, 0.502227925225818, 0.66462369930413,
0.526988978414104, 0.394591213637187, 0.623968017885322, 0.222666427921132,
0.0707407196787662, 0.715361864683925, 0.561951996212598, 0.874765155771585,
0.217631973951671, 0.576708062239157, 0.910641489550344, 0.215463715360162,
0.761807500922947, 0.417110771840405, 0.497162608159201, 0.530665309105489,
0.689703677933362, 0.00811876221245061, 0.991245541114815, 0.0518070069187705,
0.0733367055960226, 0.803126294581356, 0.0291602667026993, 0.724848517465592,
0.682316094846719, 0.0914714514707226, 0.426956537783392, 0.826985575416605,
0.3128962286514, 0.295208624024388, 0.58934716401092, 0.856718183582533,
0.183019143019377, 0.302561606994597, 0.666755501118539, 0.176298329811281,
0.389183841328174, 0.86253900906311, 0.753736534075238, 0.627220192419063,
0.319958512526359, 0.321602248149364, 0.161772830672492, 0.103166641060684,
0.339980194505715, 0.218533019046996, 0.689884789678819, 0.00251942038852481,
0.174792447835404, 0.509071373135409, 0.647835095901117, 0.22572898134156,
0.287369659385574, 0.538675651472693, 0.000995476493411555, 0.939528694637273,
0.961510166904661, 0.452822116916426, 0.2061782381611, 0.722694525115558,
0.328404467661884), .Dim = c(100L, 2L))
I hope this is what you are looking for.
I had to transform the matrix to a data frame so that when we use split function the structure will be preserved, otherwise it would split the whole matrix element by element as matrix is actually a vector that has dim attribute. So it behaves like a vector
split function divides a data frame or a vector into groups defined by f. which in your case are unique cluster values
kme <- kmeans(Sim, 2)
kme$cluster
Sim2 <- as.data.frame(cbind(Sim, kme$cluster))
split(Sim2, Sim2$V3) |>
setNames(paste("Group", sort(unique(kme$cluster))))
$`Group 1`
V1 V2 V3
2 0.4333745405 0.5063472327 1
3 0.3587945756 0.4091627393 1
7 0.1465146028 0.0660539098 1
8 0.1852116653 0.2963264368 1
9 0.2668451722 0.1200074397 1
11 0.4587600054 0.4490264181 1
12 0.2600925658 0.2686685118 1
14 0.3202149069 0.3654022235 1
16 0.2647837704 0.3351192418 1
18 0.2187713876 0.2530161667 1
19 0.0010951073 0.0052109549 1
21 0.1954679738 0.2150208060 1
23 0.2572408668 0.1679988044 1
24 0.1849234922 0.2042137773 1
25 0.0350681275 0.0506529314 1
27 0.1208000793 0.1660612907 1
29 0.3870762899 0.3705794544 1
32 0.0288314733 0.0026817501 1
36 0.0850008164 0.1060647321 1
37 0.4156920749 0.5302282477 1
43 0.2045256943 0.2226664279 1
44 0.0509099355 0.0707407197 1
48 0.2445392119 0.2176319740 1
49 0.4331846948 0.5767080622 1
51 0.2635980594 0.2154637154 1
53 0.4052111225 0.4171107718 1
54 0.4014960341 0.4971626082 1
55 0.4327961320 0.5306653091 1
57 0.0080371290 0.0081187622 1
59 0.0311756119 0.0518070069 1
60 0.0648601314 0.0733367056 1
62 0.0431173097 0.0291602667 1
65 0.0934691625 0.0914714515 1
66 0.5203664585 0.4269565378 1
68 0.3734847638 0.3128962287 1
69 0.2614873112 0.2952086240 1
72 0.2617286206 0.1830191430 1
73 0.1090252545 0.3025616070 1
75 0.2318515633 0.1762983298 1
76 0.4685582023 0.3891838413 1
80 0.2339850492 0.3199585125 1
81 0.3098398880 0.3216022481 1
82 0.1299711021 0.1617728307 1
83 0.0901338488 0.1031666411 1
84 0.4603956719 0.3399801945 1
85 0.2746464091 0.2185330190 1
87 0.0028922178 0.0025194204 1
88 0.3361085310 0.1747924478 1
89 0.3711056788 0.5090713731 1
91 0.1567314465 0.2257289813 1
92 0.2468945589 0.2873696594 1
93 0.4181940833 0.5386756515 1
94 0.0006693855 0.0009954765 1
97 0.5032828745 0.4528221169 1
98 0.1263827172 0.2061782382 1
100 0.2172021497 0.3284044677 1
$`Group 2`
V1 V2 V3
1 0.8026952 0.8049413 2
4 0.9067146 0.9211618 2
5 0.6663730 0.6644439 2
6 0.9752217 0.8299001 2
10 0.9316250 0.7688236 2
13 0.5469462 0.7337635 2
15 0.9985435 0.9798016 2
17 0.9405264 0.9293155 2
20 0.9093677 0.8700411 2
22 0.8534189 0.8588961 2
26 0.7431083 0.7314991 2
28 0.5360674 0.5205754 2
30 0.8598931 0.6556079 2
31 0.9627599 0.9784145 2
33 0.8786638 0.9375875 2
34 0.9820861 0.9924680 2
35 0.9842370 0.8563016 2
38 0.7254415 0.5022279 2
39 0.5147399 0.6646237 2
40 0.5649032 0.5269890 2
41 0.6549369 0.3945912 2
42 0.5516358 0.6239680 2
45 0.6468016 0.7153619 2
46 0.6563411 0.5619520 2
47 0.7067819 0.8747652 2
50 0.8484266 0.9106415 2
52 0.7308605 0.7618075 2
56 0.6541389 0.6897037 2
58 0.9919688 0.9912455 2
61 0.7337411 0.8031263 2
63 0.6197967 0.7248485 2
64 0.8043085 0.6823161 2
67 0.8335984 0.8269856 2
70 0.8223687 0.5893472 2
71 0.8825491 0.8567182 2
74 0.6618860 0.6667555 2
77 0.9099707 0.8625390 2
78 0.7993220 0.7537365 2
79 0.6462522 0.6272202 2
86 0.6750035 0.6898848 2
90 0.6074356 0.6478351 2
95 0.9299434 0.9395287 2
96 0.9722002 0.9615102 2
99 0.6839361 0.7226945 2
Add the kme$cluster values to the original dataframe and then create a new dataframe with each column based on the value in kme$cluster
From what I understand without a data sample:
library(tidyverse)
Sim <- Sim %>%
mutate(cluster_group = kme$cluster)
df_final <- data.frame(Group1 = Sim %>%
filter(cluster_group == 1) %>%
select(value) %>%
pull(),
Group2 = Sim %>%
filter(cluster_group== 2) %>%
select(value) %>%
pull())
With value the values used for the kmeans in Sim

Column-specific arguments to lapply in data.table .SD when applying rbinom

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

Binary representation of breast cancer wisconsin database

I want to produce a binary representation of the well-known breast cancer Wisconsin database.
The initial data set has 31 numerical variables, and one categorical variable.
id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
2 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
3 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
4 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
5 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
I want to produce a binary representation of this dataframe by:
transforming the diagnosis column (levels= M , B) to two columns diagnosis_M and diagnosis_B and put 1 or 0 in the relevant row depending on the value in the initial column (M or B).
Looking for the median of each numerical column and split it as two columns depending on whether the values are greater or lower than the mean value. eg: for the column radius_mean, split it in radius_mean_great in-which we put 1 if the values > mean, o else; and a column radius_mean_low inversely.
library(mlbench)
library("RCurl")
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data')
names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst')
breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names)
Well there are several ways to binarize the base, I found the following I hope it serves
df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] > mean(df[,i]))
df2[,2*i] <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))
library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
mutate(diagnosis_B = as.numeric(diagnosis == "B"))
df <- cbind(df3[,-2],df2)
df[1:10,1:7]
id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1 842302 1 0 1 0 0 1
2 842517 1 0 1 0 0 1
3 84300903 1 0 1 0 1 0
4 84348301 1 0 0 1 1 0
5 84358402 1 0 1 0 0 1
6 843786 1 0 0 1 0 1
7 844359 1 0 1 0 1 0
8 84458202 1 0 0 1 1 0
9 844981 1 0 0 1 1 0
10 84501001 1 0 0 1 1 0

Not-equal to character in R

What is the command for printin the string that are not equal to a specific character? From the data below I would like to print the number of rows where the t5-column does not start with d-. (In this example that is all the rows)
I tried
dim(df[df$t5 !="d-",])
df:
name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
6 seq_10002_x17 17 hsa-miR-10a-5p 23 44 5GT 0 d-T 0 TATATACC TGTGTAAG miRNA 1
19 seq_100091_x3 3 hsa-miR-142-3p 54 74 0 u-CA d-TG 0 AGGGTGTA TGGATGAG miRNA 1
20 seq_100092_x1 1 hsa-miR-142-3p 54 74 0 u-CT d-TG 0 AGGGTGTA TGGATGAG miRNA 1
23 seq_100108_x5 5 hsa-miR-10a-5p 23 44 4NC 0 d-T 0 TATATACC TGTGTAAG miRNA 1
26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
28 seq_100121_x1 1 hsa-miR-192-5p 25 45 1CT u-CT d-C d-A GGCTCTGA AGCCAGTG miRNA 1
df1 <- df[!grepl("^d-",df[,8]),]
nrow(df1)
print(df1)
There is one row in your data that has a t5 entry that does not start with "d-". To find this row, you could try:
df[!grepl("^(d-)",df$t5),]
# name freq mir start end mism add t5 t3 s5 s3 DB ambiguity
#26 seq_100113_x1219 1219 hsa-miR-577 15 36 0 0 u-G 0 AGAGTAGA CCTGATGA miRNA 1
If you only want to know the row number, you can get it with rownames()
> rownames(df[!grepl("^(d-)",df$t5),])
#[1] "26"
or with which(),
> which(!grepl("^(d-)",df$t5))
#[1] 5
depending on whether you want the row number counting from the top of your data frame or the row number according to the value on the left.

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

Resources