Levensthein logic to get all the string with minimum difference - r

Suppose i have a datframe with values
Mtemp:
-----+
code |
-----+
Ram |
John |
Tracy|
Aman |
i want to compare it with dataframe
M2:
------+
code |
------+
Vivek |
Girish|
Rum |
Rama |
Johny |
Stacy |
Jon |
i want to get result so that for each value in Mtemp i will get maximum 2 possible match in M2 with Levensthein distance 2.
i have used
tp<-as.data.frame(amatch(Mtemp$code,M2$code,method = "lv",maxDist = 2))
tp$orig<-Mtemp$code
colnames(tp)<-c('Res','orig')
and i am getting result as follow
Res |orig
-----+-----
3 |Ram
5 |John
6 |Tracy
4 |Aman
please let me know a way to get 2 values(if possible) for every Mtemp string with Lev distance =2

Related

WGCNA package: value matching function output contains wrong NAs

I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.

Selecting columns in R depending on user input

I have a dataframe col_metadata in R that goes as:
sample | b | c | ...
____________________
S1 | 1 | 1 | ...
S2 | 1 | 2 | ...
S3 | 2 | 2 | ...
S4 | 3 | 3 | ...
I want to make a function that gives me samples that have given values in front of them. For eg.,
fun(b,c(1,2))
should return
S1 S2 S3
while
fun(c,c(2,3))
should return
S2 S3 S4
and so on. If the column would have been fixed (say, b), I could simply do:
col_metaData[col_metaData$b %in% inputList,]$sample
But since there can be many more columns(hence I can't use if-else), I was looking for a different method to do the same. Can someone please help me do this? Thanks...
I solved it. Just in case anyone comes looking for an answer, we can use this:
col_metaData[col_metaData[,b] %in% inputList,]$sample
Notice [,b] instead of $b.

Count merged observations and calculate fraction

I merged two data sets using Stata and now I need to find the fraction and number of projects matched. To do this, I am assuming that I will need to calculate two counts.
How do I get both of the counts to display at the same time, and then divide one by the other?
Below is an example of my _merge variable:
4022. | master only (1) |
4023. | matched (3) |
4024. | using only (2) |
4025. | using only (2) |
4026. | using only (2) |
4027. | matched (3) |
4028. | matched (3) |
4029. | matched (3) |
4030. | matched (3) |
I would first like to count and store all of the variables under _merge, and then count those that don't say "master only". Then divide the two by each other.
For example:
count1 count2 fraction
6019 4020 .66 (4020/6019)
With count1 being everything under _merge, while count2 being everything that was matched (excludes master only).
Using the following toy example:
clear
webuse autosize
merge 1:1 make using http://www.stata-press.com/data/r14/autoexpense
First it is a good idea to confirm the value which corresponds to "master only":
list _merge
+-----------------+
| _merge |
|-----------------|
1. | matched (3) |
2. | matched (3) |
3. | matched (3) |
4. | master only (1) |
5. | matched (3) |
|-----------------|
6. | matched (3) |
+-----------------+
list _merge, nolabel
+--------+
| _merge |
|--------|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 1 |
5. | 3 |
|--------|
6. | 3 |
+--------+
Then generate the three variables by first counting the relevant observations and dividing:
count if _merge
generate count1 = r(N)
count if _merge != 1
generate count2 = r(N)
generate fraction = count2 / count1
display count1
6
display count2
5
display fraction
1.2

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

Reorder rows by predetermined strings in R

I know this is a simple question so I apologize in advance.
If I have a dataframe like this:
| name | count | class |
|-------|-------|-------------|
| bob | 1 | first grade |
| adam | 5 | college |
| suzie | 7 | high school |
and I want to reorder the rows by class, as in:
| name | count | class |
|-------|-------|-------------|
| bob | 1 | first grade |
| suzie | 7 | high school |
| adam | 5 | college |
I can't use order() since I don't want the class reordered alphabetically.
I tried this, but it failed:
target <- c("first grade", "high school", "college")
df[match(target, df$class),]
This should be straightforward...but reordering is usually reserved for when the values in the columns have some sort of alphanumeric structure. Here, the structure is to be defined by me.
I suppose I could append a new column, with number assignments for class, then sort by that. But there has got to be a more graceful way??
Make class a factor with the levels in the order you want, then use order().
df$class = factor(df$class, levels = target)
df[order(df$class), ]
I think you can do this via an ordered factor.
First create a factor variable from your variable of interest
d <- df$class
Then order the factor by the order you wish
x <- ordered(factor(d), levels=c('first grade','high school','college'))
Then use this to order your df
df[order(x),]
Job done, go play a board game.
Your match needs to be modified a little to work:
df[order(match(df$class, target)),]

Resources