Reorder rows by predetermined strings in R - r

I know this is a simple question so I apologize in advance.
If I have a dataframe like this:
| name | count | class |
|-------|-------|-------------|
| bob | 1 | first grade |
| adam | 5 | college |
| suzie | 7 | high school |
and I want to reorder the rows by class, as in:
| name | count | class |
|-------|-------|-------------|
| bob | 1 | first grade |
| suzie | 7 | high school |
| adam | 5 | college |
I can't use order() since I don't want the class reordered alphabetically.
I tried this, but it failed:
target <- c("first grade", "high school", "college")
df[match(target, df$class),]
This should be straightforward...but reordering is usually reserved for when the values in the columns have some sort of alphanumeric structure. Here, the structure is to be defined by me.
I suppose I could append a new column, with number assignments for class, then sort by that. But there has got to be a more graceful way??

Make class a factor with the levels in the order you want, then use order().
df$class = factor(df$class, levels = target)
df[order(df$class), ]

I think you can do this via an ordered factor.
First create a factor variable from your variable of interest
d <- df$class
Then order the factor by the order you wish
x <- ordered(factor(d), levels=c('first grade','high school','college'))
Then use this to order your df
df[order(x),]
Job done, go play a board game.

Your match needs to be modified a little to work:
df[order(match(df$class, target)),]

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

Count merged observations and calculate fraction

I merged two data sets using Stata and now I need to find the fraction and number of projects matched. To do this, I am assuming that I will need to calculate two counts.
How do I get both of the counts to display at the same time, and then divide one by the other?
Below is an example of my _merge variable:
4022. | master only (1) |
4023. | matched (3) |
4024. | using only (2) |
4025. | using only (2) |
4026. | using only (2) |
4027. | matched (3) |
4028. | matched (3) |
4029. | matched (3) |
4030. | matched (3) |
I would first like to count and store all of the variables under _merge, and then count those that don't say "master only". Then divide the two by each other.
For example:
count1 count2 fraction
6019 4020 .66 (4020/6019)
With count1 being everything under _merge, while count2 being everything that was matched (excludes master only).
Using the following toy example:
clear
webuse autosize
merge 1:1 make using http://www.stata-press.com/data/r14/autoexpense
First it is a good idea to confirm the value which corresponds to "master only":
list _merge
+-----------------+
| _merge |
|-----------------|
1. | matched (3) |
2. | matched (3) |
3. | matched (3) |
4. | master only (1) |
5. | matched (3) |
|-----------------|
6. | matched (3) |
+-----------------+
list _merge, nolabel
+--------+
| _merge |
|--------|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 1 |
5. | 3 |
|--------|
6. | 3 |
+--------+
Then generate the three variables by first counting the relevant observations and dividing:
count if _merge
generate count1 = r(N)
count if _merge != 1
generate count2 = r(N)
generate fraction = count2 / count1
display count1
6
display count2
5
display fraction
1.2

Levensthein logic to get all the string with minimum difference

Suppose i have a datframe with values
Mtemp:
-----+
code |
-----+
Ram |
John |
Tracy|
Aman |
i want to compare it with dataframe
M2:
------+
code |
------+
Vivek |
Girish|
Rum |
Rama |
Johny |
Stacy |
Jon |
i want to get result so that for each value in Mtemp i will get maximum 2 possible match in M2 with Levensthein distance 2.
i have used
tp<-as.data.frame(amatch(Mtemp$code,M2$code,method = "lv",maxDist = 2))
tp$orig<-Mtemp$code
colnames(tp)<-c('Res','orig')
and i am getting result as follow
Res |orig
-----+-----
3 |Ram
5 |John
6 |Tracy
4 |Aman
please let me know a way to get 2 values(if possible) for every Mtemp string with Lev distance =2

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Code new variable based on grep return in R

I have a variable actor which is a string and contains values like "military forces of guinea-bissau (1989-1992)" and a large range of other different values that are fairly complex. I have been using grep() to find character patterns that match different types of actors. For example I would like to code a new variable actor_type as 1 when actor contains "military forces of", doesn't contain "mutiny of", and the string variable country is also contained in the variable actor.
I am at a loss as to how to conditionally create this new variable without resorting to some type of horrible for loop. Help me!
Data looks roughly like this:
| | actor | country |
|---+----------------------------------------------------+-----------------|
| 1 | "military forces of guinea-bissau" | "guinea-bissau" |
| 2 | "mutiny of military forces of guinea-bissau" | "guinea-bissau" |
| 3 | "unidentified armed group (guinea-bissau)" | "guinea-bissau" |
| 4 | "mfdc: movement of democratic forces of casamance" | "guinea-bissau" |
if your data is in a data.frame df:
> ifelse(!grepl('mutiny of' , df$actor) & grepl('military forces of',df$actor) & apply(df,1,function(x) grepl(x[2],x[1])),1,0)
[1] 1 0 0 0
grepl returns a logical vector and this can be assigned to whatever, e.g. df$actor_type.
breaking that appart:
!grepl('mutiny of', df$actor) and grepl('military forces of', df$actor) satisfy your first two requirements. the last piece, apply(df,1,function(x) grepl(x[2],x[1])) goes row by row and greps for country in actor.

Resources