How to match two columns in one dataframe using values in another dataframe in R - r

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.

You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

Related

WGCNA package: value matching function output contains wrong NAs

I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.

Split a single row into multiple rows keeping the delimiter intact

I am trying to split a single row in my data set into multiple rows by keeping the delimiter intact.
This is a sample of my input data set
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1. Teams must be split into two |
| | 2. Teams must have ten players in each team |
| | 3. Each player must bring their own gear |
|---------------------|----------------------------------------------- |
When I use Strsplit function, I get the following output:
df = data.frame(rules =unlist(strsplit(as.character(df$Rules),"?=[[digits]]", perl = T)))
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1 |
|--------------------------------------------------------------------- |
1 | .Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2 |
|--------------------------------------------------------------------- |
1 | .Teams must have ten players in each team |
|--------------------------------------------------------------------- |
My desired Output
|---------------------|----------------------------------------------- |
| Group | Rules |
|---------------------|----------------------------------------------- |
| 1 | 1.Teams must be split into two |
|--------------------------------------------------------------------- |
| 1 | 2.Teams must have ten players in each team |
|--------------------------------------------------------------------- |
Here is a way to collapse each number with the following character string in column Rules. It throws warnings, not errors.
grp <- cumsum(!is.na(as.numeric(df$Rules)))
res <- lapply(split(df, grp), function(X){
data.frame(Group = X[[1]][1],
Rules = paste(X[[2]], collapse = ""))
})
res <- do.call(rbind, res)
res
# Group Rules
#1 1 1.Teams must be split into two
#2 1 2.Teams must have ten players in each team
Data.
df <- data.frame(Group = rep(1, 4),
Rules = c(1, ".Teams must be split into two",
2, ".Teams must have ten players in each team"),
stringsAsFactors = FALSE)

Count merged observations and calculate fraction

I merged two data sets using Stata and now I need to find the fraction and number of projects matched. To do this, I am assuming that I will need to calculate two counts.
How do I get both of the counts to display at the same time, and then divide one by the other?
Below is an example of my _merge variable:
4022. | master only (1) |
4023. | matched (3) |
4024. | using only (2) |
4025. | using only (2) |
4026. | using only (2) |
4027. | matched (3) |
4028. | matched (3) |
4029. | matched (3) |
4030. | matched (3) |
I would first like to count and store all of the variables under _merge, and then count those that don't say "master only". Then divide the two by each other.
For example:
count1 count2 fraction
6019 4020 .66 (4020/6019)
With count1 being everything under _merge, while count2 being everything that was matched (excludes master only).
Using the following toy example:
clear
webuse autosize
merge 1:1 make using http://www.stata-press.com/data/r14/autoexpense
First it is a good idea to confirm the value which corresponds to "master only":
list _merge
+-----------------+
| _merge |
|-----------------|
1. | matched (3) |
2. | matched (3) |
3. | matched (3) |
4. | master only (1) |
5. | matched (3) |
|-----------------|
6. | matched (3) |
+-----------------+
list _merge, nolabel
+--------+
| _merge |
|--------|
1. | 3 |
2. | 3 |
3. | 3 |
4. | 1 |
5. | 3 |
|--------|
6. | 3 |
+--------+
Then generate the three variables by first counting the relevant observations and dividing:
count if _merge
generate count1 = r(N)
count if _merge != 1
generate count2 = r(N)
generate fraction = count2 / count1
display count1
6
display count2
5
display fraction
1.2

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Resources