R: two data frame merge with 2 variables and several other conditions - r

I am a beginner in R. Here is an example of a datatable (C) that I created using jmp. I have joined Table A and B using A1 and B;C columns to create C . In the datatable B, the cloumn OP that contains CLO is dropped during the join while the column J from datatable A is updated during the join.
I am trying to create the dataframe C using the merge command in R. I used the following expression:
C <- merge(B,A, BY=c("A1","B;C"),all.x = TRUE) but I don't seem to get the Data frame C. I would appreciate any help from the community to solve this.
Data Frame A
A1 | B;C | D |E |F |G | H | I |J |K |L | M |
------|------|---|--|---|---|---|------------|---|----|----|---|
ABCD |SD;TH |HO |2 |FA | |ENG| 201808:SPR |54 |PRO |VAC |MAA|
JCBW |RF;TH |HO |2 |FU |VIN|FUT| 504278:SPR |4 |PRO |VAC |MAA|
TVGH |ED;UJ |HO |2 |FU |VIN|FUT| 504276:SPR |4 |PRO |VAC |MAA|
IGHE |WR;RE |HO |3 |IN | |SPE| 504278:SPR |73 |PRO |VAC |MAA|
UUUU |DF;TH |HO |3 |FU | |FUT| 357193:IT |13 |INT |VAC |MAA|
JFLD |YO;TH |HO |3 |CH |BRI|CHE| 476306:SPR |6 |PRO |VAC |MAA|
|
Data frame B
OWN|COM|OP |GR |J | A1 | B;C | D|E |F |G |H | I |K |L |M
---|---|---|---|--|-----|-----|--|--|--|---|---|-----------|---|---|----
SUP|X |CLO|ARE|16|59HUW|BB;TH|HO|8 |FA|MIC|SPE|90278:SPR |INT|VAC|MAA
SUP|X |OPE|ARE|75|ABCD |SD;TH|HO|8 |FU|MIC|ENG|201808:SPR |INT|VAC|MAA
SUP|X |CLO|ARE|4 |59HVG|BB;RE|HO|8 |FA|MIC|SPE|6074278:SPR|INT|VAC|MAA
PAD|X |CLO|PEN|30|9RHSG|BV;TH|HO|2 |FA| |SPE|201808:SPR |PRO|VAC|MAA
PAD|X |OPE|PEN|99|UUUU |DF;TH|HO|8 |FU|MIC|FUT|357193:IT |PRO|VAC|MAA
PAD|X |OPE|PEN|65|IGHE |WR;RE|HO|8 |IN| |SPE|504278:SPR |PRO|VAC|MAA
PAD|X |CLO|PEN|13|S9K7E|FN;TH|HO|8 |FA|MIC|FUT|394290:SPR |PRO|VAC|MAA
Data frame C
OWN|COM|OP |GR |J |A1 | B;C |D |E |F | G |H | I | K |L |M
---|---|---|---|---|----|-----|--|--|--|---|---|----------|---|---|----
SUP|x |OPE|ARE|99 |ABCD|SD;TH|HO|8 |FU|MIC|ENG|201808:SPR|INT|VAC|MAA
PAD|x |OPE|PEN|120|UUUU|DF;TH|HO|8 |FU|MIC|FUT|357193:IT |PRO|VAC|MAA
PAD|x |OPE|PEN|73 |IGHE|WR;RE|HO|8 |IN| |SPE|504278:SPR|PRO|VAC|MAA
| | | |4 |JCBW|RF;TH|HO|2 |FU|VIN|FUT|504278:SPR|PRO|VAC|MAA
| | | |25 |TVGH|ED;UJ|HO|2 |FU|VIN|FUT|504276:SPR|PRO|VAC|MAA
| | | |15 |JFLD|YO;TH|HO|3 |CH|BRI|CHE|476306:SPR|PRO|VAC|MAA

Related

Change outliers from black to colour in grouped box plot in ggplot2

I have a grouped box plot in which I want to change the outlier dots from the default of black to the colour of the boxes keeping everything else the same. There is a previous thread that provides a solution for this for a standard box plot that I am able to implement.
Coloring boxplot outlier points in ggplot2?
However, I want to do it for a grouped box plot.
Below is some example data and code for the grouped box plot.
|ID |Time |Metabolite | Concentration|
|:--|:----|:----------|-------------:|
|1 |1 |A | 40|
|1 |1 |B | 36|
|1 |1 |C | 28|
|1 |2 |A | 13|
|1 |2 |B | 150|
|1 |2 |C | 32|
|1 |3 |A | 45|
|1 |3 |B | 15|
|1 |3 |C | 15|
|2 |1 |A | 7|
|2 |1 |A | 9|
|2 |1 |B | 236|
|2 |1 |C | 33|
|2 |2 |A | 33|
|2 |2 |B | 48|
|2 |2 |C | 39|
|2 |3 |A | 15|
|2 |3 |C | 126|
|3 |1 |A | 13|
|3 |1 |B | 41|
|3 |1 |C | 37|
|3 |2 |A | 3|
|3 |2 |B | 218|
|3 |2 |C | 27|
|3 |3 |A | 7|
|3 |3 |B | 27|
|3 |3 |C | 3|
|4 |1 |A | 4|
|4 |1 |B | 7|
|4 |1 |C | 33|
|4 |2 |A | 133|
|4 |2 |B | 4|
|4 |2 |C | 10|
|4 |3 |A | 122|
|4 |3 |B | 27|
|4 |3 |C | 14|
|5 |1 |A | 7|
|5 |1 |B | 22|
|5 |1 |C | 43|
|5 |2 |A | 3|
|5 |2 |B | 6|
|5 |2 |C | 158|
|5 |3 |A | 48|
|5 |3 |B | 7|
|5 |3 |C | 24|
|6 |1 |A | 15|
|6 |1 |B | 30|
|6 |1 |C | 15|
|6 |2 |A | 27|
|6 |2 |B | 187|
|6 |2 |C | 9|
|6 |3 |A | 31|
|6 |3 |B | 40|
|6 |3 |C | 41|
|7 |1 |A | 37|
|7 |1 |B | 30|
|7 |1 |C | 28|
|7 |2 |A | 142|
|7 |2 |B | 40|
|7 |2 |C | 7|
|7 |3 |A | 45|
|7 |3 |B | 3|
|8 |3 |C | 45|
|8 |1 |A | 34|
|8 |1 |B | 8|
|8 |1 |C | 46|
|8 |2 |A | 167|
|8 |2 |B | 25|
|8 |2 |C | 34|
|8 |3 |A | 27|
|9 |3 |B | 28|
|9 |3 |C | 36|
|9 |1 |A | 44|
|9 |1 |B | 26|
|9 |1 |C | 20|
|9 |2 |A | 11|
|9 |2 |B | 18|
|9 |2 |C | 176|
|9 |3 |A | 1|
|9 |3 |B | 40|
|9 |3 |C | 10|
|10 |1 |A | 8|
|10 |1 |B | 49|
|10 |1 |C | 193|
|10 |2 |A | 13|
|10 |2 |B | 13|
|10 |2 |C | 28|
|10 |3 |A | 50|
|10 |3 |B | 47|
|10 |3 |C | 46|
|11 |1 |A | 21|
|11 |1 |B | 34|
|11 |1 |C | 28|
|11 |2 |A | 13|
|11 |2 |B | 32|
|11 |2 |C | 47|
|11 |3 |A | 15|
|11 |3 |B | 42|
|11 |3 |C | 9|
ggplot(df, aes(x=Time, y=Concentration, fill=Metabolite)) +
geom_boxplot()

Function to eliminate rows from a dataframe with certain condition in R

everyone!
I will try to explain my problem. It is very difficult for me. I Hope you can help me:
I have a data frame, lets call it DF1, that looks like the next one:
|Symbol | Date | Volume | Price|
|----------------------------|-------|
|A |2014-01-01 | 0 | 4 |
|A |2014-01-02 | 7 | 7 |
|A |2014-01-03 | 8 | 9 |
|A |2014-01-04 | 1 | 5 |
|B |2014-01-01 |45 | 6 |
|B |2014-01-02 |0 | 11 |
|B |2014-01-03 |34 | 8 |
|B |2014-01-04 |45 | 5 |
|C |2014-01-01 |4 | 6 |
|C |2014-01-02 |0 | 5 |
|C |2014-01-03 |14 | 25 |
|D |2014-01-01 |31 | 4 |
|D |2014-01-02 |7 | 6 |
|D |2014-01-03 |18 | 3 |
|D |2014-01-04 |15 | 7 |
|E |2014-01-01 |13 | 8 |
|E |2014-01-02 |0 | 9 |
Having this dataframe I create a new dataframe, let's call it DF2, through the following lines of code:
RM <- DF1 %>% group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
And from RM, I select only the dates that are of my interest :
RM_reg <- subset(RM, date >= "2014-03-05" & date<="2014-09-03")
Finally, RM_reg looks like this:
| Date | RM |
|2014-03-05 | 0 |
|2014-03-06 | 7 |
|2014-03-07 | 8 |
|2014-03-08 | 1 |
|2014-03-09 | 45 |
|2014-03-10 | 0 |
|2014-03-11 | 34 |
|2014-03-12 | 45 |
|2014-03-13 | 4 |
|2014-03-14 | 0 |
|2014-03-15 | 14 |
|2014-03-16 | 31 |
It should be noted that the values in the RM_reg column are not the actual values, but only examples. Starting from my original dataframe, RM_reg has 125 rows.
Then, from dataframe DF1, I extract the rows for which the Company column is equal to A through the following code:
DF_A <- DF_1%>%
filter(Symbol=="A")
And I add a column of returns to the dataframe DF_A, through the following code:
RA <- DF_A %>% group_by(Symbol)%>%
mutate(Ret_i = log(Price) - lag(log(Price)))
I eliminate the first row, which is NA:
AR <- na.omit(RA)
And from AR, I select only the dates that are of my interest :
AR_reg <- subset(AR, date >= "2014-03-05" & date<="2014-09-03")
AR_reg looks like this:
|Symbol | Date | volume |price | Ret_i |
|--------------------------------------------|
|A |2014-03-05 | 1 | 5 | 2 |
|A |2014-03-06 | 3 | 8 | 3 |
|A |2014-03-07 | 7 | 4 | 4 |
|A |2014-03-08 |3 | 6 | 5 |
|A |2014-03-09 |34 | 7 | 1 |
|A |2014-03-10 |45 | 34 | 4 |
|A |2014-03-11 |4 | 5 | 3 |
|A |2014-03-12 |9 | 7 | 5 |
|A |2014-03-13 |8 | 6 | 6 |
|A |2014-03-14 |4 | 4 | 1 |
|A |2014-03-15 |0 | 7 | 4 |
|A |2014-03-16 |4 | 7 | 7 |
It should be noted that the values in the AR_reg column are not the actual values, but only examples. Starting from my original dataframe, AR_reg also has 125 rows.
Finally, because RM_reg and AR_reg I can regress the Ret_i column of AR_reg on the RM column of RM_reg through the following code:
mod <- lm(AR_reg$Ret_i ~ RM_reg$RM)
What I need to do is to do the same as described above for all the Symbols in the dataframe DF1, in this case for, "B", "C", "D", "E". The problem is that we do not have the same amount of entries, or the same amount of rows corresponding to all Symbols, and this is a necessary condition to be able to do the regression. To do the regression I need to have 125 observations of returns for each Symbol.
What I have thought is to eliminate the Symbols for which the dataframe similar to AR_reg that is generated does not have 125 entries or rows; but the truth is that I do not know how to do this, I suppose that a function must be raised but this is a subject that I still do not dominate.
Thank you very much for reading me, I hope you have understood me. Any help or suggestion will be very appreciated
Translated with www.DeepL.com/Translator (free version)
Join DF1 with RM by Date, keep only data between specific dates, for each Symbol calculate Ret_i and drop NA values and create list of models.
The complete code would look like :
library(dplyr)
DF1$Date <- as.Date(DF1$Date)
RM <- DF1 %>%
group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
result <- DF1 %>%
left_join(RM, by = 'Date') %>%
filter(between(Date, as.Date("2014-03-05"), as.Date("2014-09-03")))
group_by(Symbol) %>%
mutate(Ret_i = log(Price) - lag(log(Price))) %>%
na.omit() %>%
summarise(model = list(lm(Ret_i~RM)))
result

orientdb cluster / console

I have installed orientdb with docker.
All objects appear normal in the studio interface. But I can't list the clusters correctly. Only the id appears:
orientdb {db=GratefulDeadConcerts}> list clusters
CLUSTERS (collections)
+----+-----+-----+
|# |NAME |COUNT|
+----+-----+-----+
|0 | | |
|1 | | |
|2 | | |
|3 | | |
|4 | | |
|5 | | |
|6 | | |
|7 | | |
|8 | | |
|9 | | |
|10 | | |
|11 | | |
|12 | | |
|13 | | |
|14 | | |
|15 | | |
|16 | | |
|17 | | |
|18 | | |
|19 | | |
|20 | | |
|21 | | |
|22 | | |
|23 | | |
|24 | | |
|25 | | |
|26 | | |
|27 | | |
|28 | | |
|29 | | |
|30 | | |
|31 | | |
|32 | | |
|33 | | |
|34 | | |
|35 | | |
|36 | | |
|37 | | |
|38 | | |
|39 | | |
|40 | | |
|41 | | |
|42 | | |
|43 | | |
|44 | | |
|45 | | |
|46 | | |
|47 | | |
|48 | | |
|49 | | |
|50 | | |
|51 | | |
|52 | | |
+----+-----+-----+
| |TOTAL| 0|
+----+-----+-----+
Cluster names appear correctly on the Studio interface
Why is there a difference between the console and the interface/studio ?

How to compare comma separated string in one column with the comma separated strings other dataframe

I have two df as below
df1:
M1 |
-------+
a,b,c |
a |
b,c |
c,b,a |
b,a,d |
d,a,b,c|
a,d,c |
b |
c,d |
d,a |
df2:
X1 |X2
--------+---
a |1
b |2
c |3
d |4
a,b |5
a,c |6
a,d |7
b,c |8
b,d |9
c,d |10
a,b,c |11
a,c,d |12
a,b,d |13
b,c,d |14
a,b,c,d |15
can someone help me to match values in df1$M1 and df2$X1. and put the corresponding X2 value in column M2 as below
df1:
M1 |M2
--------+---
a,b,c |11
a |1
b,c |8
c,b,a |11
b,a,d |13
d,a,b,c |15
a,d,c |12
b |2
c,d |10
d,a |7
Can someone help me
X1 and M1 have to be stored as Characters. You can check with str(df1), and re-assign if necessary df1 <- as.character(df1$X1), and the same for df2
Then, create new columns with the values in alphabetical order:
df1$Ordered <- sapply(lapply(strsplit(df1$X1, ","), sort),paste,collapse=",")
df2$Ordered <- sapply(lapply(strsplit(df2$M1, ","), sort),paste,collapse=",")
Then perform a join like so:
merge(df1, df2, by="Ordered")
If you want to include all the values in df1 regardless of whether they have a matching value in df2, add the all.x = TRUE argument. Same logic applies adding all = TRUE (include everything from both data frames), or all.y = TRUE for df2.

Recode Variable in R after matching with another Data Frame

I have 2 dataframes in R,
DF1
|attr1|attr2|attr3|
|5 |4 |9 |
|4 |30 |2 |
|5 |18 |1 |
|3 |1 |7 |
|6 |30 |0 |
|8 |18 |12 |
Now, i'm trying to recode the values present within the attr2 column in this dataframe in a manner such that if the value in attr2 is present within the col1 in DF2 then it should be recoded as 1 otherwise as 0. The second dataframe is simply a count of the top 2 unique values within attr2
DF2
|Var1|Freq|
|30 |2 |
|18 |2 |
I want the result to be in the format of something as follows:
|attr1|attr2|attr3|
|5 |0 |9 |
|4 |1 |2 |
|5 |1 |1 |
|3 |0 |7 |
|6 |1 |0 |
|8 |1 |12 |
Thanks for the help!
We can use
library(dplyr)
DF1 %>%
mutate(attr2 = as.integer(attr2 %in% DF2$Var1))

Resources