conditionally transpose select rows using tidyverse

conditionally transpose select rows using tidyverse - r

I have a dataset that I'm working with that I'm attempting to reshape using tidyverse.
From:
|Name |eval |test |type | score|
|:----|:------|:----|:---------|-----:|
|John |first |1 |pretest | 10|
|John |first |1 |posttest | 15|
|John |first |2 |pretest | 20|
|John |first |2 |posttest | 30|
|John |second |1 |pretest | 35|
|John |second |1 |posttest | 50|
|John |second |2 |pretest | 5|
|John |second |2 |posttest | 10|
|Jane |first |1 |pretest | 40|
|Jane |first |1 |posttest | 20|
|Jane |first |2 |pretest | 10|
|Jane |first |2 |posttest | 20|
To:
|Name |eval |new_name | pre_test| post_test|
|:----|:------|:-------------|--------:|---------:|
|John |first |John_first_1 | 10| 15|
|John |first |John_first_2 | 20| 30|
|John |second |John_second_1 | 35| 50|
|John |second |John_second_2 | 5| 10|
|Jane |first |Jane_first_1 | 40| 20|
|Jane |first |Jane_first_2 | 10| 20|
tried doing group_by in order to group_by Name, eval, and test so that each group would essentially be pre_test vs. post_test for a given person.
also tried using unite on Name, eval, test, and type. But if I do a spread after that then each the unique name end up being a number of columns.
also tried to doing a unite first on Name, eval, test first, and then a spread using key=(new united name) and value =Value, but the output isn't what I wanted
I know a loop function can be written to take every other value and put into a new column, but I'm trying to see if there's a tidyverse way to go about this.
Thanks!!
library(tidyverse)
Name <- c('John', 'John', 'John', 'John',
'John', 'John', 'John', 'John',
'Jane', 'Jane', 'Jane', 'Jane')
eval <- c('first', 'first', 'first', 'first',
'second', 'second', 'second', 'second',
'first', 'first', 'first', 'first')
test <- c('1', '1', '2', '2',
'1', '1', '2', '2',
'1', '1', '2', '2')
type <- c('pretest', 'posttest', 'pretest', 'posttest',
'pretest', 'posttest', 'pretest', 'posttest',
'pretest', 'posttest', 'pretest', 'posttest')
score <- c(10, 15, 20, 30, 35, 50, 5, 10, 40, 20, 10, 20)
df <- data.frame(Name, eval, test, type, score)
df %>%
unite(temp, Name, eval, test) %>%
spread(key=type, value=score)
Edit to show the original table that akrun's code worked on
From:
|Name |eval |test |type | score|
|:----|:------|:----|:---------|-----:|
|John |first |1 |pretest | 10|
|John |first |1 |posttest | 15|
|John |first |2 |pretest | 20|
|John |first |2 |postttest | 30|
|John |second |1 |pretest | 35|
|John |second |1 |posttest | 50|
|John |second |2 |pretest | 5|
|John |second |2 |postttest | 10|
|Jane |first |1 |pretest | 40|
|Jane |first |1 |posttest | 20|
|Jane |first |2 |pretest | 10|
|Jane |first |2 |postttest | 20|

We can replace the multiple 't's in the 'type' column to make it same, then use unite specify the remove = FALSE to keep the initial columns as well and spread
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(type = str_replace(type, "t{2,}", "t")) %>%
unite(new_name, Name, eval, test, remove = FALSE) %>%
spread(type, score)
# new_name Name eval test postest pretest
#1 Jane_first_1 Jane first 1 20 40
#2 Jane_first_2 Jane first 2 20 10
#3 John_first_1 John first 1 15 10
#4 John_first_2 John first 2 30 20
#5 John_second_1 John second 1 50 35
#6 John_second_2 John second 2 10 5
In the new version tidyr_1.0.0, pivot_wider is introduced and it can be used as a more generalized version of spread (would be deprecated in the future). So, instead of the spread line at the end, use
...%>%
pivot_wider(names_from = type, values_from = score)

How about something like....
data <- tibble(
Name = c(rep("John", 8), rep("Jane", 4)),
eval = c(rep("first", 4), rep("second", 4), rep("first", 4)),
type = rep(c("pretest", "posttest"), 6),
score = c(10, 15, 20, 30, 35, 50, 5, 10, 40, 20, 10, 20)
)
data %>%
group_by(Name, eval, type) %>%
mutate(num = 1:n(),
new_name = str_c(Name, "_", eval, "_", num)) %>%
ungroup() %>%
dplyr::select(new_name, type, score) %>%
spread(type, score)
Which yields:
# A tibble: 6 x 3
new_name posttest pretest
<chr> <dbl> <dbl>
1 Jane_first_1 20 40
2 Jane_first_2 20 10
3 John_first_1 15 10
4 John_first_2 30 20
5 John_second_1 50 35
6 John_second_2 10 5

Related

data.table, filter >= median per group and keep two lowest

Situation & Goal
I'm having a large table that looks like (simplified):
|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A |Y | 50|
|A |Z | 60|
|A |ZZZZ | 80|
|A |XX | 90|
|A |X | 100|
|B |XYXY | 15|
|B |XXX | 50|
|B |YY | 60|
|B |ZZZ | 150|
|B |ZZ | 400|
Now I want to filter each group (MainCat) and keep only the two lowest values (Value) that are equal/greater than median:
|MainCat |SubCat | Value|Comment |
|:-------|:------|-----:|:---------------------|
|A |Y | 50|- |
|A |Z | 60|- |
|A |ZZZZ | 80|Median, First to keep |
|A |XX | 90|Second to keep |
|A |X | 100|- |
|B |XYXY | 15|- |
|B |XXX | 50|- |
|B |YY | 60|Median, First to keep |
|B |ZZZ | 150|Second to keep |
|B |ZZ | 400|- |
Expected result:
|MainCat |SubCat | Value|
|:-------|:------|-----:|
|A |ZZZZ | 80|
|A |XX | 90|
|B |YY | 60|
|B |ZZZ | 150|
My (failed) attempt
I tried df2[Value >= df2[MainCat==MainCat, median(Value, na.rm=TRUE)]] but this calculates a Median for all values, without grouping. Can somebody help? As performance is key, I prefer a data.table solution if possible. Thank you very much.
MWE
Base data:
df2 = structure(list(MainCat = c("A", "A", "A", "A", "A", "B", "B",
"B", "B", "B"), SubCat = c("Y", "Z", "ZZZZ", "XX", "X", "XYXY",
"XXX", "YY", "ZZZ", "ZZ"), Value = c(50, 60, 80, 90, 100, 15,
50, 60, 150, 400)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Result:
data.table(MainCat=c("A","A","B","B"),
SubCat=c("ZZZZ", "XX", "YY", "ZZZ"),
Value=c(80,90,60,150))

Do a group by 'MainCat', get the row index (.I) after creating the logical expression with the median 'Value', extract the index ($V1), subset the data, order by the 'MainCat', 'Value', get the first two rows with head, grouped by 'MainCat'
library(data.table)
df2[df2[, .I[Value >= median(Value, na.rm = TRUE)],.(MainCat)]$V1
][order(MainCat, Value), head(.SD, 2), MainCat]
-output
MainCat SubCat Value
<char> <char> <num>
1: A ZZZZ 80
2: A XX 90
3: B YY 60
4: B ZZZ 150

Function to eliminate rows from a dataframe with certain condition in R

everyone!
I will try to explain my problem. It is very difficult for me. I Hope you can help me:
I have a data frame, lets call it DF1, that looks like the next one:
|Symbol | Date | Volume | Price|
|----------------------------|-------|
|A |2014-01-01 | 0 | 4 |
|A |2014-01-02 | 7 | 7 |
|A |2014-01-03 | 8 | 9 |
|A |2014-01-04 | 1 | 5 |
|B |2014-01-01 |45 | 6 |
|B |2014-01-02 |0 | 11 |
|B |2014-01-03 |34 | 8 |
|B |2014-01-04 |45 | 5 |
|C |2014-01-01 |4 | 6 |
|C |2014-01-02 |0 | 5 |
|C |2014-01-03 |14 | 25 |
|D |2014-01-01 |31 | 4 |
|D |2014-01-02 |7 | 6 |
|D |2014-01-03 |18 | 3 |
|D |2014-01-04 |15 | 7 |
|E |2014-01-01 |13 | 8 |
|E |2014-01-02 |0 | 9 |
Having this dataframe I create a new dataframe, let's call it DF2, through the following lines of code:
RM <- DF1 %>% group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
And from RM, I select only the dates that are of my interest :
RM_reg <- subset(RM, date >= "2014-03-05" & date<="2014-09-03")
Finally, RM_reg looks like this:
| Date | RM |
|2014-03-05 | 0 |
|2014-03-06 | 7 |
|2014-03-07 | 8 |
|2014-03-08 | 1 |
|2014-03-09 | 45 |
|2014-03-10 | 0 |
|2014-03-11 | 34 |
|2014-03-12 | 45 |
|2014-03-13 | 4 |
|2014-03-14 | 0 |
|2014-03-15 | 14 |
|2014-03-16 | 31 |
It should be noted that the values in the RM_reg column are not the actual values, but only examples. Starting from my original dataframe, RM_reg has 125 rows.
Then, from dataframe DF1, I extract the rows for which the Company column is equal to A through the following code:
DF_A <- DF_1%>%
filter(Symbol=="A")
And I add a column of returns to the dataframe DF_A, through the following code:
RA <- DF_A %>% group_by(Symbol)%>%
mutate(Ret_i = log(Price) - lag(log(Price)))
I eliminate the first row, which is NA:
AR <- na.omit(RA)
And from AR, I select only the dates that are of my interest :
AR_reg <- subset(AR, date >= "2014-03-05" & date<="2014-09-03")
AR_reg looks like this:
|Symbol | Date | volume |price | Ret_i |
|--------------------------------------------|
|A |2014-03-05 | 1 | 5 | 2 |
|A |2014-03-06 | 3 | 8 | 3 |
|A |2014-03-07 | 7 | 4 | 4 |
|A |2014-03-08 |3 | 6 | 5 |
|A |2014-03-09 |34 | 7 | 1 |
|A |2014-03-10 |45 | 34 | 4 |
|A |2014-03-11 |4 | 5 | 3 |
|A |2014-03-12 |9 | 7 | 5 |
|A |2014-03-13 |8 | 6 | 6 |
|A |2014-03-14 |4 | 4 | 1 |
|A |2014-03-15 |0 | 7 | 4 |
|A |2014-03-16 |4 | 7 | 7 |
It should be noted that the values in the AR_reg column are not the actual values, but only examples. Starting from my original dataframe, AR_reg also has 125 rows.
Finally, because RM_reg and AR_reg I can regress the Ret_i column of AR_reg on the RM column of RM_reg through the following code:
mod <- lm(AR_reg$Ret_i ~ RM_reg$RM)
What I need to do is to do the same as described above for all the Symbols in the dataframe DF1, in this case for, "B", "C", "D", "E". The problem is that we do not have the same amount of entries, or the same amount of rows corresponding to all Symbols, and this is a necessary condition to be able to do the regression. To do the regression I need to have 125 observations of returns for each Symbol.
What I have thought is to eliminate the Symbols for which the dataframe similar to AR_reg that is generated does not have 125 entries or rows; but the truth is that I do not know how to do this, I suppose that a function must be raised but this is a subject that I still do not dominate.
Thank you very much for reading me, I hope you have understood me. Any help or suggestion will be very appreciated
Translated with www.DeepL.com/Translator (free version)

Join DF1 with RM by Date, keep only data between specific dates, for each Symbol calculate Ret_i and drop NA values and create list of models.
The complete code would look like :
library(dplyr)
DF1$Date <- as.Date(DF1$Date)
RM <- DF1 %>%
group_by(Date) %>%
mutate(weight = Volume/sum(Volume),
R_i = weight*(log(Price)-log(lag(Price)))) %>%
summarise(RM = sum(R_i, na.rm = TRUE))
result <- DF1 %>%
left_join(RM, by = 'Date') %>%
filter(between(Date, as.Date("2014-03-05"), as.Date("2014-09-03")))
group_by(Symbol) %>%
mutate(Ret_i = log(Price) - lag(log(Price))) %>%
na.omit() %>%
summarise(model = list(lm(Ret_i~RM)))
result

r parser translating symbol_function_call as a symbol

If I parse do.call(what=knitr::kable,args=args) the function kable in do.call is parsed to as a SYMBOL and not as a SYMBOL_FUNCTION_CALL.
Why shouldn't it be the later?
tf <- tempfile()
cat('do.call(knitr::kable,args=args)',file = tf)
parsed <- utils::getParseData(parse(tf))
knitr::kable(parsed)
| | line1| col1| line2| col2| id| parent|token |terminal |text |
|:--|-----:|----:|-----:|----:|--:|------:|:--------------------|:--------|:-------|
|18 | 1| 1| 1| 31| 18| 0|expr |FALSE | |
|1 | 1| 1| 1| 7| 1| 3|SYMBOL_FUNCTION_CALL |TRUE |do.call |
|3 | 1| 1| 1| 7| 3| 18|expr |FALSE | |
|2 | 1| 8| 1| 8| 2| 18|'(' |TRUE |( |
|7 | 1| 9| 1| 20| 7| 18|expr |FALSE | |
|4 | 1| 9| 1| 13| 4| 7|SYMBOL_PACKAGE |TRUE |knitr |
|5 | 1| 14| 1| 15| 5| 7|NS_GET |TRUE |:: |
|6 | 1| 16| 1| 20| 6| 7|SYMBOL |TRUE |kable |
|8 | 1| 21| 1| 21| 8| 18|',' |TRUE |, |
|11 | 1| 22| 1| 25| 11| 18|SYMBOL_SUB |TRUE |args |
|12 | 1| 26| 1| 26| 12| 18|EQ_SUB |TRUE |= |
|13 | 1| 27| 1| 30| 13| 15|SYMBOL |TRUE |args |
|15 | 1| 27| 1| 30| 15| 18|expr |FALSE | |
|14 | 1| 31| 1| 31| 14| 18|')' |TRUE |) |

If you just have ktable its a symbol. That symbol could point to a function or a value. It's not clear until you actually evaluate it what it is.
However if you have ktable(), it's clear that you expect ktable to be a function and that you are calling it.
The do.call obscures the parser's ability to recognize that you are trying to call a function and that intention isn't realized till run-time.
Things can get funny if you do something like
sum <- 5
sum
# [1] 5
sum(1:3)
# [1] 6
Here sum is behaving both like a regular variable and a function. We've actually created a shadow variable in our global environment that masks the sum function from base. But because the parse treats sum and sum() differently we can still get at both meanings.

Recode Variable in R after matching with another Data Frame

I have 2 dataframes in R,
DF1
|attr1|attr2|attr3|
|5 |4 |9 |
|4 |30 |2 |
|5 |18 |1 |
|3 |1 |7 |
|6 |30 |0 |
|8 |18 |12 |
Now, i'm trying to recode the values present within the attr2 column in this dataframe in a manner such that if the value in attr2 is present within the col1 in DF2 then it should be recoded as 1 otherwise as 0. The second dataframe is simply a count of the top 2 unique values within attr2
DF2
|Var1|Freq|
|30 |2 |
|18 |2 |
I want the result to be in the format of something as follows:
|attr1|attr2|attr3|
|5 |0 |9 |
|4 |1 |2 |
|5 |1 |1 |
|3 |0 |7 |
|6 |1 |0 |
|8 |1 |12 |
Thanks for the help!

We can use
library(dplyr)
DF1 %>%
mutate(attr2 = as.integer(attr2 %in% DF2$Var1))

R: two data frame merge with 2 variables and several other conditions

I am a beginner in R. Here is an example of a datatable (C) that I created using jmp. I have joined Table A and B using A1 and B;C columns to create C . In the datatable B, the cloumn OP that contains CLO is dropped during the join while the column J from datatable A is updated during the join.
I am trying to create the dataframe C using the merge command in R. I used the following expression:
C <- merge(B,A, BY=c("A1","B;C"),all.x = TRUE) but I don't seem to get the Data frame C. I would appreciate any help from the community to solve this.
Data Frame A
A1 | B;C | D |E |F |G | H | I |J |K |L | M |
------|------|---|--|---|---|---|------------|---|----|----|---|
ABCD |SD;TH |HO |2 |FA | |ENG| 201808:SPR |54 |PRO |VAC |MAA|
JCBW |RF;TH |HO |2 |FU |VIN|FUT| 504278:SPR |4 |PRO |VAC |MAA|
TVGH |ED;UJ |HO |2 |FU |VIN|FUT| 504276:SPR |4 |PRO |VAC |MAA|
IGHE |WR;RE |HO |3 |IN | |SPE| 504278:SPR |73 |PRO |VAC |MAA|
UUUU |DF;TH |HO |3 |FU | |FUT| 357193:IT |13 |INT |VAC |MAA|
JFLD |YO;TH |HO |3 |CH |BRI|CHE| 476306:SPR |6 |PRO |VAC |MAA|
|
Data frame B
OWN|COM|OP |GR |J | A1 | B;C | D|E |F |G |H | I |K |L |M
---|---|---|---|--|-----|-----|--|--|--|---|---|-----------|---|---|----
SUP|X |CLO|ARE|16|59HUW|BB;TH|HO|8 |FA|MIC|SPE|90278:SPR |INT|VAC|MAA
SUP|X |OPE|ARE|75|ABCD |SD;TH|HO|8 |FU|MIC|ENG|201808:SPR |INT|VAC|MAA
SUP|X |CLO|ARE|4 |59HVG|BB;RE|HO|8 |FA|MIC|SPE|6074278:SPR|INT|VAC|MAA
PAD|X |CLO|PEN|30|9RHSG|BV;TH|HO|2 |FA| |SPE|201808:SPR |PRO|VAC|MAA
PAD|X |OPE|PEN|99|UUUU |DF;TH|HO|8 |FU|MIC|FUT|357193:IT |PRO|VAC|MAA
PAD|X |OPE|PEN|65|IGHE |WR;RE|HO|8 |IN| |SPE|504278:SPR |PRO|VAC|MAA
PAD|X |CLO|PEN|13|S9K7E|FN;TH|HO|8 |FA|MIC|FUT|394290:SPR |PRO|VAC|MAA
Data frame C
OWN|COM|OP |GR |J |A1 | B;C |D |E |F | G |H | I | K |L |M
---|---|---|---|---|----|-----|--|--|--|---|---|----------|---|---|----
SUP|x |OPE|ARE|99 |ABCD|SD;TH|HO|8 |FU|MIC|ENG|201808:SPR|INT|VAC|MAA
PAD|x |OPE|PEN|120|UUUU|DF;TH|HO|8 |FU|MIC|FUT|357193:IT |PRO|VAC|MAA
PAD|x |OPE|PEN|73 |IGHE|WR;RE|HO|8 |IN| |SPE|504278:SPR|PRO|VAC|MAA
| | | |4 |JCBW|RF;TH|HO|2 |FU|VIN|FUT|504278:SPR|PRO|VAC|MAA
| | | |25 |TVGH|ED;UJ|HO|2 |FU|VIN|FUT|504276:SPR|PRO|VAC|MAA
| | | |15 |JFLD|YO;TH|HO|3 |CH|BRI|CHE|476306:SPR|PRO|VAC|MAA