Replacing column with another data frame based on name matching - r

Hi I am a bit new so I am not sure if I am doing this right, but I looked around on the overflow and couldn't find a code or advice that worked with my code.
I have a dataframe mainDF that looks like this:
Person
ABG
SEP
CLC
XSP
APP
WED
GSH
SP-1
2.1
3.0
1.3
1.8
1.4
2.5
1.4
SP-2
2.5
2.1
2.0
1.9
1.2
1.2
2.1
SP-3
2.3
3.1
2.5
1.5
1.1
2.6
2.1
I have another dataframe, TranslateDF that has the converting info for the abbreviated column names. And I want to replace the abbreviated names with the real names here:
Do note that the translating data frame may have extraneous info or it could be missing info for the column, and so if the mainDF does not get the full naming, for it to be dropped from the data.
Abbreviated
Full Naming
ABG
All barbecue grill
SEP
shake eel peel
CLC
cold loin cake
XSP
xylophone spear pint
APP
apple pot pie
HUM
hall united meat
LPL
lending porkloin
Ideally, I would get the new resulted dataframe as:
Person
All barbecue grill
shake eel peel
cold loin cake
xylophone spear pint
apple pot pie
SP-1
2.1
3.0
1.3
1.8
1.4
SP-2
2.5
2.1
2.0
1.9
1.2
SP-3
2.3
3.1
2.5
1.5
1.1
I would appreciate any help on this thank you!

You can pass a named vector to select() which will rename and select in one step. Wrapping with any_of() ensures it won't fail if any columns don't exist in the main data frame:
library(dplyr)
df1 %>%
select(Person, any_of(setNames(df2$Abbreviated, df2$Full_Naming)))
# A tibble: 3 x 6
Person `All barbecue grill` `shake eel peel` `cold loin cake` `xylophone spear pint` `apple pot pie`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SP-1 2.1 3 1.3 1.8 1.4
2 SP-2 2.5 2.1 2 1.9 1.2
3 SP-3 2.3 3.1 2.5 1.5 1.1
Data:
df1 <- structure(list(Person = c("SP-1", "SP-2", "SP-3"), ABG = c(2.1,
2.5, 2.3), SEP = c(3, 2.1, 3.1), CLC = c(1.3, 2, 2.5), XSP = c(1.8,
1.9, 1.5), APP = c(1.4, 1.2, 1.1), WED = c(2.5, 1.2, 2.6), GSH = c(1.4,
2.1, 2.1)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), spec = structure(list(cols = list(
Person = structure(list(), class = c("collector_character",
"collector")), ABG = structure(list(), class = c("collector_double",
"collector")), SEP = structure(list(), class = c("collector_double",
"collector")), CLC = structure(list(), class = c("collector_double",
"collector")), XSP = structure(list(), class = c("collector_double",
"collector")), APP = structure(list(), class = c("collector_double",
"collector")), WED = structure(list(), class = c("collector_double",
"collector")), GSH = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Abbreviated = c("ABG", "SEP", "CLC", "XSP", "APP",
"HUM", "LPL"), Full_Naming = c("All barbecue grill", "shake eel peel",
"cold loin cake", "xylophone spear pint", "apple pot pie", "hall united meat",
"lending porkloin")), class = "data.frame", row.names = c(NA,
-7L))

How about this:
mainDF <- structure(list(Person = c("SP-1", "SP-2", "SP-3"), ABG = c(2.1,
2.5, 2.3), SEP = c(3, 2.1, 3.1), CLC = c(1.3, 2, 2.5), XSP = c(1.8,
1.9, 1.5), APP = c(1.4, 1.2, 1.1), WED = c(2.5, 1.2, 2.6), GSH = c(1.4,
2.1, 2.1)), row.names = c(NA, 3L), class = "data.frame")
translateDF <- structure(list(Abbreviated = c("ABG", "SEP", "CLC", "XSP", "APP",
"HUM", "LPL"), `Full Naming` = c("All barbecue grill", "shake eel peel",
"cold loin cake", "xylophone spear pint", "apple pot pie", "hall united meat",
"lending porkloin")), row.names = c(NA, 7L), class = "data.frame")
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
mainDF %>%
pivot_longer(-Person,
names_to="Abbreviated",
values_to = "vals") %>%
left_join(translateDF) %>%
select(-Abbreviated) %>%
na.omit() %>%
pivot_wider(names_from=`Full Naming`, values_from="vals")
#> Joining, by = "Abbreviated"
#> # A tibble: 3 × 6
#> Person `All barbecue grill` `shake eel peel` `cold loin cake` `xylophone spe…`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 SP-1 2.1 3 1.3 1.8
#> 2 SP-2 2.5 2.1 2 1.9
#> 3 SP-3 2.3 3.1 2.5 1.5
#> # … with 1 more variable: `apple pot pie` <dbl>
Created on 2022-04-24 by the reprex package (v2.0.1)

library(tidyverse)
mainDF %>%
rename_with(~str_replace_all(., set_names(TranslateDF[, 2], TranslateDF[, 1]))) %>%
select(Person, which(!(names(.) %in% names(mainDF))))
Person All barbecue grill shake eel peel cold loin cake xylophone spear pint apple pot pie
1 SP-1 2.1 3.0 1.3 1.8 1.4
2 SP-2 2.5 2.1 2.0 1.9 1.2
3 SP-3 2.3 3.1 2.5 1.5 1.1

Related

Perform calculation using rows with the same value [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 1 year ago.
I have the following dataframe structure
Company Name stock price Date
HP 1.2 10/05/2020
HP 1.4 11/05/2020
APPL 1.1 05/03/2020
APPL 1.2 06/03/2020
FB 5 15/08/2020
FB 5.2 16/08/2020
FB 5.3 17/08/2020
...and so on for multiple companies and their stock prices for different dates.
I wish to calculate daily returns for each stock and I am trying to figure out the code to iterate this dataframe for each company. I.e. when we are done with APPL we start again over for FB by setting the first row to N/A since we don't have returns to compare with, and so on as shown below.
Company Name stock price Date Daily Returns
HP 1.2 10/05/2020 N/A
HP 1.4 11/05/2020 0.2
APPL 1.1 05/03/2020 N/A
APPL 1.2 06/03/2020 0.1
FB 5 15/08/2020 N/A
FB 5.2 16/08/2020 0.2
FB 5.3 17/08/2020 0.1
Is there a more efficient solution to tackle this than extracting a list of unique company names and then cycling through each of them to perform this calculation?
You should use dplyr for this kind of tasks:
library(dplyr)
df %>%
arrange(Company_Name, Date) %>%
group_by(Company_Name) %>%
mutate(Daily_Returns = stock_price - lag(stock_price)) %>%
ungroup()
This returns
Company_Name stock_price Date Daily_Returns
<chr> <dbl> <chr> <dbl>
1 HP 1.2 10/05/2020 NA
2 HP 1.4 11/05/2020 0.2
3 APPL 1.1 05/03/2020 NA
4 APPL 1.2 06/03/2020 0.100
5 FB 5 15/08/2020 NA
6 FB 5.2 16/08/2020 0.200
7 FB 5.3 17/08/2020 0.100
First we order the data by Company_Name and Date
Then we group it by Company_Name, so every calculation starts over again for a new company
Then we calculate the daily returns by substracting the former day (here we use lag)
Data
structure(list(Company_Name = c("HP", "HP", "APPL", "APPL", "FB",
"FB", "FB"), stock_price = c(1.2, 1.4, 1.1, 1.2, 5, 5.2, 5.3),
Date = c("10/05/2020", "11/05/2020", "05/03/2020", "06/03/2020",
"15/08/2020", "16/08/2020", "17/08/2020")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(Company_Name = structure(list(), class = c("collector_character",
"collector")), stock_price = structure(list(), class = c("collector_double",
"collector")), Date = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

R format the wide table to long table

cutoff KM KM_lo KM_hi rstm rstm_lo rstm_hi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017-01-01 2.1 1.4 4.9 7.2 3.9 10.2
2 2017-04-01 3.5 2.1 4.7 8.9 6.6 10.8
3 2017-07-01 3.7 2.8 4.2 7.2 6.2 8.4
How do I convert this to a long table? I am struggling to create it into the format I want. I tried the gather and melt functions. The output table would look something like this
cutoff VAR Val Val-hi Val-lo
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 4.9 1.4
2 2017-01-01 rstm 7.2 4.7 3.9
3 2017-07-01 KM 3.7 4.2 2.8
Sample date
structure(list(cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"
), KM = c(2.1, 3.5, 3.7), KM_lo = c(1.4, 2.1, 2.8), KM_hi = c(4.9,
4.7, 4.2), rstm = c(7.2, 8.9, 7.2), rstm_lo = c(3.9, 6.6, 6.2
), rstm_hi = c(10.2, 10.8, 8.4)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
We may do
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
rename_with(~ str_c(., "_none"), c("KM", "rstm")) %>%
pivot_longer(cols = -cutoff, names_to = c("VAR", ".value"),
names_sep = "_") %>%
rename_with(~ c("Val", "Val-lo", "Val-hi"), 3:5)
-output
# A tibble: 6 × 5
cutoff VAR Val `Val-lo` `Val-hi`
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 1.4 4.9
2 2017-01-01 rstm 7.2 3.9 10.2
3 2017-04-01 KM 3.5 2.1 4.7
4 2017-04-01 rstm 8.9 6.6 10.8
5 2017-07-01 KM 3.7 2.8 4.2
6 2017-07-01 rstm 7.2 6.2 8.4
Here is another pivot_longer approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-cutoff,
names_to = c("VAR", ".value"),
names_pattern = "(.+)_(.+)"
) %>%
na.omit()
cutoff VAR lo hi
<chr> <chr> <dbl> <dbl>
1 2017-01-01 KM 1.4 4.9
2 2017-01-01 rstm 3.9 10.2
3 2017-04-01 KM 2.1 4.7
4 2017-04-01 rstm 6.6 10.8
5 2017-07-01 KM 2.8 4.2
6 2017-07-01 rstm 6.2 8.4
library(tidyverse)
df <-
structure(
list(
cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"),
KM = c(2.1, 3.5, 3.7),
KM_lo = c(1.4, 2.1, 2.8),
KM_hi = c(4.9, 4.7, 4.2),
rstm = c(7.2, 8.9, 7.2),
rstm_lo = c(3.9, 6.6, 6.2),
rstm_hi = c(10.2, 10.8, 8.4)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame")
)
df %>%
pivot_longer(cols = -cutoff) %>%
separate(col = name, into = c("name", "suffix"), sep = "_", remove = TRUE) %>%
mutate(id = data.table::rleid(name)) %>%
pivot_wider(id_cols = c(id, cutoff, name), names_from = suffix, names_prefix = "VAL_", values_from = value) %>%
select(-id) %>%
rename(VAL = VAL_NA)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 4, 7,
#> 10, 13, 16].
#> # A tibble: 6 x 5
#> cutoff name VAL VAL_lo VAL_hi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2017-01-01 KM 2.1 1.4 4.9
#> 2 2017-01-01 rstm 7.2 3.9 10.2
#> 3 2017-04-01 KM 3.5 2.1 4.7
#> 4 2017-04-01 rstm 8.9 6.6 10.8
#> 5 2017-07-01 KM 3.7 2.8 4.2
#> 6 2017-07-01 rstm 7.2 6.2 8.4
Created on 2021-09-28 by the reprex package (v2.0.1)

Match same strings over two different vectors

Say we have two different datasets:
Dataset A:
ids name price
1234 bread 1.5
245r7 butter 1.2
123984 red wine 5
43498 beer 1
235897 cream 1.8
Dataset B:
ids name price
24908 lait 1
1234,089 pain 1.7
77289,43498 bière 1.5
245r7 beurre 1.4
My goal is to match all the products sharing at least one ID and bring them together into a new dataset that should look as follows:
id a_name b_name a_price b_price
1234 bread pain 1.5 1.7
245r7 butter beurre 1.2 1.4
43498 beer bière 1 1.5
Is this feasible using stringr or any other R package?
You can create a long dataset with separate_rows and then do a join.
library(dplyr)
library(tidyr)
B %>%
separate_rows(ids, sep = ',') %>%
inner_join(A, by = 'ids')
# ids name.x price.x name.y price.y
# <chr> <chr> <dbl> <chr> <dbl>
#1 1234 pain 1.7 bread 1.5
#2 43498 bière 1.5 beer 1
#3 245r7 beurre 1.4 butter 1.2
We can use the sqldf package here:
library(sqldf)
sql <- "SELECT a.ids AS id, a.name AS a_name, b.name AS b_name, a.price AS a_price,
b.price AS b_price
FROM df_a a
INNER JOIN df_b b
ON ',' || b.ids || ',' LIKE '%,' || a.ids || ',%'"
output <- sqldf(sql)
As separate_rows (my favorite) is already provided by Ronak Shah,
Here is another strategy using strsplit and unnest():
library(tidyr)
library(dplyr)
df_B %>%
mutate(ids = strsplit(as.character(ids), ",")) %>%
unnest() %>%
inner_join(df_A, by="ids")
ids name.x price.x name.y price.y
<chr> <chr> <dbl> <chr> <chr>
1 1234 pain 1.7 bread 1.5
2 43498 bi??re 1.5 beer 1
3 245r7 beurre 1.4 butter 1.2
data:
df_A <- structure(list(ids = c("1234", "245r7", "123984", "43498", "235897"
), name = c("bread", "butter", "red", "beer", "cream"), price = c("1.5",
"1.2", "wine", "1", "1.8")), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L), problems = structure(list(
row = 3L, col = NA_character_, expected = "3 columns", actual = "4 columns",
file = "'test'"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")))
df_B <- structure(list(ids = c("24908", "1234,089", "77289,43498", "245r7"
), name = c("lait", "pain", "bi??re", "beurre"), price = c(1,
1.7, 1.5, 1.4)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L))

Replacing certain values in a column with words

My goal is to replace a specific column's numeric values into certain words based off of a range to use in a future categorical test. Im trying to change this dataframe below:
Lets call this data frame as DF
SubjectID
ColumnA
ColumnB
Column C
Subject1
38
2.3
2.1
Subject2
12
2.1
2.0
Subject3
1
1.1
1.9
Subject4
34
3.2
1.5
Subject5
1
1.7
1.5
Subject6
56
3.9
1.7
To achieve a dataframe such as the one here:
SubjectID
ColumnA
ColumnB
Column C
Subject1
Mid
2.3
2.1
Subject2
Low
2.1
2.0
Subject3
Low
1.1
1.9
Subject4
Mid
3.2
1.5
Subject5
Low
1.7
1.5
Subject6
High
3.9
1.7
So in this case, I want to only change columnA's value names based off of a specific range the data values lie in.
For this example:
A value of Low represents a value lower than 30.
A value of Mid represents a value between 30 and 50
A value of High represents a value higher than 50
What would be the best way to do this?
We could use case_when
library(dplyr)
DF <- DF %>%
mutate(ColumnA = case_when(ColumnA < 30 ~ "Low",
between(ColumnA, 30, 50) ~ "Mid", TRUE ~ "High"))
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7
Another convenient option without doing multiple expressions is cut from base R
cut(DF$ColumnA, breaks = c(-Inf, 30, 50, Inf), labels = c("Low", "Mid", "High"))
[1] Mid Low Low Mid Low High
Levels: Low Mid High
data
DF <- structure(list(SubjectID = c("Subject1", "Subject2", "Subject3",
"Subject4", "Subject5", "Subject6"), ColumnA = c(38L, 12L, 1L,
34L, 1L, 56L), ColumnB = c(2.3, 2.1, 1.1, 3.2, 1.7, 3.9), ColumnC = c(2.1,
2, 1.9, 1.5, 1.5, 1.7)), class = "data.frame", row.names = c(NA,
-6L))
If you prefer a base R solution you can use nested ifelse:
DF$ColumnA <- ifelse(DF$ColumnA < 30, "Low",
ifelse(DF$ColumnA >= 30 & DF$ColumnA <= 50, "Mid", "High"))
Result:
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7

Getting p values for groupwise correlation using the dplyr package

I am trying run correlations between some variables in a dataframe. I have one character vector (group) and rest are numeric.
dataframe<-
Group V1 V2 V3 V4 V5
NG -4.5 3.5 2.4 -0.5 5.5
NG -5.4 5.5 5.5 1.0 2.0
GL 2.0 1.5 -3.5 2.0 -5.5
GL 3.5 6.5 -2.5 1.5 -2.5
GL 4.5 1.5 -6.5 1.0 -2.0
Following is my code:
library(dplyr)
dataframe %>%
group_by(Group) %>%
summarize(COR=cor(V3,V4))
Here is my output:
Group COR
<chr> <dbl>
1 GL 0.1848529
2 NG 0.1559912
How do i use edit this code to get the p-values? Any help would be appreciated! I have looked elsewhere but nothing is working. Thanks!!
You should try ?corrplot if you want to see pairwise correlation
library(corrplot)
df_cor <- cor(df[,sapply(df, is.numeric)])
corrplot(df_cor, method="color", type="upper", order="hclust")
In below graph you can notice that 'positive correlations' are displayed in 'blue' and 'negative correlations' in 'red' color and it's intensity are proportional to the correlation coefficients.
#sample data
> dput(df)
structure(list(Group = structure(c(2L, 2L, 1L, 1L, 1L), .Label = c("GL",
"NG"), class = "factor"), V1 = c(-4.5, -5.4, 2, 3.5, 4.5), V2 = c(3.5,
5.5, 1.5, 6.5, 1.5), V3 = c(2.4, 5.5, -3.5, -2.5, -6.5), V4 = c(-0.5,
1, 2, 1.5, 1), V5 = c(5.5, 2, -5.5, -2.5, -2)), .Names = c("Group",
"V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-5L))

Resources