Replacing certain values in a column with words - r

My goal is to replace a specific column's numeric values into certain words based off of a range to use in a future categorical test. Im trying to change this dataframe below:
Lets call this data frame as DF
SubjectID
ColumnA
ColumnB
Column C
Subject1
38
2.3
2.1
Subject2
12
2.1
2.0
Subject3
1
1.1
1.9
Subject4
34
3.2
1.5
Subject5
1
1.7
1.5
Subject6
56
3.9
1.7
To achieve a dataframe such as the one here:
SubjectID
ColumnA
ColumnB
Column C
Subject1
Mid
2.3
2.1
Subject2
Low
2.1
2.0
Subject3
Low
1.1
1.9
Subject4
Mid
3.2
1.5
Subject5
Low
1.7
1.5
Subject6
High
3.9
1.7
So in this case, I want to only change columnA's value names based off of a specific range the data values lie in.
For this example:
A value of Low represents a value lower than 30.
A value of Mid represents a value between 30 and 50
A value of High represents a value higher than 50
What would be the best way to do this?

We could use case_when
library(dplyr)
DF <- DF %>%
mutate(ColumnA = case_when(ColumnA < 30 ~ "Low",
between(ColumnA, 30, 50) ~ "Mid", TRUE ~ "High"))
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7
Another convenient option without doing multiple expressions is cut from base R
cut(DF$ColumnA, breaks = c(-Inf, 30, 50, Inf), labels = c("Low", "Mid", "High"))
[1] Mid Low Low Mid Low High
Levels: Low Mid High
data
DF <- structure(list(SubjectID = c("Subject1", "Subject2", "Subject3",
"Subject4", "Subject5", "Subject6"), ColumnA = c(38L, 12L, 1L,
34L, 1L, 56L), ColumnB = c(2.3, 2.1, 1.1, 3.2, 1.7, 3.9), ColumnC = c(2.1,
2, 1.9, 1.5, 1.5, 1.7)), class = "data.frame", row.names = c(NA,
-6L))

If you prefer a base R solution you can use nested ifelse:
DF$ColumnA <- ifelse(DF$ColumnA < 30, "Low",
ifelse(DF$ColumnA >= 30 & DF$ColumnA <= 50, "Mid", "High"))
Result:
DF
SubjectID ColumnA ColumnB ColumnC
1 Subject1 Mid 2.3 2.1
2 Subject2 Low 2.1 2.0
3 Subject3 Low 1.1 1.9
4 Subject4 Mid 3.2 1.5
5 Subject5 Low 1.7 1.5
6 Subject6 High 3.9 1.7

Related

Replacing column with another data frame based on name matching

Hi I am a bit new so I am not sure if I am doing this right, but I looked around on the overflow and couldn't find a code or advice that worked with my code.
I have a dataframe mainDF that looks like this:
Person
ABG
SEP
CLC
XSP
APP
WED
GSH
SP-1
2.1
3.0
1.3
1.8
1.4
2.5
1.4
SP-2
2.5
2.1
2.0
1.9
1.2
1.2
2.1
SP-3
2.3
3.1
2.5
1.5
1.1
2.6
2.1
I have another dataframe, TranslateDF that has the converting info for the abbreviated column names. And I want to replace the abbreviated names with the real names here:
Do note that the translating data frame may have extraneous info or it could be missing info for the column, and so if the mainDF does not get the full naming, for it to be dropped from the data.
Abbreviated
Full Naming
ABG
All barbecue grill
SEP
shake eel peel
CLC
cold loin cake
XSP
xylophone spear pint
APP
apple pot pie
HUM
hall united meat
LPL
lending porkloin
Ideally, I would get the new resulted dataframe as:
Person
All barbecue grill
shake eel peel
cold loin cake
xylophone spear pint
apple pot pie
SP-1
2.1
3.0
1.3
1.8
1.4
SP-2
2.5
2.1
2.0
1.9
1.2
SP-3
2.3
3.1
2.5
1.5
1.1
I would appreciate any help on this thank you!
You can pass a named vector to select() which will rename and select in one step. Wrapping with any_of() ensures it won't fail if any columns don't exist in the main data frame:
library(dplyr)
df1 %>%
select(Person, any_of(setNames(df2$Abbreviated, df2$Full_Naming)))
# A tibble: 3 x 6
Person `All barbecue grill` `shake eel peel` `cold loin cake` `xylophone spear pint` `apple pot pie`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SP-1 2.1 3 1.3 1.8 1.4
2 SP-2 2.5 2.1 2 1.9 1.2
3 SP-3 2.3 3.1 2.5 1.5 1.1
Data:
df1 <- structure(list(Person = c("SP-1", "SP-2", "SP-3"), ABG = c(2.1,
2.5, 2.3), SEP = c(3, 2.1, 3.1), CLC = c(1.3, 2, 2.5), XSP = c(1.8,
1.9, 1.5), APP = c(1.4, 1.2, 1.1), WED = c(2.5, 1.2, 2.6), GSH = c(1.4,
2.1, 2.1)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), spec = structure(list(cols = list(
Person = structure(list(), class = c("collector_character",
"collector")), ABG = structure(list(), class = c("collector_double",
"collector")), SEP = structure(list(), class = c("collector_double",
"collector")), CLC = structure(list(), class = c("collector_double",
"collector")), XSP = structure(list(), class = c("collector_double",
"collector")), APP = structure(list(), class = c("collector_double",
"collector")), WED = structure(list(), class = c("collector_double",
"collector")), GSH = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Abbreviated = c("ABG", "SEP", "CLC", "XSP", "APP",
"HUM", "LPL"), Full_Naming = c("All barbecue grill", "shake eel peel",
"cold loin cake", "xylophone spear pint", "apple pot pie", "hall united meat",
"lending porkloin")), class = "data.frame", row.names = c(NA,
-7L))
How about this:
mainDF <- structure(list(Person = c("SP-1", "SP-2", "SP-3"), ABG = c(2.1,
2.5, 2.3), SEP = c(3, 2.1, 3.1), CLC = c(1.3, 2, 2.5), XSP = c(1.8,
1.9, 1.5), APP = c(1.4, 1.2, 1.1), WED = c(2.5, 1.2, 2.6), GSH = c(1.4,
2.1, 2.1)), row.names = c(NA, 3L), class = "data.frame")
translateDF <- structure(list(Abbreviated = c("ABG", "SEP", "CLC", "XSP", "APP",
"HUM", "LPL"), `Full Naming` = c("All barbecue grill", "shake eel peel",
"cold loin cake", "xylophone spear pint", "apple pot pie", "hall united meat",
"lending porkloin")), row.names = c(NA, 7L), class = "data.frame")
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
mainDF %>%
pivot_longer(-Person,
names_to="Abbreviated",
values_to = "vals") %>%
left_join(translateDF) %>%
select(-Abbreviated) %>%
na.omit() %>%
pivot_wider(names_from=`Full Naming`, values_from="vals")
#> Joining, by = "Abbreviated"
#> # A tibble: 3 × 6
#> Person `All barbecue grill` `shake eel peel` `cold loin cake` `xylophone spe…`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 SP-1 2.1 3 1.3 1.8
#> 2 SP-2 2.5 2.1 2 1.9
#> 3 SP-3 2.3 3.1 2.5 1.5
#> # … with 1 more variable: `apple pot pie` <dbl>
Created on 2022-04-24 by the reprex package (v2.0.1)
library(tidyverse)
mainDF %>%
rename_with(~str_replace_all(., set_names(TranslateDF[, 2], TranslateDF[, 1]))) %>%
select(Person, which(!(names(.) %in% names(mainDF))))
Person All barbecue grill shake eel peel cold loin cake xylophone spear pint apple pot pie
1 SP-1 2.1 3.0 1.3 1.8 1.4
2 SP-2 2.5 2.1 2.0 1.9 1.2
3 SP-3 2.3 3.1 2.5 1.5 1.1

R format the wide table to long table

cutoff KM KM_lo KM_hi rstm rstm_lo rstm_hi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017-01-01 2.1 1.4 4.9 7.2 3.9 10.2
2 2017-04-01 3.5 2.1 4.7 8.9 6.6 10.8
3 2017-07-01 3.7 2.8 4.2 7.2 6.2 8.4
How do I convert this to a long table? I am struggling to create it into the format I want. I tried the gather and melt functions. The output table would look something like this
cutoff VAR Val Val-hi Val-lo
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 4.9 1.4
2 2017-01-01 rstm 7.2 4.7 3.9
3 2017-07-01 KM 3.7 4.2 2.8
Sample date
structure(list(cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"
), KM = c(2.1, 3.5, 3.7), KM_lo = c(1.4, 2.1, 2.8), KM_hi = c(4.9,
4.7, 4.2), rstm = c(7.2, 8.9, 7.2), rstm_lo = c(3.9, 6.6, 6.2
), rstm_hi = c(10.2, 10.8, 8.4)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
We may do
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
rename_with(~ str_c(., "_none"), c("KM", "rstm")) %>%
pivot_longer(cols = -cutoff, names_to = c("VAR", ".value"),
names_sep = "_") %>%
rename_with(~ c("Val", "Val-lo", "Val-hi"), 3:5)
-output
# A tibble: 6 × 5
cutoff VAR Val `Val-lo` `Val-hi`
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 1.4 4.9
2 2017-01-01 rstm 7.2 3.9 10.2
3 2017-04-01 KM 3.5 2.1 4.7
4 2017-04-01 rstm 8.9 6.6 10.8
5 2017-07-01 KM 3.7 2.8 4.2
6 2017-07-01 rstm 7.2 6.2 8.4
Here is another pivot_longer approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-cutoff,
names_to = c("VAR", ".value"),
names_pattern = "(.+)_(.+)"
) %>%
na.omit()
cutoff VAR lo hi
<chr> <chr> <dbl> <dbl>
1 2017-01-01 KM 1.4 4.9
2 2017-01-01 rstm 3.9 10.2
3 2017-04-01 KM 2.1 4.7
4 2017-04-01 rstm 6.6 10.8
5 2017-07-01 KM 2.8 4.2
6 2017-07-01 rstm 6.2 8.4
library(tidyverse)
df <-
structure(
list(
cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"),
KM = c(2.1, 3.5, 3.7),
KM_lo = c(1.4, 2.1, 2.8),
KM_hi = c(4.9, 4.7, 4.2),
rstm = c(7.2, 8.9, 7.2),
rstm_lo = c(3.9, 6.6, 6.2),
rstm_hi = c(10.2, 10.8, 8.4)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame")
)
df %>%
pivot_longer(cols = -cutoff) %>%
separate(col = name, into = c("name", "suffix"), sep = "_", remove = TRUE) %>%
mutate(id = data.table::rleid(name)) %>%
pivot_wider(id_cols = c(id, cutoff, name), names_from = suffix, names_prefix = "VAL_", values_from = value) %>%
select(-id) %>%
rename(VAL = VAL_NA)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 4, 7,
#> 10, 13, 16].
#> # A tibble: 6 x 5
#> cutoff name VAL VAL_lo VAL_hi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2017-01-01 KM 2.1 1.4 4.9
#> 2 2017-01-01 rstm 7.2 3.9 10.2
#> 3 2017-04-01 KM 3.5 2.1 4.7
#> 4 2017-04-01 rstm 8.9 6.6 10.8
#> 5 2017-07-01 KM 3.7 2.8 4.2
#> 6 2017-07-01 rstm 7.2 6.2 8.4
Created on 2021-09-28 by the reprex package (v2.0.1)

R Make scatter plots (ggplot) from columns based on attributes from rows

I have the following type of table :
df0 <- read.table(text = 'Sample Method Mg Al Ca Ti
Sa A 5.5 2.2 33 0.2
Sb A 4.2 1.2 44 0.1
Sc A 1.1 0.5 25 0.3
Sd A 3.3 1.3 31 0.5
Se A 6.2 0.2 55 0.6
Sa B 5.2 2 35 0.25
Sb B 4.6 1.3 48 0.1
Sc B 1.6 0.8 22 0.32
Sd B 3.1 1.6 29 0.4
Se B 6.8 0.3 51 0.7
Sa C 5.6 2.5 30 0.2
Sb C 4.1 1.2 41 0.15
Sc C 1 0.6 22 0.4
Sd C 3.2 1.5 30 0.5
Se C 6.8 0.1 51 0.65', header = T, stringsAsFactors = F)
Which include chemical compositions. I would like to use the Method A as a reference (X-axis) and to make automated scatter plots with the data from Method B, C in Y (with linear trend). With a reference line of 1:1 which would correspond to a perfect match.
In other words, I would like to produce plots like that :
I think a solution could start from transforming the data frame into:
df <- read.table(text = 'Sample Mg_A Al_A Ca_A Ti_A Mg_B Al_B Ca_B Ti_B Mg_C Al_C Ca_C Ti_C
Sa 5.5 2.2 33 0.2 5.2 2 35 0.25 5.6 2.5 30 0.2
Sb 4.2 1.2 44 0.1 4.6 1.3 48 0.1 4.1 1.2 41 0.15
Sc 1.1 0.5 25 0.3 1.6 0.8 22 0.32 1 0.6 22 0.4
Sd 3.3 1.3 31 0.5 3.1 1.6 29 0.4 3.2 1.5 30 0.5
Se 6.2 0.2 55 0.6 6.8 0.3 51 0.7 6.8 0.1 51 0.65
', header = T, stringsAsFactors = F)
But I don't know how to go further.
Any help would be appreciated.
Best, Anne-Christine
You can use the following code
library(tidyverse)
df0 %>%
pivot_wider(names_from = Method, values_from = c(Mg, Al, Ca, Ti)) %>%
pivot_longer(cols = -Sample) %>% #wide to long data format
separate(name, c("key","number"), sep = "_") %>%
group_by(number) %>% #Group the vaules according to number
mutate(row = row_number()) %>% #For creating unique IDs
pivot_wider(names_from = number, values_from = value) %>%
ggplot() +
geom_point(aes(x=A, y=B, color = "A vs B")) +
geom_point(aes(x=A, y=C, color = "A vs C")) +
geom_abline(slope=1, intercept=0) +
geom_smooth(aes(x=A, y=B, color = "A vs B"), method=lm, se=FALSE, fullrange=TRUE)+
geom_smooth(aes(x=A, y=C, color = "A vs C"), method=lm, se=FALSE, fullrange=TRUE)+
facet_wrap(key~., scales = "free")+
theme_bw()+
ylab("B or C") +
xlab("A")
Data
df0 = structure(list(Sample = c("Sa", "Sb", "Sc", "Sd", "Se", "Sa",
"Sb", "Sc", "Sd", "Se", "Sa", "Sb", "Sc", "Sd", "Se"), Method = c("A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C",
"C"), Mg = c(5.5, 4.2, 1.1, 3.3, 6.2, 5.2, 4.6, 1.6, 3.1, 6.8,
5.6, 4.1, 1, 3.2, 6.8), Al = c(2.2, 1.2, 0.5, 1.3, 0.2, 2, 1.3,
0.8, 1.6, 0.3, 2.5, 1.2, 0.6, 1.5, 0.1), Ca = c(33L, 44L, 25L,
31L, 55L, 35L, 48L, 22L, 29L, 51L, 30L, 41L, 22L, 30L, 51L),
Ti = c(0.2, 0.1, 0.3, 0.5, 0.6, 0.25, 0.1, 0.32, 0.4, 0.7,
0.2, 0.15, 0.4, 0.5, 0.65)), class = "data.frame", row.names = c(NA,
-15L))

Creating a new dataframes based on two conditions from two other dataframes

I'm fairly new to coding languages and have been asked to create a new dataframe based on two existing dataframes. Dataframe 1 is the original and dataframe 2 is a subset of the original. The new data frame needs to be a copy of the original with certain scores removed if they meet certain conditions from
df2, ie identify the task type in df2 and remove the corresponding value from df1, if the sample ID matches.
E.g.Dataframe1:
sample_id Low Mid High
13420 NA 2.4 3.7
43905 7.5 NA NA
52078 5.6 3.2 5.6
43292 10 NA 1.9
79327 5.7 3.2 NA
Dataframe2:
Sample Task type
13420 High
52078 Low
52078 Mid
43292 High
79327 Low
New dataframe:
13420 NA 2.4 NA
43905 7.5 NA NA
52078 NA NA 5.6
43292 10 NA NA
79327 NA 3.2 NA
Can anyone help, please? I've tried a few conditional statements, but have had no luck.
sample data
df1 <- data.table::fread("sample_id Low Mid High
13420 NA 2.4 3.7
43905 7.5 NA NA
52078 5.6 3.2 5.6
43292 10 NA 1.9
79327 5.7 3.2 NA")
df2 <- data.table::fread("Sample Tasktype
13420 High
52078 Low
52078 Mid
43292 High
79327 Low")
code
library( data.table )
#or make data.frames a data.table
data.table::setDT(df1);data.table::setDT(df2)
#melt df1 to long format
df1.melt <- melt( df1, id.vars = "sample_id" )
#update join the molten dataset with df2, updating the value with NA
df1.melt[ df2, value := NA, on = .(sample_id = Sample, variable = Tasktype) ]
#and cast df1 wit the new values back to wide format
dcast( df1.melt, sample_id ~ variable, value.var = "value" )
output
# sample_id Low Mid High
# 1: 13420 NA 2.4 NA
# 2: 43292 10.0 NA NA
# 3: 43905 7.5 NA NA
# 4: 52078 NA NA 5.6
# 5: 79327 NA 3.2 NA
Here is an approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(-sample_id) %>%
left_join(df2, by = c("sample_id" = "Sample",
"name" = "Task.type"),
keep = TRUE) %>%
mutate(value = ifelse(!is.na(Task.type) & Task.type == name,
NA_real_, value)) %>%
dplyr::select(-c(Sample,Task.type)) %>%
pivot_wider(id_cols = c("sample_id"))
## A tibble: 5 x 4
# sample_id Low Mid High
# <int> <dbl> <dbl> <dbl>
#1 13420 NA 2.4 NA
#2 43905 7.5 NA NA
#3 52078 NA NA 5.6
#4 43292 10 NA NA
#5 79327 NA 3.2 NA
Sample Data
df1<-structure(list(sample_id = c(13420L, 43905L, 52078L, 43292L,
79327L), Low = c(NA, 7.5, 5.6, 10, 5.7), Mid = c(2.4, NA, 3.2,
NA, 3.2), High = c(3.7, NA, 5.6, 1.9, NA)), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(Sample = c(13420L, 52078L, 52078L, 43292L, 79327L
), Task.type = structure(c(1L, 2L, 3L, 1L, 2L), .Label = c("High",
"Low", "Mid"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))

Arrange one variable according to other variable

I have two variables (dataframes). One is Transcolmax(dataframe 1) and another one is Transcolmean(dataframe 2). I want to arrange Transcolmean(dataframe 2) according to Transcolmax(dataframe 1). dataframes tables are the following. Third table is not the desired output. Forth table is the desired output. I put third table only for better understanding. I want to recreate another file using same [3:3] matrixs (dput)
Transcolmax(dataframe 1)
MSFT 10 7 11
AAPL 12 6 5
GOOGL 9.5 11 8
Transcolmean (dataframe 2)
MSFT 2 1.5 3
AAPL 1 1.2 2.5
GOOGL 5 1 1.7
Arrange companies according to Transcolmax (high to low)
AAPL GOOGL MSFT
MSFT MSFT GOOGL
GOOGL AAPL AAPL
Arrange Transcolmean varience according to Transcolmax (high to low) (desired output)
1 1 3
2 1.5 1.7
5 1.2 2.5
df1 = read.table(text="MSFT 10 7 11
AAPL 12 6 5
GOOGL 9.5 11 8")
df2 = read.table(text="MSFT 2 1.5 3
AAPL 1 1.2 2.5
GOOGL 5 1 1.7")
df2[,1]<-NULL
df1[,1]<-NULL
for(i in 1:ncol(df1))
{
df2[,i] = df2[order(df1[,i],decreasing=TRUE),i]
}
Output:
1 1 3
2 1.5 1.7
5 1.2 2.5
We can use mapply to do this
mapply(function(x, y) y[order(-x)], as.data.frame(Transcolmax[,-1]),
as.data.frame(Transcolmean[,-1]))
# v2 v3 v4
#[1,] 1 1.0 3.0
#[2,] 2 1.5 1.7
#[3,] 5 1.2 2.5
data
Transcolmax <- structure(list(v1 = c("MSFT", "AAPL", "GOOGL"), v2 = c(10, 12,
9.5), v3 = c(7L, 6L, 11L), v4 = c(11L, 5L, 8L)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame", row.names = c(NA, -3L
))
Transcolmean<- structure(list(v1 = c("MSFT", "AAPL", "GOOGL"), v2 = c(2L, 1L,
5L), v3 = c(1.5, 1.2, 1), v4 = c(3, 2.5, 1.7)), .Names = c("v1",
"v2", "v3", "v4"), class = "data.frame", row.names = c(NA, -3L
))

Resources