Sorting dataframe by column of letters and numbers - r

I have been attempting to sort my dataframe by the first column - or day - with multiple different methods listed below to no avail. I suspect it could be because it is attempting to order by the first number but I am unsure how I would change that to get it to order the rows properly. The dataset is as follows:
df1
[day][sample1][sample2]
[1,]day0 22 11
[2,]day11 23 15
[3,]day15 25 14
[4,]day2 21 13
[5,]day8 20 17
...
I am looking to order the entire row by day. I have tried the following
df[sort(as.character(df$day)),]
df[order(as.character(df$day)),]
mixedorder(as.character(df$day)) (gtools package)
The mixedorder merely output an index of numbers.
Current Code:
df_0$day = metadata_df[,3]
df_0 <- df_0[,c(8,1:7)]
df1 <- aggregate(df_0[,2:ncol(df_0)], df_0[1], mean)
df1 <- df1[mixedorder(as.character(df1$day)),]
df1$day <- factor(df1$day, levels = unique(df1$day))
rownames(df1) <- 1:nrow(df1)
##Plotting expression levels
Plot1 <- ggplot() +
geom_line(data=df1, aes(x=day, y=sample1, group=1, color="blue"))+
geom_line(data=df2, aes(x=day, y=sample1, group=2, color="red"))
Note that I have done the same transformations with df2 as I have with df1. Both df1 and df2 are the same, except with slightly different values in them.

The mixedorder gives the ordered index which can be used to order the rows
df1 <- df[mixedorder(as.character(df$day)),]
df1
# day sample1 sample2
#1 day0 22 11
#4 day2 21 13
#5 day8 20 17
#2 day11 23 15
#3 day15 25 14
It is not clear about how the OP is plotting.
library(tidyverse)
df1 %>%
mutate(day = factor(day, levels = unique(day))) %>%
gather(key, val, -day) %>%
ggplot(., aes(x = day, y = val, color = key)) +
geom_point()
data
df <- structure(list(day = structure(1:5, .Label = c("day0", "day11",
"day15", "day2", "day8"), class = "factor"), sample1 = c(22L,
23L, 25L, 21L, 20L), sample2 = c(11L, 15L, 14L, 13L, 17L)), .Names = c("day",
"sample1", "sample2"), class = "data.frame", row.names = c(NA,
-5L))

Related

Remove characters including and after third hyphen

I have a dataframe df and want to remove everything including and after the third '-' in the column 'case_id':
df
case_id unit
TCGA-3A-01-03-9441 27
TCGA-9C-01-04-9641 15
TCGA-1E-01-05-9471 6
This is the desired output:
df
case_id unit
TCGA-3A-01 27
TCGA-9C-01 15
TCGA-1E-01 6
We could use str_replace
library(stringr)
library(dplyr)
df1 %>%
mutate(case_id = str_replace(case_id, "^(([^-]+-){2}[^-]+)-.*", "\\1"))
-output
case_id unit
1 TCGA-3A-01 27
2 TCGA-9C-01 15
3 TCGA-1E-01 6
data
df1 <- structure(list(case_id = c("TCGA-3A-01-03-9441", "TCGA-9C-01-04-9641",
"TCGA-1E-01-05-9471"), unit = c(27L, 15L, 6L)),
class = "data.frame", row.names = c(NA,
-3L))

Dplyr merge rows based on one column value and sum other columns

My current df looks like the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 10 15 .05
2018-00 5 10 .1
2018-01 7 9 .1
....
2018-52 10 12 .06
2019-00 6 10 .05
....
What I would like to do is combine the last two weeks of each year together into the final week of the year and combine COUNT, COUNT2, and PERCENTAGE. The weeks I currently have that I would like to combine are: 2017-53 and 2018-00, 2018-52 and 2019-00, 2019-52 and 2020-00. Which I would like to merge into 2017-53, 2018-52, 2019-52 My expected output would be the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 15 25 .15
2018-01 7 9 .1
....
2018-52 16 22 .11
....
With tidyverse, after converting the 'WEEK' to Date class, arrange by that column, extract the 'year', create a grouping with 'WEEK' based on the difference of adjacent elements of 'year', and then summarise to get the sum of the columns that matches 'COUNT' or 'PERCENTAGE'
library(stringr)
library(lubridate)
library(dplyr) #1.0.0
df1 %>%
mutate(Date = as.Date(str_c(WEEK, "-01"), format = '%Y-%U-%w')) %>%
arrange(Date) %>%
mutate(year = year(Date)) %>%
group_by(WEEK = case_when(lag(year, default = first(year)) - year < 0 ~
lag(WEEK), TRUE ~ WEEK)) %>%
summarise(across(matches("COUNT|PERCENTAGE"), sum))
# A tibble: 3 x 4
# WEEK COUNT COUNT2 PERCENTAGE
# <chr> <int> <int> <dbl>
#1 2017-53 15 25 0.15
#2 2018-01 7 9 0.1
#3 2018-52 16 22 0.11
data
df1 <- structure(list(WEEK = c("2017-53", "2018-00", "2018-01", "2018-52",
"2019-00"), COUNT = c(10L, 5L, 7L, 10L, 6L), COUNT2 = c(15L,
10L, 9L, 12L, 10L), PERCENTAGE = c(0.05, 0.1, 0.1, 0.06, 0.05
)), class = "data.frame", row.names = c(NA, -5L))
You could use colSums() as is shown here, but it's a bit convoluted. I'd recommend using aggregate and pipes, as is shown further down in the same link.
Hope this helps!

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

Failed to use map2 with mutate with purrr and dplyr

I am reading a list of files form my computer and doing several transformations on them with purrr and dplyr, everything works great, but I have a vector with the IDs of each data frame created, and I want to add a column with the ID of data for each data frame.
Loading libraries
library(readr)
library(lubridate)
library(dplyr)
library(purrr)
Reading list of files to be read and modified
ArchivosTemp <- list.files(pattern = "Tem.csv")
For reproducible purposes
lets say the list of dataframes called Temperaturas made after the first line of the code is
Temperaturas <- list(structure(list(`Date/Time` = c("01-07-2016 14:55", "01-07-2016 15:55",
"01-07-2016 16:55", "01-07-2016 17:55", "01-07-2016 18:55", "01-07-2016 19:55"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(28L, 24L,
25L, 25L, 25L, 25L), a = c(68L, 682L, 182L, 182L, 182L, 182L)), .Names = c("Date/Time",
"Unit", "Value", "a"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(`Date/Time` = c("12-06-2016 19:44",
"12-06-2016 20:44", "12-06-2016 21:44", "12-06-2016 22:44", "12-06-2016 23:44",
"13-06-2016 0:44"), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(31L,
29L, 27L, 26L, 26L, 24L), a = c(129L, 131L, 632L, 633L, 133L,
633L)), .Names = c("Date/Time", "Unit", "Value", "a"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame")), structure(list(
`Date/Time` = c("07-06-16 7:54:01", "07-06-16 8:54:01", "07-06-16 9:54:01",
"07-06-16 10:54:01", "07-06-16 11:54:01", "07-06-16 12:54:01"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(23L,
19L, 25L, 27L, 30L, 34L), a = c("119", "116", "119", "119",
"118", "113")), .Names = c("Date/Time", "Unit", "Value",
"a"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
and a vector with the ID of each element of the list
IDs <- c("H1F102", "H1F105", "H1F106")
The purrr code that is working so far
a <- ArchivosTemp %>% map(read_csv) %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Since you cant read the csvs from mu computer lets replace the ArchivosTemp %>% map(read_csv) with the list that I made above
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Then I want each of the 3 data frames to have a column called ID with its corresponding element in the IDs vector I tried this:
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% map2(y = IDs,~mutate(.x, ID = y.))
but it does not work, any ideas of What I am doing wrong?
Expected outcome
As an example this is the results I expect using only the first data frame
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% reduce(rbind)
mutate(a[[1]], ID = IDs[1])
which turns into
# A tibble: 6 x 3
Date.Time Temperatura ID
<dttm> <int> <chr>
1 2020-07-01 16:14:55 28 H1F102
2 2020-07-01 16:15:55 24 H1F102
3 2020-07-01 16:16:55 25 H1F102
4 2020-07-01 16:17:55 25 H1F102
5 2020-07-01 16:18:55 25 H1F102
6 2020-07-01 16:19:55 25 H1F102
You have a minor parameter problem with map2, the parameters are named as .x, .y, changing y to .y works for me:
map2(.y = IDs, ~ mutate(.x, ID = .y))
Besides if you eventually need to bind all elements in the list as a single data frame, you can set_names to your list with the IDs vector and then specify the .id parameter in map_df, which will map and bind_rows of all data frames in the lists to form a new final data frame, and converts the list names to a new column with the name of .id:
Temperaturas %>%
set_names(IDs) %>%
map_df(~ transmute(.x, Date.Time=dmy_hms(`Date/Time`), Temperatura=Value), .id="ID")
# A tibble: 18 x 3
# ID Date.Time Temperatura
# <chr> <dttm> <int>
# 1 H1F102 2020-07-01 16:14:55 28
# 2 H1F102 2020-07-01 16:15:55 24
# 3 H1F102 2020-07-01 16:16:55 25
# 4 H1F102 2020-07-01 16:17:55 25
# 5 H1F102 2020-07-01 16:18:55 25
# 6 H1F102 2020-07-01 16:19:55 25
# 7 H1F105 2020-06-12 16:19:44 31
# 8 H1F105 2020-06-12 16:20:44 29
# 9 H1F105 2020-06-12 16:21:44 27
#10 H1F105 2020-06-12 16:22:44 26
#11 H1F105 2020-06-12 16:23:44 26
#12 H1F105 2020-06-13 16:00:44 24
#13 H1F106 2016-06-07 07:54:01 23
#14 H1F106 2016-06-07 08:54:01 19
#15 H1F106 2016-06-07 09:54:01 25
#16 H1F106 2016-06-07 10:54:01 27
#17 H1F106 2016-06-07 11:54:01 30
#18 H1F106 2016-06-07 12:54:01 34
Besides, you can use transmute as a short hand for rename %>% mutate %>% select

How to merge specific rows that match a grep pattern

I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen
We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))
Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.

Resources