From raws to column, pivot_wider mistake [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 months ago.
Hi guys I have a list that looks like that :
Sample Value
10 152365
10 236548
10 232547
10 145987
22 98564
22 98745
22 236547
And I would like to make it like this
10 22
152365 98564
236548 98745
232547 236547
145987
I have tried pivot_wider, but, since I have over 100'000 values it give me the mistake that some identical values are found and thus cannot work, wheras the spread just simply freeze...
Can you help me?
Thanks
Lore

Assuming the blank under 22 can be NA, this works:
library(dplyr)
library(tidyr) # pivot_longer
quux %>%
group_by(Sample) %>%
mutate(rn = row_number()) %>%
pivot_wider(rn, names_from = "Sample", values_from = "Value") %>%
select(-rn)
# # A tibble: 4 x 2
# `10` `22`
# <int> <int>
# 1 152365 98564
# 2 236548 98745
# 3 232547 236547
# 4 145987 NA
Data
quux <- structure(list(Sample = c(10L, 10L, 10L, 10L, 22L, 22L, 22L), Value = c(152365L, 236548L, 232547L, 145987L, 98564L, 98745L, 236547L)), class = "data.frame", row.names = c(NA, -7L))

Related

Calculate or Filter the wrong date entries of two Date columns in R

I am trying to figure how I am going to filter the wrong entries or calculate the difference between two Date columns of the same data frame in R. The scenario is: I have Patient table and there are two columns of Patient_admit and Patient discharge. How I am going to find if the date entered for Patient_discharge is before the Patient_admit. In the below dataframe example, the entries of patient 2 and 6 are incorrect.
executed
dput(head(patient)
structure(list(id = c(1003L, 1005L, 1006L, 1007L, 1010L, 1010L
), date_admit = structure(c(115L, 18L, 138L,
91L, 34L, 278L), .Label = c("01/01/2020", "01/02/2020", "01/03/2020",............,
date_discharge = structure(c(143L, 130L, 181L, 156L, 198L,
86L), .Label = c("01/01/2020", "01/01/2021", "01/02/2020",
............., class = "factor")), row.names = c(NA, 6L), class = "data.frame")
The list of date is very long so I just put "..........." for ease of understanding. Thanks
Another possible solution, based on lubridate::dmy:
library(dplyr)
library(lubridate)
df %>%
filter(dmy(Patient_admit) <= dmy(Patient_discharge))
#> Patient_ID Patient_admit Patient_discharge
#> 1 1 20/10/2020 21/10/2020
#> 2 3 21/10/2021 22/10/2021
#> 3 4 25/11/2022 25/11/2022
#> 4 5 25/11/2022 26/11/2022
First convert your dates to the right format using strptime. Calculate the difference in days using difftime and filter if the days are negative. You can use the following code:
library(dplyr)
df %>%
mutate(Patient_admit = strptime(Patient_admit, "%d/%m/%Y"),
Patient_discharge = strptime(Patient_discharge, "%d/%m/%Y")) %>%
mutate(diff_days = difftime(Patient_discharge, Patient_admit, units = c("days"))) %>%
filter(diff_days >= 0) %>%
select(-diff_days)
Output:
Patient_ID Patient_admit Patient_discharge
1 1 2020-10-20 2020-10-21
2 3 2021-10-21 2021-10-22
3 4 2022-11-25 2022-11-25
4 5 2022-11-25 2022-11-26
Data
df <- data.frame(Patient_ID = c(1,2,3,4,5,6),
Patient_admit = c("20/10/2020", "22/10/2021", "21/10/2021", "25/11/2022", "25/11/2022", "05/10/2020"),
Patient_discharge = c("21/10/2020", "20/10/2021", "22/10/2021", "25/11/2022", "26/11/2022", "20/09/2020"))

Dplyr merge rows based on one column value and sum other columns

My current df looks like the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 10 15 .05
2018-00 5 10 .1
2018-01 7 9 .1
....
2018-52 10 12 .06
2019-00 6 10 .05
....
What I would like to do is combine the last two weeks of each year together into the final week of the year and combine COUNT, COUNT2, and PERCENTAGE. The weeks I currently have that I would like to combine are: 2017-53 and 2018-00, 2018-52 and 2019-00, 2019-52 and 2020-00. Which I would like to merge into 2017-53, 2018-52, 2019-52 My expected output would be the following:
WEEK COUNT COUNT2 PERCENTAGE
2017-53 15 25 .15
2018-01 7 9 .1
....
2018-52 16 22 .11
....
With tidyverse, after converting the 'WEEK' to Date class, arrange by that column, extract the 'year', create a grouping with 'WEEK' based on the difference of adjacent elements of 'year', and then summarise to get the sum of the columns that matches 'COUNT' or 'PERCENTAGE'
library(stringr)
library(lubridate)
library(dplyr) #1.0.0
df1 %>%
mutate(Date = as.Date(str_c(WEEK, "-01"), format = '%Y-%U-%w')) %>%
arrange(Date) %>%
mutate(year = year(Date)) %>%
group_by(WEEK = case_when(lag(year, default = first(year)) - year < 0 ~
lag(WEEK), TRUE ~ WEEK)) %>%
summarise(across(matches("COUNT|PERCENTAGE"), sum))
# A tibble: 3 x 4
# WEEK COUNT COUNT2 PERCENTAGE
# <chr> <int> <int> <dbl>
#1 2017-53 15 25 0.15
#2 2018-01 7 9 0.1
#3 2018-52 16 22 0.11
data
df1 <- structure(list(WEEK = c("2017-53", "2018-00", "2018-01", "2018-52",
"2019-00"), COUNT = c(10L, 5L, 7L, 10L, 6L), COUNT2 = c(15L,
10L, 9L, 12L, 10L), PERCENTAGE = c(0.05, 0.1, 0.1, 0.06, 0.05
)), class = "data.frame", row.names = c(NA, -5L))
You could use colSums() as is shown here, but it's a bit convoluted. I'd recommend using aggregate and pipes, as is shown further down in the same link.
Hope this helps!

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

Sorting dataframe by column of letters and numbers

I have been attempting to sort my dataframe by the first column - or day - with multiple different methods listed below to no avail. I suspect it could be because it is attempting to order by the first number but I am unsure how I would change that to get it to order the rows properly. The dataset is as follows:
df1
[day][sample1][sample2]
[1,]day0 22 11
[2,]day11 23 15
[3,]day15 25 14
[4,]day2 21 13
[5,]day8 20 17
...
I am looking to order the entire row by day. I have tried the following
df[sort(as.character(df$day)),]
df[order(as.character(df$day)),]
mixedorder(as.character(df$day)) (gtools package)
The mixedorder merely output an index of numbers.
Current Code:
df_0$day = metadata_df[,3]
df_0 <- df_0[,c(8,1:7)]
df1 <- aggregate(df_0[,2:ncol(df_0)], df_0[1], mean)
df1 <- df1[mixedorder(as.character(df1$day)),]
df1$day <- factor(df1$day, levels = unique(df1$day))
rownames(df1) <- 1:nrow(df1)
##Plotting expression levels
Plot1 <- ggplot() +
geom_line(data=df1, aes(x=day, y=sample1, group=1, color="blue"))+
geom_line(data=df2, aes(x=day, y=sample1, group=2, color="red"))
Note that I have done the same transformations with df2 as I have with df1. Both df1 and df2 are the same, except with slightly different values in them.
The mixedorder gives the ordered index which can be used to order the rows
df1 <- df[mixedorder(as.character(df$day)),]
df1
# day sample1 sample2
#1 day0 22 11
#4 day2 21 13
#5 day8 20 17
#2 day11 23 15
#3 day15 25 14
It is not clear about how the OP is plotting.
library(tidyverse)
df1 %>%
mutate(day = factor(day, levels = unique(day))) %>%
gather(key, val, -day) %>%
ggplot(., aes(x = day, y = val, color = key)) +
geom_point()
data
df <- structure(list(day = structure(1:5, .Label = c("day0", "day11",
"day15", "day2", "day8"), class = "factor"), sample1 = c(22L,
23L, 25L, 21L, 20L), sample2 = c(11L, 15L, 14L, 13L, 17L)), .Names = c("day",
"sample1", "sample2"), class = "data.frame", row.names = c(NA,
-5L))

combine data in depending on the value of one column

I have a data frame in R
year group sales
1 2000 1 20
2 2001 1 25
3 2002 1 23
4 2003 1 30
5 2001 2 50
6 2002 2 55
And I want to group the data by groups or create some kind of object. I want to create one array for each group that will store the year and the sales. And the I will try to save it as a json file with this structure:
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]]},
{"group": 2, "sales":[[2001, 50], [2002,55]]}]
Is it possible to do it automatically?
Thanks a lot
We can use data.table to paste the 'year' and 'sales' column grouped by 'group. We convert the 'data.frame' to 'data.table' (setDT(df1)). Group by 'group', we use sprintf to paste the 'year', 'sales' along with the parentheses ([]), then collapse the output to a single string with toString (it is a wrapper for paste(..., collapse=', ')), paste the [], and use toJSON.
library(jsonlite)
library(data.table)
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',
year, sales)),']')), by = group])
#[{"group":1,"sales":"[[2000,20], [2001,25], [2002,23], [2003,30]]"},
#{"group":2,"sales":"[[2001,50], [2002,55]]"}]
The paste by group can be done using base R. We split the dataset by the 'group' column to create a list. Loop through the list with lapply, paste, the 'year', 'sales' column as mentioned above. Create a data.frame with the first element of 'group' and the string from the paste step, rbind the list elements to create a single data.frame and then use toJSON.
toJSON(
do.call(rbind,
lapply(
split(df1, df1$group),
function(x) data.frame(group=x$group[1L],
sales=paste0('[',
toString(sprintf('[%d,%d]', x$year, x$sales)),
']')))))
data
df1 <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L
), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L,
30L, 50L, 55L)), .Names = c("year", "group", "sales"),
class = "data.frame", row.names = c(NA, -6L))
Since the other answer uses data.table, I thought it would be a interesting exercise to try to do this in dplyr. This is not the optimal way but illustrates do which I'm not convinced is well enough documented. I have also shown the more appropriate summarise solution.
df <-read.table(textConnection('
year group sales expenses
2000 1 20 19
2001 1 25 19
2002 1 23 20
2003 1 30 15
2001 2 50 27
2002 2 55 30
'),header=TRUE)
library(dplyr)
library(jsonlite)
df %>%
group_by( group ) %>%
do(
sales = group_by(.,year) %>% select(sales) %>% apply(MARGIN=2,identity),
expenses = group_by(.,year) %>% select(expenses) %>% apply(MARGIN=2,identity)
)
df %>%
group_by( group ) %>%
summarise(
sales = list(apply( data.frame(year,sales), MARGIN=2, identity ))
,expenses = list(apply( data.frame(year,sales), MARGIN=2, identity ))
) %>% jsonlite::toJSON()

Resources