How to sum a variable by group with NA? - r

I have a large data set like this :
ID Number
153 31
28
31
30
104 31
30
254 31
266 31
and I want to compute sum by ID include the NA. I mean get this :
ID Number
153 120
104 61
254 31
266 31
I tried aggregate but I dont get the expected result. Some help would be appreciated

One option is to convert the blanks to NA, then fill replace the NA elements with non-NA adjacent elements above in 'ID', grouped by 'ID', get the sum of 'Number'
library(tidyverse)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Number = sum(Number))
# A tibble: 4 x 2
# ID Number
# <chr> <int>
#1 104 61
#2 153 120
#3 254 31
#4 266 31
Or without using fill, create a grouping variable with a logical expression and cumsum, and then do the sum
df1 %>%
group_by(grp = cumsum(ID != "")) %>%
summarise(ID = first(ID), Number = sum(Number)) %>%
select(-grp)
data
df1 <- structure(list(ID = c("153", "", "", "", "104", "", "254", "266"
), Number = c(31L, 28L, 31L, 30L, 31L, 30L, 31L, 31L)), row.names = c(NA,
-8L), class = "data.frame")

Or do it straightforwardly :) by
cbind(df1[df1$ID != "", "ID", drop = FALSE],
Number = rev(diff(c(0, rev((rev(cumsum(rev(df1$Number)))[df1$ID != ""]))))))

Related

How to dynamically pass column name in e_bar (echarts4r) function?

I have a dataframe which contains count for each continent year wise. Below is the dataframe.
# A tibble: 4 x 4
continent year_2020 year_2021 year_2022
<chr> <dbl> <dbl> <dbl>
1 Asia 35 177 350
2 Europe 45 47 84
3 Australia 26 46 58
4 Africa 15 20 25
And this is the R script I used to create the graph
stack %>%
e_charts(continent) %>%
e_bar(year_2020) %>%
e_bar(year_2021) %>%
e_bar(year_2022)
Graph
Bar graph
My expectation is how do I pass this column names dynamically. The above dataframe is sample dataset and the year column keeps on increasing. My idea is to show max of 3 bars per continent.
What I tried was, have a start year and end year so the bar graph can be shown based on the input and not hotcode the column name in e_bar function.
start_year <- "2020"
end_year <- "2022"
year_val <- paste0("year_",start_year:end_year)
year_val1 <- year_val[1]
year_val2 <- year_val[2]
year_val3 <- year_val[3]
stack %>%
e_charts(continent) %>%
e_bar(sym(year_val1)) %>%
e_bar(sym(year_val2)) %>%
e_bar(sym(year_val3))
But was getting the below error
Error in chr_as_locations():
! Can't subset columns that don't exist.
x Column sym(year_val1) doesn't exist.
Need help on how to dynamically to pass the year columns.
Thanks
One option would be to switch to the "underscored" version of e_bar, i.e. e_bar_ which allows to pass the name of the series as a character string:
library(echarts4r)
stack |>
e_charts(continent) |>
e_bar_(year_val1) |>
e_bar_(year_val2) |>
e_bar_(year_val3)
DATA
stack <- structure(list(continent = c("Asia", "Europe", "Australia", "Africa"), year_2020 = c(35L, 45L, 26L, 15L), year_2021 = c(
177L, 47L,
46L, 20L
), year_2022 = c(350L, 84L, 58L, 25L)), class = "data.frame", row.names = c(
"1",
"2", "3", "4"
))

Calculating Percent Change in R for Multiple Variables

I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))

Failed to use map2 with mutate with purrr and dplyr

I am reading a list of files form my computer and doing several transformations on them with purrr and dplyr, everything works great, but I have a vector with the IDs of each data frame created, and I want to add a column with the ID of data for each data frame.
Loading libraries
library(readr)
library(lubridate)
library(dplyr)
library(purrr)
Reading list of files to be read and modified
ArchivosTemp <- list.files(pattern = "Tem.csv")
For reproducible purposes
lets say the list of dataframes called Temperaturas made after the first line of the code is
Temperaturas <- list(structure(list(`Date/Time` = c("01-07-2016 14:55", "01-07-2016 15:55",
"01-07-2016 16:55", "01-07-2016 17:55", "01-07-2016 18:55", "01-07-2016 19:55"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(28L, 24L,
25L, 25L, 25L, 25L), a = c(68L, 682L, 182L, 182L, 182L, 182L)), .Names = c("Date/Time",
"Unit", "Value", "a"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(`Date/Time` = c("12-06-2016 19:44",
"12-06-2016 20:44", "12-06-2016 21:44", "12-06-2016 22:44", "12-06-2016 23:44",
"13-06-2016 0:44"), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(31L,
29L, 27L, 26L, 26L, 24L), a = c(129L, 131L, 632L, 633L, 133L,
633L)), .Names = c("Date/Time", "Unit", "Value", "a"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame")), structure(list(
`Date/Time` = c("07-06-16 7:54:01", "07-06-16 8:54:01", "07-06-16 9:54:01",
"07-06-16 10:54:01", "07-06-16 11:54:01", "07-06-16 12:54:01"
), Unit = c("C", "C", "C", "C", "C", "C"), Value = c(23L,
19L, 25L, 27L, 30L, 34L), a = c("119", "116", "119", "119",
"118", "113")), .Names = c("Date/Time", "Unit", "Value",
"a"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
)))
and a vector with the ID of each element of the list
IDs <- c("H1F102", "H1F105", "H1F106")
The purrr code that is working so far
a <- ArchivosTemp %>% map(read_csv) %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Since you cant read the csvs from mu computer lets replace the ArchivosTemp %>% map(read_csv) with the list that I made above
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura))
Then I want each of the 3 data frames to have a column called ID with its corresponding element in the IDs vector I tried this:
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% map2(y = IDs,~mutate(.x, ID = y.))
but it does not work, any ideas of What I am doing wrong?
Expected outcome
As an example this is the results I expect using only the first data frame
a <- Temperaturas %>% map(~rename(.x, Temperatura = Value, Date.Time = `Date/Time`)) %>% map(~mutate(.x, Date.Time = dmy_hms(Date.Time))) %>% map(~select(.x, Date.Time, Temperatura)) %>% reduce(rbind)
mutate(a[[1]], ID = IDs[1])
which turns into
# A tibble: 6 x 3
Date.Time Temperatura ID
<dttm> <int> <chr>
1 2020-07-01 16:14:55 28 H1F102
2 2020-07-01 16:15:55 24 H1F102
3 2020-07-01 16:16:55 25 H1F102
4 2020-07-01 16:17:55 25 H1F102
5 2020-07-01 16:18:55 25 H1F102
6 2020-07-01 16:19:55 25 H1F102
You have a minor parameter problem with map2, the parameters are named as .x, .y, changing y to .y works for me:
map2(.y = IDs, ~ mutate(.x, ID = .y))
Besides if you eventually need to bind all elements in the list as a single data frame, you can set_names to your list with the IDs vector and then specify the .id parameter in map_df, which will map and bind_rows of all data frames in the lists to form a new final data frame, and converts the list names to a new column with the name of .id:
Temperaturas %>%
set_names(IDs) %>%
map_df(~ transmute(.x, Date.Time=dmy_hms(`Date/Time`), Temperatura=Value), .id="ID")
# A tibble: 18 x 3
# ID Date.Time Temperatura
# <chr> <dttm> <int>
# 1 H1F102 2020-07-01 16:14:55 28
# 2 H1F102 2020-07-01 16:15:55 24
# 3 H1F102 2020-07-01 16:16:55 25
# 4 H1F102 2020-07-01 16:17:55 25
# 5 H1F102 2020-07-01 16:18:55 25
# 6 H1F102 2020-07-01 16:19:55 25
# 7 H1F105 2020-06-12 16:19:44 31
# 8 H1F105 2020-06-12 16:20:44 29
# 9 H1F105 2020-06-12 16:21:44 27
#10 H1F105 2020-06-12 16:22:44 26
#11 H1F105 2020-06-12 16:23:44 26
#12 H1F105 2020-06-13 16:00:44 24
#13 H1F106 2016-06-07 07:54:01 23
#14 H1F106 2016-06-07 08:54:01 19
#15 H1F106 2016-06-07 09:54:01 25
#16 H1F106 2016-06-07 10:54:01 27
#17 H1F106 2016-06-07 11:54:01 30
#18 H1F106 2016-06-07 12:54:01 34
Besides, you can use transmute as a short hand for rename %>% mutate %>% select

Reshaping before and after data

The following data is a very small part from a series of tests before and after a treatment. Right now my data is like this:
Subject Var1 Var2 Var3 Var4
1 A-pre 25 27 23 0
2 A-post 25 26 25 120
3 B-pre 30 28 27 132
4 B-post 30 28 26 140
and I need to reshape it like this:
Subject Var1.pre Var1.post Var2.pre Var2.post Var3.pre Var3.post Var4.pre Var4.post
1 A 25 25 27 26 23 25 0 120
2 B 30 30 28 28 27 26 132 140
I have read many questions in SO and the documentations of packages for data wrangling in r like reshape2 etc but I could not find something similar. Any ideas?
Here is the code for replicating the first table:
dat<-structure(list(Subject = structure(c(2L, 1L, 4L, 3L), .Label = c("A-post",
"A-pre", "B-post", "B-pre"), class = "factor"), Var1 = c(25L,
25L, 30L, 30L), Var2 = c(27L, 26L, 28L, 28L), Var3 = c(23L, 25L,
27L, 26L), Var4 = c(0L, 120L, 132L, 140L)), .Names = c("Subject",
"Var1", "Var2", "Var3", "Var4"), row.names = c(NA, -4L), class = "data.frame")
You can use dcast from the devel version of data.table ie. v1.9.5 after splitting the 'Subject' column into two using tstrsplit with split as '-'. We use the dcast to reshape from 'long' to 'wide' format. The dcast function from data.table can take multiple value.var columns, i.e. 'Var1' to 'Var4'.
library(data.table)#v1.9.5+
#convert the data.frame to data.table with `setDT(dat)`
#split the 'Subject' column with tstrsplit and create two columns
setDT(dat)[, c('Subject', 'New') :=tstrsplit(Subject, '-')]
#change the New column class to 'factor' and specify the levels in order
#so that while using dcast we get the 'pre' column before 'post'
dat[, New:= factor(New, levels=c('pre', 'post'))]
#reshape the dataset
dcast(dat, Subject~New, value.var=grep('^Var', names(dat), value=TRUE),sep=".")
# Subject Var1.pre Var1.post Var2.pre Var2.post Var3.pre Var3.post Var4.pre
#1: A 25 25 27 26 23 25 0
#2: B 30 30 28 28 27 26 132
# Var4.post
#1: 120
#2: 140
NOTE: Instructions to install the devel version are here
An option using dplyr/tidyr would be to split the 'Subject' column into two by separate, convert the 'wide' format to 'long' format using gather, unite the 'Var' column (i.e. Var1 to Var4) and 'New' ('VarNew') and spread the 'long' format to 'wide'.
library(dplyr)
library(tidyr)
dat %>%
separate(Subject, into=c('Subject', 'New')) %>% #split to two columns
gather(Var, Val, Var1:Var4)%>% #change from wide to long. Similar to melt
unite(VarNew, Var, New, sep=".") %>% #unite two columns to form a single
spread(VarNew, Val)#change from 'long' to 'wide'

Identifying Duplicate/Unique Teams (and Restructuring Data) in R

I have a data set that looks like this:
Person Team
1 30
2 30
3 30
4 30
11 40
22 40
1 50
2 50
3 50
4 50
15 60
16 60
17 60
1 70
2 70
3 70
4 70
11 80
22 80
My overall goal is to organize that team identification codes so that it is easy to see which teams are duplicates of one another and which teams are unique. I want to summarize the data so that it looks like this:
Team Duplicate1 Duplicate2
30 50 70
40 80
60
As you can see, teams 30, 50, and 70 have identical members, so they share a row. Similarly, teams 40 and 80 have identical members, so they share a row. Only team 60 (in this example) is unique.
In situations where teams are duplicated, I don't care which team id goes in which column. Also, there may be more than 2 duplicates of a team. Teams range in size from 2 members to 8 members.
This answer gives the output data format you asked for. I left the duplicate teams in a single variable because I think it's a better way to handle an arbitrary number of duplicates.
require(dplyr)
df %>%
arrange(Team, Person) %>% # this line is necessary in case the rest of your data isn't sorted
group_by(Team) %>%
summarize(players = paste0(Person, collapse = ",")) %>%
group_by(players) %>%
summarize(teams = paste0(Team, collapse = ",")) %>%
mutate(
original_team = ifelse(grepl(",", teams), substr(teams, 1, gregexpr(",", teams)[[1]][1]-1), teams),
dup_teams = ifelse(grepl(",", teams), substr(teams, gregexpr(",", teams)[[1]][1]+1, nchar(teams)), NA)
)
The result:
Source: local data frame [3 x 4]
players teams original_team dup_teams
1 1,2,3,4 30,50,70 30 50,70
2 11,22 40,80 40 80
3 15,16,17 60 60 NA
Not exactly the format you're wanting, but pretty useful:
# using MrFlick's data
library(dplyr)
dd %>% group_by(Team) %>%
arrange(Person) %>%
summarize(team.char = paste(Person, collapse = "_")) %>%
group_by(team.char) %>%
arrange(team.char, Team) %>%
mutate(duplicate = 1:n())
Source: local data frame [6 x 3]
Groups: team.char
Team team.char duplicate
1 40 11_22 1
2 80 11_22 2
3 60 15_16_17 1
4 30 1_2_3_4 1
5 50 1_2_3_4 2
6 70 1_2_3_4 3
(Edited in the arrange(Person) line in case the data isn't already sorted, got the idea from #Reed's answer.)
Using this for your sample data
dd<-structure(list(Person = c(1L, 2L, 3L, 4L, 11L, 22L, 1L, 2L, 3L,
4L, 15L, 16L, 17L, 1L, 2L, 3L, 4L, 11L, 22L), Team = c(30L, 30L,
30L, 30L, 40L, 40L, 50L, 50L, 50L, 50L, 60L, 60L, 60L, 70L, 70L,
70L, 70L, 80L, 80L)), .Names = c("Person", "Team"),
class = "data.frame", row.names = c(NA, -19L))
You could try a table()/interaction() to find duplicate groups. For example
tt <- with(dd, table(Team, Person))
grp <- do.call("interaction", c(data.frame(unclass(tt)), drop=TRUE))
split(rownames(tt), grp)
this returns
$`1.1.1.1.0.0.0.0.0`
[1] "30" "50" "70"
$`0.0.0.0.0.1.1.1.0`
[1] "60"
$`0.0.0.0.1.0.0.0.1`
[1] "40" "80"
so the group "names" are really just indicators for membership for each person. You could easily rename them if you like with setNames(). But here it collapse the appropriate teams.
Two more base R options (though not exactly the desired output):
DF2 <- aggregate(Person ~ Team, DF, toString)
> split(DF2$Team, DF2$Person)
$`1, 2, 3, 4`
[1] 30 50 70
$`11, 22`
[1] 40 80
$`15, 16, 17`
[1] 60
Or
( DF2$DupeGroup <- as.integer(factor(DF2$Person)) )
Team Person DupeGroup
1 30 1, 2, 3, 4 1
2 40 11, 22 2
3 50 1, 2, 3, 4 1
4 60 15, 16, 17 3
5 70 1, 2, 3, 4 1
6 80 11, 22 2
Note that the expected output as shown in the question would either require to add NAs or empty strings in some of the columns entries because in a data.frame, all columns must have the same number of rows. That is different for lists in, as you can see in some of the answers.
The second option, but using data.table, since aggregate tends to be slow for large data:
library(data.table)
setDT(DF)[, toString(Person), by=Team][,DupeGroup := .GRP, by=V1][]
Team V1 DupeGroup
1: 30 1, 2, 3, 4 1
2: 40 11, 22 2
3: 50 1, 2, 3, 4 1
4: 60 15, 16, 17 3
5: 70 1, 2, 3, 4 1
6: 80 11, 22 2
Using uniquecombs from the mgcv package:
library(mgcv)
library(magrittr) # for the pipe %>%
# Using MrFlick's data
team_names <- sort(unique(dd$Team))
unique_teams <- with(dd, table(Team, Person)) %>% uniquecombs %>% attr("index")
printout <- unstack(data.frame(team_names, unique_teams))
> printout
$`1`
[1] 60
$`2`
[1] 40 80
$`3`
[1] 30 50 70
Now you could use something like this answer to print it in tabular form (note that the groups are column-wise, not row-wise as in your question):
attributes(printout) <- list(names = names(printout)
, row.names = 1:max(sapply(printout, length))
, class = "data.frame")
> printout
1 2 3
1 60 40 30
2 <NA> 80 50
3 <NA> <NA> 70
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs

Resources