Pivoting data frame in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 9 months ago.
I have a data frame that looks like the following:
Day
Minutes
Status
1
0
Play
1
10
Eat
1
30
Move
1
50
Transport
2
0
Play
2
20
Transport
2
50
Sleep
Is it possible to pivot the table to have my Day as an index while the column names are the status and the values are the minutes?
Desired Output:
Day
Play
Eat
Move
Transport
Play
Transport
Sleep
1
2

You can use pivot_wider from tidyr (part of the tidyverse). You can supply the new column names using names_from, then you want to fill in the values with the data from Minutes.
library(tidyverse)
df %>%
pivot_wider(names_from = "Status", values_from = "Minutes")
Output
Day Play Eat Move Transport Sleep
<int> <int> <int> <int> <int> <int>
1 1 0 10 30 50 NA
2 2 0 NA NA 20 50
Data
df <- structure(list(Day = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), Minutes = c(0L,
10L, 30L, 50L, 0L, 20L, 50L), Status = c("Play", "Eat", "Move",
"Transport", "Play", "Transport", "Sleep")), class = "data.frame", row.names = c(NA,
-7L))

Related

Count an observation based on condition of another variable [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I have dataset of regional patent. I want to count where how many Appln_id has more than one Person_id and how many Apply_id has only one Person_id.
Appln_id 3 3 3 10 10 10 10 2 4 4
Person_id 23 22 24 49 50 55 51 101 122 104
here Appln_id 3 has three different person_id (23,22,24) and Appln_id 2 has only one Person_id(101). So, I want to count them that how many of Appln_id has more than one Person_id and how many Apply_id has only one Person_id
Count number of unique person for each Appln_id.
library(dplyr)
result <- df %>% group_by(Appln_id) %>% summarise(n = n_distinct(Person_id))
result
# Appln_id n
#* <int> <int>
#1 2 1
#2 3 3
#3 4 2
#4 10 4
Now you can count how many of them have only 1 Person_id and how many of them have more than that.
sum(result$n == 1)
#[1] 1
sum(result$n > 1)
#[1] 3
data
df <- structure(list(Appln_id = c(3L, 3L, 3L, 10L, 10L, 10L, 10L, 2L,
4L, 4L), Person_id = c(23L, 22L, 24L, 49L, 50L, 55L, 51L, 101L,
122L, 104L)), class = "data.frame", row.names = c(NA, -10L))
We can use data.table
library(data.table)
setDT(df)[, .(n = uniqueN(Person_id)), by = Appln_id]

group two variables(in rows) in R to create one variable [duplicate]

This question already has answers here:
How to merge multiple rows by a given condition and sum?
(2 answers)
Closed 2 years ago.
I have a data frame where
Disease Genemutation Mean. Total No of pateints No.of pateints.
cancertype1 BRCA1 1 10 2
cancertype2 BRCA2 5 10 3
cancertype3 BRCA2 7 10 4
cancertype1 BRCA1 8 10 1
cancertype3 BRCA2 4 10 4
cancertype2 BRCA1 6 10 1
how do I create an new variable called cancertype 4 (from cancer type 3 and cancer type 2) that includes the number of patients that have it as a result of merging the two variable?
We can use replace with %in% to replace those values (assuming 'Disease' is character class)
df1 %>%
group_by(Disease = replace(Disease,
Disease %in% c("cancertype2", "cancertype3"), "cancertype4")) %>%
summarise(TotalNoofpateints = sum(TotalNoofpateints))
-output
# A tibble: 2 x 2
# Disease TotalNoofpateints
# <chr> <int>
#1 cancertype1 20
#2 cancertype4 40
Here is a base R option using aggregate
aggregate(
Total.No.of.pateints ~ Disease,
transform(
df,
Disease = replace(Disease, Disease %in% c("cancertype2", "cancertype3"), "cancertype4")
),
sum
)
giving
Disease Total.No.of.pateints
1 cancertype1 20
2 cancertype4 40
Data
> dput(df)
structure(list(Disease = c("cancertype1", "cancertype2", "cancertype3",
"cancertype1", "cancertype3", "cancertype2"), Genemutation = c("BRCA1",
"BRCA2", "BRCA2", "BRCA1", "BRCA2", "BRCA1"), Mean. = c(1L, 5L,
7L, 8L, 4L, 6L), Total.No.of.pateints = c(10L, 10L, 10L, 10L,
10L, 10L), No.of.pateints. = c(2L, 3L, 4L, 1L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

How to get the difference between groups with a dataframe in long format in R?

Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:
year id points
1 1 10
1 2 12
2 1 20
2 2 18
How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?
id points_difference
1 10
2 6
Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).
Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?
year id points scores
1 1 10 7
1 ... ... ...
1 N 12 8
2 1 20 9
2 ... ... ...
2 N 12 9
id points_difference scores_difference
1 10 2
... ... ...
N 0 1
If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :
library(dplyr)
df %>%
arrange(id, year) %>%
group_by(id) %>%
summarise(across(c(points, scores), diff, .names = '{col}_difference'))
# id points_difference scores_difference
# <int> <int> <int>
#1 1 10 2
#2 1 -7 1
#3 2 6 2
#4 2 -3 3
data
df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L,
1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L,
3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))

How would I add a Total Row for each value in a specific column, that does calculations based upon other columns,

Assume I have this data frame
What I want is this
What I want to do is create rows which groups upon the month variable, which then obtains the sum of the total variable, and the unique value of the days_month variable for all of the values in person for that month.
I am just wondering if there is an easy way to do this that does not involve multiple spreads and gathers with adorn totals that I have to change the days in month back to original value after the totals were summed, etc. Is there a quick and easy way to do this?
One option would be to group by 'month', 'days_in_month' and apply adorn_total by group_mapping
library(dplyr)
library(janitor)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ .x %>%
adorn_totals("row")) %>%
select(names(df1))
# A tibble: 10 x 4
# Groups: month, days_in_month [2]
# month person total days_in_month
# <int> <chr> <int> <int>
# 1 1 John 7 31
# 2 1 Jane 18 31
# 3 1 Tim 20 31
# 4 1 Cindy 11 31
# 5 1 Total 56 31
# 6 2 John 18 28
# 7 2 Jane 13 28
# 8 2 Tim 15 28
# 9 2 Cindy 9 28
#10 2 Total 55 28
If we need other statistics, we can have it in group_map
library(tibble)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ bind_rows(.x, tibble(person = "Mean", total = mean(.x$total))))
data
df1 <- structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), person = c("John",
"Jane", "Tim", "Cindy", "John", "Jane", "Tim", "Cindy"), total = c(7L,
18L, 20L, 11L, 18L, 13L, 15L, 9L), days_in_month = c(31L, 31L,
31L, 31L, 28L, 28L, 28L, 28L)), class = "data.frame", row.names = c(NA,
-8L))

Reducing multiple rows to 1 by index in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!
I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)
Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

Resources