Removing NA values and creating new columns when reshaping a dataframe - r

I have the following example df. The actual df have 80 rows and 10 columns:
Fruits PRS_001_Person_ABCD PRS_002_Person_ABCD PRS_015_Person_ABCD PRS_016_Person_ABCD
Apple 0.5 1.3 NA NA
Orange 0.2 NA 0.021 NA
Grape NA 0.06 NA 0.7
Berry NA NA 0.3 0.04
Apple NA 1.3 0.5 NA
I would like to have the following data frame:
Fruits Value1 Value2 Person1 Person2
Apple 0.5 1.3 Product 1 Product 2
Orange 0.2 0.021 Product 1 Product 15
Grape 0.06 0.7 Product 2 Product 16
Berry 0.3 0.04 Product 15 Product 16
Apple 1.3 0.5 Product 2 Product 15

Using tidyr and dplyr:
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(-c(rn,Fruits), values_to = "Value", values_drop_na = TRUE) %>%
mutate(Person = paste("Product",
str_remove(str_extract(name, "[[:digit:]]+"),"^0+"))) %>%
group_by(Fruits, rn) %>%
mutate(row = row_number()) %>%
pivot_wider(-name, names_from = row, values_from = c(Value, Person), names_sep = "") %>%
ungroup %>% select(-rn)
#> # A tibble: 5 x 5
#> Fruits Value1 Value2 Person1 Person2
#> <chr> <dbl> <dbl> <chr> <chr>
#> 1 Apple 0.5 1.3 Product 1 Product 2
#> 2 Orange 0.2 0.021 Product 1 Product 15
#> 3 Grape 0.06 0.7 Product 2 Product 16
#> 4 Berry 0.3 0.04 Product 15 Product 16
#> 5 Apple 1.3 0.5 Product 2 Product 15
Data:
read.table(text = "Fruits PRS_001_Person_ABCD PRS_002_Person_ABCD PRS_015_Person_ABCD PRS_016_Person_ABCD
Apple 0.5 1.3 NA NA
Orange 0.2 NA 0.021 NA
Grape NA 0.06 NA 0.7
Berry NA NA 0.3 0.04
Apple NA 1.3 0.5 NA ",
stringsAsFactors=FALSE, header = TRUE) -> df1

Here is another option with tidyverse
library(dplyr)
library(tidyr)
library(stringr)
library(hacksaw)
df1 %>%
mutate(across(-Fruits,
~ case_when(!is.na(.x)~ sprintf("Product %d",
readr::parse_number(cur_column()))), .names = "Person_{.col}")) %>%
unite(Person, starts_with("Person"), na.rm = TRUE) %>%
separate_wider_delim(Person, delim = '_', names = c("Person1", "Person2")) %>%
relocate(matches("Person\\d+"), .after = Fruits) %>%
as.data.frame %>%
shift_row_values() %>%
rename_with(~ str_c("Value", seq_along(.x)), contains("_")) %>%
select(Fruits, starts_with("Value"), starts_with("Person"),
-where(~ all(is.na(.x)))) %>%
as_tibble %>%
type.convert(as.is = TRUE)
-output
# A tibble: 5 × 5
Fruits Value1 Value2 Person1 Person2
<chr> <dbl> <dbl> <chr> <chr>
1 Apple 0.5 1.3 Product 1 Product 2
2 Orange 0.2 0.021 Product 1 Product 15
3 Grape 0.06 0.7 Product 2 Product 16
4 Berry 0.3 0.04 Product 15 Product 16
5 Apple 1.3 0.5 Product 2 Product 15

Related

Reshaping data frame with many NAs

I have a data frame in R with four variables:
id
var1
var2
var3
1
NA
0.4
NA
1
0.8
NA
NA
2
0.7
NA
NA
2
NA
0.5
NA
2
NA
NA
0.1
3
NA
0.5
NA
3
NA
NA
0.2
There are repeated entries per id and each observation only contains one data value besides the id. I would like to obtain one observation per id with all of the data values "collected".
The output should look like this:
id
var1
var2
var3
1
0.8
0.4
NA
2
0.7
0.5
0.1
3
NA
0.5
0.2
I have played around with pivot_wider, data.table, gather, but am not getting anywhere. It seems to me that this should be very simple. Like some sort of collapse. Grateful for any pointers.
Or using summarise per group:
library(dplyr)
df |>
group_by(id) |>
summarise(across(everything(), ~ first(na.omit(.))))
Output:
# A tibble: 3 × 4
id var1 var2 var3
<int> <dbl> <dbl> <dbl>
1 1 0.8 0.4 NA
2 2 0.7 0.5 0.1
3 3 NA 0.5 0.2
Thanks to #Darren Tsai for the data.
You can use tidyr::fill by groups and then subset unique rows.
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(var1:var3, .direction = "downup") %>%
distinct() %>%
ungroup()
# # A tibble: 3 × 4
# id var1 var2 var3
# <int> <dbl> <dbl> <dbl>
# 1 1 0.8 0.4 NA
# 2 2 0.7 0.5 0.1
# 3 3 NA 0.5 0.2
Data
df <- read.table(text = "
id var1 var2 var3
1 NA 0.4 NA
1 0.8 NA NA
2 0.7 NA NA
2 NA 0.5 NA
2 NA NA 0.1
3 NA 0.5 NA
3 NA NA 0.2", header = TRUE)
You can first pivot_longer, then remove NA, and finally pivot_widerback again:
library(tidyverse)
df %>%
pivot_longer(-id) %>%
na.omit() %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 3 × 4
id var2 var1 var3
<dbl> <dbl> <dbl> <dbl>
1 1 0.4 0.8 NA
2 2 0.5 0.7 0.1
3 3 0.5 NA 0.2

How to count different groups using dplyr in R

I have the following df structure:
category difference factor
a -0.12 1
a -0.12 2
b -0.17 3
b -0.21 4
I want to categorise this data such that I can create identify each category separately by a number and rank them according to decreasing differences. Expected result is something like this:
category difference factor catCount rank
a -0.12 1 2 2
a -0.12 2 2 1
b -0.17 3 1 2
b -0.21 4 1 1
I'm using the following code to achieve this:
df %>% group_by(category) %>% mutate(categoryNumber = n_distinct(category)) %>% mutate(rank = rank(difference, ties.method = 'last'))
but getting the out put as :
category difference factor catCount rank
a -0.12 1 2 2
a -0.12 2 2 1
b -0.17 3 2 2
b -0.21 4 2 1
Any suggestions for this?
use this
df %>% group_by(category, catcnt = dense_rank(desc(category))) %>%
mutate(rank = rank(difference, ties.method = 'last'))
# A tibble: 4 x 5
# Groups: category [2]
category difference factor catcnt rank
<chr> <dbl> <int> <int> <int>
1 a -0.12 1 2 2
2 a -0.12 2 2 1
3 b -0.17 3 1 2
4 b -0.21 4 1 1
counting n_distinct category for each category would always give 1. Try this :
library(dplyr)
df %>%
arrange(category, difference) %>%
group_by(category) %>%
mutate(catCount = cur_group_id(),
rank = row_number()) %>%
ungroup()
# category difference factor catCount rank
# <chr> <dbl> <int> <int> <int>
#1 a -0.12 1 1 1
#2 a -0.12 2 1 2
#3 b -0.21 4 2 1
#4 b -0.17 3 2 2
Here catCount is a unique number for each category whereas rank is rank based on decreasing differences.

How to create a new dataframe from an existing one grouping by day and individual and calculating proportions?

I have a dataframe with info about the state (State) of different individuals (ID) over time (Datetime). Below I show an example of what I have:
df <- data.frame(ID=c(rep(c("A"),8),rep(c("B"),8)),
Datetime=c("2020-08-05 12:00:00","2020-08-05 17:00:00","2020-08-05 18:03:00","2020-08-05 22:54:00","2020-08-06 01:08:00","2020-08-06 13:26:00","2020-08-06 19:04:00","2020-08-08 11:00:00",
"2020-08-04 03:00:00","2020-08-04 15:00:00","2020-08-04 23:00:00","2020-08-06 14:00:00","2020-08-06 17:00:00","2020-08-06 20:00:00","2020-08-07 04:00:00","2020-08-07 16:00:00"),
State=c(1,2,1,1,1,1,2,2,1,1,1,2,2,1,1,1))
df$Datetime <- as.POSIXct(df$Datetime,format="%Y-%m-%d %H:%M:%S", tz="UTC")
df
ID Datetime State
1 A 2020-08-05 12:00:00 1
2 A 2020-08-05 17:00:00 2
3 A 2020-08-05 18:03:00 1
4 A 2020-08-05 22:54:00 1
5 A 2020-08-06 01:08:00 1
6 A 2020-08-06 13:26:00 1
7 A 2020-08-06 19:04:00 2
8 A 2020-08-08 11:00:00 2
9 B 2020-08-04 03:00:00 1
10 B 2020-08-04 15:00:00 1
11 B 2020-08-04 23:00:00 1
12 B 2020-08-06 14:00:00 2
13 B 2020-08-06 17:00:00 2
14 B 2020-08-06 20:00:00 1
15 B 2020-08-07 04:00:00 1
16 B 2020-08-07 16:00:00 1
I want to calculate the proportion of time by day that each one of my individuals has spent in state 1 and 2. That is, I would like to get this:
ID DateTime State.1 State.2
1 A 2020-08-05 0.75 0.25 # Individual `A` was in 3 out of the four records (=rows) in state `1` for this day.
2 A 2020-08-06 0.66 0.33
3 A 2020-08-08 0.00 1.00
4 B 2020-08-04 1.00 0.00
5 B 2020-08-06 0.33 0.66
6 B 2020-08-07 1.00 0.00
However, I don't know how exactly proceed to do all this at once, and my dataframe is too large to do it manually.
Does anyone know how to do it?
Thanks in advance
Does this work:
library(lubridate)
library(dplyr)
df %>% mutate(date = format(Datetime, format = '%Y-%m-%d')) %>% group_by(ID, date) %>%
summarise(State.1 = sum(+(State == 1))/n(), State.2 = sum(+(State == 2))/n())
`summarise()` regrouping output by 'ID' (override with `.groups` argument)
# A tibble: 6 x 4
# Groups: ID [2]
ID date State.1 State.2
<chr> <chr> <dbl> <dbl>
1 A 2020-08-05 0.75 0.25
2 A 2020-08-06 0.667 0.333
3 A 2020-08-08 0 1
4 B 2020-08-04 1 0
5 B 2020-08-06 0.333 0.667
6 B 2020-08-07 1 0
Updated answer to include missing dates:
df %>% mutate(date = format(Datetime, format = '%Y-%m-%d')) %>% group_by(ID, date) %>%
summarise(State.1 = sum(+(State == 1))/n(), State.2 = sum(+(State == 2))/n()) %>%
ungroup() %>% complete(ID, nesting(date))
`summarise()` regrouping output by 'ID' (override with `.groups` argument)
# A tibble: 10 x 4
ID date State.1 State.2
<chr> <chr> <dbl> <dbl>
1 A 2020-08-04 NA NA
2 A 2020-08-05 0.75 0.25
3 A 2020-08-06 0.667 0.333
4 A 2020-08-07 NA NA
5 A 2020-08-08 0 1
6 B 2020-08-04 1 0
7 B 2020-08-05 NA NA
8 B 2020-08-06 0.333 0.667
9 B 2020-08-07 1 0
10 B 2020-08-08 NA NA
>
Maybe this work, just want to use functions I recently learned:
df %>%
mutate(dt = substring(Datetime, 1, 10), val = 1) %>%
pivot_wider(
id_cols = c(ID, dt),
names_from = State,
values_from = val,
names_glue = "State.{State}",
values_fn = sum,
values_fill = 0
) %>%
mutate(
State.1 = State.1/(State.1 + State.2),
State.2 = 1 - State.1
)
# A tibble: 6 x 4
ID dt State.1 State.2
<chr> <chr> <dbl> <dbl>
1 A 2020-08-05 0.75 0.25
2 A 2020-08-06 0.667 0.333
3 A 2020-08-08 0 1
4 B 2020-08-04 1 0
5 B 2020-08-06 0.333 0.667
6 B 2020-08-07 1 0
>

dplyr to calculate fraction by group

there are only 2 farms, but tons of fruit. trying to see which farm has been performing better over 3 years where the performance is simply farmi / (farm1 + farm2), so for the fruit==peach farm1 performance was 20% vs. farm2 80%
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(0,0,3,12,0,7,4,6),
'y2018' = c(5,3,0,0,8,2,0,0),'y2017' = c(4,5,7,15,0,0,0,0) )
> df
fruit farm y2019 y2018 y2017
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 7 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
>
desired output:
out
fruit farm y2019 y2018 y2017
1 apple 1 0.0 0.625 0.444444
2 apple 2 0.0 0.375 0.555556
3 peach 1 0.2 0.000 0.318818
4 peach 2 0.8 0.000 0.681818
5 pear 1 0.0 0.800 0.000000
6 pear 2 1.0 0.200 0.000000
7 lime 1 0.4 0.000 0.000000
8 lime 2 0.6 0.000 0.000000
>
this is a far as i could go:
df %>%
group_by(fruit) %>%
summarise(across(where(is.numeric), sum))
We can group by 'fruit', mutate across the columns that starts with 'y' to divide the elements by the sum of the values in those columns and if all values are 0, then return 0
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(starts_with('y'), ~ if(all(. == 0)) 0 else ./sum(.)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0
NOTE: Here, we just used dplyr package and it is done in a single step
Or another option is adorn_percentages from janitor
library(janitor)
library(purrr)
df %>%
group_split(fruit) %>%
map_dfr(adorn_percentages, denominator = "col") %>%
as_tibble
Or using data.table
library(data.table)
setDT(df)[, (3:5) := lapply(.SD, function(x) if(all(x == 0)) 0
else x/sum(x, na.rm = TRUE)), .SDcols = 3:5, by = fruit][]
Or using base R
grpSums <- rowsum(df[3:5], df$fruit)
df[3:5] <- df[3:5]/grpSums[match(df$fruit, row.names(grpSums)),]
We can use prop.table to calculate the proportions for each fruit.
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), prop.table),
#to replace `NaN` with 0
across(where(is.numeric), tidyr::replace_na, 0))
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0

How to replace unique values with index number using mutate function?

I would like to replace unique values with an index number using dplyr::mutate.
I am grouping by a couple of different variables to access the appropriate subset of my dataframe.
head(df)
group start_time end_time
1 group1 0 0.4
2 group1 0 0.4
3 group1 0 0.4
4 group1 0.4 0.8
5 group1 0.4 0.8
6 group2 0.0 0.4
7 group2 0.4 0.8
8 group2 0.8 1.02
I group_by 'group,' and then by 'start_time.' Sometimes a given group has only one start_time, sometimes two start_times, or sometimes three. I need to create a new variable, 'idx,' for each unique start_time. But I can't think how to do it.
new_df <- df %>%
group_by(group, start_time) %>%
mutate(idx = row_number()) %>%
as.data.frame
Creating a new variable using row_number() isn't right. It gives me:
idx
1
2
3
1
2
1
1
1
But I want:
idx
1
1
1
2
2
1
2
3
I thought of replacing each unique value in group_by with a number? And repeating?
We can use match after grouping by 'group'
library(tidyverse)
df %>%
group_by(group) %>%
mutate(idx = match(start_time, unique(start_time)))
# A tibble: 8 x 4
# Groups: group [2]
# group start_time end_time idx
# <chr> <dbl> <dbl> <int>
#1 group1 0 0.4 1
#2 group1 0 0.4 1
#3 group1 0 0.4 1
#4 group1 0.4 0.8 2
#5 group1 0.4 0.8 2
#6 group2 0 0.4 1
#7 group2 0.4 0.8 2
#8 group2 0.8 1.02 3
Or another option is group_indices
df %>%
group_split(group) %>%
map_df(~ .x %>%
mutate(idx = group_indices(., start_time)))
NOTE: If the 'idx' needs to be created outside the 'group', then remove the group_by step
NOTE2: In the OP's example, both (with/without group_by) gives the same output
We can actually do this easily using R's factor type. A factor variable is stored as integers that refer to a table of levels which holds the actual values. We can then use as.integer or as.numeric to convert from factor back to a number. When you do that, the levels table is lost and you're left with only the integers that would refer back to it; normally this is undesired (you want your actual values, not the encoded values) but in this case it's desirable since identical values will be encoded with the same number:
df <- structure(list(group = c("group1", "group1", "group1", "group1",
"group1", "group2", "group2", "group2"), start_time = c(0, 0,
0, 0.4, 0.4, 0, 0.4, 0.8), end_time = c(0.4, 0.4, 0.4, 0.8, 0.8,
0.4, 0.8, 1.02)), class = "data.frame", row.names = c(NA, -8L
))
df %>%
mutate(idx = as.integer(factor(start_time)))
group start_time end_time idx
1 group1 0.0 0.40 1
2 group1 0.0 0.40 1
3 group1 0.0 0.40 1
4 group1 0.4 0.80 2
5 group1 0.4 0.80 2
6 group2 0.0 0.40 1
7 group2 0.4 0.80 2
8 group2 0.8 1.02 3
As an added benefit, this works just as well in base R:
df$idx <- as.integer(factor(df$start_time))
df
group start_time end_time idx
1 group1 0.0 0.40 1
2 group1 0.0 0.40 1
3 group1 0.0 0.40 1
4 group1 0.4 0.80 2
5 group1 0.4 0.80 2
6 group2 0.0 0.40 1
7 group2 0.4 0.80 2
8 group2 0.8 1.02 3
Another option is data.table::frank (short for fast rank)
df %>%
group_by(group) %>%
mutate(idx = data.table::frank(start_time, ties.method = 'dense'))
# # A tibble: 8 x 4
# # Groups: group [2]
# group start_time end_time idx
# <chr> <dbl> <dbl> <int>
# 1 group1 0 0.4 1
# 2 group1 0 0.4 1
# 3 group1 0 0.4 1
# 4 group1 0.4 0.8 2
# 5 group1 0.4 0.8 2
# 6 group2 0 0.4 1
# 7 group2 0.4 0.8 2
# 8 group2 0.8 1.02 3

Resources