R: Dataframe Manipulation - r

I’ve the follow dataframe as shown below
ID
COUNT OF STOCK
YEAR
A1
10
2000
A1
20
2000
A1
18
2000
A1
15
2001
A1
30
2001
A2
35
2002
A2
50
2001
A2
10
2002
A2
22
2002
A3
11
2001
A3
15
2001
A3
28
2000
I would like change the dataframe to the one shown below by grouping ID and Year(which is then use to count the number of years from 2020) to find the sum of count of stock
ID
Sum of COUNT OF STOCK
number of years from 2020 (2020-year)
A1
48
20
A1
45
19
A2
67
18
A2
50
19
A3
26
19
A3
28
20
Thanks in advance!!

This is pretty straight forward. To work with those verbose column names you will have to quote them though, which might be a challenge.
dat %>% group_by( ID, YEAR ) %>%
summarise(
`Sum of COUNT OF STOCK` = sum( `COUNT OF STOCK` ),
`number of years from 2020 (2020-year)` = 2020 - first(YEAR)
) %>% select( -YEAR )
Output:
ID `Sum of COUNT OF STOCK` `number of years from 2020 (2020-year)`
<chr> <int> <dbl>
1 A1 48 20
2 A1 45 19
3 A2 50 19
4 A2 67 18
5 A3 28 20
6 A3 26 19

Simply do this.
df %>% group_by(D, number_of_years = 2020 - YEAR) %>%
summarise(Sum_of_stock = sum(COUNT_OF_STOCK))
# A tibble: 6 x 3
# Groups: D [3]
D number_of_years Sum_of_stock
<chr> <dbl> <int>
1 A1 19 45
2 A1 20 48
3 A2 18 67
4 A2 19 50
5 A3 19 26
6 A3 20 28
data
df <- read.table(text = "D COUNT_OF_STOCK YEAR
A1 10 2000
A1 20 2000
A1 18 2000
A1 15 2001
A1 30 2001
A2 35 2002
A2 50 2001
A2 10 2002
A2 22 2002
A3 11 2001
A3 15 2001
A3 28 2000", header = T)

Related

With known proportion/percentage and sample size, populate original data in R

I have a dataset where I have a known sample_size and percent "yes" for each year/location/ID combination. I am trying to use sample_size and percent to back-calculate the full dataset, which would be a binary Y_N column (where 1 is yes and 0 is no), corresponding to the percent in the percent column.
Here is an example of the dataset:
table <- "year location ID sample_size percent
1 2000 A 1a 10 40
2 2001 A 1a 10 30
3 2000 B 2a 10 70
4 2001 B 2a 10 90
5 2005 C 1a 20 10
6 2006 C 1a 20 50"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
And here is a subset of what that would look like expanded for the first two year/location/ID combinations:
table <- "year location ID Y_N
1 2000 A 1a 1
2 2000 A 1a 1
3 2000 A 1a 1
4 2000 A 1a 1
5 2000 A 1a 0
6 2000 A 1a 0
7 2000 A 1a 0
8 2000 A 1a 0
9 2000 A 1a 0
10 2000 A 1a 0
11 2001 A 1a 1
12 2001 A 1a 1
13 2001 A 1a 1
14 2001 A 1a 0
15 2001 A 1a 0
16 2001 A 1a 0
17 2001 A 1a 0
18 2001 A 1a 0
19 2001 A 1a 0
20 2001 A 1a 0"
Is there a way to do this, say with dplyr mutate(), so that the expanded dataset can be grouped by each unique year/location/ID combination?
You could first compute the number of yes and no, then reshape to long and afterwards use tidyr::uncount like so:
library(dplyr)
library(tidyr)
df |>
mutate(n_1 = sample_size * percent / 100, n_0 = sample_size - n_1) |>
select(-sample_size, -percent) |>
pivot_longer(c(n_1, n_0), names_to = "Y_N", values_to = "count", names_prefix = "n_") |>
tidyr::uncount(count)
#> # A tibble: 80 × 4
#> year location ID Y_N
#> <int> <chr> <chr> <chr>
#> 1 2000 A 1a 1
#> 2 2000 A 1a 1
#> 3 2000 A 1a 1
#> 4 2000 A 1a 1
#> 5 2000 A 1a 0
#> 6 2000 A 1a 0
#> 7 2000 A 1a 0
#> 8 2000 A 1a 0
#> 9 2000 A 1a 0
#> 10 2000 A 1a 0
#> # … with 70 more rows
using a bazooka you could also do it with base loops
table <- data.frame(1,2,3,4,5)
table[1,] <- c("2000", "A", "1a", 10, 40)
table[2,] <- c("2001", "A", "1a", 10, 50)
table2 <- table[1,-c(5)]
table3 <- table2[1,]
for (i in 1:nrow(table)){
nb_row <- as.numeric(table[i,4])
percentage <- as.numeric(table[i,5])
table2[1:nb_row,] <- 1
table2$X1 <- table[i,1]
table2$X2 <- table[i,2]
table2$X3 <- table[i,3]
table2[c(1:percentage/10),4] <- 1
table2[c(((percentage/10)+1):nb_row),4] <- 0
table3 <- rbind(table3, table2)
}
table3 <- table3[-c(1),]
> table3
X1 X2 X3 X4
2 2000 A 1a 1
3 2000 A 1a 1
4 2000 A 1a 1
5 2000 A 1a 1
6 2000 A 1a 0
7 2000 A 1a 0
8 2000 A 1a 0
9 2000 A 1a 0
10 2000 A 1a 0
11 2000 A 1a 0
12 2001 A 1a 1
13 2001 A 1a 1
14 2001 A 1a 1
15 2001 A 1a 1
16 2001 A 1a 1
17 2001 A 1a 0
18 2001 A 1a 0
19 2001 A 1a 0
20 2001 A 1a 0
21 2001 A 1a 0
You can use rbinom rowwise:
library(dplyr)
library(purrr)
df %>%
mutate(Y_N = pmap(select(., c(sample_size, percent)), ~ rbinom(..1, 1, prob = ..2 / 100))) %>%
unnest(Y_N) %>%
select(-c(sample_size, percent))
# A tibble: 80 × 4
year location ID Y_N
<int> <chr> <chr> <int>
1 2000 A 1a 0
2 2000 A 1a 0
3 2000 A 1a 0
4 2000 A 1a 1
5 2000 A 1a 0
6 2000 A 1a 1
7 2000 A 1a 0
8 2000 A 1a 0
9 2000 A 1a 1
10 2000 A 1a 0
# … with 70 more rows
# ℹ Use `print(n = ...)` to see more rows
Or tabulate if you want exact count:
df %>%
mutate(Y_N = pmap(select(., c(sample_size, percent)), ~ tabulate(seq(..2 / 10), nbins = ..1))) %>%
unnest("Y_N") %>%
select(-c(sample_size, percent))

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

Fill NAs with either last or next non NA value in R

I am trying to fill NA values in a column with other non-NA values within the same group in R.
So my data looks something like this:
df
id year pop
1 E1 2000 NA
2 E2 2000 NA
3 E2 2001 NA
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 NA
9 E4 2004 10
10 E4 2005 NA
11 E4 2008 NA
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80
And I want NA values to have either the last non-NA or the next non-NA value of pop within the same group of id. To look something like this:
df.desired
id year pop
1 E1 2000 NA
2 E2 2000 120
3 E2 2001 120
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 300
9 E4 2004 10
10 E4 2005 10
11 E4 2008 9
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80
I tried different things with both zoo::na.locf() and dplyr::fill() but I keep having two main issues: 1. I get errors because entire groups only have NA (like id == "E1" here) and 2. I can only choose either the last or the naxt non-NA value.
These are some examples of what I've tried:
library(tidyverse)
library(zoo)
df.desired <- df %>%
group_by(id) %>%
arrange(year)%>%
mutate(pop_imputated = pop)%>%
fill(pop_imputated)%>%
ungroup()
df.desired <- df %>%
group_by(id) %>%
arrange(year)%>%
mutate(pop_imputated = zoo::na.locf(pop))%>%
fill(pop_imputated)%>%
ungroup()
Any ideas?
Thanks a lot!
Here is an answer that would match your expected output exactly: it will impute to the nearest non-missing value, either upward or downward.
Here is the code, using a spiced up version of your example:
library(tidyverse)
df = structure(list(id = c("E1", "E2", "E2", "E2", "E2", "E3", "E3", "E3", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E4", "E5", "E5"),
year = c(2000L, 2000L, 2001L, 2003L, 2005L, 1999L, 2001L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2018L, 2019L, 2002L, 2003L),
pop = c(NA, NA, NA, 120L, 125L, 115L, 300L, NA, 10L, NA, NA, NA, NA, 9L, NA, 8L, 12L, 80L),
pop_exp = c(NA, 120L, 120L, 120L, 125L, 115L, 300L, 300L, 10L, 10L, 10L, 9L, 9L, 9L, 9L, 8L, 12L, 80L)),
class = "data.frame", row.names = c(NA, -18L))
fill_nearest = function(x){
keys=which(!is.na(x))
if(length(keys)==0) return(NA)
b = map_dbl(seq.int(x), ~keys[which.min(abs(.x-keys))])
x[b]
}
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(pop_imputated = fill_nearest(pop)) %>%
ungroup()
#> # A tibble: 18 x 5
#> id year pop pop_exp pop_imputated
#> <chr> <int> <int> <int> <int>
#> 1 E1 2000 NA NA NA
#> 2 E2 2000 NA 120 120
#> 3 E2 2001 NA 120 120
#> 4 E2 2003 120 120 120
#> 5 E2 2005 125 125 125
#> 6 E3 1999 115 115 115
#> 7 E3 2001 300 300 300
#> 8 E3 2003 NA 300 300
#> 9 E4 2004 10 10 10
#> 10 E4 2005 NA 10 10
#> 11 E4 2006 NA 10 10
#> 12 E4 2007 NA 9 9
#> 13 E4 2008 NA 9 9
#> 14 E4 2009 9 9 9
#> 15 E4 2018 NA 9 9
#> 16 E4 2019 8 8 8
#> 17 E5 2002 12 12 12
#> 18 E5 2003 80 80 80
Created on 2021-05-13 by the reprex package (v2.0.0)
As I had to use a purrr loop, it might get a bit slow in a huge dataset though.
EDIT: I suggested to add this option in tidyr::fill(): https://github.com/tidyverse/tidyr/issues/1119. The issue also contains a tweaked version of this function to use the year column as the reference to calculate the "distance" between the values. For instance, you would rather have row 15 as 8 than as 9 because the year is much closer.
library runner has a built in function fill_run which may also be used
df %>%
group_by(id) %>%
mutate(pop = runner::fill_run(pop, run_for_first = T))
#> Warning in runner::fill_run(pop, run_for_first = T): All x values are NA
#> # A tibble: 14 x 3
#> # Groups: id [5]
#> id year pop
#> <chr> <int> <int>
#> 1 E1 2000 NA
#> 2 E2 2000 120
#> 3 E2 2001 120
#> 4 E2 2003 120
#> 5 E2 2005 125
#> 6 E3 1999 115
#> 7 E3 2001 300
#> 8 E3 2003 300
#> 9 E4 2004 10
#> 10 E4 2005 10
#> 11 E4 2008 10
#> 12 E4 2009 9
#> 13 E5 2002 12
#> 14 E5 2003 80
Created on 2021-05-13 by the reprex package (v2.0.0)
Have you tried to change the .direction attribute of the tidyr::fill function? You can use "downup" (first downwards, then upwards), or viceversa "updown"
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(pop_imputated = pop) %>%
fill(pop_imputated, .direction = "downup") %>%
ungroup()
# A tibble: 14 x 4
id year pop pop_imputated
<chr> <int> <int> <int>
1 E1 2000 NA NA
2 E2 2000 NA 120
3 E2 2001 NA 120
4 E2 2003 120 120
5 E2 2005 125 125
6 E3 1999 115 115
7 E3 2001 300 300
8 E3 2003 NA 300
9 E4 2004 10 10
10 E4 2005 NA 10
11 E4 2008 NA 10
12 E4 2009 9 9
13 E5 2002 12 12
14 E5 2003 80 80
It looks similar to your desired output
I hope this is what you were looking for. I filled all NA values with the last non NA values in a group.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(pop, ~ coalesce(.x, last(.x[!is.na(.x)]))))
# A tibble: 14 x 3
# Groups: id [5]
id year pop
<chr> <int> <int>
1 E1 2000 NA
2 E2 2000 125
3 E2 2001 125
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 300
9 E4 2004 10
10 E4 2005 9
11 E4 2008 9
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80
Another solution uses nalocf(NA last observation carried forward); as it operates top-down, we first need to re-arrange the dataframe so the first popvalue is non-NA:
library(zoo)
df %>%
arrange(desc(id)) %>%
mutate(pop = na.locf(pop))
id year pop
13 E5 2002 12
14 E5 2003 80
9 E4 2004 10
10 E4 2005 10
11 E4 2008 10
12 E4 2009 9
6 E3 1999 115
7 E3 2001 300
8 E3 2003 300
2 E2 2000 300
3 E2 2001 300
4 E2 2003 120
5 E2 2005 125
1 E1 2000 125
We can of course reinstitute the original order:
library(zoo)
df %>%
arrange(desc(id)) %>%
mutate(pop = na.locf(pop)) %>%
arrange(id)
Since na.approx accepts approx arguments (see ?approx and ?na.approx) we can use na.approx with method = "constant" and rule = 2. Also sort the data back to the original order if you want exactly the output shown in the question.
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
arrange(year)%>%
mutate(pop_imputated = na.approx(pop, method = "const", rule = 2, na.rm = FALSE)) %>%
ungroup() %>%
arrange(id, year)
giving:
# A tibble: 14 x 4
id year pop pop_imputated
<chr> <int> <int> <dbl>
1 E1 2000 NA NA
2 E2 2000 NA 120
3 E2 2001 NA 120
4 E2 2003 120 120
5 E2 2005 125 125
6 E3 1999 115 115
7 E3 2001 300 300
8 E3 2003 NA 300
9 E4 2004 10 10
10 E4 2005 NA 10
11 E4 2008 NA 10
12 E4 2009 9 9
13 E5 2002 12 12
14 E5 2003 80 80
Note
Lines <- " id year pop
1 E1 2000 NA
2 E2 2000 NA
3 E2 2001 NA
4 E2 2003 120
5 E2 2005 125
6 E3 1999 115
7 E3 2001 300
8 E3 2003 NA
9 E4 2004 10
10 E4 2005 NA
11 E4 2008 NA
12 E4 2009 9
13 E5 2002 12
14 E5 2003 80"
df <- read.table(text = Lines)

r - Using fill() with a conditional

library(tidyverse)
df <- tibble(X = c("A1", "A2", "A3", "A4", "A5", "A5", "A6", "A7", "A8", "A8", "A9", "A9"),
Y = c(31, 52, 45, 86, NA, 50, 93, 85, 59, NA, 85, NA),
Z = c(70, 64, 51, 38, 18, NA, 76, 54, NA, 69, NA, 96),
D = c(1,1,1,1,2,2,1,1,2,2,2,2))
> df
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 NA 18 2
6 A5 50 NA 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 NA 2
10 A8 NA 69 2
11 A9 85 NA 2
12 A9 NA 96 2
The column X has duplicate values that sometimes repeat twice. Column D is measuring those occurances. Column Y and Z have some scores. I want those scores to repeat within those duplicated observations within column X. I tried using fill() method and my output is below
df %>%
filter(D == 1) %>%
bind_rows(df %>%
filter(D != 1) %>%
fill(c("Y", "Z"), .direction = "downup")
)
# A tibble: 12 x 4
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 18 2
10 A8 59 69 2
11 A9 85 69 2
12 A9 85 96 2
However, whatever .direction option I use, I cannot seem to get correct numbers. For example in the above output, for A9, Z should be repeating 96 twice. Same issue is with A8.
My desired output is below
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A6 93 76 1
6 A7 85 54 1
7 A5 50 18 2
8 A5 50 18 2
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2
You could do:
library(tidyverse)
df %>%
group_by(X) %>%
mutate(across(Y:Z, ~ first(na.omit(.))))
Output:
# A tibble: 12 x 4
# Groups: X [9]
X Y Z D
<chr> <dbl> <dbl> <dbl>
1 A1 31 70 1
2 A2 52 64 1
3 A3 45 51 1
4 A4 86 38 1
5 A5 50 18 2
6 A5 50 18 2
7 A6 93 76 1
8 A7 85 54 1
9 A8 59 69 2
10 A8 59 69 2
11 A9 85 96 2
12 A9 85 96 2
You could also use fill like below, but in my experience this can be quite slow:
df %>%
group_by(X) %>%
fill(Y, Z, .direction = 'downup')
You can use group_by and mutate to change the values of the NAs to the other one in the group
df %>%
dplyr::group_by(X) %>%
dplyr::mutate(
Y = dplyr::case_when(
is.na(Y) ~ Y[!is.na(Y)],
TRUE ~ Y),
Z = dplyr::case_when(
is.na(Z) ~ Z[!is.na(Z)],
TRUE ~ Z))

Summarize Table based on a Threshold

It might be a very simple problem. But I failed to so by using my known dplyr functions. Here's the data:
tab1 <- read.table(header=TRUE, text="
Col1 A1 A2 A3 A4 A5
ID1 43 52 33 25 59
ID2 27 41 20 71 22
ID3 37 76 36 27 44
ID4 23 71 62 25 63
")
tab1
Col1 A1 A2 A3 A4 A5
1 ID1 43 52 33 25 59
2 ID2 27 41 20 71 22
3 ID3 37 76 36 27 44
4 ID4 23 71 62 25 63
I intend to get a contingency table like the following by keeping values lower than 30.
Col1 Col2 Val
ID1 A4 25
ID2 A1 27
ID2 A3 20
ID2 A5 22
ID3 A4 27
ID4 A1 23
ID4 A4 25
Or if you insist on dplyrness, you can gather the data first and then filter as desired
library(dplyr)
library(tidyr)
tab1 %>%
gather(Col2, Val, -Col1) %>%
filter(Val < 30)
# Col1 Col2 Val
# 1 ID2 A1 27
# 2 ID4 A1 23
# 3 ID2 A3 20
# 4 ID1 A4 25
# 5 ID3 A4 27
# 6 ID4 A4 25
# 7 ID2 A5 22
Use the reshape2 package with melt:
library(reshape2)
tab2 = melt(tab1)
tab2[tab2$value < 30,]
output:
Col1 variable value
2 ID2 A1 27
4 ID4 A1 23
10 ID2 A3 20
13 ID1 A4 25
15 ID3 A4 27
16 ID4 A4 25
18 ID2 A5 22
Using base R:
x<-apply(tab1, 1, function(y)y[y<30])
data.frame(Col1 = rep(tab1$Col1, sapply(x, length)),
Col2 = names(unlist(x)),
Val = unlist(x))
Col1 Col2 Val
1 ID1 A4 25
2 ID2 A1 27
3 ID2 A3 20
4 ID2 A5 22
5 ID3 A4 27
6 ID4 A1 23
7 ID4 A4 25

Resources