Using tidyverse and pipes how do I assign fixed rows - r

Given this dataframe
X1 X2
2001 NA
abc 10
def 12
xo 13
2002 NA
abc 10
efd 22
dd 23
2005 NA
a 30
All the years have NA in X2. My goal is to get this data frame to become
X1 X2 Date
abc 10 2001
def 12 2001
xo 13 2001
abc 10 2002
efd 22 2002
dd 23 2002
a 30 2005
That is, the years became their own column and the NA's have been dropped
What I tried
a = read_csv("given.csv")
a %>% mutate(Date = ifelse(is.na(X2), X1, NA))
This turns the first dataframe to
X1 X2 Date
2001 NA 2001
abc 10 NA
def 12 NA
xo 13 NA
2002 NA 2002
abc 10 NA
efd 22 NA
dd 23 NA
2005 NA 2005
a 30 NA
I'm not sure how to replace the NA of the date column into the upper value for each year. After that I think i can just drop_na and it will be like i would want it

Another option:
library(dplyr)
library(zoo)
a %>%
mutate(Date = na.locf(case_when(is.na(X2) ~ X1))) %>%
na.omit
Output:
X1 X2 Date
2 abc 10 2001
3 def 12 2001
4 xo 13 2001
6 abc 10 2002
7 efd 22 2002
8 dd 23 2002
10 a 30 2005
If you want to reset row numbers just use filter(!is.na(X2)) instead of na.omit.
P.S. You can of course just load tidyverse and do something like:
library(tidyverse)
a %>%
mutate(Date = case_when(is.na(X2) ~ X1)) %>%
fill(Date) %>%
drop_na
.. however note that fill is quite slow compared to the na.locf function from zoo.

We can create a grouping column based on the occurrence of numbers only elements (\\d+) in 'X1', get the cumulative sum, create the 'Date' as the first element of 'X1', ungroup and remove the NA rows
library(dplyr)
library(stringr)
a %>%
group_by(grp = cumsum(str_detect(X1, '^\\d+$'))) %>%
mutate(Date = first(X1)) %>%
ungroup %>%
select(-grp) %>%
na.omit
# A tibble: 7 x 3
# X1 X2 Date
# <chr> <int> <chr>
#1 abc 10 2001
#2 def 12 2001
#3 xo 13 2001
#4 abc 10 2002
#5 efd 22 2002
#6 dd 23 2002
#7 a 30 2005
Or using data.table with zoo
library(data.table)
library(zoo)
na.omit(setDT(a)[, Date := na.locf(fifelse(is.na(X2), X1, NA_character_))])
data
a <- structure(list(X1 = c("2001", "abc", "def", "xo", "2002", "abc",
"efd", "dd", "2005", "a"), X2 = c(NA, 10L, 12L, 13L, NA, 10L,
22L, 23L, NA, 30L)), class = "data.frame", row.names = c(NA,
-10L))

Related

R subset dataframe where no observations of certain variables

I have a dataframe that looks like
country
sector
data1
data2
France
1
7
.
France
2
10
.
belgium
1
12
7
belgium
2
14
8
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country
sector
data1
France
1
7
France
2
10
belgium
1
12
belgium
2
14
data 2 is now excluded because it had a complete set of missing values for all sectors in France
We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \ replaces function.
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]

Apply function over list then iterate over second variable, in r

I am trying to have a function apply over a list and iterate over a second variable in the function, in r.
Here is an example:
Create the data
A <- data.frame(var = 1:3, year = 2000:2002)
B <- data.frame(var = 4:6, year = 2000:2002)
C <- data.frame(var = 7:9, year = 2000:2002)
ABC <- list(A, B, C)
> ABC
[[1]]
var year
1 1 2000
2 2 2001
3 3 2002
[[2]]
var year
1 4 2000
2 5 2001
3 6 2002
[[3]]
var year
1 7 2000
2 8 2001
3 9 2002
Write the function: sum (which simply filters for a start year and sums the 'var' values - sorry this simple function got messier in this example than I had intended).
library(dplyr)
sum <- function(dat, start.year) {
dat %>%
filter(year >= start.year) %>%
select(var) %>%
colSums() %>%
data.frame(row.names = NULL) %>%
rename(var = '.') %>%
mutate(start = start.year)
}
Now I can apply the function to the list (and bind_rows to get a neat output):
lapply(ABC, sum, 2000) %>%
bind_rows()
var start
1 6 2000
2 15 2000
3 24 2000
What I want to do however is iterate over start.year creating dataframes for start.year = c(2000, 2001, 2002). This would ideally give:
var start
1 6 2000
2 15 2000
3 24 2000
4 5 2001
5 11 2001
6 17 2001
7 3 2002
8 6 2002
9 9 2002
I have looked at map2, but that talks about using vectors of the same length. That would work in this case, but imagine my list had 4 items in it and only 3 records per list. So assume map2 is doing something different. I also thought about a nested for loop. When I started writing that however I realized I would be dealing with list.append functions in r and that seemed wrong. I assume this is an easy thing to do. Any help would be appreciated.
We can do this with a nested lapply/map
library(purrr)
map_dfr(2000:2002, ~ map_dfr(ABC, sum, .x))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or inspired from #thelatemail's suggestion with Map
map2_dfr(rep(ABC, 3), rep(2000:2002,each=length(ABC)), sum)
With lapply
do.call(rbind, lapply(2000:2002, function(x) do.call(rbind, lapply(ABC, sum, x))))
# var start
#1 6 2000
#2 15 2000
#3 24 2000
#4 5 2001
#5 11 2001
#6 17 2001
#7 3 2002
#8 6 2002
#9 9 2002
Or as #thelatemail mentioned
do.call(rbind, Map(sum, ABC, start.year=rep(2000:2002,each=length(ABC))))
If the OP's function can be changed, another option is
library(dplyr)
library(tidyr)
map_dfr(ABC, ~ .x %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(year2) %>%
summarise(var = base::sum(var)))
Or instead of doing this in a list, we can bind them together with bind_rows then do a group by sum after crossing with the input 'years'
bind_rows(ABC, .id = 'grp') %>%
group_by(grp) %>%
crossing(year2 = 2000:2002) %>%
filter(year >= year2) %>%
group_by(grp, year2) %>%
summarise(var = base::sum(var))

Index multiple vectors into table in R

I have three vectors:
position <- c(13, 13, 24, 20, 24, 6, 13)
my_string_allele <- c("T>A", "T>A", "G>C", "C>A", "A>G", "A>G", "G>T")
position_ref <- c("12006", "1108", "13807", "1970", "9030", "2222", "4434")
I want to create a table (starting from the smallest position) as shown below. I want to account for the number of occurrence for each my_string_allele column for each position and have their corresponding position_ref in position_ref column. What would be the simplest way to do this?
position T>A position_ref G>C position_ref C>A position_ref A>G position_ref G>T position_ref
6 1 2222
13 2 12006, 1108 1 4434
20 1 1970
24 1 13807 1 9030
Here is a spread() method which stretches data to the wide format with mutate_all() to count the number of occurrences.
Data
library(tidyverse)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
Code
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(position_ref, collapse = ", ")) %>%
distinct() %>%
spread(my_string_allele, position_ref) %>%
mutate_all(funs(N = if_else(is.na(.), NA_integer_, lengths(str_split(., ", ")))))
Output
position `A>G` `C>A` `G>C` `G>T` `T>A` `A>G_N` `C>A_N` `G>C_N` `G>T_N` `T>A_N`
<dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
1 6 2222 NA NA NA NA 1 NA NA NA NA
2 13 NA NA NA 4434 12006, 1108 NA NA NA 1 2
3 20 NA 1970 NA NA NA NA 1 NA NA NA
4 24 9030 NA 13807 NA NA 1 NA 1 NA NA
(You can sort the columns by their column names to get the output you show in the question.)
Full disclosure: I am adapting part of #DarrenTsai's answer with data.table to provide the number of occurrence as well (since it is missing from his answer). Using data.table:
library(data.table)
df <- data.frame(position, my_string_allele, position_ref, stringsAsFactors = F)
setDT(df)
df[, `:=`(position_ref = paste(.N, paste(position_ref, collapse = ", "))),
by = c("position", "my_string_allele")] %>%
unique(., by = c("position", "my_string_allele", "position_ref")) %>%
dcast(position ~ my_string_allele, value.var = "position_ref")
Result:
position A>G C>A G>C G>T T>A
1: 6 1 2222 <NA> <NA> <NA> <NA>
2: 13 <NA> <NA> <NA> 1 4434 2 12006, 1108
3: 20 <NA> 1 1970 <NA> <NA> <NA>
4: 24 1 9030 <NA> 1 13807 <NA> <NA>
With dplyr (largely based on #DarrenTsai's answer, should upvote his as well):
library(dplyr)
df %>% group_by(position, my_string_allele) %>%
mutate(position_ref = paste(n(), paste(position_ref, collapse = ", "))) %>%
distinct() %>%
tidyr::spread(my_string_allele, position_ref)

Fill numeric variable while preserving group

[EDITED to reflect a better example]
Say I have a dataframe like this:
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
> df
x year
1 A 2001
2 A 2004
3 B 2002
4 B 2005
How can I increment year by 1 while preserving x? I would like to fill in year so that the sequence is this:
x year
1 A 2001
2 A 2002
3 A 2003
4 A 2004
5 B 2002
6 B 2003
7 B 2004
8 B 2005
Can anyone recommend a good way of doing this?
#useR recommend this approach:
> data.frame(year = min(df$year):max(df$year)) %>%
full_join(df) %>%
fill(x)
Joining, by = "year"
year x
1 2001 A
2 2002 B
3 2003 B
4 2004 A
5 2005 B
However that does not match the desired output.
An option using tidyr::complete and dplyr::lead can be as:
library(tidyverse)
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
df %>% mutate(nextYear = ifelse(is.na(lead(year)),year, lead(year)-1)) %>%
group_by(x) %>%
complete(year = seq(year, nextYear, by=1)) %>%
select(-nextYear) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 B 2004
# 5 B 2005
# 6 B 2006
# 7 C 2007
Edited: The solution for modified data
df <- data.frame(x = c("A","A","B", "B"), year = c(2001,2004,2002,2005))
library(tidyverse)
df %>% group_by(x) %>%
complete(year = seq(min(year), max(year), by=1)) %>%
as.data.frame()
# x year
# 1 A 2001
# 2 A 2002
# 3 A 2003
# 4 A 2004
# 5 B 2002
# 6 B 2003
# 7 B 2004
# 8 B 2005
Using base R (with a little help from zoo):
full_df = data.frame(year = min(df$year):max(df$year))
df = merge(df, full_df, all = TRUE)
df = df[order(df$year), ]
df$x = zoo::na.locf(df$x)
df
# year x
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
Using the "tidyverse"
df <- data.frame(x = LETTERS[1:3], year = c(2001,2004,2007))
library(dplyr)
library(tidyr)
df = df %>% mutate(year = factor(year, levels = min(year):max(year))) %>%
complete(year) %>%
fill(x) %>%
mutate(year = as.numeric(as.character(year)))
df
# # A tibble: 7 x 2
# year x
# <dbl> <fctr>
# 1 2001 A
# 2 2002 A
# 3 2003 A
# 4 2004 B
# 5 2005 B
# 6 2006 B
# 7 2007 C
We can first split by x, then create a year vector for each x group, join with each group df, fill down x, then finally rbind all group df's together.
library(dplyr)
library(tidyr)
df %>%
split(.$x) %>%
lapply(function(y) data.frame(year = min(y$year):max(y$year)) %>%
full_join(y) %>%
fill(x)) %>%
unname() %>%
do.call(rbind, .)
Result:
year x
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
Here's a pretty simple base R method with tapply and stack.
stack(tapply(df$year, df["x"], function(x) min(x):max(x)))
Here, tapply splits the year vector by df$x groups and then constructs a sequence from the min to the max year. This returns a named list which is fed to stack to produce the following.
values ind
1 2001 A
2 2002 A
3 2003 A
4 2004 A
5 2002 B
6 2003 B
7 2004 B
8 2005 B
If you are curious how you might do this in data.table, it's also pretty straight forward:
library(data.table)
setDT(df)[, .(year=min(year):max(year)), by=x]
which returns
x year
1: A 2001
2: A 2002
3: A 2003
4: A 2004
5: B 2002
6: B 2003
7: B 2004
8: B 2005

Speeding up rowwise comparison

I have a data.frame like the following:
id year x y v1
1 2006 12 1 0.8510703
1 2007 12 1 0.5954527
1 2008 12 2 -1.9312854
1 2009 12 1 0.1558393
1 2010 8 1 0.9051487
2 2001 12 2 -0.5480566
2 2002 12 2 -0.7607420
2 2003 3 2 -0.8094283
2 2004 3 2 -0.1732794
I would like to sum up (grouped by id) v1 of consecutive years (so 2010 and 2009, 2009 and 2008 and so on) only if x and y match. Expected output:
id year res
1 2010 NA
1 2009 NA
1 2008 NA
1 2007 1.4465230
2 2004 -0.9827077
2 2003 NA
2 2002 -1.3087987
The oldest year per id is removed, as there is no preceding year.
I have a slow lapply solution in place but would like to speed things up, as my data is rather large.
Data:
set.seed(1)
dat <- data.frame(id = c(rep(1,5),rep(2,4)),year = c(2006:2010,2001:2004),
x = c(12,12,12,12,8,12,12,3,3), y = c(1,1,2,1,1,2,2,2,2),
v1 = rnorm(9))
Current Solution:
require(dplyr)
myfun <- function(dat) { do.call(rbind,lapply(rev(unique(dat$year)[-1]),
function(z) inner_join(dat[dat$year==z,2:5],
dat[dat$year==z-1,2:5],
by=c("x","y")) %>%
summarise(year = z, res = ifelse(nrow(.) < 1,NA,sum(v1.x,v1.y)))))
}
dat %>% group_by(id) %>% do(myfun(.))
Here is a data.table solution, I think.
datNew <- setDT(dat)[, .(year=year, res=(v1+shift(v1)) * NA^(x != shift(x) | y != shift(y))),
by=id][-1, .SD, by=id][]
id year res
1: 1 2007 -0.4428105
2: 1 2008 NA
3: 1 2009 NA
4: 1 2010 NA
5: 2 2001 NA
6: 2 2002 -0.3330393
7: 2 2003 NA
8: 2 2004 1.3141061
Here, the j statement contains a list with two elements, the year and a function. This function sums values with the lagged value, using shift, but is multiplied by NA or 1 depending on whether the x and y match with their lagged values. This calculation is performed by id. The output is fed to a second chain, which drops the first observation of each id which is all NA.
You can adjust the order efficiently using setorder if desired.
setorder(datNew, id, -year)
datNew
id year res
1: 1 2010 NA
2: 1 2009 NA
3: 1 2008 NA
4: 1 2007 -0.4428105
5: 2 2004 1.3141061
6: 2 2003 NA
7: 2 2002 -0.3330393
8: 2 2001 NA
Assuming there are sorted years as in the example:
dat %>%
group_by(id) %>%
mutate(res = v1 + lag(v1), #simple lag for difference
res = ifelse(x == lag(x) & y == lag(y), v1, NA)) %>% #NA if x and y don't match
slice(-1) #drop the first year
You can use %>% select(id, year, res), and %>% arrange(id, desc(year)) at the end if you want.

Resources