Divide data in to chunks with multiple values in each chunk in R - r

I have a dataframe with observations from three years time, with column df$week that indicates the week of the observation. (The week count of the second year continues from the count of the first, so the data contains 207 weeks).
I would like to divide the data to longer time periods, to df$period that would include all observations from several weeks' time.
If a period would be the length of three weeks, and I the data would include 13 observations in six weeks time, the I idea would be to divide
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
into
periods <- c(1, 1, 1, 2, 2, 3, 3), c(4, 5, 5, 6, 6, 6)
periods
[1]
1 1 1 2 2 3 3
[2]
4 5 5 6 6 6
To look something like
> df
week period
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 1
8 4 2
9 5 2
10 5 2
11 6 2
12 6 2
13 6 2
>
The data contains +13k rows so would need to do some sort of map in style of
mapPeriod <- function(df, fun) {
out <- vector("vector_of_weeks", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
I just don't know what to include in the fun to divide the weeks to the decided sequences of periods. Can function rep be of assistance here? How?
I would be very grateful for all input and suggestions.

split(weeks, f = (weeks - 1) %/% 3)
$`0`
[1] 1 1 1 2 2 3 3
$`1`
[1] 4 5 5 6 6 6
from comments below
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
df <- data.frame(weeks)
library(data.table)
df$period <- data.table::rleid((weeks - 1) %/% 3)
# weeks period
# 1 1 1
# 2 1 1
# 3 1 1
# 4 2 1
# 5 2 1
# 6 3 1
# 7 3 1
# 8 4 2
# 9 5 2
# 10 5 2
# 11 6 2
# 12 6 2
# 13 6 2

Related

Sorting specific columns of a dataframe by their names in R

df is a test dataframe and I need to sort the last three columns in ascending order (without hardcoding the order).
df <- data.frame(X = c(1, 2, 3, 4, 5),
Z = c(1, 2, 3, 4, 5),
Y = c(1, 2, 3, 4, 5),
A = c(1, 2, 3, 4, 5),
C = c(1, 2, 3, 4, 5),
B = c(1, 2, 3, 4, 5))
Desired output:
> df
X Z Y A B C
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5
I'm aware of the order() function but I can't seem to find the right way to implement it to get the desired output.
Update:
Base R:
cbind(df[1:3],df[4:6][,order(colnames(df[4:6]))])
First answer:
We could use relocate from dplyr:
https://dplyr.tidyverse.org/reference/relocate.html
It is configured to arrange columns:
Here we relocate by the index.
We take last (index = 6) and put it before (position 5, which is C)
library(dplyr)
df %>%
relocate(6, .before = 5)
An alternative:
library(dplyr)
df %>%
select(order(colnames(df))) %>%
relocate(4:6, .before = 1)
X Z Y A B C
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5
In base R, a selection on the first columns then sort the last 3 names :
df[, c(names(df)[1:(ncol(df)-3)], sort(names(df)[ncol(df)-2:0]))]
We want to reorder the columns based on the column names, so if we use names(df) as the argument to order, we can reorder the data frame as follows.
The complicating factor is that order() returns a vector of numbers, so if we want to reorder only a subset of the column names, we'll need an approach that retains the original sort order for the first three columns.
We accomplish this by creating a vector of the first 3 column names, the sorted remaining column names using a function that returns the values rather than locations in the vector, and then use this with the [ form of the extract operator.
df <- data.frame(X = c(1, 2, 3, 4, 5),
Z = c(1, 2, 3, 4, 5),
Y = c(1, 2, 3, 4, 5),
A = c(1, 2, 3, 4, 5),
C = c(1, 2, 3, 4, 5),
B = c(1, 2, 3, 4, 5))
df[,c(names(df[1:3]),sort(names(df[4:6])))]
...and the output:
> df[,c(names(df[1:3]),sort(names(df[4:6])))]
X Z Y A B C
1 1 1 1 1 1 1
2 2 2 2 2 2 2
3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5
to_order <- seq(ncol(df)) > ncol(df) - 3
df[order(to_order*order(names(df)))]
#> X Z Y A B C
#> 1 1 1 1 1 1 1
#> 2 2 2 2 2 2 2
#> 3 3 3 3 3 3 3
#> 4 4 4 4 4 4 4
#> 5 5 5 5 5 5 5
Created on 2021-12-24 by the reprex package (v2.0.1)

Counting occurances for each element without any loop

I have the following vector V1 <- c(1, 2, 3, 1, 4, 5, 5, 2, 1).
I would like to have a vector V2 of the same size that for each element indicates how often the corresponding element in V1 occurs, i.e. V2 should be c(3, 2, 1, 3, 1, 2, 2, 2, 3).
I know a multitude of ways to obtain this result, however all of them include using a loop, e.g.
V2 <- c()
for(k in 1:length(V1)){
V2[k] <- length(which(V1 == V1[k]))
}
My question is: Is there a more elegant way of doing it, i.e. without using a loop?
Summing up the solutions provided
# #bouncyball
ave(V1, V1, FUN = length)
# [1] 3 2 1 3 1 2 2 2 3
# #AntoniosK
table(V1)[V1]
# V1
# 1 2 3 1 4 5 5 2 1
# 3 2 1 3 1 2 2 2 3
# #Roman
library(tidyverse);
as.tibble(V1) %>% group_by(value) %>% mutate(V2=n())
# A tibble: 9 x 2
# Groups: value [5]
# value V2
# <dbl> <int>
# 1 1 3
# 2 2 2
# 3 3 1
# 4 1 3
# 5 4 1
# 6 5 2
# 7 5 2
# 8 2 2
# 9 1 3

Count the occurrence of one vector's values in another vector including non match values in R

I have 2 vectors:
v1 <- c(1, 2, 3, 4, 1, 3, 5, 6, 4)
v2 <- c(1, 2, 3, 4, 5, 6, 7)
I want to calculate the occurrence of values of v1 in v2. The expected result is:
1 2 3 4 5 6 7
2 1 2 2 1 1 0
I know there is a function can do this:
table(v1[v1 %in% v2])
However, it only list the matched values:
1 2 3 4 5 6
2 1 2 2 1 1
How can I show all the values in v2?
You can do
table(factor(v1, levels=unique(v2)))
# 1 2 3 4 5 6 7
# 2 1 2 2 1 1 0

(tidyverse) Extract substring groupwise for each string in column given positions to include

I have a dataframe of DNA alignments. Each alignment has a label and could be composed of 3 or more isolates. My goal is to mutate the alignment column to get rid of positions that are all gaps (denoted by "-") in isolate 1, 3, and 4 for each alignment. All alignments will always have isolates 1,3, and 4 in them, and sometimes only those three will be in the alignment.
what I have:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","--attga","a---ggg","acgttgg","a---tgg","a---tgg", "aggatgg"))
> test_df
isolate label alignment
1 1 1 --atc-a
2 2 1 at----a
3 3 1 --ataga
4 4 1 --attga
5 1 2 a---ggg
6 2 2 acgttgg
7 3 2 a---tgg
8 4 2 a---tgg
9 5 2 aggatgg
what I want:
> test_df
isolate label alignment
1 1 1 atc-a
2 2 1 ----a
3 3 1 ataga
4 4 1 attga
5 1 2 aggg
6 2 2 atgg
7 3 2 atgg
8 4 2 atgg
9 5 2 atgg
what I've tried:
I can get a list of sites that I want to keep for each alignment like so:
library(tidyverse)
library(stringr)
test_df %>%
mutate(positions=str_locate_all(alignment, "[^-]")) %>%
group_by(label) %>%
filter(isolate %in% c(1,3,4)) %>%
summarise(pos_to_keep=list(unique(unlist(Reduce(rbind, positions)))))
but then I'm not sure how to proceed in order to slice all the alignments.
This is one way I could get to your solution. There might be quicker way out there.
library(dplyr)
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--atc-a","at----a","--ataga","--attga","a---ggg","acgttgg","a---tgg","a---tgg", "aggatgg"),stringsAsFactors = FALSE)
# Get the correct positions
labelGroups <- test_df %>% mutate(positions=(str_locate_all(alignment, "[^-]"))) %>% filter(isolate %in% c(1,3,4)) %>% group_by(label) %>% summarise(pos_to_keep=list(unique(sort(unlist(positions)))))
# Make a function to extract the relevant letters
getletters <- function(wordlist,indexlist){n <- length(indexlist);lapply(1:n,function(i) paste0(sapply(indexlist[[i]], function(x) substr(wordlist[i],x,x)),collapse=""))}
# Try it
test_df %>% left_join(labelGroups,by="label") %>% mutate(newAlignment=getletters(alignment,pos_to_keep))
# isolate label alignment pos_to_keep newAlignment
# 1 1 1 --atc-a 3, 4, 5, 6, 7 atc-a
# 2 2 1 at----a 3, 4, 5, 6, 7 ----a
# 3 3 1 --ataga 3, 4, 5, 6, 7 ataga
# 4 4 1 --attga 3, 4, 5, 6, 7 attga
# 5 1 2 a---ggg 1, 5, 6, 7 aggg
# 6 2 2 acgttgg 1, 5, 6, 7 atgg
# 7 3 2 a---tgg 1, 5, 6, 7 atgg
# 8 4 2 a---tgg 1, 5, 6, 7 atgg
# 9 5 2 aggatgg 1, 5, 6, 7 atgg

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Resources