R: How to automatically create flag variables for sequences of values? - r

Suppose, you're given the following dataframe:
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"))
What I am looking for is to create the variables flag_1, ..., flag_7 in a containing the information of how many times the respective values occur. For a, I would expect the following result:
var flag_1 flag_2 flag_3 flag_4 flag_5
",1,2,3," 1. 1. 1. 0. 0.
",2,3,5," 0. 1. 1. 0. 1.
",1,3,5,5," 1. 0. 1. 0. 2.
I managed to get the result using a nested for-loop and an if-condition but there must be a nicer (more aesthetic and better performing) solution.

One option would be to do strsplit, get the table and then cbind with original data
cbind(a, do.call(rbind, lapply(strsplit(as.character(a$var), ","),
function(x) table(factor(x[nzchar(x)], levels = 1:5, labels = paste0("flag_", 1:5))))))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
Another option is with tidyverse
library(tidyverse)
str_extract_all(a$var, "[0-9]") %>%
map(~ as.integer(.x) %>%
as_tibble) %>%
bind_rows(.id = 'grp') %>%
count(grp, value = factor(value, levels = min(value):max(value))) %>%
spread(value, n, drop = FALSE, fill = 0) %>%
select(-grp) %>%
bind_cols(a, .) %>%
rename_at(vars(matches("^[0-9]+$")), ~ paste0("flag_", .))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2

First, don't make the strings into factors. Nothing good comes from that.
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"),
stringsAsFactors = FALSE)
To get from strings to your table is simple enough if we take it in small steps. Here, I've written (or renamed) a function per step and then gone through the steps using lapply one at a time. You can string it all together in a pipeline if like, but it would be roughly these steps.
First, I extract the numbers from the strings. That involves splitting on commas, getting rid of empty strings, you have those because you can begin and end a string with a comma, but otherwise, that step wouldn't be necessary. Then we need to translate the strings into numbers, count how often we see each (we can do that with the as.numeric and table functions, respectively), and then it is just a question of mapping the observed counts into a table that also includes those we haven't observed.
pick_indices <- function(str) unlist(strsplit(str, split = ","))
remove_empty <- function(chrs) chrs[nchar(chrs) > 0]
get_indices <- as.numeric
to_counts <- table
to_flag_vect <- function(counts, len) {
vec <- rep(0, len)
names(vec) <- 1:len
vec[names(counts)] <- counts
vec
}
strings <- lapply(a$var, pick_indices)
cleaned <- lapply(strings, remove_empty)
indices <- lapply(cleaned, get_indices)
counts <- lapply(indices, to_counts)
flags <- lapply(counts, to_flag_vect, len = 5)
We now have the flag-counts in a list, so to make it into the table you want, with the column names you want, we simply do this:
tbl <- do.call(rbind, flags)
colnames(tbl) <- paste0("flag_", 1:5)
tbl
Done.

Split and unlist the values into a factor with appropriate levels
x = strsplit(a$var, ",")
xp = factor(unlist(x), levels = seq_len(5))
Create an index that maps the values of xp to the rows they came from
i = rep(seq_along(x), lengths(x))
use xtabs() to cross-tabulate the entries by row
xt = xtabs(~ i + xp)
and cbind() the matrix representation of the result to the original
> cbind(a, unclass(xt))
var 1 2 3 4 5
1 ,1,2,3, 1 1 1 0 0
2 ,2,3,5, 0 1 1 0 1
3 ,1,3,5,5, 1 0 1 0 2

Related

How to perform the equivalent of Excel rolling sumifs in dplyr?

In the below reproducible code, I would like to add a column for SumIfs using dplyr as detailed in the below image, whereby the Excel sumifs() formula in column H of the image has conditions with the tops of the specified ranges "anchored", for a "rolling" calculation as you move down row-wise. Any recommendations for how to do the same in dplyr? I'm sure it requires grouping but unsure of how to handle conditions. The blue below shows the current reproducible code output, the yellow shows what I would like to add, and the non-highlighted shows the underlying XLS formulas.
Now using my words: to derive Sumifs, for each row one-at-a-time rolling from top-to-bottom of the array sequentially, sum all values in column D from the top of the column D range to the current row in the Column D range that have a column C "Code1" value less than the current row column C "Code1" value. So for example in deriving the value of 3 in cell G6: add the 1 in cell D3 (because its Code1 of 0 (cell C3) is < Code1 of 3 (cell C6)) to the 2 in cell D5 (because its Code1 of 1 (cell C5) is < Code1 of 3 (cell C6)).
Reproducible code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(0,1,1,2,2,0,0,0,0),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
CountIfs <- function(x,y) {
out <- integer(length(x))
for(i in seq_along(x)) {
cond1 <- y[1:i] > 0
cond2 <- x[1:i] == x[i]
out[i] <- sum(cond1*cond2)
}
out
}
myDataRender <-
myData %>%
mutate(CountIfs = CountIfs(Code1, Code2))
print.data.frame(myDataRender)
Adapt Tsai solution for situations where the top/bottom of the XLS sumifs() ranges are anchored (fixed, not rolling)(where first XLS formula in the image would be =SUMIFS(D$3:D$11,C$3:$C11,"<"&C3)), for those of us transitioning from XLS to R:
myData %>% mutate(SumIfs = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
You could use map() or imap() from purrr:
library(dplyr)
library(purrr)
# (1)
myData %>%
mutate(SumIfs = map_dbl(1:n(), ~ sum(Code2[1:.x][Code1[1:.x] < Code1[.x]])))
# (2)
myData %>%
mutate(SumIfs = imap_dbl(Code1, ~ sum(Code2[1:.y][Code1[1:.y] < .x])))
# Name Group Code1 Code2 SumIfs
# 1 B 0 0 1 0
# 2 R 1 1 0 1
# 3 R 1 1 2 1
# 4 R 2 3 0 3
# 5 R 2 3 1 3
# 6 B 0 4 2 4
# 7 A 0 -1 1 0
# 8 A 0 0 0 1
# 9 A 0 0 0 1
If you don't want to rely on purrr, the map() solution can be adapted directly for the base sapply() version:
myData %>%
mutate(SumIfs = sapply(1:n(), \(x) sum(Code2[1:x][Code1[1:x] < Code1[x]])))
Here is another way using map2_dbl() with the row number.
library(dplyr)
library(purrr)
myData %>%
mutate(SumIfs = map2_dbl(Code1, row_number(),
~ sum(if_else(Code1 < .x & row_number() <= .y, Code2, 0))))
Also using base Map(), this will scale to as many criteria as you want.
library(dplyr)
myData %>%
mutate(SumIfs = unlist(Map(\(x, y) sum(if_else(Code1 < x & row_number() <= y, Code2, 0)),
Code1, row_number())))

looping within a variable in panel data using loop in R

I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))

summarize results on a vector of different length of the original - Pivot table r

I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3

Concatenating positions into genomic segments

I would like to concatenate all rows which have more than 0.955 of similarity score. The Aboand Bel columns represents the similarity score with above and below rows, respectively. In the following input df I have 10 genomic probes (NAME column) which is concatenated in just 4 genomic segments (dfout).
df <- " NAME Abo Bel Chr GD Position
BovineHD0100009217 NA 1.0000000 1 0 31691781
BovineHD0100009218 1.0000000 0.6185430 1 0 31695808
BovineHD0100019600 0.6185430 0.9973510 1 0 69211537
BovineHD0100019601 0.9973510 1.0000000 1 0 69213650
BovineHD0100019602 1.0000000 1.0000000 1 0 69214650
BovineHD0100019603 1.0000000 0.6600000 1 0 69217942
BovineHD0100047112 0.6600000 1.0000000 1 0 93797691
BovineHD0100026604 1.0000000 1.0000000 1 0 93815774
BovineHD0100026605 1.0000000 0.4649007 1 0 93819471
BovineHD0100029861 0.4649007 NA 1 0 105042452"
df <- read.table(text=df, header=T)
My expected output dfout:
dfout <- "Chr start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1"
dfout <- read.table(text=dfout, header=T)
Any ideas?
I couldn't think of any pretty solution using basic dataframe manipulation, so here's a bad-looking one that works:
First, add stringsAsFactors to df creation:
df <- read.table(text=df, header=T, stringsAsFactors = FALSE)
start <- df$Position[1]
end <- integer()
output <- NULL
count <- 1
for (i in 1:(nrow(df)-1)) {
if(df$Bel[i] < 0.955) {
end <- df$Position[i]
output <- rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count))
start <- df$Position[i+1]
count <- 0
}
count <- count + 1
}
end <- df$Position[nrow(df)]
output <- as.data.frame(rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count)))
colnames(output) <- c("start", "end", "startp", "endp", "nprob")
The basic idea here is looping through the rows and checking if the next should be added to the current segment (Bel > 0.955) or if a new segment should start (Bel <= 0.955). When a new sequence has to be started, the endrow is defined, the respective row added to the output and the new starting segment also defined. A count is used to add the number of rows used to create the segment (nprob).
Finally the last segment is added, outside the for loop, and the output receives its column names and is converted to a dataframe. I did not use Chr because 1. They are all equal, 2. if they weren't you didn't give any way to choose/summarize them.
Result:
> output
start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
I'm pretty sure that you or someone else can work on this to make it shorter and more concise.
Here is dplyr version. First we need to define groups, that is what mutate bit is doing, then simple summarise function within the groups.
library(dplyr)
df %>%
mutate(
Abo955=ifelse(Abo<0.955,NA,Abo),
myGroup=cumsum(is.na(Abo955)*1)) %>%
group_by(myGroup) %>%
summarise(
Chr=min(Chr),
start=min(Position),
end=max(Position),
startp=first(NAME),
lastp=last(NAME),
nprob=n()) %>%
select(-myGroup)
This solution is purely based on logical vectors and works with the provided example.
As Molx said, let's add stringsAsFactors=F
df <- read.table(text=df, header=T, stringAsFactors = F)
An just so that the logical evaluations work let's change NA to 0s
df(is.na(df)) <- 0
Now, for the consecutive rows that will be concatenated lets find the "start" and "end" rows using logical evaluations
starts <- df$Bel >= 0.955 & df$Abo < 0.955
ends <- df$Bel < 0.955 & df$Abo >= 0.955
With this we can already construct a data.frame concatenating rows that need to be concatenated
concatenated <- data.frame(Chr = df[starts, "Chr"],
start = df[starts, "Position"],
end = df[ends, "Position"],
startp = df[starts, "NAME"],
endp = df[ends, "NAME"],
nprob = c( diff (which(starts))[1] ,diff (which(ends)))
)
And let's also construct a data.frame with the rows that are not concatenated, i.e. the ones that do not have the desired similarity score with neither the above nor below row
notConcatenate <- df$Abo < 0.955 & df$Bel < 0.955
non_concatenated <- data.frame(Chr = df[notConcatenate, "Chr"],
start = df[notConcatenate, "Position"],
end = df[notConcatenate, "Position"],
startp = df[notConcatenate, "NAME"],
endp = df[notConcatenate, "NAME"],
nprob = 1
)
And finally bind the two data.frames
dfout <- rbind(concataneted,non_concatenated)
Resulting in
> dfout
Chr start end startp endp nprob
1 1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
NOTE: This code assumes that correlated probes are within the same chromosome
Cheers!

Providing variable name in function call and cbinding to existing data frame in R

The goal is to have the last argument of the function call to provide the name for the new column to be bound to the original data frame.
Referring to this and this previous question and building upon the minimal working example of the first.
GroupId <- c(1,1,1,1,2,2,2,3,3)
IndId <- c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <- c(1,2,1,2,2,2,1,2,2)
df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df
ValidGroupC <- c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroupC)
df
library(dplyr)
grouptest <- function(object, group, ind, type, new){
groupvar <- deparse(substitute(group))
indvar <- deparse(substitute(ind))
typevar <- deparse(substitute(type))
eval(substitute(
tmp <- object[, c(groupvar, indvar, typevar)] %.%
group_by(group, ind) %.%
mutate(type1 = any(type == 1)) %.%
group_by(group, add = FALSE) %.%
mutate(tmp2 = all(type1) * 1) %.%
select(-type1)
))
new <- tmp[, 4] # this is the relevant part
tmp <- cbind(object, new) # this is the relevant part
}
df <- grouptest(df, GroupId, IndId, PropertyType, ValidGroup)
df
So most of the code is already a product of the referenced questions. The relevant part for this question is at the end where I take the 4th column of the calculations I made to tmp and put it in an new object, the name of which should be taken from the new argument in the function call, which I then bind to the original data frame.
My question: why is the last column of the final df not named ValidGroup ?
I don't get what is wrong - new should be replaced by ValidGroup, but it isn't?
I have tried putting the two lines inside the eval(), which results in Error in cbind(df, ValidGroup) : object 'ValidGroup' not found.
I have tried putting another eval(substitute()) around the two lines, same Error.
I have tried numerous other variations of where to put the lines, using a deparsed newvar, naming the tmp also new, . . .
You want to change the last two lines highlighted in your function to:
object[, new] <- tmp[, 4]
object
Then, when you call the function specify the new argument as a character string:
> df <- grouptest(df, GroupId, IndId, PropertyType, "ValidGroup")
> df
GroupId IndId IndGroupProperty PropertyType ValidGroupC ValidGroup
1 1 1 1 1 1 1
2 1 1 2 2 1 1
3 1 2 1 1 1 1
4 1 2 2 2 1 1
5 2 3 3 2 0 0
6 2 4 3 2 0 0
7 2 4 4 1 0 0
8 3 5 5 2 0 0
9 3 5 6 2 0 0
If the object is always a data.frame, why don't you simply make a new one?
tmp <- data.frame(object, new=tmp[,4])
names(tmp)[4] <- as.character(match.call()$new)
return(tmp)
Edit: Changed code to accept name instead of character for argument new. I still don't think this is a good idea, though. You should at least have an optional argument to switch the second line to just names(tmp)[4] <- new in line with #hadley 's reasoning in this thread.
I suspect that you are looking for the assign function:
assign(deparse(substitute(new)), tmp[,4])
So apparently I missunderstood the question. Here's another approach. Instead of using cbind, you can just add a new column to your existing object.
object[, deparse(substitute(new))] <- tmp[,4]
object

Resources