count new elements in vector list - r

I want to count new elements that weren't present in previous years. In the example
Sample data:
var1 <- list('2003' = 1:3, '2004' = c(4:3), '2005' = c(6,4,1), '2006' = 1:4 )
I would like to get the output
newcount <- list('2003' = 0, '2004' = 1, '2005' = 1, '2006' = 0)
Unsuccessful code:
newcount <- mapply(setdiff, var1, seq_along(var1), function(i)
{if (i > 1) {Reduce(union, var1[i-1], accumulate=T)}}, length)

Almost there, but its better to use vector indexing to work with the offset and add the always-known initial element afterwards:
lapply(c(list(`2003`=integer(0)),
mapply(setdiff,var1[-1],
Reduce(union,var1,accumulate=TRUE)[-length(var1)])),length)
$`2003`
[1] 0
$`2004`
[1] 1
$`2005`
[1] 1
$`2006`
[1] 0

Assuming that var1 is sorted according to year, and that for 2003 you'd like 3 instead of 1, you could try
newcount <- lapply(seq_along(var1),function(x){
prev<-unlist(var1[seq_len(x-1)])
# Improvement suggested by plannapus
sum(!var1[[x]]%in%prev) # length(which(!var1[[x]]%in%prev))
})
names(newcount)<-names(var1)
newcount
# $`2003`
# [1] 3
# $`2004`
# [1] 1
# $`2005`
# [1] 1
# $`2006`
# [1] 0
OK, if you're absolutely sure that 2003 should be 0 (which I see as an exception to your logic), then you could do the following:
newcount <- c(0, lapply(seq_along(var1)[-1],function(x){
prev<-unlist(var1[seq_len(x-1)])
sum(!var1[[x]]%in%prev)
}))

Related

How to get consistent output when checking for infinite values

# works fine
check = c(1,2,3,4, Inf)
out = check[-which(check == Inf)]
print(out)
# [1] 1 2 3 4
# does not work fine
check = c(1,2,3,4)
out = check[-which(check == Inf)]
print(out)
# numeric(0)
The first example creates an outvariable with the correct values 1,2,3,4. The second variable creates an empty variable out as the which function returns integer(0) and apparently remove integer(0) from the check vector gives 0 elements.
I know how to write this in several lines but is there a one-liner for this?
Try, is.finite():
# example 1
check <- c(1, 2, 3, 4, Inf)
out <- check[ is.finite(check) ]
out
# [1] 1 2 3 4
# example 2
check <- c(1, 2, 3, 4)
out <- check[ is.finite(check) ]
out
# [1] 1 2 3 4
Related post about: is.finite().
check = c(1,2,3,4)
out = check[!is.infinite(check)]
print(out)
Not sure whether this is technically a oneliner...
out = if (any(is.na(check))) {check[-which(is.na(check))]} else {check}

Improvement and parallelization of a code that produces combinations of specified string characters

First of all let me try explain you what the following code is doing.
From a list like the above it takes the containing string "1MAKK" and tries to find the possible combinations of the positions of the specified characters at chars
Here is an example of the initial conformation
# Initial list
lst1 = list("P1"=list("1MAKK") )
chars = c("M","K")
classes = c("class.1","class.35")
# Get the P name
p_name = names(lst1[1])
# Get the string sequence
p_seq = unlist(lst1[[1]][1])
The classes list is nothing more than some labels corresponding to the chars list and just used for some naming.
Now the main code is getting these variables p_name , p_seq and produces a data frame with all possible combinations of the position combinations of the specified characters.
This is the code:
library(stringr) # str_locate
library(purrr) # map2
# Functions
move_one <- function(seq){
if(grepl("1" , seq))
seq = paste0(substring(seq,2),1)
else
seq
}
# Move the number one from the first to last position
seq = move_one(p_seq)
# Get the positions of each character in the string
pos = unlist( map2(
.f=function(a ,p) str_locate_all(p, a) ,
.x=chars ,
.y=seq),
recursive = F
)
# Check if there is a letter that didn't exist in the sequence and add zeros at the respective list item
for( x in 1:length(pos)){
ifelse(is.na(pos[[x]][1]) , pos[[x]] <- rbind(pos[[x]] , c(0,0)) , pos[[x]] <- pos[[x]] )
}
# Calculate all possible combinations and transpose the arrays inside the list
ind1 = pmap(
.f = function(x) lapply(1:nrow(pos[[x]]), combn, x=as.list(pos[[x]][,1])),
.l = list( 1:length(pos) )
)
ind1 = pmap(
.f = function(x) lapply(ind1[[x]], t) ,
.l = list( 1:length(ind1) )
)
# Add Zero at each first element
z = pmap(
.f = function(x) lapply(ind1[[x]][1] , rbind , 0 ) ,
.l = list( 1:length(ind1) )
)
# Merge the list with the zeros and the complete one
ind1 = map2(
.f = function(a,b) {a[1]<-b[1]; a},
.x = ind1,
.y = z)
# Create a vector for each letter combination
ind1 = pmap(
.f = function (x) unlist( lapply(ind1[[x]], function(i) do.call(paste, c(as.data.frame(i), sep = ':'))) ),
.l = list ( 1:length(ind1) )
)
# Get the position of the class.1
isClass1 = grep("class.1", classes)
# Check if the seq is the first one
isFirst = grepl("1",seq)
# Set only 1 and 0 in the vector of UNIMOD.1 if is the first peptide
ifelse(isFirst , ind1[[isClass1]] <- c("1","0") , ind1[[isClass1]] <- c("0") )
# expand.grid for all these vectors inside ind1
ind2 = expand.grid(ind1)
# Apply column names in ind2
colnames(ind2) = classes
# Add a column with the p_name and seq
ind3 = cbind( "p_name"=rep(p_name, nrow(ind2) ) , "seq"=rep( gsub('.{1}$','',seq) , nrow(ind2) ) , ind2 )
The result for that specific input will be
> ind3
p_name seq class.1 class.35
1 P1 MAKK 1 3
2 P1 MAKK 0 3
3 P1 MAKK 1 4
4 P1 MAKK 0 4
5 P1 MAKK 1 0
6 P1 MAKK 0 0
7 P1 MAKK 1 3:4
8 P1 MAKK 0 3:4
As you can see I tried to use lapply, map2,pmap methods and not for loops in order to make it faster and to give it a chance to run in more than one CPU core in the final version.
So somewhere here is where I need your help and your opinion.
The actual initial list of mine does not have only one string character but it looks like the following one, but with the difference that there are thousands of inner lists (Px where x = {1,2,3,4,...2000} and each Px could have around a hundred of sequences.
p_list = list( "P1" = list( c("1MAK","ERTD","FTRWDSE" )) , "P2" = list( c("1MERTDF","DFRGRSDFG","DFFF")) )
The first question and probably the easiest one to answer, is how can I run (apply) the above code in such a list.
And secondly how can I implement this to be calculated in parallel and use more than one of CPU cores from a server that has 24 of them in order to save some time.
P.S: The final result expected to be the combination of all the individual results (using rbind maybe), (like the one previously showed) into a data frame.
Any improvement, idea or suggestion is welcomed.
Thank you.
First part
This is basically the code I would use for one string. At the end, you get column-lists (which is nice is you know what it is).
library(purrr)
x <- "MAKK"
chars <- set_names(c("M", "K"), c("class.1", "class.35"))
get_0_and_all_combn <- function(x) {
map(seq_along(x), function(i) combn(x, i, simplify = FALSE)) %>%
unlist(recursive = FALSE) %>%
c(0L, .)
}
get_0_and_all_combn(3:4)
[[1]]
[1] 0
[[2]]
[1] 3
[[3]]
[1] 4
[[4]]
[1] 3 4
get_pos_combn <- function(x, chars) {
x.spl <- strsplit(x, "")[[1]]
map(chars, function(chr) {
which(x.spl == chr) %>%
get_0_and_all_combn()
}) %>%
expand.grid()
}
get_pos_combn(x, chars)
class.1 class.35
1 0 0
2 1 0
3 0 3
4 1 3
5 0 4
6 1 4
7 0 3, 4
8 1 3, 4
get_pos_combn_with_infos <- function(seq, chars, p_name) {
cbind.data.frame(p_name, seq, get_pos_combn(seq, chars))
}
get_pos_combn_with_infos(x, chars, p_name)
p_name seq class.1 class.35
1 P1 MAKK 0 0
2 P1 MAKK 1 0
3 P1 MAKK 0 3
4 P1 MAKK 1 3
5 P1 MAKK 0 4
6 P1 MAKK 1 4
7 P1 MAKK 0 3, 4
8 P1 MAKK 1 3, 4
So, now if you want me to finish my answer, I would need to know which are the chars and classes corresponding to your complete example
p_list = list( "P1" = list( c("1MAK","ERTD","FTRWDSE" )) ,
"P2" = list( c("1MERTDF","DFRGRSDFG","DFFF")) )
Also, are you sure you want to make "P1" and "P2" lists of length 1?

R within() order of operation and logic

I am trying to understand how the within() function in R "works." For example, in the code below I try to make a new variable named "FEELS" based on a condition. The first two uses of the within() function do not work. The third use of the within() function works, but I am not confident I understand the logic of "why" it works. Any help is appreciated.
DF <- data.frame(DATE = seq(as.Date("2015-01-01"), as.Date("2015-12-31"), "month"), TEMP = c(30, 40, 50, 60, 70, 80, 90, 100, 90, 80, 70, 60))
DF <- within(DF, {
FEELS[30 <= TEMP & TEMP <= 50] <- "Cold"
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
})
DF <- within(DF, {
FEELS <- "Cold"
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
})
DF
DF <- within(DF, {
FEELS <- NA
FEELS[60 <= TEMP & TEMP <= 70] <- "Good"
FEELS[80 <= TEMP & TEMP <= 100] <- "Hot"
FEELS[is.na(FEELS)] <- "Cold"
})
DF
Let's break these down one by one.
1. This one simply results in an error message:
Error in FEELS[30 <= TEMP & TEMP <= 50] <- "Cold" : object 'FEELS' not found
That makes perfect sense. You haven't yet defined FEELS, so subsetting it results in an error.
2. This one's interesting and can be seen more clearly if you do it outside of 'within'
FEELS <- "cold"
tf <- 60 <= DF$TEMP & DF$TEMP <= 70
tf
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
FEELS[tf] <- "Good"
FEELS
[1] "cold" NA NA "Good" "Good" NA NA NA NA NA "Good"
[12] "Good"
R starts with a vector of length one containing "cold", but your subsetting forces it to extend and place "Good" in all elements where it's TRUE. R doesn't have any values for everything that's FALSE, so puts NA there.
3. The last one is pretty straightforward. You start with an NA vector which is extended in the same way as the one in 2. You then replace all the NAs which are left with "cold".
When you create an object inside within(DF, {...}), it does not automatically have the same length as columns of DF. Instead, it will be "recycled" at the end of {...} to fill out the column
within(data.frame(A=1:6), { B = 1; C = 1:2 })
# A C B
# 1 1 1 1
# 2 2 2 1
# 3 3 1 1
# 4 4 2 1
# 5 5 1 1
# 6 6 2 1
If, before the end of {...}, you want to modify an object as if it were a full column, it must have the correct length:
within(data.frame(A=1:6), {
D = 1
D[ A < 3 ] = 0
D2 = rep(1, length(A))
D2[A < 3 ] = 0
})
# A D2 D
# 1 1 0 0
# 2 2 0 0
# 3 3 1 NA
# 4 4 1 NA
# 5 5 1 NA
# 6 6 1 NA
To understand why D2 gave the expected output while D did not, try examining the objects in steps, using browser() as suggested by #sebastian-c or following the steps as illustrated in his answer.
In the OP's case, initializing with rep and then making several substitutions is one option. Another would be to use cut, which is designed for assigning labels to intervals of ordered data.

Determining if two columns contain either a negative and positive value in R

Here is code to generate some sample data:
set.seed(1000)
dat <- data.frame(A = runif(5, -5, 5),
B = runif(5, -5, 5),
C = runif(5, -5, 5))
Data:
A B C
1 -1.721213 -4.3226204 -1.500625
2 2.588465 2.3871486 2.554616
3 -3.860636 0.8353505 -1.829158
4 1.907551 -2.8422860 3.658128
5 0.164024 -2.4387760 2.641607
I want to compare each column and return a TRUE or FALSE depending on whether or not there there a negative or positive value as we go down by row. For example, in row 4 of columns A and B, we would get TRUE. Because one is positive and the other is negative. If it were swapped (a negative value in A for that row but positive in B) then it would also return TRUE.
However, if either of the two values being compared is between 1 to -1, the comparison won't happen and it will just return NA. Here is what the final output should look like for this example:
A B C AB BC AC
1 -1.721213 -4.3226204 -1.500625 FALSE FALSE FALSE
2 2.588465 2.3871486 2.554616 FALSE FALSE FALSE
3 -3.860636 0.8353505 -1.829158 NA NA FALSE
4 1.907551 -2.8422860 3.658128 TRUE TRUE FALSE
5 0.164024 -2.4387760 2.641607 NA TRUE NA
I have tried to compare rows using this kind of logic:
if((dat$A > 1 & datB < -1) | (dat$A < -1 & dat$B > -1) == TRUE)...
But I think there must be a more efficient way of doing it.
Edit after seeing some of the answers:
There are many great answers here that all have worked when I tested them out. I like the answer by mpalanco the best because of its readability and succinctness. However, I have chosen the answer by DMC since it can be generalized to other cases and may be useful for someone else searching for an answer to this or similar questions.
Here is a result that generalizes in case you have more columns than just {A, B, C} and you want to look at all pairs of columns:
library(tidyr)
library(dplyr)
# get original names
orig_names <- names(dat)
# add a row indicator
dat <- dat %>% mutate(k = row_number())
dat2 <- dat %>%
# reshape to long
gather(letter, value, A:C) %>%
# change value to {-1, 1}
mutate(
value = ifelse(value <= -1, -1, ifelse(value >= 1, 1, NA)),
letter = as.character(letter)
)
# create a placeholder data frame for result
d_new <- expand.grid(
V1 = orig_names,
V2 = orig_names,
k = 1:nrow(dat),
stringsAsFactors = FALSE
) %>%
filter(V1 < V2)
# compute result
result <- d_new %>%
left_join(dat2, by = c("V1" = "letter", "k" = "k")) %>%
left_join(dat2, by = c("V2" = "letter", "k" = "k")) %>%
mutate(
result = (value.x != value.y),
combo = paste0(V1, V2)
) %>%
select(-starts_with("value"), -V1, -V2) %>%
spread(combo, result)
# join with original data
dat %>% left_join(result)
## A B C k AB AC BC
## 1 -1.721213 -4.3226204 -1.500625 1 FALSE FALSE FALSE
## 2 2.588465 2.3871486 2.554616 2 FALSE FALSE FALSE
## 3 -3.860636 0.8353505 -1.829158 3 NA FALSE NA
## 4 1.907551 -2.8422860 3.658128 4 TRUE FALSE TRUE
## 5 0.164024 -2.4387760 2.641607 5 NA NA TRUE
It might be best to use some functions:
is_between <- function(x, a, b) {
x > a & x < b
}
makeCol <- function(col1, col2) {
ifelse(
is_between(col1, -1, 1) | is_between(col2, -1, 1),
NA,
!as.logical(sign(col1) + sign(col2))
)
}
dat$AB <- makeCol(dat$A, dat$B)
dat$BC <- makeCol(dat$B, dat$C)
dat$AC <- makeCol(dat$A, dat$C)
To make this more DRY, you could follow #akrun's lead in his comment and do something like:
combn(seq_along(dat), 2, function(x) { makeCol(dat[, x[1]], dat[, x[2]]) })
This would be my solution.
dat[dat$A*dat$B<0 & abs(dat$A)>=1 & abs(dat$B)>=1,4]<-TRUE
dat[dat$A*dat$B>0 & abs(dat$A)>=1 & abs(dat$B)>=1,4]<-FALSE
dat[dat$C*dat$B<0 & abs(dat$C)>=1 & abs(dat$B)>=1,5]<-TRUE
dat[dat$C*dat$B>0 & abs(dat$C)>=1 & abs(dat$B)>=1,5]<-FALSE
dat[dat$A*dat$C<0 & abs(dat$A)>=1 & abs(dat$C)>=1,6]<-TRUE
dat[dat$A*dat$C>0 & abs(dat$A)>=1 & abs(dat$C)>=1,6]<-FALSE
colnames(dat)<-c("A","B","C","AB","BC","AC")
A very basic step by step approach:
# To keep original data
dat2 <- dat
# Assign NA to values between -1 and -1
dat[ifelse(dat >-1 & dat < 1, TRUE, FALSE)] <- NA
# Create three new columns
dat[4:6] <- data.frame(AB = dat$A*dat$B, BC = dat$B*dat$C, AC = dat$A*dat$C)
# If the are positive will be FALSE
dat[4:6] <- ifelse(dat[, 4:6] > 0, FALSE, TRUE)
# Final result
cbind(dat2[, 1:3], dat[, 4:6])
Output:
A B C AB BC AC
1 -1.721213 -4.3226204 -1.500625 FALSE FALSE FALSE
2 2.588465 2.3871486 2.554616 FALSE FALSE FALSE
3 -3.860636 0.8353505 -1.829158 NA NA FALSE
4 1.907551 -2.8422860 3.658128 TRUE TRUE FALSE
5 0.164024 -2.4387760 2.641607 NA TRUE NA

How to use lists?

I am trying to use lists in R as dictionaries in computing winning percentages for basketball teams. Basically, for each win, I'd like to increment the appropriate dictionary amount, and for each game, I'd like to increment the appropriate dictionary amount. Somehow, the answers I'm getting seem reasonable but are incorrect, and I can't figure out why the program logically doesn't give the expected outputs. Any suggestions or tips would be appreciated. The code I'm using is below:
games <- read.csv(game_pathname, header = FALSE)
names(games) <- c("GameDate", "DateCount", "HomeID", "AwayID", "HomePts", "AwayPts", "HomeAbbr", "AwayAbbre", "HomeName", "AwayName")
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}
As I looked in the code and by creation toy example there is no problems in the algorithm. In the simulation below I used three teams, where one is complete looser, another break even, and the third is a champion.
games <- data.frame(HomeName = c("a", "b", "c"),
HomePts = c(1, 2, 3),
AwayPts = c(3, 1, 2),
AwayName = c("c", "a", "b") )
wins = list()
total = list()
for (team in unique(games$HomeName)) {
wins[team] <- 0
total[team] <- 0
}
for (i in 1:nrow(games)) {
if (games$HomePts[i] > games$AwayPts[i]) {
wins[games$HomeName[i]] <- wins[[games$HomeName[i]]] + 1
} else {
wins[games$AwayName[i]] <- wins[[games$AwayName[i]]] + 1
}
total[games$HomeName[i]] <- total[[games$HomeName[i]]] + 1
total[games$AwayName[i]] <- total[[games$AwayName[i]]] + 1
}
for (team in unique(games$HomeName)) {
print(paste(team, wins[[team]] / total[[team]]))
}
games
wins
total
The output of your algorithm is below:
[1] "a 0"
[1] "b 0.5"
[1] "c 1"
> games
HomeName HomePts AwayPts AwayName
1 a 1 3 c
2 b 2 1 a
3 c 3 2 b
> wins
$`a`
[1] 0
$b
[1] 1
$c
[1] 2
> total
$`a`
[1] 2
$b
[1] 2
$c
[1] 2
However it not very much "R-style" as using for and direct manipulation with list indices is considered like not "comme il faut" :)
You can get similar results with e.g. dplyr packagу, which is a part of tidyverse packages. Code below is a comparison of the results of the games, then split it into two data frames and merge it row-wise. Finally group by team name and calculate mean win rate. Please see below:
library(dplyr)
df <- games %>% mutate(hwins = (HomePts > AwayPts), awins = !hwins)
df_home <- df %>% select(HomeName, hwins) %>% rename(name = HomeName, wins = hwins)
df_away <- df %>% select(AwayName, awins) %>% rename(name = AwayName, wins = awins)
df <- bind_rows(df_home, df_away) %>% group_by(name) %>% summarise(mean_wins = mean(wins))
df
Output:
# A tibble: 3 x 2
name mean_wins
<fct> <dbl>
1 a 0
2 b 0.5
3 c 1

Resources