I'm new using R and I'm having an issue trying to perform a "countif" as in Excel. What I have is below. There are two vectors, vector1 being the possible values of vector2. Vector1 numbers include team_ids to represent the possible teams that may win a game in a tournament. Vector2 is the result of a simulation.
The reason why I can't use a table to summarize the simulation is that many teams won't be represented in game63, but I would still like to return a 0.
In the end, I would like to add a vector possible_teams_prob that counts the number of times each item in possible_teams is in game63. This way I can combine into a final possible table that has the teams listed along with their probabilities of winning game63.
> possible_teams <- seq(1,64)
> possible_teams
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[13] 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60
[61] 61 62 63 64
> game63[1:20]
[1] 4 8 4 4 3 20 2 3 3 1 3 20
[13] 3 8 2 4 3 1 14 3
Interesting question. In general, one can use the fact that R evaluates TRUE as 1 and FALSE as 0 to do a lot of COUNTIF-type work. In this case, though, you want it along the vector. Writing a loop would certainly work, but this is R, so we need to use some vectorized version, which leads one to the apply family. In this case, the following seems to be what you want:
f2 <- function(V1, V2) sum(V1 == V2)
vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1))
which returns
[1] 2 2 7 4 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
This works by setting up a function which create the "countif," between vectors. It won't work on its own, as it expects vectors and the two don't recycle nicely, but vapply will iterate the function down the length of the first vector, which is what you wanted.
sapply will work as well, and doesn't require a "target value" definition, but can be slower because of that. Your case is small enough it doesn't really matter.
> microbenchmark(sapply(possible_teams, f2, V2 = game_63), vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1)), times = 1000L, control=list(order='block'))
Unit: microseconds
expr min lq mean median uq max neval
sapply(possible_teams, f2, V2 = game_63) 89.351 92.926 103.31433 95.309 100.371 945.629 1000
vapply(possible_teams, f2, V2 = game_63, FUN.VALUE = double(1)) 61.057 64.631 73.80298 67.610 71.779 1223.510 1000
Try this:
# recreate your data
allteams <- seq(64)
# summarize the game63 data to get counts by team
temp = tapply(game63,game63,length)
# initialize return vector
answer = integer(length(allteams)); names(answer) <- 1:64
# replace true values
answer <- temp[match(allteams,names(temp))]
# replace missing values
answer[is.na(answer)] <- 0
Related
I have a data frame with distance in the first colomn and class in the second:
data.tab <- read.table(text = "
644 1
76 1
78 1
350 1
45 1
37 2
366 2
46 2
71 3
28 3
97 3
30 3
55 3
65 3
116 3
30 3
18 4
143 4
99 4")
I want to shape it into a new data frame by adding zero according to the longest class. The result will be:
data.tab <- read.table(text = "
1 644 76 78 350 45 0 0 0
2 37 366 46 0 0 0 0 0
3 71 28 97 30 55 65 116 30
4 18 143 99 0 0 0 0 0")
This essentially boils down to a simple long to wide reshape
library(tidyverse)
data.tab %>%
group_by(V2) %>%
mutate(col = paste0("V", 1:n())) %>%
spread(col, V1, fill = 0) %>%
ungroup()
## A tibble: 4 x 8
# V1 V2 V3 V4 V5 V6 V7 V8
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 644 76 78 350 45 0 0 0
#2 37 366 46 0 0 0 0 0
#3 71 28 97 30 55 65 116 30
#4 18 143 99 0 0 0 0 0
Using df as name instead of data.tab:
MAX <- max(table(df$V2))
t(sapply(split(df$V1, df$V2), function(x) c(x, rep(0, MAX-length(x)))))
(The idea is to split V1 into groups defined by V2, making the vectors equal in length by adding 0's at the end when necessary, and then combining that into a single matrix. sapply does the last bit automatically but columnwise, so t is needed.)
another way using length<-
U <- unstack(df) # a hack learned from G.Grothendieck's answer
U <- with(df, split(V1,V2)) # more readable version of the above
M <- max(lengths(U))
R <- t(sapply(U, "length<-", M)) # setting all lengths equal
replace(R, is.na(R), 0) # replacing NAs by zeroes
And a (rather unreadable) one-liner doing the same thing:
"[<-"(R<-t(sapply(U<-unstack(df),"length<-",max(lengths(U)))),is.na(R),0)
1) xtabs Using only base R create a sequence number column within class and then use xtabs to rearrange it into a table. Finally convert that to data frame. Omit the last line of code if a table is sufficient.
data.tab2 <- transform(data.tab, seq = ave(V2, V2, FUN = seq_along))
xt <- xtabs(V1 ~ V2 + seq, data.tab2)
as.data.frame.matrix(xt)
giving:
1 2 3 4 5 6 7 8
1 644 76 78 350 45 0 0 0
2 37 366 46 0 0 0 0 0
3 71 28 97 30 55 65 116 30
4 18 143 99 0 0 0 0 0
2) ts Another base R solution is to convert the elements of each class to a ts series giving tt a multivariate time series with NAs at the ends of the shorter ones. Convert those NAs to 0 in the second line of code and then convert that to a data frame in the last line.
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[] <- ifelse(is.na(tt), 0, tt)
as.data.frame(t(tt))
3) Using data.tab2 from (1) use tapply to create the matrix mat and then convert that to a data.frame. Omit the last line of code if a matrix is sufficient.
mat <- with(data.tab2, tapply(V1, list(V2, seq), c, default = 0))
as.data.frame(mat)
Note
A comment claimed ifelse would be slower than a suggested alternative but benchmarking it showed no overall difference on the data in the question. Of course performance may not be very important here in the first place.
library(rbenchmark)
benchmark(
ifelse = {
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[] <- ifelse(is.na(tt), 0, tt)
as.data.frame(t(tt))
},
replace = {
tt <- do.call("cbind", lapply(unstack(data.tab), ts))
tt[is.na(tt)] <- 0
as.data.frame(t(tt))
}
)[1:4]
giving:
test replications elapsed relative
1 ifelse 100 0.25 1
2 replace 100 0.25 1
using data.table's transpose
cbind(sort(unique(data.tab$V2)),do.call(rbind,transpose(transpose(split(data.tab$V1, data.tab$V2), 0))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 1 644 76 78 350 45 0 0 0
#[2,] 2 37 366 46 0 0 0 0 0
#[3,] 3 71 28 97 30 55 65 116 30
#[4,] 4 18 143 99 0 0 0 0 0
I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]
Quite the number of questions I've made today.
I'd like to calculate the Confidence Interval (99% level, not 95) for the mean value of variable age of two dataframes, infert_control and infert_patient where:
infert_control = subset(infert$age, infert$case == 0)
infert_patient = subset(infert$age, infert$case == 1)
infert is a built-in R dataset, for those not familiar with it, here it is: case 0 denominates the control-group patients, case 1 the actual ones.
> infert
education age parity induced case spontaneous stratum pooled.stratum
1 0-5yrs 26 6 1 1 2 1 3
2 0-5yrs 42 1 1 1 0 2 1
3 0-5yrs 39 6 2 1 0 3 4
4 0-5yrs 34 4 2 1 0 4 2
5 6-11yrs 35 3 1 1 1 5 32
6 6-11yrs 36 4 2 1 1 6 36
7 6-11yrs 23 1 0 1 0 7 6
8 6-11yrs 32 2 0 1 0 8 22
9 6-11yrs 21 1 0 1 1 9 5
10 6-11yrs 28 2 0 1 0 10 19
11 6-11yrs 29 2 1 1 0 11 20
...
239 12+ yrs 38 6 0 0 2 74 63
240 12+ yrs 26 2 1 0 1 75 49
241 12+ yrs 31 1 1 0 0 76 45
242 12+ yrs 31 2 0 0 1 77 53
243 12+ yrs 25 1 0 0 1 78 41
244 12+ yrs 31 1 0 0 1 79 45
245 12+ yrs 34 1 0 0 0 80 47
246 12+ yrs 35 2 2 0 0 81 54
247 12+ yrs 29 1 0 0 1 82 43
248 12+ yrs 23 1 0 0 1 83 40
What would be the correct way to solve this?
I've already calculated the Mean value of column age for both infert_control and infert_patient, plus the standard deviation of each subset.
You could use bootstrap for this:
library(boot)
set.seed(42)
boot_mean <- boot(infert_control, function(x, i) mean(x[i]), R=1e4)
quantile(boot_mean$t, probs=c(0.005, 0.995))
# 0.5% 99.5%
# 30.47273 32.58182
Or if you don't want to use a library:
set.seed(42)
R <- 1e4
boot_mean <- colMeans(
matrix(
sample(infert_control, R * length(infert_control), TRUE),
ncol=R))
quantile(boot_mean, probs=c(0.005, 0.995))
# 0.5% 99.5%
#30.42424 32.55152
So many answers...
The mean value of a random sample has a t-distribution, not normal, although t -> N as df -> Inf.
cl <- function(data,p) {
n <- length(data)
cl <- qt(p/2,n-1,lower.tail=F)*sd(data)/sqrt(n)
m <- mean(data)
return(c(lower=m-cl,upper=m+cl))
}
cl.control <- cl(infert_control,0.01)
cl.control
# lower upper
# 30.42493 32.55689
cl.patient <- cl(infert_patient,0.01)
cl.patient
# lower upper
# 30.00221 33.05803
aggregate(age~case,data=infert,cl,p=0.01) # much better way...
# case age.lower age.upper
# 1 0 30.42493 32.55689
# 2 1 30.00221 33.05803
Also, the quantile functions (e.q. qt(...) and qnorm(...)) return the lower tail by default, so your limits would be reversed unless you set lower.tail=F
You could easily calculate the confidence interval manually:
infert_control <- subset(infert$age, infert$case == 0)
# calculate needed values
m <- mean(infert_control)
s <- sd(infert_control)
n <- length(infert_control)
# calculate error for normal distribution (choose you distribution here, e.g. qt for t-distribution)
a <- 0.995 # 99% CI => 0.5% on both sides
error <- qnorm(a)*s/sqrt(n)
# calculate CI
ci_lower <- m-error
ci_upper <- m+error
See also http://en.wikipedia.org/wiki/Confidence_interval (sorry for a wikipedia link, but it has a good explanation and shows you the formula)
... or as small function:
cifun <- function(data, ALPHA){
c(mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)))
}
cifun(infert_control, 0.01)
Hi this an extension of the question asked here:
Conditional cumulative sum
Suppose I have the following vector. I'd like to calculate the running total of blocks within the zeros.
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1);
Ans d <- c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4).
I'd like to do it in a vectorized way as my vector is rather large. So far I have been trying to use rle to achieve this without much success.
Many thanks.
This will work:
aux <- split(d, cumsum(d == 0))
v <- unlist(sapply(aux, cumsum))
1 2 31 32 33 34 35 36 37 38 39 310 4 51 52 53 54 55 6 7 81 82 83 84 85
0 0 0 1 4 8 13 12 14 17 12 20 0 0 -2 -5 -2 3 0 0 0 -1 -2 -3 -4
as.vector(v)
[1] 0 0 0 1 4 8 13 12 14 17 12 20 0 0 -2 -5 -2 3 0 0 0 -1 -2 -3 -4
here as.vector() just hides the numbers of elements.
this should work. no loops.
very fast because all the work happens outside R
sum_from<-function(value,from)
{
i <- cummax(seq_along(value)*from)
cv <- cumsum(value*cummax(from))
cv - c(0,0,cv[-length(cv)])[i+1]
}
d <- c(0,0,0,1,3,4,5,-1,2,3,-5,8,0,0,-2,-3,3,5,0,0,0,-1,-1,-1,-1)
all(sum_from(d,d==0)==c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4))
results match:
> all(sum_from(d,d==0)==c(0,0,0,1,4,8,13,12,14,17,12,20,0,0,-2,-5,-2,3,0,0,0,-1,-2,-3,-4))
[1] TRUE
>
What's the right way to do in R:
for(row in 1:10)
{
counts[row] <- length(otherData[otherData[["some property"]] == otherList[row],])
}
In other words, put into each row of a new anything (matrix, data.frame, whatever) the count of those rows in another anything (matrix, data.frame, whatever) that equal the corresponding entry in some other list (again abstractly speaking, not literally list object)?
E.g. say x = otherData is
a b c
d 1 2 3
e 1 3 4
f 2 5 6
g 1 5 3
And say the "otherList" is the first column of x, so I want to count how many of x's rows have each of 1, 2, 3, etc. first
So I want counts to be
3,
1,
0,
(0s as long as this counts list goes)
Note it's more important that I be able to select out that data subset than that I get its length; I need to use the subset for other computations as well, though again want to select it out row-by-row and have the output of whatever computations I do stored in the row of the results (in this case counts) matrix.
I can obviously do this with a for loop, but what's the clever way to skip the loop?
Apologies if this duplicates another question. This seems like a very basic question, but I'm not sure what terms to search for. This question seems similar and would work for getting lengths, though I'm not clear on how to apply it in the general case.
EDIT
Here's an example. We select certain rows of x (here x is like otherData in my description above) that satisfy some row-dependent condition, in this case having a first col entry = to row, but the point is that "== row" could be replaced with any condition on row, e.g. "<= otherlist[row]-2" etc.
> x
condition value
1 2 25
2 9 72
3 41 60
4 41 61
5 25 38
6 41 10
7 41 43
8 41 26
9 41 46
10 12 263
11 26 136
12 24 107
13 9 70
14 12 62
15 12 136
16 34 44
17 12 53
18 32 14
19 32 148
20 4 34
> results = 0*1:20
> results
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> for(row in 1:20) {
+ results[row] = length(x[x[["condition"]]==row,2]) }
> results
[1] 0 1 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 0 0 0
Edited:
sapply( 1:20, function(z) sum(x[["condition"]] == z) )
#[1] 0 1 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 0 0 0
You would be able to substitute a different logical test and the sum would be the number of qualifying rows. (I was never able to figure out why you were using column number 2.) If you were hoping to select out a subset of rows that met a condition (which your example was not illustrating) then you could use this:
x[ x[,1] == test , ] " e.g.
> x[ x$condition == 9, ]
condition value
2 9 72
13 9 70
Or if you only wanted the column 'value' that corresponded to the tested 'condition' column , then use:
> x[ x[['condition']] == 9, "value" ]
[1] 72 70
If you want to apply functions to selected (disjoint) subsets of x and you can create a factor variable as long as the dataframe then you can use aggregate or by to process the split up lists. If you want to use the sapply formalism above, here's an example that computes the separate means for subsets of "values" for rows having rownames that are in "condition":
> sapply( rownames(x), function(z) mean( x[x[["condition"]] == z , "value"]) )
[1] NaN 25.0 NaN 34.0 NaN NaN NaN NaN 71.0 NaN NaN 128.5 NaN NaN NaN NaN
[17] NaN NaN NaN NaN
What about table?
table(factor(x[, 1], x[1, ]))
#
# 1 2 3
# 3 1 0
Update
Using the second x table in your question, same solution:
table(factor(x$condition, rownames(x)))
#
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 0 1 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 0 0 0
Also, try match:
match(x$condition, rownames(x))
# [1] 2 9 NA NA NA NA NA NA NA 12 NA NA 9 12 12 NA 12 NA NA 4
table(match(x$condition, rownames(x)))
#
# 2 4 9 12
# 1 1 2 4
> a <- c(seq(1,10))
> a
[1] 1 2 3 4 5 6 7 8 9 10
> d <- cbind(a,a)
> d
a a
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
[7,] 7 7
[8,] 8 8
[9,] 9 9
[10,] 10 10
> d[,2]
[1] 1 2 3 4 5 6 7 8 9 10
> d[,2] <- d[,1]*2
> d
a a
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20
>