Trying to generate a data frame in a for loop - r

Currently I'm working on the following code:
data <- rep(1:3, times = c(10,4,6))
for(i in 1:5) {
samp <- sample(data, 4)
data <- exclude(data, samp)
print(samp)
for(i in 1:3) {
prsamp <- sum(samp == i)/4
print(prsamp)
}
if (length(data) == 0) {
break
}
}
This currently prints out five vectors of length four, with the corresponding probabilities of each number occurring in each vector.
> source("buffoon.R")
> buffoon(20, 4, 3, c(10,4,6))
[1] 1 1 2 3
[1] 0.5
[1] 0.25
[1] 0.25
[1] 1 3 3 2
[1] 0.25
[1] 0.25
[1] 0.5
[1] 2 1 1 1
[1] 0.75
[1] 0.25
[1] 0
[1] 3 1 2 3
[1] 0.25
[1] 0.25
[1] 0.5
[1] 1 3 1 1
[1] 0.75
[1] 0
[1] 0.25
So, for instance, the first vector 1123 gives us a 0.5 prob of 1, 0.25 of 2, and 0.25 of 3. I would like to turn the output into a nice data frame which lists in column 1 each row vector, and in column 2 another row vector corresponding to the respective elemental probability occurrences, but I'm running into many errors. I've been researching this issue for a few hours now, but no success. Any help is appreciated.
My ideal data frame would look like this:
Sample Probability Dist
1 1123 0.5 0.25 0.25
2 1332 0.25 0.25 0.5
and so on, down to row 5.

The first thing you will have to do is create an empty dataframe. Secondly, you will want your for loop to write in this dataframe instead of simply printing out the results directly. Also, you don't want to use a for loop containing i as variable in a for loop already using i. I suggest you try the following:
data <- rep(1:3, times = c(10,4,6))
datafr <- data.frame(Sample=rep(NA,5),Probability.Dist=rep(NA,5))
for(i in 1:5) {
samp <- sample(data, 4)
data <- exclude(data, samp)
datafr$Sample[i] <- samp[1]*1000+samp[2]*100+samp[3]*10+samp[4] #easy way of getting your wanted sample layout
prsamp <- rep(0,3)
for(j in 1:3) {
prsamp[j] <- sum(samp == j)/4
}
datafr$Probability.Dist[i] <- toString(prsamp)
if (length(data) == 0) {
break
}
}
datafr
# Sample Probability.Dist
#1 1231 0.5, 0.25, 0.25
#2 2132 0.25, 0.5, 0.25
#3 1313 0.5, 0, 0.5
#4 2111 0.75, 0.25, 0
#5 3131 0.5, 0, 0.5
I also have to advise you against using 3 values in a single column of a dataframe. For further analysis and even readability, it would be much preferred to give each value it's own column.

Related

Counting values within levels

I have a set of levels in R that I generate with cut, e.g. say fractional values between 0 and 1, broken down into 0.1 bins:
> frac <- cut(c(0, 1), breaks=10)
> levels(frac)
[1] "(-0.001,0.1]" "(0.1,0.2]" "(0.2,0.3]" "(0.3,0.4]" "(0.4,0.5]"
[6] "(0.5,0.6]" "(0.6,0.7]" "(0.7,0.8]" "(0.8,0.9]" "(0.9,1]"
Given a vector v containing continuous values between [0.0, 1.0], how do I count the frequency of elements in v that fall within each level in levels(frac)?
I could customize the number of breaks and/or the interval from which I am making levels, so I'm looking for a way to do this with standard R commands, so that I can build a two-column data frame: one column for the levels as factors, and the second column for a fractional or percentage value of total elements in v over the level.
Note: The following does not work:
> table(frac)
frac
(-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
1 0 0 0 0 0
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
0 0 0 1
If I use cut on v directly, then I do not get the same levels when I run cut on different vectors, because the range of values — their minimum and maximum — is going to be different between arbitrary vectors, and so while I may have the same number of breaks, the level intervals will not be the same.
My goal is to take different vectors and bin them to the same set of levels. Hopefully this helps clarify my question. Thanks for any assistance.
Amend frac to actually represent your desired intervals, and then use the table function:
x = runif(100) # For example.
frac = cut(x, breaks = seq(0, 1, 0.1))
table(frac)
Result:
frac
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
14 9 8 10 8 12 7 7
(0.8,0.9] (0.9,1]
16 9
Introduce extremes c(0, 1) to v then use the same cut:
library(dplyr)
#dummy data
set.seed(1)
v <- round(runif(7), 2)
#result
data.frame(v,
vFrac = cut(c(0, 1, v), breaks = 10)[-c(1, 2)]) %>%
group_by(vFrac) %>%
mutate(vFreq = n())
# Source: local data frame [10 x 3]
# Groups: vFrac [8]
#
# v vFrac vFreq
# <dbl> <fctr> <int>
# 1 0.27 (0.2,0.3] 1
# 2 0.37 (0.3,0.4] 1
# 3 0.57 (0.5,0.6] 1
# 4 0.91 (0.9,1] 2
# 5 0.20 (0.1,0.2] 1
# 6 0.90 (0.8,0.9] 1
# 7 0.94 (0.9,1] 2
frac = seq(0,1,by=0.1)
ranges = paste(head(frac,-1), frac[-1], sep=" - ")
freq = hist(v, breaks=frac, include.lowest=TRUE, plot=FALSE)
data.frame(range = ranges, frequency = freq$counts)
Use findInterval instead of cut:
v<-data.frame(v=runif(100,0,1))
library(plyr)
v$x<-findInterval(v$v,seq(0,1,by=0.1))*0.1
ddply(v, .(x), summarize, n=length(x))
frac = seq(0, 1, 0.1)
set.seed(42); v = rnorm(10, 0.5, 0.2)
sapply(1:(length(frac)-1), function(i) sum(frac[i]<v & frac[i+1]>=v))
#[1] 0 0 0 1 3 2 1 1 1 1

round but .5 should be floored

From R help function: Note that for rounding off a 5, the IEC 60559 standard is expected to be used, ‘go to the even digit’. Therefore round(0.5) is 0 and round(-1.5) is -2.
> round(0.5)
[1] 0
> round(1.5)
[1] 2
> round(2.5)
[1] 2
> round(3.5)
[1] 4
> round(4.5)
[1] 4
But I need all values ending with .5 to be rounded down. All other values should be rounded as it they are done by round() function.
Example:
round(3.5) = 3
round(8.6) = 9
round(8.1) = 8
round(4.5) = 4
Is there a fast way to do it?
Per Dietrich Epp's comment, you can use the ceiling() function with an offset to get a fast, vectorized, correct solution:
round_down <- function(x) ceiling(x - 0.5)
round_down(seq(-2, 3, by = 0.5))
## [1] -2 -2 -1 -1 0 0 1 1 2 2 3
I think this is faster and much simpler than many of the other solutions shown here.
As noted by Carl Witthoft, this adds much more bias to your data than simple rounding. Compare:
mean(round_down(seq(-2, 3, by = 0.5)))
## [1] 0.2727273
mean(round(seq(-2, 3, by = 0.5)))
## [1] 0.4545455
mean(seq(-2, 3, by = 0.5))
## [1] 0.5
What is the application for such a rounding procedure?
Check if the remainder of x %% 1 is equal to .5 and then floor or round the numbers:
x <- seq(1, 3, 0.1)
ifelse(x %% 1 == 0.5, floor(x), round(x))
> 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
I'll join the circus too:
rndflr <- function(x) {
sel <- vapply(x - floor(x), function(y) isTRUE(all.equal(y, 0.5)), FUN.VALUE=logical(1))
x[sel] <- floor(x[sel])
x[!sel] <- round(x[!sel])
x
}
rndflr(c(3.5,8.6,8.1,4.5))
#[1] 3 9 8 4
This function works by finding elements that have decimal part equal to 0.5, and adding a small negative number to to them before rounding, ensuring that they'll be rounded downwards. (It relies -- harmlessly but in slightly obfuscated manner --- on the fact that a Boolean vector in R will be converted to a vector of 0's and 1's when multiplied by a numeric vector.)
f <- function(x) {
round(x - .1*(x%%1 == .5))
}
x <- c(0.5,1,1.5,2,2.5,2.01,2.99)
f(x)
[1] 0 1 1 2 2 2 3
The function (not golfed) is very simple and checks whether the decimals that are left are .5 or less. In effect you could easily make it more useful and take 0.5 as an argument:
nice.round <- function(x, myLimit = 0.5) {
bX <- x
intX <- as.integer(x)
decimals <- x%%intX
if(is.na(decimals)) {
decimals <- 0
}
if(decimals <= myLimit) {
x <- floor(x)
} else {
x <- round(x)
}
if (bX > 0.5 & bX < 1) {
x <- 1
}
return(x)
}
Tests
Currently, this function does not work properly with values between 0.5 and 1.0.
> nice.round(1.5)
[1] 1
> nice.round(1.6)
[1] 2
> nice.round(10000.624541)
[1] 10001
> nice.round(0.4)
[1] 0
> nice.round(0.6)
[1] 1

Summing values in columns based on other values in R

I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

How can I vectorize this task in R?

For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2

Resources