I have this code:
df[, -1] = apply(df[, -1], 2, function(x){x * log(x)})
df looks like:
sample a b c
a2 2 1 2
a3 3 0 45
The problem I am having is that some of my values in df are 0. You cannot take the ln(0). So I would like tell my program to spit out a 0 if it tries to take ln(0).
You could use ifelse here:
df[,-1] = apply(df[,-1], 2, function(x){ ifelse(x != 0, x*log(x), 0) })
You can take advantage of floating point error to add a tiny amount less than the floating point error to x. Since log(0.00000000000000...0000223) is 0.0000..., inputting 0 will work. The results of other numbers will only be changed by amounts smaller than the floating point error, meaning for practical purposes not at all.
Avoiding the iteration and using .Machine$double.xmin for a very, very small number,
df <- data.frame(sample = c("a2", "a3"),
a = 2:3,
b = c(1L, 0L),
c = c(2L, 45L))
df
#> sample a b c
#> 1 a2 2 1 2
#> 2 a3 3 0 45
df[-1] <- df[-1] * log(df[-1] + .Machine$double.xmin)
df
#> sample a b c
#> 1 a2 1.386294 0 1.386294
#> 2 a3 3.295837 0 171.299812
To check the results, let's use another approach, changing 0 values to 1 so they're return 0:
df2 <- data.frame(sample = c("a2", "a3"),
a = 2:3,
b = c(1L, 0),
c = c(2L, 45L))
df2[df2 == 0] <- 1
df2[-1] <- df2[-1] * log(df2[-1])
df2
#> sample a b c
#> 1 a2 1.386294 0 1.386294
#> 2 a3 3.295837 0 171.299812
Because the change is less than floating point error, the results are identical according to R:
identical(df, df2)
#> [1] TRUE
Related
I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0
I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1
I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))
I am trying to create a column in a very large data frame (~ 2.2 million rows) that calculates the cumulative sum of 1's for each factor level, and resets when a new factor level is reached. Below is some basic data that resembles my own.
itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)
I would like the output variable, cum.goodp, to look like this:
cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)
I get that there is a lot out there using the canonical split-apply-combine approach, which, conceptually is intuitive, but I tried using the following:
k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))
When I try to run this code it's very very slow. I get that transform is part of the reason why (the 'by' doesn't help either). There are over 70K different values for the itemcode variable, so it should probably be vectorized. Is there a way to vectorize this, using cumsum? If not, any help whatsoever would be truly appreciated. Thanks so much.
A base R approach is to calculate cumsum over the whole vector, and capture the geometry of the sub-lists using run-length encoding. Figure out the start of each group, and create new groups
start <- c(TRUE, itemcode[-1] != itemcode[-length(itemcode)]) | !goodp
f <- cumsum(start)
Summarize these as a run-length encoding, and calculate the overall sum
r <- rle(f)
x <- cumsum(x)
Then use the geometry to get the offset that each embedded sum needs to be corrected by
offset <- c(0, x[cumsum(r$lengths)])
and calculate the updated value
x - rep(offset[-length(offset)], r$lengths)
Here's a function
cumsumByGroup <- function(x, f) {
start <- c(TRUE, f[-1] != f[-length(f)]) | !x
r <- rle(cumsum(start))
x <- cumsum(x)
offset <- c(0, x[cumsum(r$lengths)])
x - rep(offset[-length(offset)], r$lengths)
}
Here's the result applied to the sample data
> cumsumByGroup(goodp, itemcode)
[1] 0 1 2 0 1 1 2 0 0 1 1 1 2 0 1
and it's performance
> n <- 1 + rpois(1000000, 1)
> goodp <- sample(c(0, 1), sum(n), TRUE)
> itemcode <- rep(seq_along(n), n)
> system.time(cumsumByGroup(goodp, itemcode))
user system elapsed
0.55 0.00 0.55
The dplyr solution takes about 70s.
#alexis_laz solution is both elegant and 2 times faster than mine
cumsumByGroup1 <- function(x, f) {
start <- c(TRUE, f[-1] != f[-length(f)]) | !x
cs = cumsum(x)
cs - cummax((cs - x) * start)
}
With the modified example input/output you could use the following base R approach (among others):
transform(df, cum.goodpX = ave(goodp, itemcode, cumsum(goodp == 0), FUN = cumsum))
# itemcode goodp cum.goodp cum.goodpX
#1 a1 0 0 0
#2 a1 1 1 1
#3 a1 1 2 2
#4 a1 0 0 0
#5 a1 1 1 1
#6 a2 1 1 1
#7 a2 1 2 2
#8 a3 0 0 0
#9 a4 0 0 0
#10 a4 1 1 1
#11 a5 1 1 1
#12 a6 1 1 1
#13 a6 1 2 2
#14 a6 0 0 0
#15 a6 1 1 1
Note: I added column cum.goodp to the input df and created a new column cum.goodpX so you can easily compare the two.
But of course you can use many other approaches with packages, either what #MartinMorgan suggested or for example using dplyr or data.table, to name just two options. Those may be a lot faster than base R approaches for large data sets.
Here's how it would be done in dplyr:
library(dplyr)
df %>%
group_by(itemcode, grp = cumsum(goodp == 0)) %>%
mutate(cum.goodpX = cumsum(goodp))
A data.table option was already provided in the comments to your question.
I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA