Let say I’ve a data frame consists of one variable (x)
df <- data.frame(x=c(1,2,3,3,5,6,7,8,9,9,4,4))
I want to know how many numbers are less than 2,3,4,5,6,7.
I know how to do this manually using
# This will tell you how many numbers in df less than 4
xnew <- length(df[ which(df$x < 4), ])
My question is how can I automate this by using for-loop or other method(s)? And I need to store the results in an array as follows
i length
2 1
3 2
4 4
5 6
6 7
7 8
Thanks
One way would be to loop over (sapply) the numbers (2:7), check which elements in df$x is less than (<) the "number" and do the sum, cbind with the numbers, will give the matrix output
res <- cbind(i=2:7, length=sapply(2:7, function(y) sum(df$x <y)))
Or you can vectorize by creating a matrix of numbers (2:7) with each number replicated by the number of rows of df, do the logical operation < with df$x. The logical operation is repeated for each column of the matrix, and get the column sums using colSums.
length <- colSums(df$x <matrix(2:7, nrow=nrow(df), ncol=6, byrow=TRUE))
#or
#length <- colSums(df$x < `dim<-`(rep(2:7,each=nrow(df)),c(12,6)))
cbind(i=2:7, length=length)
num = c(2,3,4,5,6,7)
res = sapply(num, function(u) length(df$x[df$x < u]))
data.frame(number=num,
numberBelow=res)
A vectorized solution:
findInterval(2:7*(1-.Machine$double.eps),sort(df$x))
The .Machine$double.eps part assure that you are taking just the numbers lower than and not lower or equal than.
Related
I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30
I have got a list with 10 data.frames and I just need to divide two columns for each data.frame and after calculate the relative standard deviation.
I would like to use lapply.
Here an example of one of the data.frame contained within the list:
df <- read.table(text = 'X Y
2 4
5 3
1 2
7 1
4 2
6 1', header = TRUE)
I have to perform the following operations with lapply for all my 10 data.frames:
ratio <- df$X / df$Y
sd <- sd(ratio)
We can do this by looping over the list with lapply, extract the columns of interest, divide to get the 'ratio' and then do the sd on that ratio. (It could be done on a single step too)
lapply(lst, function(x) {ratio <- x$X/x$Y
sd(ratio) })
where 'lst' is the list of 'data.frame's.
I have a data frame of two columns
set.seed(120)
df <- data.frame(m1 = runif(500,1,30),n1 = round(runif(500,10,25),0))
and I wish to add a third column that uses column n1 and m1 to generate a normal distribution and then to get the standard deviation of that normal distribution. I mean to use the values in each row of the columns n1 as the number of replicates (n) and m1 as the mean.
How can I write a function to do this? I have tried to use apply
stdev <- function(x,y) sd(rnorm(n1,m1))
df$Sim <- apply(df,1,stdev)
But this does not work. Any pointers would be much appreciated.
Many thanks,
Matt
Your data frame input looks like:
# > head(df)
# m1 n1
# 1 12.365323 15
# 2 4.654487 15
# 3 10.993779 24
# 4 24.069388 22
# 5 6.684450 18
# 6 15.056766 16
I mean to use the values in each row of the columns n1 and m1 as the number of replicates (n) and as the mean.
First show you how to use apply:
apply(df, 1, function(x) sd(rnorm(n = x[2], mean = x[1])))
But a better way is to use mapply:
mapply(function(x,y) sd(rnorm(n = x, mean = y)), df$n1, df$m1)
apply is ideal for matrix input; for data frame input you get great overhead for type conversion.
Another option
lapply(Map(rnorm,n=df$m1,mean=df$n1),sd)
I have a matrix defined as the pair-wise differences between the elements of an array:
a <- as.matrix(dist(c(1,2,3,4,5)))
I need to compute without looping the sum of the pair-wise differences between the first two elements, the first three elements, etc. i.e., I need to arrive to the array:
v <- c(1,4,10,20)
Try
head(cumsum(cumsum(1:5)),-1)
#[1] 1 4 10 20
I don't know if you indeed want the to call the cumulative sum function twice, as I think "the sum of the pair-wise differences between the first two elements, the first three elements, etc." should result in:
c(1, 3, 6, 10)
Anyway, this should work with non-sequential x as well for your required output:
> cumsumdiff <- function (x) cumsum(cumsum(sapply(x[-1], `-`, x[1])))
> cumsumdiff(1:5)
[1] 1 4 10 20
Or based on #Jota's suggestion using the distance matrix:
> cumsumdiff <- function(x) cumsum(cumsum(unname(as.matrix(dist(x))[1, -1])))
> cumsumdiff(1:5)
[1] 1 4 10 20
I have a data frame
x <- data.frame(id=letters[1:3],val0=c(100,200,300),val1=c(400,500,600),val2=c(700,800,900))
I want to divide odd columns with a specific number n1(say) and even columns with another number n2 (say). So, the result I want is:
>n1<-2
>n2<-5
id val0 val1 val2
a 50 80 350
b 100 100 400
c 150 120 450
Can someone suggest me how to do this?
Thanks.
You can use function seq() to generate values for column numbers and then subset those columns. For even columns start with 2 and for odd star with 3. Then replace selected columns with the same selected columns divided by number you are interested in.
x[,seq(2,ncol(x),2)]<-x[,seq(2,ncol(x),2)]/n1
x[,seq(3,ncol(x),2)]<-x[,seq(3,ncol(x),2)]/n2
A slightly disguised for loop:
x[] <- lapply(seq_len(ncol(x)), function(i) x[, i]/ifelse(i%%2, 2, 5))
And just for kicks:
x[] <- lapply(seq_len(ncol(x)), function(i) x[, i]/if(i%%2) 2 else 5)