This is an easy question, but I had problems solving it so please don't laugh at me.
I'm given a task to re-create my own function for mean in R instead of using the in-built mean function.
The condition for my function is that I need to use map_dbl to handle any iteration in my function.
I know that mean = (sum of all elements)/(number of elements)
The question is, does anyone knows how to calculate the sum of all elements using map_dbl?
A bit overkill:
x <- c(1:10)
counter <- 0
mapsum <- map_dbl(x, ~{counter <<- counter + .x})
mapsum
[1] 1 3 6 10 15 21 28 36 45 55
tail(mapsum,1)
55
As mentionned in comments, this works but sum/mean is a reduce operation, not a map operation.
Related
I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30
Say I have a series of numbers:
seq1<-c(1:20,25:40,48:60)
How can I return a vector that lists points in which the sequence was broken, like so:
c(21,24)
[1] 21 24
c(41,47)
[1] 41 47
Thanks for any help.
To show my miserably failing attempt:
nums<-min(seq1):max(seq1) %in% seq1
which(nums==F)[1]
res.vec<-vector()
counter<-0
res.vec2<-vector()
counter2<-0
for (i in 2:length(seq1)){
if(nums[i]==F & nums[i-1]!=F){
counter<-counter+1
res.vec[counter]<-seq1[i]
}
if(nums[i]==T & nums[i-1]!=T){
counter2<-counter2+1
res.vec2[counter2]<-seq1[i]
}
}
cbind(res.vec,res.vec2)
I have changed the general function a bit so I think this should be a sepparate answer.
You could try
seq1<-c(1:20,25:40,48:60)
myfun<-function(data,threshold){
cut<-which(c(1,diff(data))>threshold)
return(cut)
}
You get the points you have to care about using
myfun(seq1,1)
[1] 21 37
In order to better use is convenient to create an object with it.
pru<-myfun(seq1,1)
So you can now call
df<-data.frame(pos=pru,value=seq1[pru])
df
pos value
1 21 25
2 37 48
You get a data frame with the position and the value of the brakes with your desired threshold. If you want a list instead of a data frame it works like this:
list(pos=pru,value=seq1[pru])
$pos
[1] 21 37
$value
[1] 25 48
Function diff will give you the differences between successive values
> x <- c(1,2,3,5,6,3)
> diff(x)
[1] 1 1 2 1 -3
Now look for those values that are not equal to one for "breakpoints" in your sequence.
Taking in account the comments made here. For a general purpose, you could use.
fun<-function(data,threshold){
t<-which(c(1,diff(data)) != threshold)
return(t)
}
Consider that data could be any numerical vector (such as a data frame column). I would also consider using grep with a similar approach but it all depends on user preference.
I am using mapply(function,args), for a big dataset. After 100 iterations I need to set a delay for 1 sec. So the question is if it possible to show iteration count or progress bar within mapply (function, args)
Thanks
No, but if you switch to using the corresponding functions from plyr you can add a progress bar to the function call.
Without you giving us a minimal, reproducible example I'm not going to the effort of finding the exact plyr equivalent, but it will be one of the m*ply functions:
> ls(pos=2,pattern="m.*ply")
[1] "maply" "mdply" "mlply" "m_ply"
If you know the total number of iterations in advance, you could just add another argument to mapply as an iteration counter. In this example I added z. This example makes the command line sleep for 1 second every 3 iterations....
mapply( function(x,y,z) { if(z%%3==0){Sys.sleep(1);
cat(paste0( "Interation " , z , " ...sleeping\n") ) }
x*y } ,x=1:10,y=1:10,z=1:10)
#Interation 3 ...sleeping
#Interation 6 ...sleeping
#Interation 9 ...sleeping
# [1] 1 4 9 16 25 36 49 64 81 100
If you need more convincing wrap the statement in system.time(). I get a runtime of 3.002 seconds.
I'm quite new to R and I would like to learn how to write a Loop to create and process several columns.
I imported a table into R that cointains data with 23 variables. For all of these variables I want to calculate the per capita valuem multiply this with 1000 and either write the data into a new table or in the same table as the old data.
So to this for only one column my operation looked like this:
<i>agriculture<-cbind(agriculture,"Total_value_per_capita"=agriculture$Total/agriculture$Total.Population*1000)</i>
Now I'm asking how to do this in a Loop for the 23 variables so that I won't have to write 23 similar lines of code.
I think the solution might look quite similar to the code pasted in this thread:
loop to create several matrix in R (maybe using paste)
but I dind't got it working on my code.
So any suggestion would be very helpful.
I would always favor an appropriate *ply function over loops in R. In this case sapply could be your friend:
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
df.per.capita <– as.data.frame(
sapply(
df[ colnames(df) != "c" ], function(x){ x/df$c *1000 }
)
)
For more complicated cases, you should definitely have a look at the plyr package.
This can be done using sweep function. Using Beasterfield's data generation but setting the seed you can obtain the same results
set.seed(001)
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
per.capita <- sweep(df[,colnames(df) != "c"], 1, STATS=df$c, FUN='/')*1000
per.capita
a b
1 300.0000 300.0000
2 2000.0000 1000.0000
3 833.3333 1000.0000
4 7000.0000 10000.0000
5 222.2222 555.5556
6 1000.0000 875.0000
7 1285.7143 1142.8571
8 1200.0000 800.0000
9 3333.3333 333.3333
10 250.0000 2250.0000
Comparing with Beasterfield's results:
all.equal(df.per.capita, per.capita)
[1] TRUE
Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj
You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)
ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )