Populate column by adding onto row above using lag() in R - r

I want to populate an existing column with values that continually add onto the row above.
This is easy in Excel, but I haven't figured out a good way to automate it in R.
If we had 2 columns in Excel, A and B, we want cell B2 to =B1+A2, and cell B3 would = B2+A3. How can I do this in R?
#example dataframe
df <- data.frame(A = 0:9, B = c(50,0,0,0,0,0,0,0,0,0))
#desired output
desired <- data.frame(A = 0:9, B = c("NA",51,53,56,60,65,71,78,86,95))
I tried using the lag() function, but it didn't give the correct output.
df <- df %>%
mutate(B = B + lag(A))
So I made a for loop that works, but I feel like there's a better solution.
for(i in 2:nrow(df)){
df$B[i] <- df$B[i-1] + df$A[i]
}
Eventually, I want to iterate this function over every n rows of the whole dataframe, essentially so the summation resets every n rows. (any tips on how to do that would be greatly appreciated!)

This might be close to what you need, and uses tidyverse. Specifically, it uses accumulate from purrr.
Say you want to reset to zero every n rows, you can also use group_by ahead of time.
It was not entirely clear how you'd like to handle the first row; here, it will just use the first B value and ignore the first A value, which looked similar to what you had in the post.
n <- 5
library(tidyverse)
df %>%
group_by(grp = ceiling(row_number() / n)) %>%
mutate(B = accumulate(A[-1], sum, .init = B[1]))
Output
A B grp
<int> <dbl> <dbl>
1 0 50 1
2 1 51 1
3 2 53 1
4 3 56 1
5 4 60 1
6 5 0 2
7 6 6 2
8 7 13 2
9 8 21 2
10 9 30 2

cumsum() can be used to get the result you need.
df$B <- cumsum(df$B + df$A)
df
A B
1 0 50
2 1 51
3 2 53
4 3 56
5 4 60
6 5 65
7 6 71
8 7 78
9 8 86
10 9 95

Related

R: Creating Random Samples From Entries in Neighboring Row

I am working with the R programming language.
I have the following data set:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
I want to create a new variable that generates a single random integer for each row based on the corresponding value of "n". I tried to do this with the following code:
my_data$rand = sample.int(my_data$n,1)
But this is not working (the same random number is repeated 5 times).
I also tried to define a function to this:
my_function <- function(x){sample.int(x,1)}
transform(my_data, new_column= my_function(my_data$n) )
But this is also not working (the same random number is again repeated 5 times)..
In the end, I am trying to achieve something like this :
my_data$rand = c(sample.int(15,1), sample.int(3,1), sample.int(51,1), sample.int(8,1), sample.int(75,1))
Can someone please show me how to do this for larger datasets without having to manually specify each "sample.int" command?
Thanks!
When you say "based on value of n" what do you mean by that exactly? Based on n how?
Guess#1: at each row, you want to draw one random number with possible values being 1 to n.
Guess#2: at each row, you want to draw n random numbers for possible values between 0 and 1.
Second option is harder, but option #1 can be done with a loop:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
my_data$rand = NA
set.seed(123)
for(i in 1:nrow(my_data)){
my_data$rand[i] = sample(1:(my_data$n[i]), size = 1)
}
my_data
id n rand
1 1 15 15
2 2 3 3
3 3 51 51
4 4 8 6
5 5 75 67
We can use sapply to go over all rows in my_data, and generate one sample.int per iteration.
my_data$rand <- sapply(1:nrow(my_data), function(x) sample.int(my_data[x, 2], 1))
id n rand
1 1 15 7
2 2 3 2
3 3 51 28
4 4 8 6
5 5 75 9
You can do this efficiently by a single call to runif(), multiplying by n, and rounding up:
transform(my_data, rand = ceiling(runif(n) * n))
id n rand
1 1 15 13
2 2 3 1
3 3 51 41
4 4 8 1
5 5 75 9

How many times does the value for column B appear for a value in column A?

I am having the hardest time coming up with a code that lets me match a topic (Column B) to a name (Column A) and create a frequency column for the times B has matched with A (or how many times both have appeared together). Col A and B are codes for longer names.
I thought maybe using the count function from plyr but cant make it work. Maybe you can give me an idea of what I could use for a code?
For example I have a table:
**Col A
Col B**
1
38
1
6
1
38
2
38
2
7
2
7
2
8
2
7
The result that I am looking for is
**Col A
Col B
freq**
1
38
2
1
6
1
2
38
1
2
7
3
2
8
1
So the number 38 has appeared in "1" two times. 6 has appeared one time. and so on.
I have 600 rows of data and cant come up with a useful or even a close call code.
Thank you so much for your help!
Summarise and count using dplyr:
library(dplyr)
df2 <- df %>%
group_by(col1, col2) %>%
summarise(count = n()) %>%
ungroup()
returns:
col1 col2 count
<dbl> <dbl> <int>
1 1 6 1
2 1 38 2
3 2 7 3
4 2 8 1
5 2 38 1

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Resources