Running (cumulative) product in R? [duplicate] - r

This question already has answers here:
Multiplying elements of a column in skipping an element after each iteration
(3 answers)
Closed 3 years ago.
x <- as.data.frame(1:5)
with the above data frame I want to create a new column which has a running product, i.e. the first element should be
1*2*3*4*5 = 120 then
2*3*4*5 = 120 then
3*4*5 = 60
and so on.
How can I do this in R?
result should be
> x[,"result"] <- c(120,120,60,20,5)
> x
1:5 result
1 1 120
2 2 120
3 3 60
4 4 20
5 5 5

We can use cumprod
rev(cumprod(rev(x[[1]])))
#[1] 120 120 60 20 5
Or
rev(Reduce(`*`, rev(x[[1]]), accumulate = TRUE))
Also, there is a convenient wrapper in accumulate
library(tidyverse)
x %>%
mutate(result = accumulate(`1:5`, `*`, .dir = "backward"))
# 1:5 result
#1 1 120
#2 2 120
#3 3 60
#4 4 20
#5 5 5

To do so while simply adding a new column to your data:
data <- data.frame(list(x = 1:5))
data
x
1 1
2 2
3 3
4 4
5 5
data$prod <- apply(data,1,function(x) prod(x:5))
data
x prod
1 1 120
2 2 120
3 3 60
4 4 20
5 5 5

Related

Count the amount of times value A occurs without value B and vice versa

I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
Output
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
A more efficient approach would be to work on a tabulation format:
tab = crossprod(table(rep(seq_len(nrow(df)), ncol(df)), c(df$id1, df$id2)))
#tab
#
# 1 2 3 4
# 1 7 3 2 2
# 2 3 6 1 2
# 3 2 1 4 1
# 4 2 2 1 5
So, now, we have the times each value appears with another (irrespectively of their order in the two columns). Here on, we need a way to subset the above table by each pair and subtract the value of their cooccurence from the value of each id's total appearance.
Make a grid of all combinations:
gr = expand.grid(id1 = colnames(tab), id2 = rownames(tab), stringsAsFactors = FALSE)
Create 2-column matrices to subset the table:
id1.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id1, rownames(tab)))
id2.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id2, rownames(tab)))
Subtract the respective values:
cbind(gr, count = tab[id1.ij] - tab[id2.ij])
# id1 id2 count
#1 1 1 0
#2 2 1 3
#3 3 1 2
#4 4 1 3
#5 1 2 4
#6 2 2 0
#7 3 2 3
#8 4 2 3
#9 1 3 5
#10 2 3 5
#11 3 3 0
#12 4 3 4
#13 1 4 5
#14 2 4 4
#15 3 4 3
#16 4 4 0
Of course, if we do not need the full grid of values, we can set:
gr = unique(df)
which results in:
# id1 id2 count
#1 1 2 4
#3 1 3 5
#4 1 4 5
#5 2 1 3
#6 2 3 5
#7 2 4 4
#8 3 1 2
#9 3 4 3
#10 4 2 3
#11 4 1 3

Conditional Subset, Manipulate and Replace

Following on from a previous question here I extracted the following data.frame
DF <- data.frame(A =c("One","Two","Three","Four","Five"),
B=c(1,1,2,2,3),
D=c(10,2,3,-5,5))
subset(DF, B %in% c(1,3))
A B D
1 One 1 10
2 Two 1 2
5 Five 3 5
but now I want to (for example) multiply the numbers by (say) five and replace them in the original data.frame
The following code
subset(DF, B %in% c(1,3))[,2:3] * 5
B D
1 5 50
2 5 10
5 15 25
gives me the numbers I want but how to I get them back to
A B D
1 One 5 50
2 Two 5 10
3 Three 2 3
4 Four 2 -5
5 Five 15 25
The answer is staring me in the face (ie the index numbers ... but how do I get to them)?
You can do
DF[DF$B %in% c(1, 3), 2:3] <- DF[DF$B %in% c(1, 3), 2:3] * 5
DF
# A B D
#1 One 5 50
#2 Two 5 10
#3 Three 2 3
#4 Four 2 -5
#5 Five 15 25

transform all but one variable in data.frame [duplicate]

This question already has answers here:
Exclude column in `dplyr` `mutate_at` while using data in this column
(2 answers)
Closed 3 years ago.
I have a data.frame that looks like this
set.seed(1)
s <- 100
example_df <- data.frame(id=1:s) %>%
mutate(x=sample(0:10, s, replace=T),
x1=sample(0:10, s, replace=T),
x2=sample(0:10, s, replace=T),
x3=sample(0:10, s, replace=T),
x4=sample(0:10, s, replace=T))
Now what I'd like to do is the following:
example_df %>%
mutate(d_x1 = (x-x1)^2,
d_x2 = (x-x2)^2,
d_x3 = (x-x3)^2,
d_x4 = (x-x4)^2)
How do I achieve the same result without explicitly spelling out every operation? mutate_at?
Thanks.
Yes, you can use mutate_at like
library(dplyr)
example_df %>%
mutate_at(vars(matches("x\\d+")), list(d = ~(x - .)^2)) %>% head
# id x x1 x2 x3 x4 x1_d x2_d x3_d x4_d
#1 1 8 6 7 6 9 4 1 4 1
#2 2 3 7 3 1 5 16 0 4 4
#3 3 6 4 9 4 10 4 9 4 16
#4 4 0 5 7 9 1 25 49 81 1
#5 5 1 7 4 9 4 36 9 64 9
#6 6 6 0 4 7 3 36 4 1 9
Or in base R, we can use lapply
cols <- grep("x\\d+", names(example_df), value = TRUE)
example_df[paste0("d_", cols)] <- lapply(example_df[cols], function(x)
(example_df$x - x)^2)

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Resources