how to get a cumulative sum on a large dataframe in r - r

I have a large data dataframe (48 x 100). Is there an elegant formula in R that makes you transform this dataframe in to a "custom dataframe"?
a = c(2, 3, 5)
b = c(2, 3, 5)
c = c(2, 3, 5)
df <- rbind(a,b,c)
Now.. i want cumsum of df so it looks like this.
I know its easy to do with a loop.. but is there a function?

Like this:
a <- c(2, 3, 5)
b <- c(2, 3, 5)
c <- c(2, 3, 5)
df <- data.frame(rbind(a,b,c))
df <- cumsum(df)
so this dataframe:
> df
X1 X2 X3
a 2 3 5
b 2 3 5
c 2 3 5
becomes this:
> cumsum(df)
X1 X2 X3
a 2 3 5
b 4 6 10
c 6 9 15

Related

In R, How to make a function that finds if there is a matching pairs?

I want to make a function that can detect if there is a matching pair of numbers. I want to simulate x and y many times to see the # of matches occurring using a function.
x<-sample(1:6,6)
y<-sample(1:6,6)
x;y
For example, I have x<- c(2, 5, 6, 4, 3, 1)and y<- c(2, 1, 6, 5, 4, 3). Numbers 2 and 6 matches in order. There are 2 pairs. If there is no match between x and y, it should be just 0. I can use sum(x==y) to find for one example of x and y.
How can I make a function that finds number of identical pairs for many x and y?
You can just use
f<-function(n,k) {
sapply(1:k, \(i) sum(sample(n) == sample(n)))
}
where k is the number of iterations and n is the range (in your case 6)
Example Usage:
f(n=6, k=100)
In base R the following function would do the trick. The length of vector is given by the size argument, and the number of trials is given by n
n_pairs <- function(size, n) {
colSums(replicate(n, sample(size)) == replicate(n, sample(size)))
}
So, for example we can see:
set.seed(1)
n_pairs(size = 6, n = 5)
#> [1] 2 0 1 1 1
hist(n_pairs(6, 100), breaks = 0:6)
mean(n_pairs(6, 1000))
#> [1] 1.013
Note though that R already has the function rbinom, which can achieve the same result with:
rbinom(n, size, 1/size)
Created on 2022-04-26 by the reprex package (v2.0.1)
Maybe this one (removed first answer):
x<- c(2, 5, 6, 4, 3, 1)
y<- c(2, 1, 6, 5, 4, 3)
lst = list(x,y)
pairs <- outer(lst,lst,Vectorize(function(x,y){x[x==y]}))
pairs[1,2]
[[1]]
[1] 2 6
A possible solution with dplyr package
require(tidyverse)
x <- c(2, 5, 6, 4, 3, 1)
y <- c(2, 1, 6, 5, 4, 3)
df <- tibble(x = x,
y = y) %>%
mutate(pair = case_when(x == y ~ "PAIR",
TRUE ~ "NOT"))
The dataset:
# A tibble: 6 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 5 1 NOT
3 6 6 PAIR
4 4 5 NOT
5 3 4 NOT
6 1 3 NOT
Filtering:
df %>%
filter(pair == "PAIR")
Output:
# A tibble: 2 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 6 6 PAIR
Will this give you what you want? Make a table out of the values that are paired.
table(x[x==y])
x <- sample(1:6,1000, TRUE)
y <- sample(1:6,1000, TRUE)
table(x[x==y])
# 1 2 3 4 5 6
# 37 26 32 28 30 33

How to arrange elements of a vector based on a square matrix

I have a vector that results from a square matrix as below
P = as.vector(matrix(c(1,2,3,4),nrow=2))
What would be the simplest way of arranging this vector to get a response similar to what I have below as columns
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4,1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
I have been able to arrange the first 2 columns as
library(tidyverse)
df <- expand.grid(rep(c(1, 2, 3, 4),2))
df1 <- df %>% arrange_all()
df = expand.grid(a = df1[,1], b = df1[,1])
df[,c(2,1)]
The last column should repeat as a whole through
1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4
Does this work:
as.vector(apply(matrix(P, nrow = 2), 2, rep, 4))
[1] 1 2 1 2 1 2 1 2 3 4 3 4 3 4 3 4
paste(as.vector(apply(matrix(P, nrow = 2), 2, rep, 4)), collapse = ',')
[1] "1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4"

Using a column as a column index to extract value from a data frame in R

I am trying to use the values from a column to extract column numbers in a data frame. My problem is similar to this topic in r-bloggers. Copying the script here:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
However, instead of having column names in choice, I have column index number, such that my data frame looks like this:
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3),
stringsAsFactors = FALSE)
I tried using this solution:
df$newValue <-
df[cbind(
seq_len(nrow(df)),
match(df$choice, colnames(df))
)]
Instead of giving me an output that looks like this:
# x y choice newValue
# 1 1 4 1 1
# 2 2 5 2 2
# 3 3 6 1 6
# 4 8 9 3 NA
My newValue column returns all NAs.
# x y choice newValue
# 1 1 4 1 NA
# 2 2 5 2 NA
# 3 3 6 1 NA
# 4 8 9 3 NA
What should I modify in the code so that it would read my choice column as column index?
As you have column numbers which we need to extract from data frame already we don't need match here. However, since there is a column called choice in the data which you don't want to consider while extracting data we need to turn the values which are not in the range to NA before subsetting from the dataframe.
mat <- cbind(seq_len(nrow(df)), df$choice)
mat[mat[, 2] > (ncol(df) -1), ] <- NA
df$newValue <- df[mat]
df
# x y choice newValue
#1 1 5 1 1
#2 2 6 2 6
#3 3 7 1 3
#4 4 8 3 NA
data
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c(1, 2, 1, 3))

Row-wise sum for columns with certain names

I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.
We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])
If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5

How can I plot a line/bar for following type of data?

b <- data.frame(head=c("a", "b", "c", "d", "e"),
ab=c(1, 2, 3, 4, 5), bc=c(4, 5, 6, 7, 8), ca=c(2, 3, 4, 5, 6))
and so on.
I want to plot (5 individual plots in this case) for different head values, e.g. a plot for a for different values of ab,bc,ca same for b and so on.
The problem is it's easier to plot this if the table is transposed but difficult in this way.
Example if the data would have been in this way:
b <- data.frame(head=c("ab", "bc", "ca"),
a=c(1, 4, 2), b=c(2, 5, 3), c=c(3, 6, 4), d=c(4, 7, 5), e=c(5, 8, 6))
then it would be simple to plot for a with a command barplot(b$a). But how can I plot the same for the data presented in other way as shown in first line.
You could use reshape2 to transform the dataset b to your expected b
library(reshape2)
d1 <- dcast(melt(b,id.var="head"), variable~head, value.var="value")
d1
# variable a b c d e
#1 ab 1 2 3 4 5
#2 bc 4 5 6 7 8
#3 ca 2 3 4 5 6
Or in this case:
b1 <- t(b[,-1])
colnames(b1) <- b[,1]
b1
# a b c d e
#ab 1 2 3 4 5
#bc 4 5 6 7 8
#ca 2 3 4 5 6
If you want to plot 5 barplots on the same window:
library(ggplot2)
mb <- melt(b, id.var="head")
ggplot(mb, aes(head, value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity") +
theme_bw()
If you need 5 individual bar plots using the original b dataset, you could try:
pdf("barplots.pdf")
apply(b[,-1], 1, function(x) barplot(x))
dev.off()
'barplot' can be used with original b data.frame:
barplot(as.matrix(b[,-1]), beside=T, legend.text=b$head)
For other grouping, transpose the data (as pointed out by #akrun):
barplot(t(as.matrix(b[,-1])), beside=T, legend.text=names(b)[2:4], names.arg=b$head)

Resources