I have a dataframe in wide format, and I want to subtract specific columns from different series of columns. Ideally I'd like the results to be in a new dataframe.
For example:
From this sample dataframe (dfOld), I would like columns A, B and C to each subtract D, and columns E, F and G to each subtract column H. In the real dataset, this keeps going and needs to be iterated.
image of dfOld as table
Sample Data:
dfOld <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), A = c(2, 3, 4,5,4,6,7,1,9,12), B = c(3, 4, 5,2,4,5,1,7,0,8), C = c(5, 6, 7,2,4,1,5,4,6,13), D = c(68, 7, 8,2,1,5,7,9,78,7), E = c(2, 3, 42,5,4,6,7,1,9,12), F = c(37, 4, 5,2,48,5,1,7,60,8), G = c(5, 6, 7,2,4,1,5,4,6,13), H = c(35, 7, 8,2,1,5,7,9,78,7))
The results would ideally be in a new dataframe, with columns that have values and names for A-D, B-D, C-D, E-H, F-H, G-H, and look like this:
image of dfNew as table
In Excel, the formula would be "=B2-$E2" dragged down the rows, and across 3 columns, and then repeated again for "F2-$I2" etc, using the "$" sign to lock the column
In R, I've only been able to do this manually, kind of like the answer previously posted for a similar question (Subtracting two columns to give a new column in R)
dfOld$A-D<-(dfOld$A-dfOld$D)
dfOld$B-D<-(dfOld$B-dfOld$D)
dfOld$C-D<-(dfOld$C-dfOld$D)
dfOld$E-H<-(dfOld$E-dfOld$H)
dfOld$F-H<-(dfOld$F-dfOld$H)
dfOld$G-H<-(dfOld$G-dfOld$H)
And then separated the new columns out into a new dataset.
However, this obviously isn't scalable for my much larger dataset, and I'd really like to learn how else to do this kind of operation that's so easy in Excel(although still not scalable for large datasets).
Part of the answer may already be here: Subtract a column in a dataframe from many columns in R
But this answer (an several other similar ones) changes the values in the same dataframe, and the columns keep the same names.
I haven't been able to adapt it so that the new values have new columns, with new names (and ideally in a new dataframe)
Another part of the answer may be here:
Iterative function to subtract columns from a specific column in a dataframe and have the values appear in a new column
These answers put the subtracted results in new columns with new names, but every column in this dataframe subtracts values of every other column (A,B,C,D,E,F,G,H each minus C). And I can't seem to adapt it so that it works over specific series of columns (A, B, C each minus D, then E, F, G each minus H, etc.)
Thanks in advance for your help.
Probably others have better ways - but here is one possibility.
load two libraries and set dfOld to data.table
library(data.table)
library(magrittr)
setDT(dfOld)
get information about the columns, and make into a list.
lv = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4>0]
lv = split(lv, ceiling(seq_along(lv)/3))
names(lv) = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4==0]
lv looks like this:
> lv
$D
[1] "A" "B" "C"
$H
[1] "E" "F" "G"
This is a bit convoluted, but basically, I'm taking each of the elements of the lv list, and I'm reshaping columns from dfOld, so I can do all subtractions at once. Then I'm retaining only the variables I need, and binding each of the resulting list of data.tables into a single datatable using rbindlist
res =rbindlist(lapply(names(lv), function(x) {
melt(dfOld,id=c("ID", x),measure.vars = lv[[x]]) %>%
.[,`:=`(nc=value-get(x),variable=paste0(variable,"-",x))] %>%
.[,.(ID,variable,nc)]
}))
Last step is simple - just dcast back
dcast(res,ID~variable, value.var="nc")
Output
ID A-D B-D C-D E-H F-H G-H
1: 1 -66 -65 -63 -33 2 -30
2: 2 -4 -3 -1 -4 -3 -1
3: 3 -4 -3 -1 34 -3 -1
4: 4 3 0 0 3 0 0
5: 5 3 3 3 3 47 3
6: 6 1 0 -4 1 0 -4
7: 7 0 -6 -2 0 -6 -2
8: 8 -8 -2 -5 -8 -2 -5
9: 9 -69 -78 -72 -69 -18 -72
10: 10 5 1 6 5 1 6
First, I create a function to do the simple calculation, where we have the dataframe, then the column names as the inputs. Then, I use purrr map2 to pass the function (which I replicate for the number of times needed, which in this case is 6). Then, I provide the list of parameters to apply that function for each column pair. Then, I use invoke to apply the function and parameter. Now, we are left with a list of dataframes (as the output is an individual column with the ID). Then, I use reduce` to combine them back into one dataframe, then update the column names.
library(tidyverse)
subtract <- function(x, a, b){
x %>%
mutate(!! a := !!rlang::parse_expr(a) - !!rlang::parse_expr(b)) %>%
dplyr::select(ID, which(colnames(x)==a))
}
col_names <- c("ID", "A-D", "B-D", "C-D", "E-H", "F-H", "G-H")
map2(
flatten(list(rep(list(
subtract
), 6))),
list(
expression(a = "A", b = "D"),
expression(a = "B", b = "D"),
expression(a = "C", b = "D"),
expression(a = "E", b = "H"),
expression(a = "F", b = "H"),
expression(a = "G", b = "H")
),
~ invoke(.x, c(list(dfOld), as.list(.y)))
) %>%
reduce(left_join, by = "ID") %>%
set_names(col_names)
Output
ID A-D B-D C-D E-H F-H G-H
1 1 -66 -65 -63 -33 2 -30
2 2 -4 -3 -1 -4 -3 -1
3 3 -4 -3 -1 34 -3 -1
4 4 3 0 0 3 0 0
5 5 3 3 3 3 47 3
6 6 1 0 -4 1 0 -4
7 7 0 -6 -2 0 -6 -2
8 8 -8 -2 -5 -8 -2 -5
9 9 -69 -78 -72 -69 -18 -72
10 10 5 1 6 5 1 6
Data
dfOld <- structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
A = c(2,
3, 4, 5, 4, 6, 7, 1, 9, 12),
B = c(3, 4, 5, 2, 4, 5, 1, 7, 0,
8),
C = c(5, 6, 7, 2, 4, 1, 5, 4, 6, 13),
D = c(68, 7, 8, 2,
1, 5, 7, 9, 78, 7),
E = c(2, 3, 42, 5, 4, 6, 7, 1, 9, 12),
F = c(37,
4, 5, 2, 48, 5, 1, 7, 60, 8),
G = c(5, 6, 7, 2, 4, 1, 5, 4, 6,
13),
H = c(35, 7, 8, 2, 1, 5, 7, 9, 78, 7)
),
class = "data.frame",
row.names = c(NA,-10L)
)
This is a question following a previous one. In that question, it is suggested to use rollapply to calculate sum of the 1st, 2nd, 3rd entry of a vector; then 2nd, 3rd, 4th; and so on.
My question is how calculate sum of the 1st, 2nd and 3rd; then the 4th, 5th and 6th. That is, rolling without overlapping. Can this be easily done, please?
Same idea. You just need to specify the by argument. Default is 1.
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
zoo::rollapply(x, 3, by = 3, sum)
#[1] 10 20 12
#or another Base R option
sapply(split(x, ceiling(seq_along(x)/3)), sum)
# 1 2 3
#10 20 12
Using tapply in base R:
set.seed(1)
vec <- sample(10, 20, replace = TRUE)
#[1] 3 4 6 10 3 9 10 7 7 1 3 2 7 4 8 5 8 10 4 8
unname(tapply(vec, (seq_along(vec)-1) %/% 3, sum))
# [1] 13 22 24 6 19 23 12
Alternatively,
colSums(matrix(vec[1:(ceiling(length(vec)/3)*3)], nrow = 3), na.rm = TRUE)
#[1] 13 22 24 6 19 23 12
vec[1:(ceiling(length(vec)/3)*3)] fills in the vector with NA if the length is not divisible by 3. Then, you simply ignore NAs in colSums.
Yet another one using cut and aggregate:
x <- ceiling(length(vec)/3)*3
df <- data.frame(vec=vec[1:x], col=cut(1:x, breaks = seq(0,x,3)))
aggregate(vec~col, df, sum, na.rm = TRUE)[[2]]
#[1] 13 22 24 6 19 23 12
We can use roll_sum from RcppRoll which would be very efficient
library(RcppRoll)
roll_sum(x, n=3)[c(TRUE, FALSE, FALSE)]
#[1] 10 20 12
data
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
you can define the window size, and do:
x <-c(1, 5, 4, 5, 7, 8, 9, 2, 1)
n <- 3
diff(c(0, cumsum(x)[slice.index(x, 1)%%n == 0]))
p.s. using the input from the answer by #Sotos
I want to count number of zeros in each column in a R data frame and express it as a percentage. This percentage should be added to last row of the original data frame?
example
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
want to see the zeros in each column and express as percentage
Thanks
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
#This is a matrix not a data.frame.
res <- colSums(data_a==0)/nrow(data_a)*100
If you must, rbind to the matrix (usually not really a good idea).
rbind(data_a, res)
# x y z
# 0 3 3
# 4 0 6
# 6 9 9
# 0 12 0
# 10 15 15
# res 40 20 20
Here is one more method using lapply, this would work for a data frame though.
lapply(data_a, function(x){ length(which(x==0))/length(x)})
A combination of prop.table and some *apply work can give you the same answer as #Roland's
> prop <- apply(data_a, 2, function(x) prop.table(table(x))*100)
> rbind(data_a, sapply(prop, "[", 1))
x y z
[1,] 0 3 3
[2,] 4 0 6
[3,] 6 9 9
[4,] 0 12 0
[5,] 10 15 15
[6,] 40 20 20
This is probably inelegant, but this is how I went about it when my columns had NAs:
#Returns the number of zeroes in a column
numZero <- colSums(vars == 0, na.rm = T)
#Returns the number of non-NA entries in each column
numNA <- colSums(is.na(vars))
#Returns total sample size
numSamp <- rep(nrow(vars), ncol(vars))
#Combine the three
varCheck <- as.data.frame(cbind(numZero, numNA, numSamp))
#Number of observations for that variable
varCheck$numTotal <- varCheck$numSamp - varCheck$numNA
#Percentage zero
varCheck$pctZero <- varCheck$numZero / varCheck$numTotal
#Check which have lower than 1%
varCheck[which(varCheck$pctZero > 0.99),]
Say I have a matrix of 5 x 100 of numbers between 0 and 100 for instance:
1 5 10 15 3
2 15 3 8 27
1 22 34 45 35
28 27 32 3 8
......
I would like to find repeated "patterns" of numbers (mainly couples or triplets).
So in my example I would have the couple 3,15 appearing twice and the triplet 3, 8, 27 also appearing twice (I don't care about the order).
How would you implement that in R?
I would like to have couples and triplets separately and have their count.
thanks
nico
Here is one way. For each row of your 100-row matrix, you find all pairs/triples of numbers (using combn ) and do a frequency-count (using table) of the pairs/triples. The pasteSort function I defined creates a string out of a vector after sorting it. We apply this function to each pair/tuple in each row, and collect all pairs/tuples from the matrix before doing the frequency-count. Note that if a pair repeats on the same row, it's counted as a "repeat".
> mtx <- matrix( c(1,5,10,15,3,
2, 15, 3, 8, 27,
1, 22, 34, 45, 35,
28, 27, 32, 3, 8), ncol=5, byrow=TRUE)
> pasteSort <- function( x ) do.call(paste, as.list( sort( x) ) )
> pairs <- c( apply(mtx, 1, function(row) apply( combn(row, 2), 2, pasteSort)) )
> pairFreqs <- table(pairs)
> pairFreqs[ pairFreqs > 1 ]
3 15 3 27 3 8 8 27
2 2 2 2
> triples <- c( apply(mtx, 1, function(row) apply( combn(row, 3), 2, pasteSort)) )
> tripleFreqs <- table( triples )
> tripleFreqs[ tripleFreqs > 1 ]
3 8 27
2