Take sum of rows for every 3 columns in a dataframe - r

I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame

One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))

We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Add new variable to specific position in dataframe without specifying a numbered position

In Stata, I can create a variable after or before another one. E.g. gen age=., after(sex)
I would like to do the same in R. Is it possible?
My database has 300 variables, so I don't want to count it to discover its numbered position and also I might change from time to time.
You could do:
library(tibble)
data <- data.frame(a = c(1,2,3), b = c(1,2,3), c = c(1,2,3))
add_column(data, d = "", .after = "b")
# a b d c
# 1 1 1
# 2 2 2
# 3 3 3
Or another way could be:
data.frame(append(data, list(d = ""), after = match("b", names(data))))
First add the new column to the end of your data frame. Then, find the index of the column after which you want that new column to actually appear, and interpolate it:
df$new_col <- ...
index <- match("col_before", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
Sample:
df <- data.frame(v1=c(1:3), v2=c(4:6), v3=c(7:9))
df$new_col <- c(7,7,7)
index <- match("v2", names(df))
df <- df[, c(names(df)[c(1:index)], "new_col", names(df)[c((index+1):(ncol(df)-1))])]
df
v1 v2 new_col v3
1 1 4 7 7
2 2 5 7 8
3 3 6 7 9

Counting the Number of Dates When Larger Than Another Date - R [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 4 years ago.
I have no idea where to start with this, but what I am trying to do is create a new value based on the number of times another value is represented in another column.
For example
# Existing Data
key newcol
a ?
a ?
a ?
b ?
b ?
c ?
c ?
c ?
Would like the output to look like
key newcol
a 3
a 3
a 3
b 2
b 2
c 3
c 3
c 3
Thanks!
This can be achieved with the doBy package like so:
require(doBy)
#original data frame
df <- data.frame(key = c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'))
#add counter
df$count <- 1
#use summaryBy to count number of instances of key
counts <- summaryBy(count ~ key, data = df, FUN = sum, var.names = 'newcol', keep.names = TRUE)
#merge counts into original data frame
df <- merge(df, counts, by = 'key', all.x = TRUE)
df then looks like:
> df
key count newcol
1 a 1 3
2 a 1 3
3 a 1 3
4 b 1 2
5 b 1 2
6 c 1 3
7 c 1 3
8 c 1 3
If key is a vector like this key <- rep(c("a", "b", "c"), c(3,2,3)), then you can get what you want by using table to count occurences of key elements
> N <- table(key)
> data.frame(key, newcol=rep(N,N))
key newcol
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
On the other hand, if key is a data.frame, then...
key.df <- data.frame(key = rep(letters[1:3], c(3, 2, 3)))
N <- table(key.df$key)
data.frame(key=key.df, newcol=rep(N, N))

Why are there differences in using merge and %in%?

I have two datasets that I'd like to merge via two identifying variables (up and ver_u):
df1 looks like this:
up ver_u
257001 1
1010 1
101010 1
100316 1
df2 looks like this:
up ver_u code_uc quantity
500116 1 395884 1
100116 1 36761 2
160116 1 81308 3
100116 1 76146 1
113216 1 6338 1
101116 1 33887 1
What I would like to do is to take out a subset of df2 where their up and ver_u matches with those in df1. I did this in two different ways and I got different answers.
First method:
pur <- merge(df2, df1,by=c("up","ver_u"))
Second method:
test <- df2[(df2$up %in% df1$up) & (df2$ver_u %in% df1$ver_u),]
They are giving me different number of observations and I don't see why they are giving me a difference.
When I used merge on dataframe test with the following code, I got the same number of observations, but the two resulting dataframes I got are still different.
pur1 = merge(test, df1,by=c("up","ver_u"))
Is there some systematic differences of using merge and %in%?
Would greatly appreciate any insight on this.
Because merge is comparing row by row for both columns, while %in% is comparing one row by all other rows. Example:
#dummy data
df1 <- data.frame(x = c(1,2,3),
y = c(2,3,4))
df1
# x y
# 2 2 3
# 3 3 4
df2 <- data.frame(x = c(2,3,1,3),
y = c(3,1,4,1))
df2
# x y
# 1 2 3
# 2 3 1
# 3 1 4
# 4 3 1
# using merge
merge(df1, df2, by = c("x", "y"))
# x y
# 1 2 3
# using %in%
df1[(df1$x %in% df2$x) & (df1$y %in% df2$y), ]
# x y
# 2 2 3
# 3 3 4

Filtering a R DataFrame with repeated values in columns

I have a R DataFrame and I want to make another DF from this one, but only with the values which appears more than X times in a determinate column.
>DataFrame
Value Column
1 a
4 a
2 b
6 c
3 c
4 c
9 a
1 d
For example a want a new DataFrame only with the values in Column which appears more than 2 times, to get something like this:
>NewDataFrame
Value Column
1 a
4 a
6 c
3 c
4 c
9 a
Thank you very much for your time.
We can use table to get the count of values in 'Column' and subset the dataset ('df1') based on the names in 'tbl' that have a count greater than 'n'
n <- 2
tbl <- table(DataFrame$Column) > n
NewDataFrame <- subset(DataFrame, Column %in% names(tbl)[tbl])
# Value Column
#1 1 a
#2 4 a
#4 6 c
#5 3 c
#6 4 c
#7 9 a
Or using ave from base R
NewDataFrame <- DataFrame[with(DataFrame, ave(Column, Column, FUN=length)>n),]
Or using data.table
library(data.table)
NewDataFrame <- setDT(DataFrame)[, .SD[.N>n] , by = Column]
Or
NewDataFrame <- setDT(DataFrame)[, if(.N > n) .SD, by = Column]
Or dplyr
NewDataFrame <- DataFrame %>%
group_by(Column) %>%
filter(n()>2)

Resources