Sum the values of groups of 4 contiguous columns in R - r

Starting from a table of 372 columns and 12,000 rows in R, I need to create a new table with columns that contain rows with the sum of same row from columns 1:4, then 5:8, then 9:12, and so on up to column 372 of the original table. Here a short example:
Input:
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L, 7L,
+ 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"), c("a", "b",
+"c", "d", "e", "f", "g", "h")))
Which looks like this:
a b c d e f g h
r1 3 2 3 1 1 2 3 1
r2 1 6 1 8 5 1 7 1
Expected output:
A B
r1 9 7
r2 16 14
So, A = a+b+c+d, and B=e+f+g+h. Easy to do with a small table in Excel. Columns a-d correspond to a group, e-f to another, if that helps.

The question is currently underspecified, but supposing you have a matrix...
m = structure(c(3L, 1L, 2L, 6L, 3L, 1L, 1L, 8L, 1L, 5L, 2L, 1L, 3L,
7L, 1L, 1L), .Dim = c(2L, 8L), .Dimnames = list(c("r1", "r2"),
c("a", "b", "c", "d", "e", "f", "g", "h")))
Make your column mapping:
map = data.frame(old = colnames(m), new = rep(LETTERS, each=4, length.out=ncol(m)))
old new
1 a A
2 b A
3 c A
4 d A
5 e B
6 f B
7 g B
8 h B
And then rowsum by it:
res = rowsum(t(m), map$new)
r1 r2
A 9 16
B 7 14
We have to transpose the data with t here because R has rowsum but no colsum. You can transpose it back afterwards, like t(res).

A base R solution, suppose df is your data frame:
cols = 8
do.call(cbind, lapply(seq(1, ncols, 4), function(i) rowSums(df[i:(i+3)])))
# [,1] [,2]
# r1 9 7
# r2 16 14

Another way:
df <- data.frame(t(matrix(colSums(matrix(t(df), nrow=4)),nrow=nrow(df))))
## X1 X2
##1 9 7
##2 16 14
First transpose the data to a 4 x (ncol(df)/4 * now(df)) matrix where now each column is a group of four columns for each row in the original data frame.
Sum each column using colSums
Transpose the data back to a data frame with the original number of rows

You can do this in a vectorised way if you transform your original data to a matrix with 4 columns, then use rowSums on that, and then transform it back to match the rows of the original data frame. Here it is in one long command
df <- read.table(header = TRUE, text = "a b c d e f g h
3 2 3 1 1 2 3 1
1 6 1 8 5 1 7 1")
matrix(rowSums(matrix(as.vector(t(as.matrix(df))),
ncol = 4, byrow = TRUE)), ncol = ncol(df) / 4, byrow = TRUE)
# [,1] [,2]
#[1,] 9 7
#[2,] 16 14
Edit: To preserve the row names, if e.g. rownames(df) <- c("r1", "r2"), just apply them to the resulting matrix (the row order is preserved), ie run rownames(result) <- rownames(df).

Related

How to move two specific rows to top of dataframe?

Below I have a DF.
A B C D
a 4 2 2
g 5 2 2
d 7 65 7
e 3 6 7
I would like to make this DF so that column A has "g" in the first row, and "d" in the second row. I would like to do this by calling the value in column A (rather than an index). How can I do this?
Ideal output
A B C D
g 5 2 2
d 7 65 7
a 4 2 2
e 3 6 7
We may convert to factor with levels specified in an order before arrangeing
library(forcats)
library(dplyr)
DF %>%
arrange(fct_relevel(A, 'g', 'd'))
A B C D
1 g 5 2 2
2 d 7 65 7
3 a 4 2 2
4 e 3 6 7
with fct_relevel, we can specify the order of specific levels without specifying the rest of the levels
> with(DF, fct_relevel(A, 'g', 'd'))
[1] a g d e
Levels: g d a e
data
DF <- structure(list(A = c("a", "g", "d", "e"), B = c(4L, 5L, 7L, 3L
), C = c(2L, 2L, 65L, 6L), D = c(2L, 2L, 7L, 7L)), class = "data.frame",
row.names = c(NA,
-4L))
Another possible solution:
library(dplyr)
df <- data.frame(
stringsAsFactors = FALSE,
A = c("a", "g", "d", "e"),
B = c(4L, 5L, 7L, 3L),
C = c(2L, 2L, 65L, 6L),
D = c(2L, 2L, 7L, 7L)
)
df %>% arrange(match(A, c("g", "d", setdiff(c("g", "d"), A))))
#> A B C D
#> 1 g 5 2 2
#> 2 d 7 65 7
#> 3 a 4 2 2
#> 4 e 3 6 7
Try the code below
with(
df,
df[match(c("g","d",A[!A%in%c("g","d")]),A),]
)
and you will see
A B C D
2 g 5 2 2
3 d 7 65 7
1 a 4 2 2
4 e 3 6 7
Just to add a base R solution if you are not interested in external packages, you can specify the row order directly:
# Sample Data
DF <- structure(list(A = c("a", "g", "d", "e"), B = c(4L, 5L, 7L, 3L
), C = c(2L, 2L, 65L, 6L), D = c(2L, 2L, 7L, 7L)), class = "data.frame",
row.names = c(NA, -4L))
A hard code for this example:
DF2 <- DF[c(2,3,1,4),]
A more generalizable example:
# specify desired rows
rownums <- which(DF$A %in% c("g","d"), arr.ind = TRUE)
# Specify other rows
otherrows <- seq(1:nrow(DF))[!(seq(1:nrow(DF)) %in% rownums)]
# Organize
DF2 <- DF[c(rownums,otherrows),]

Give percentage by group in R

For a sample dataframe:
df1 <- structure(list(i.d = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), group = c(1L,
1L, 2L, 1L, 3L, 3L, 2L, 2L, 1L), cat = c(0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, NA)), .Names = c("i.d", "group", "cat"), class = "data.frame", row.names = c(NA,
-9L))
I wish to add an additional column to my dataframe ("pc.cat") which records the percentage '1s' in column cat BY the group ID variable.
For example, there are four values in group 1 (i.d's a, b, d and i). Value 'i' is NA so this can be ignored for now. Only one of the three values left is one, so the percentage would read 33.33 (to 2 dp). This value will be populated into column 'pc.cat' next to all the rows with '1' in the group (even the NA columns). The process would then be repeated for the other groups (2 and 3).
If anyone could help me with the code for this I would greatly appreciate it.
This can be accomplished with the ave function:
df1$pc.cat <- ave(df1$cat, df1$group, FUN=function(x) 100*mean(na.omit(x)))
df1
# i.d group cat pc.cat
# 1 a 1 0 33.33333
# 2 b 1 0 33.33333
# 3 c 2 1 66.66667
# 4 d 1 1 33.33333
# 5 e 3 0 0.00000
# 6 f 3 0 0.00000
# 7 g 2 1 66.66667
# 8 h 2 0 66.66667
# 9 i 1 NA 33.33333
library(data.table)
setDT(df1)
df1[!is.na(cat), mean(cat), by=group]
With data.table:
library(data.table)
DT <- data.table(df1)
DT[, list(sum(na.omit(cat))/length(cat)), by = "group"]

Remove the rows that have the same column A value but different column B value from df (but not vice-versa) in R

I’m trying to remove all the rows that have the same value in the "lan" column of my dataframe but different value for my "id" column (but not vice-versa).
Using an example dataset:
require(dplyr)
t <- structure(list(id = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), lan = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 1L,
7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"),
value = c(0.22988498, 0.848989831, 0.538065821, 0.916571913,
0.304183372, 0.983348167, 0.356128559, 0.054102854, 0.400934593,
0.001026817, 0.488452667)), .Names = c("id", "lan", "value"
), class = "data.frame", row.names = c(NA, -11L))
t
I need to get rid of rows 1 and 10 because they have the same lan (a) but different id.
I've tried the following, without success:
a<-t[(!duplicated(t$id)),]
c<-a[duplicated(a$lan)|duplicated(a$lan, fromLast=TRUE),]
d<-t[!(t$lan %in% c$lan),]
Thanks for your help!
And an alternative using dplyr:
t2 <- t %>%
group_by(lan,id) %>%
summarise(value=sum(value)) %>%
group_by(lan) %>%
summarise(number=n()) %>%
filter(number>1) %>%
select(lan)
> t[!t$lan %in% t2$lan ,]
id lan value
2 2 b 0.84898983
3 2 c 0.53806582
4 3 d 0.91657191
5 3 d 0.30418337
6 4 e 0.98334817
7 4 e 0.35612856
8 4 e 0.05410285
9 4 f 0.40093459
11 4 g 0.48845267
You could use duplicated on "lan", to get the logical index of all elements that are duplicates, repeat the same with both columns together ('id', 'lan'), to get the elements not duplicated, check which of these elements are TRUE in both cases, negate, and subset.
indx1 <- with(t, duplicated(lan)|duplicated(lan,fromLast=TRUE))
indx2 <- !(duplicated(t[1:2])|duplicated(t[1:2],fromLast=TRUE))
t[!(indx1 & indx2),]
# id lan value
#2 2 b 0.84898983
#3 2 c 0.53806582
#4 3 d 0.91657191
#5 3 d 0.30418337
#6 4 e 0.98334817
#7 4 e 0.35612856
#8 4 e 0.05410285
#9 4 f 0.40093459
#11 4 g 0.48845267

Data roll up along with transpose

I want to roll up at customer unique id level with each observation being transposed againt it as given below
Below is the snapshot of my data
basedata <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
Or
customer obs
a 12
a 11
a 12
a 10
b 3
b 5
b 7
d 8
d 1
I want to convert it in the following form
customer obs1 obs2 obs3 obs4
a 12 11 12 10
b 3 5 7 -
d 8 1 - -
I used the following code
basedata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
reshape(basedata, idvar = "customer", direction = "wide")
It gives the following error
Error in `[.data.frame`(data, , timevar) : undefined columns selected
How can I do it in R and excel?
Thank You
x <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
I chose to use a couple of extra packages (plyr and reshape2) because I find them easier and more general to use than reshape from the base package.
library(plyr)
library(reshape2)
## add observation number
x2 <- ddply(x,"customer",transform,num=1:length(customer))
## reshape
dcast(x2,customer~num,value.var="obs")
A base R way, assuming dat is the data,
> s <- split(dat$obs, dat$customer)
> df <- data.frame(do.call(rbind, lapply(s, function(x){ length(x) <- 4; x })))
> names(df) <- paste0('obs', seq(df))
> df
# obs1 obs2 obs3 obs4
# a 12 11 12 10
# b 3 5 7 NA
# d 8 1 NA NA
If you want the unique customer ID to be a column,
> df2 <- cbind(customer = rownames(df), df)
> rownames(df2) <- seq(nrow(df2))
> df2
# customer obs1 obs2 obs3 obs4
# 1 a 12 11 12 10
# 2 b 3 5 7 NA
# 3 d 8 1 NA NA
I'm assuming that "basedata" and "rawdata" are supposed to be the same (or at least copies of each other). If that's the case, you're simply missing specifying what the timevar argument for reshape should be.
Continuing from where you left off:
rawdata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
## rawdata$shopping <- with(rawdata, ave(customer, customer, FUN = seq_along))
Here's the actual reshaping step:
reshape(rawdata, idvar = "customer", timevar="shopping", direction = "wide")
# customer obs.1 obs.2 obs.3 obs.4
# 1 a 12 11 12 10
# 5 b 3 5 7 NA
# 8 d 8 1 NA NA

R- How to merge multiple dataframes of different lengths?

I have been stuck with this issue for a while now. Need some help.
I am reading the following files (which can be mire than 3 files files) into a dataframe.
My input files look like the following:
file1:
someName someMOD someID
A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P) 1
B S495(P) 2
C S162(P),Q159(D) 3
D S45(P),C47(C),S48(P),S26(P) 4
E S18(P) 5
file2:
someName someMOD someID
C S162(P),Q159(D) 3
D S45(P),C47(C),S48(P),S26(P) 4
F S182(P) 6
E S18(P) 5
Z Q100(P) 9
A T754(P),M691(O),S694(P),S739(P),S740(P) 1
file3:
someName someMOD someID
A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P) 1
B S495(P) 2
D S45(P),C47(C),S48(P),S26(P) 4
E S18(P) 5
F S182(P) 6
L Z182(P) 8
C S162(P),Q159(D) 3
My Code:
fileList <- dir(pattern="*.xls")
i<-1
j<-1
a<-list()
mybigtable<-data.frame
for (f in 1:length(fileList)){
fileName <- fileList[f]
X <-read.xls(fileName)
if(regexpr("Drug_Rep", fileName)[1]>0){
a[[i]]<-X
}
i=i+1
}
else{
#Don't do anything
}
}
#Now i want to merge my dataframes
mymerge <- function(x, y)
merge(x, y, by=c("someName", "someID"), all=TRUE))
Reduce(mymerge,a) #passing my list of dataframes 'a'
I did dput() on my 'a' list:
list(structure(list(someName = structure(c(1L, 2L, 4L, 5L, 6L,
7L, 3L), .Label = c("A", "B", "C", "D", "E", "F", "L"), class = "factor"),
someMOD = structure(c(6L, 5L, 4L, 2L, 3L, 7L, 1L), .Label = c("S162(P),Q159(D)",
"S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "S495(P)",
"T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)",
"Z182(P)"), class = "factor"), someID = c(1L, 2L, 4L, 5L,
6L, 8L, 3L)), .Names = c("someName", "someMOD", "someID"), class = "data.frame", row.names = c(NA,
-7L)), structure(list(someName = structure(1:5, .Label = c("A",
"B", "C", "D", "E"), class = "factor"), someMOD = structure(c(5L,
4L, 1L, 3L, 2L), .Label = c("S162(P),Q159(D)", "S18(P)", "S45(P),C47(C),S48(P),S26(P)",
"S495(P)", "T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)"
), class = "factor"), someID = 1:5), .Names = c("someName", "someMOD",
"someID"), class = "data.frame", row.names = c(NA, -5L)), structure(list(
someName = structure(c(2L, 3L, 5L, 4L, 6L, 1L), .Label = c("A",
"C", "D", "E", "F", "Z"), class = "factor"), someMOD = structure(c(2L,
5L, 4L, 3L, 1L, 6L), .Label = c("Q100(P)", "S162(P),Q159(D)",
"S18(P)", "S182(P)", "S45(P),C47(C),S48(P),S26(P)", "T754(P),M691(O),S694(P),S739(P),S740(P)"
), class = "factor"), someID = c(3L, 4L, 6L, 5L, 9L, 1L)), .Names = c("someName",
"someMOD", "someID"), class = "data.frame", row.names = c(NA,
-6L)))
What is my mistake in populating a list? Any help is really appreciated.
I am just trying to get an out put like the following:
The problem with the code I gave you before is that merge gets confused if there are any duplicate column names, and you're merging more than 3 datasets. You'll have to rename your someMOD columns so they don't clash. A for loop works as well as anything for this purpose.
dupvars <- which(!names(a[[1]]) %in% c("someName", "someID"))
for(i in seq_along(a))
names(a[[i]])[dupvars] <- paste0(names(a[[i]])[dupvars], i)
# and then merge
Reduce(mymerge, a)
Perhaps the problem is that you're actually not trying to merge in the standard sense, but reshape. In this case, you can rbind all the data.frames together after adding a "time" variable, and use dcast from "reshape2" to get what you're after:
Add a "time" variable and rbind the data.frames together
temp <- do.call(rbind,
lapply(seq_along(a),
function(x) data.frame(a[[x]], time = x)))
head(temp)
# someName someMOD someID time
# 1 A T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P) 1 1
# 2 B S495(P) 2 1
# 3 D S45(P),C47(C),S48(P),S26(P) 4 1
# 4 E S18(P) 5 1
# 5 F S182(P) 6 1
# 6 L Z182(P) 8 1
Transform the data.frame from a "long" format to a "wide" format
library(reshape2)
dcast(temp, someName + someID ~ time, value.var="someMOD")
# someName someID 1
# 1 A 1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P)
# 2 B 2 S495(P)
# 3 C 3 S162(P),Q159(D)
# 4 D 4 S45(P),C47(C),S48(P),S26(P)
# 5 E 5 S18(P)
# 6 F 6 S182(P)
# 7 L 8 Z182(P)
# 8 Z 9 <NA>
# 2
# 1 T754(P),M691(O),S692(P),S694(P),S739(P),S740(P),S759(P),S762(P)
# 2 S495(P)
# 3 S162(P),Q159(D)
# 4 S45(P),C47(C),S48(P),S26(P)
# 5 S18(P)
# 6 <NA>
# 7 <NA>
# 8 <NA>
# 3
# 1 T754(P),M691(O),S694(P),S739(P),S740(P)
# 2 <NA>
# 3 S162(P),Q159(D)
# 4 S45(P),C47(C),S48(P),S26(P)
# 5 S18(P)
# 6 S182(P)
# 7 <NA>
# 8 Q100(P)

Resources