I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))
Related
I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.
I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.
What I need is something like this.
A B C
1 DSFG DSFG DSGG
2 DDEG DDED DDEE
3 HUGO HUGI HUGO
So if this is my table, I want the result like this
D(?) E
1 DSFG A,B
2 DSGG C
4 DDEG A
5 DDED B
6 DDEE C
7 HUGO A,C
8 HUGI B
Solution
An idea via base R is to use stack to convert to long, and aggregate to get the required output.
aggregate(ind ~ values, stack(df), toString)
# values ind
#1 DDED B
#2 DDEE C
#3 DDEG A
#4 DSFG A, B
#5 DSGG C
#6 HUGI B
#7 HUGO A, C
NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))
Explanations
Stacking turns data into "long format":
stack(df)
values ind
1 DSFG A
2 DDEG A
3 HUGO A
4 DSFG B
5 DDED B
6 HUGI B
7 DSGG C
8 DDEE C
9 HUGO C
toString() simply joins elements in a vector by comma
toString(c("A", "B", "C"))
[1] "A, B, C"
Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:
aggregate(ind ~ values, stack(df), FUN=toString)
Doing it the tidy way:
Input
df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))
Summarizing data
library(tidyverse)
df %>%
gather("Column", "Value", 1:3) %>%
group_by(Value) %>%
summarise(Cols = paste(Column, collapse = ","))
Output
Value Cols
DDED B
DDEE C
DDEG A
DSFG A,B
DSGG C
HUGI B
HUGO A,C
Coming from Sum the values according to labels in R.
I've been notified that working with 2 dimensional tables is rather significantly different with 1 dimensional ones, like:
a a,b a,b,c c
d 5 2 1 2
d,e 2 1 1 1
And we want to achieve:
a b c
d 12 5 5
e 4 2 2
So how can this be achieved using R?
A little bit convoluted, but it should work :
m <- as.matrix(data.frame('a'=c(5,2),'a,b'=c(2,1),
'a,b,c'=c(1:1),'c'=c(2,1),
check.names = FALSE,row.names=c('d','d,e')))
colNamesSplits <- strsplit(colnames(m),',')
rowNamesSplits <- strsplit(rownames(m),',')
colNms <- unique(unlist(colNamesSplits))
rowNms <- unique(unlist(rowNamesSplits))
colIdxs <- unlist(sapply(1:length(colNamesSplits),
function(i) rep.int(i,length(colNamesSplits[[i]]))))
rowIdxs <- unlist(sapply(1:length(rowNamesSplits),
function(i) rep.int(i,length(rowNamesSplits[[i]]))))
colIdxsMapped <- unlist(sapply(colNamesSplits, function(n) match(n,colNms)))
rowIdxsMapped <- unlist(sapply(rowNamesSplits, function(n) match(n,rowNms)))
# let's create the fully expanded matrix
expanded <- as.matrix(m[rowIdxs,colIdxs])
rownames(expanded) <- rowNms[rowIdxsMapped]
colnames(expanded) <- colNms[colIdxsMapped]
# aggregate expanded by cols :
expanded <- do.call(cbind,lapply(split(1:ncol(expanded),colnames(expanded)),
function(ii) rowSums(expanded[,ii,drop=FALSE])))
# aggregate expanded by rows :
expanded <- do.call(rbind,lapply(split(1:nrow(expanded),rownames(expanded)),
function(ii) colSums(expanded[ii,,drop=FALSE])))
> expanded
a b c
d 12 5 5
e 4 2 2
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)
I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15
You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset
You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15
This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 5 years ago.
This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B
If I now remove duplicates, I get the following data frame:
df[duplicated(df),]
a b
3 A B
6 B A
8 C B
However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?
Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.
Thanks!
Extending Ari's answer, to specify columns to check if other columns are also there:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)
df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
a b c d
1 A A 10 B
2 A B 8 S
3 A B 7 J
4 B C 3 Q
5 B A 2 I
6 B A 6 U
7 C B 4 L
8 C B 5 V
cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
newdf[i, ] = sort(df[i,cols])
}
df[!duplicated(newdf),]
a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
One solution is to first sort each row of df:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it's just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.
The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.
Stealing from the other linked answer using data.table, you could try something like:
df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]
A comparison benchmark with a larger dataset (df2):
df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]
system.time(
df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user system elapsed
# 0.07 0.00 0.06
system.time({
for (i in 1:nrow(df2))
{
df2[i, ] = sort(df2[i, ])
}
df2[!duplicated(df2),]
}
)
# user system elapsed
# 42.07 0.02 42.09
Using apply will be a better option than loops.
newDf <- data.frame(t(apply(df,1,sort)))
All you need to do now is remove duplicates.
newDf <- newDf[!duplicated(newDf),]