Combining Rows - Summing Certain Columns and Not Others in R - r

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!

Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))

I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.

You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Related

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

Reshape data frame with different column lengths into two columns replicating column ID

I have the following data frame, with different row lengths:
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
data.frame(ID,myvar)
ID V1 V2 V3 V4 V5
1 Walter NA NA NA NA
2 Walter NA NA NA NA
3 Walter Jesse NA NA NA
4 Gus Tuco Mike NA NA
5 Gus Mike Hank Saul Flynn
My goal is to switch this data frame into a two column data frame. The first column would be the ID and the other one would be the character name. Note that the ID must be correspondent to the row the character were originally placed. I'm expecting the following result:
ID V
1 Walter
2 Walter
3 Walter
3 Jesse
4 Gus
4 Tuco
4 Mike
5 Gus
5 Mike
5 Hank
5 Saul
5 Flynn
I've tried dcast {reshape2} but it doesn't returned what I need. It is noteworthy that my original data frame is quite big. Any tips? Cheers.
You could use unlist
res <- subset(data.frame(ID,value=unlist(myvar[-1],
use.names=FALSE)), value!='NA')
res
# ID value
#1 1 Walter
#2 2 Walter
#3 3 Walter
#4 4 Gus
#5 5 Gus
#6 3 Jesse
#7 4 Tuco
#8 5 Mike
#9 4 Mike
#10 5 Hank
#11 5 Saul
#12 5 Flynn
NOTE: The NAs are 'character' elements in the dataset, it is better to create it without quotes so that it will be real NAs and we can remove it by na.omit, is.na, complete.cases etc.
data
myvar <- data.frame(ID,myvar)
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
df <- data.frame(ID, myvar)
Using base reshape. (I'm converting your "NA" character strings to NA which you may not have to do, this is just due to how you created this example)
df[df == 'NA'] <- NA
na.omit(reshape(df, direction = 'long', varying = list(2:6))[, c('ID','V1')])
# ID V1
# 1.1 1 Walter
# 2.1 2 Walter
# 3.1 3 Walter
# 4.1 4 Gus
# 5.1 5 Gus
# 3.2 3 Jesse
# 4.2 4 Tuco
# 5.2 5 Mike
# 4.3 4 Mike
# 5.3 5 Hank
# 5.4 5 Saul
# 5.5 5 Flynn
or using reshape2
library('reshape2')
## na.omit(melt(df, id.vars = 'ID')[, c('ID','value')])
## or better yet as ananda suggests:
melt(df, id.vars = 'ID', na.rm = TRUE)[, c('ID','value')]
# ID value
# 1 1 Walter
# 2 2 Walter
# 3 3 Walter
# 4 4 Gus
# 5 5 Gus
# 8 3 Jesse
# 9 4 Tuco
# 10 5 Mike
# 14 4 Mike
# 15 5 Hank
# 20 5 Saul
# 25 5 Flynn
you get warnings that the factor levels over the columns are not the same but that's fine.
Fix your "NA" so that they are actually NA first:
mydf[mydf == "NA"] <- NA
Using some subsetting to do it all in one fell swoop:
data.frame(ID=mydf$ID[row(mydf[-1])[!is.na(mydf[-1])]], V=mydf[-1][!is.na(mydf[-1])])
# ID V
#1 1 Walter
#2 2 Walter
#3 3 Walter
#4 4 Gus
#5 5 Gus
#6 3 Jesse
#7 4 Tuco
#8 5 Mike
#9 4 Mike
#10 5 Hank
#11 5 Saul
#12 5 Flynn
Or much more readable in base R:
sel <- which(!is.na(mydf[-1]), arr.ind=TRUE)
data.frame(ID=mydf$ID[sel[,1]], V=mydf[-1][sel])
Using tidyr
library("tidyr")
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
myvar <- data.frame(ID,myvar)
myvar %>%
gather(ID, Name, V1:V5 ) %>%
select(ID, value) %>%
filter(value != "NA")
If your NAs are coded as NA instead of "NA", then we can in fact use the na.rm = TRUE option in gather. E.g.:
myvar[myvar == "NA"] <- NA
myvar %>%
gather(ID, Name, V1:V5, na.rm = TRUE ) %>%
select(ID, value)
gives
ID value
1 1 Walter
2 2 Walter
3 3 Walter
4 4 Gus
5 5 Gus
6 3 Jesse
7 4 Tuco
8 5 Mike
9 4 Mike
10 5 Hank
11 5 Saul
12 5 Flynn
Since you are thinking of huge data,
time performance would matter, even sorting afterwards could take forever
Here's my solution. You better be using data.table but here I'll use reshape2
first solution
myvar <- as.data.frame(rbind(c("Walter","NA","NA","NA","NA"),
c("Walter","NA","NA","NA","NA"),
c("Walter","Jesse","NA","NA","NA"),
c("Gus","Tuco","Mike","NA","NA"),
c("Gus","Mike","Hank","Saul","Flynn")))
ID <- as.factor(c(1:5))
dat = data.frame(ID,myvar)
dat[] <- lapply(dat, function(x) {x[x=="NA"]=NA; x})
str(dat$V5)
library(dplyr)
library(reshape2)
dat2 <- melt(dat, id.vars="ID", measure.vars = paste0("V", 1:5), na.rm=TRUE)
dat2
dat2[, c('ID', 'value')]
second solution needs some preprocessing. for huge data, i would recommend data.table
datB <- t(dat)
datB
colnames(datB) <- datB["ID", ]
datB <- datB[-1,]
melt(datB, measure.vars = 1:5, na.rm=TRUE)[, c('Var2', 'value')]
you do not need sorting afterwards

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

How to use "cast" in reshape without aggregation

In many uses of cast I've seen, an aggregation function such as mean is used.
How about if you simply want to reshape without information loss.
For example, if I want to take this long format:
ID condition Value
John a 2
John a 3
John b 4
John b 5
John a 6
John a 2
John b 1
John b 4
To this wide-format without any aggregation:
ID a b
John 2 4
John 3 5
Alex 6 1
Alex 2 4
I suppose that this is assuming that observations are paired and you were missing value would mess this up but any insight is appreciated
In such cases you can add a sequence number:
library(reshape2)
DF$seq <- with(DF, ave(Value, ID, condition, FUN = seq_along))
dcast(ID + seq ~ condition, data = DF, value.var = "Value")
The last line gives:
ID seq a b
1 John 1 2 4
2 John 2 3 5
3 John 3 6 1
4 John 4 2 4
(Note that we used the sample input from the question but the sample output in the question does not correspond to the sample input.)

How to aggregate some columns while keeping other columns in R?

I have a data frame like this:
id no age
1 1 7 23
2 1 2 23
3 2 1 25
4 2 4 25
5 3 6 23
6 3 1 23
and I hope to aggregate the date frame by id to a form like this: (just sum the no if they share the same id, but keep age there)
id no age
1 1 9 23
2 2 5 25
3 3 7 23
How to achieve this using R?
Assuming that your data frame is named df.
aggregate(no~id+age, df, sum)
# id age no
# 1 1 23 9
# 2 3 23 7
# 3 2 25 5
Even better, data.table:
library(data.table)
# convert your object to a data.table (by reference) to unlock data.table syntax
setDT(DF)
DF[ , .(sum_no = sum(no), unq_age = unique(age)), by = id]
Alternatively, you could use ddply from plyr package:
require(plyr)
ddply(df,.(id,age),summarise,no = sum(no))
In this particular example the results are identical. However, this is not always the case, the difference between the both functions is outlined here. Both functions have their uses and are worth exploring, which is why I felt this alternative should be mentioned.

Resources