Sum multiple columns [duplicate] - r

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 3 years ago.
I am trying to write a function that will sum the column(s) in the data frame according to the values in the first two columns.For example I have a matrix M,
Crs gr P_7 P_8
38 1 3 16
38 1 12 45
38 1 9 28
40 2 3 9
40 2 14 29
40 1 4 3
40 2 8 2
I want to sum the columns according to column1(crs) first and then column2(gr). Result will be,
Crs gr P_7 P_8
38 1 24 89
40 2 25 40
40 1 4 3
Currently I am using,
M <- M[, list(sum(P_7),sum(P_8)), by=list(Crs,gr)]
But the problem with this, is that I have to define the names of columns which wont be fixed. So, I was wondering how can I do this without defining the names of the columns.
Thanks in advance!

You're looking for this:
M[, lapply(.SD, sum), by = list(Crs, gr)]

The package plyr has some magic for situations just like this. Use a combination of ddply and numcolwise, like this:
library(plyr)
ddply(dat, .(Crs, gr), numcolwise(sum))
results in:
Crs gr P_7 P_8
1 38 1 24 89
2 40 1 4 3
3 40 2 25 40

Related

For loop to iterate through columns in data.table [duplicate]

This question already has answers here:
Convert *some* column classes in data.table
(2 answers)
Closed 4 years ago.
I am trying to write a "for" loop that iterates through each column in a data.table and return a frequency table. However, I keep getting an error saying:
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,i]))
}
Error in `[.data.table`(cars, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
When I use each column individually like below, I do not have any problem:
> table(cars[,dist])
2 4 10 14 16 17 18 20 22 24 26 28 32 34 36 40 42 46 48 50 52 54 56 60 64 66
1 1 2 1 1 1 1 2 1 1 4 2 3 3 2 2 1 2 1 1 1 2 2 1 1 1
68 70 76 80 84 85 92 93 120
1 1 1 1 1 1 1 1 1
My data is quite large (8921483x52), that is why I want to use the "for" loop and run everything at once then look at the result.
I included the cars dataset (which is easier to run) to demonstrate my code.
If I convert the dataset to data.frame, there is no problem running the "for" loop. But I just want to know why this does not work with data.table because I am learning it, which work better with large dataset in my belief.
If by chance, someone saw a post with an answer already, please let me know because I have been trying for several hours to look for one.
Some solution found here
My personal preference is the apply function though
library(datasets)
data(cars)
cars <- as.data.table(cars)
apply(cars,2,table)
To make your loop work you tweak the i
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,(i) := as.character(get(i))]))
}

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

aggregate over multiple columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
Hey I have some data looks like this:
ExpNum Compound Peak Tau SS
1 a 100 30 50
2 a 145 23 45
3 b 78 45 56
4 b 45 43 23
5 c 344 23 56
Id like to fund the mean based on Compound name
What I have
Norm_Table$Norm_Peak = (aggregate(data[[3]],by=list(Compound),FUN=normalization))
This is fine and I have this coding repeating 3 times just changing the data[[x]] number. Would lapply work here? or a for loop?
A dplyr solution:
library(dplyr)
data %>%
group_by(Compound) %>%
summarize_each(funs(mean), -ExpNum)

How to make data in a single column (long) with multiple, nested group categories wide

I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9

How to update values with dplyr

I'm currently trying to update values from a data.frame using dplyr butI don't know if it is possible to replace a subset of values?
# the net4 table
head(net4)
Source: local data frame [6 x 4]
temps2 NNET NET ave
1 18 2 4 36
2 18 2 4 36
3 22 2 4 44
4 18 2 4 36
5 22 2 4 44
6 27 3 4 36
# I would like to do the same command line as below:
subs <- (net4$ave < 10 & net4$ave!=net4$temps2)
net4$ave[subs] <- with(net4[subs,], temps2/NNET*NET)
Thanks
Use mutate and ifelse
mutate(net4,
ave = ifelse(ave < 10 & ave != temp2, temps2 / NNET * NET, ave)
)

Resources