This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 5 years ago.
I was just wondering how I can go about calculating the cumulative sum with conditions on a matrix. Here's what I mean: Let's say we have a matrix with a column called ID and a column called Value as follows:
ID | VALUE
------------------------------
2 | 50
7 | 19
5 | 32
2 | 21
8 | 56
7 | 5
7 | 12
2 | 16
5 | 42
I wish to compute the cumulative sum on this matrix based on the ID column. This means the cumulative sum column (or vector) would look like:
ID | CUMULATIVE SUM
----------------------------------
2 | 50
7 | 19
5 | 32
2 | 71
8 | 56
7 | 24
7 | 36
2 | 87
5 | 74
Is there a way to do this? A search for this hasn't turned up much at all (I've found stuff relevant for data frames/data tables, but I haven't found anything at all when it comes to 'conditions' with matrices), so any help would be appreciated.
There are a number of ways to do this, here I use data.table. I edited your data slightly to just use a , as a separator and got rid of the header row:
R> suppressMessages(library(data.table))
R> dat <- fread(" ID , VALUE
2 , 50
7 , 19
5 , 32
2 , 21
8 , 56
7 , 5
7 , 12
2 , 16
5 , 42")
R> dat[, cumsum(VALUE), by=ID]
ID V1
1: 2 50
2: 2 71
3: 2 87
4: 7 19
5: 7 24
6: 7 36
7: 5 32
8: 5 74
9: 8 56
R>
After that, it is a standard grouping by (which you can do in many different ways) and a cumulative sum in each group.
The reordering here is automatic because of the grouping. If you must keep your order you can.
Related
This question already has an answer here:
The rules of subsetting
(1 answer)
Closed 8 years ago.
Good day
I have a data set I got from a txt file
> MyData
Xdat Ydat
1 1 12
2 2 23
3 3 34
4 4 45
5 5 56
6 6 67
7 7 78
I need to use this set to extract rows that correspond to the case where the 2nd column(Ydat) is greater than 40.
Resulting in
MyData2
Xdat Ydat
4 4 45
5 5 56
6 6 67
7 7 78
Simple subsetting will do it -
MyData[which(MyData[,2]>40),]
as #DavidArenburg points out, this works fine:
MyData[(MyData[,2]>40),]
for example I have in a variable the different ages of people:
ed <- c(51,51,44,25,49,52,21,45,29,46,34,33,30,28,50,21,25,21,22,51,25,52,39,53,52,23,20,23,34)
but it turns out I want to summarize this huge amount in seven rows
---------------
scale | h
---------------
20 - 25 | 9
26 - 30 | 3
31 - 35 | 3
36 - 40 | 1
41 - 45 | 2
46 - 50 | 3
51 - 55 | 7
Are there any libraries that allow to create scales?
I have tried it with conditionals and it is very tedious
I have the following vector:
x <- c(54.11, 58.09, 60.82, 86.59, 89.92, 91.61,
95.03, 95.03, 96.77, 98.52, 100.29, 102.07,
102.07, 107.51, 113.10, 130.70, 130.70, 138.93,
147.41, 149.57, 153.94, 158.37, 165.13, 201.06,
208.67, 235.06, 240.53, 251.65,254.47, 254.47, 333.29)
I want to get the following stem and leaf plot in R:
Stem Leaf
5 4 8
6 0
8 6 9
9 1 5 5 6 8
10 0 2 2 7
11 3
13 0 0 8
14 7 9
15 3 8
16 5
20 1 8
23 5
24 0
25 1 4 4
33 3
However, when I try the stem() function in R, I get the folliwing:
> stem(x)
The decimal point is 2 digit(s) to the right of the |
0 | 566999
1 | 000000011334
1 | 55567
2 | 0144
2 | 555
3 | 3
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |
4 | 48
6 | 1
8 | 7025579
10 | 02283
12 | 119
14 | 7048
16 | 5
18 |
20 | 19
22 | 5
24 | 1244
26 |
28 |
30 |
32 | 3
Question: Am I missing an argument in the stem() function? If not, is there another solution?
I believe what you want is a little non-standard: a stem-and-leaf should have on its left equally-spaced numbers/digits, and you're asking for irregularly-spaced. I understand your frustration that 54 and 58 are grouped within the 40s, but the stem-and-leaf graph is really just a textual representation of a horizontal histogram, and the numbers on the side reflect the "bins" which will often begin/end outside of the known data. Think of scale(x, scale=2) left-scale numbers as 40-59, 60-79, etc.
You probably already tried this, but
stem(x, scale=3)
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 7 |
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 12 |
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 17 |
# 18 |
# 19 |
# 20 | 19
# 21 |
# 22 |
# 23 | 5
# 24 | 1
# 25 | 244
# 26 |
# 27 |
# 28 |
# 29 |
# 30 |
# 31 |
# 32 |
# 33 | 3
This is a good start, and is "proper" in that the bins are equally sized.
If you must remove the empty rows (which to me are still statistically significant, relevant, informative, etc), then because stem's default is to print to the console, you'll need to capture the console output (might have problems in rmarkdown docs), filter out the empty rows, and re-cat them to the console.
cat(Filter(function(s) grepl("decimal|\\|.*[0-9]", s),
capture.output(stem(x, scale=3))),
sep="\n")
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 20 | 19
# 23 | 5
# 24 | 1
# 25 | 244
# 33 | 3
(My grepl regex could likely be improved to handle something akin to "if there is a pipe, then it must be followed by one or more digits", but I think this suffices for now.)
There are some inequalities, in that you want 6 | 0, but your 60.82 is rounding to 61 (ergo the "1"). If you really want the 60.82 to be a 6 | 0, then truncate it with stem(trunc(x), scale=3). It's not exact, but I'm guessing that's because your sample output is hand-jammed.
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
Suppose I have a data frame with 2 variables which I'm trying to run some basic summary stats on. I would like to run a loop to give me the difference between minimum and maximum seconds values for each unique value of number. My actual data frame is huge and contains many values for 'number' so subsetting and running individually is not a realistic option. Data looks like this:
df <- data.frame(number=c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4,5,5,5,5),
seconds=c(1,4,8,1,5,11,23,1,8,1,9,11,24,44,112,1,34,55,109))
number seconds
1 1 1
2 1 4
3 1 8
4 2 1
5 2 5
6 2 11
7 2 23
8 3 1
9 3 8
10 4 1
11 4 9
12 4 11
13 4 24
14 4 44
15 4 112
16 5 1
17 5 34
18 5 55
19 5 109
my current code only returns the value of the difference between minimum and maximum seconds for the entire data fram:
ZZ <- unique(df$number)
for (i in ZZ){
Y <- max(df$seconds) - min(df$seconds)
}
Since you have a lot of data performance should matter and you should use a data.table instead of a data.frame:
library(data.table)
dt <- as.data.table(df)
dt[, .(spread = (max(seconds) - min(seconds))), by=.(number)]
number spread
1: 1 7
2: 2 22
3: 3 7
4: 4 111
5: 5 108
This question already has an answer here:
The rules of subsetting
(1 answer)
Closed 8 years ago.
Good day
I have a data set I got from a txt file
> MyData
Xdat Ydat
1 1 12
2 2 23
3 3 34
4 4 45
5 5 56
6 6 67
7 7 78
I need to use this set to extract rows that correspond to the case where the 2nd column(Ydat) is greater than 40.
Resulting in
MyData2
Xdat Ydat
4 4 45
5 5 56
6 6 67
7 7 78
Simple subsetting will do it -
MyData[which(MyData[,2]>40),]
as #DavidArenburg points out, this works fine:
MyData[(MyData[,2]>40),]