R row summing to a new column [duplicate] - r

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 7 years ago.
I'm new to R and this is probablly a very simple question.
I just haven't been able to make rsum/apply work
My task is to add all the different expense categories in my dataframe and create a new variable with this value like this:
(not the original)
Food Dress Car
235 564 532
452 632 719
... ... ...
and then
Food Dress Car Total
235 564 532 1331
452 632 719 1803
... ... ... ...
I have tried:
rowsum, apply and aggregate and can't get it right

You can use addmargins after converting to matrix
addmargins(as.matrix(df1),2)
# Food Dress Car Sum
#[1,] 235 564 532 1331
#[2,] 452 632 719 1803
Or use rowSums
df1$Total <- rowSums(df1)
Or with Reduce
df1$Total <- Reduce(`+`, df1)

With apply functions:
cbind(dat, Total = apply(dat, 1, sum))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803
or with just a:
cbind(dat, Total = rowSums(dat))
Food Dress Car Total
1 235 564 532 1331
2 452 632 719 1803

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Automize portfolios volatilities computation in R

Thanks for reading my post. I have a series of portfolios created from the combination of several stocks. I should compute the volatility of those portfolios using the historical daily performances of each stock. Since I have all the combinations in one data frame (called final_output), and all stocks return in another data frame (called perf, where the columns are stocks and rows days) I don't know which will be the most efficient way to automize the process. Below you can find an extract:
> Final_output
ISIN_1 ISIN_2 ISIN_3 ISIN_4
2 CH0595726594 CH1111679010 XS1994697115 CH0587331973
3 CH0595726594 CH1111679010 XS1994697115 XS2027888150
4 CH0595726594 CH1111679010 XS1994697115 XS2043119358
5 CH0595726594 CH1111679010 XS1994697115 XS2011503617
6 CH0595726594 CH1111679010 XS1994697115 CH1107638921
7 CH0595726594 CH1111679010 XS1994697115 XS2058783270
8 CH0595726594 CH1111679010 XS1994697115 JE00BGBBPB95
> perf
CH0595726594 CH1111679010 XS1994697115 CH0587331973
626 0.0055616769 -0.0023656130 1.363791e-03 1.215922e-03
627 0.0086094443 0.0060037334 0.000000e+00 2.519220e-03
628 0.0053802380 0.0009027081 0.000000e+00 7.508635e-04
629 -0.0025213543 -0.0022046297 4.864050e-05 1.800720e-04
630 0.0192416817 0.0093401627 -6.079767e-03 3.800836e-03
631 -0.0101224820 0.0051741294 6.116956e-03 -1.345184e-03
632 -0.0013293793 -0.0100475153 -4.494163e-03 -1.746106e-03
633 0.0036350604 0.0012999350 3.801130e-03 -5.997121e-05
634 0.0030097434 -0.0011484496 -1.187614e-03 -2.069131e-03
635 0.0002034381 0.0030493901 -1.851762e-03 -3.806280e-04
636 -0.0035594427 0.0167455769 -2.148123e-04 -4.709560e-04
637 0.0007654623 -0.0051958237 -3.711191e-04 1.604010e-04
638 0.0107592678 -0.0016260163 4.298764e-04 3.397951e-03
639 0.0050953486 -0.0007403020 2.011738e-03 8.790770e-04
640 0.0008532851 -0.0071121648 -9.746114e-04 5.389598e-04
641 -0.0068204614 0.0133810874 -9.755622e-05 -1.346674e-03
642 0.0091395678 0.0102591793 1.717157e-03 -1.977785e-03
643 0.0027520640 -0.0157912638 1.256440e-03 -1.301119e-04
644 -0.0048902196 0.0039494471 -1.624514e-03 -3.373340e-03
645 -0.0116838833 0.0062450826 6.625549e-04 1.205255e-03
646 0.0004566442 -0.0018570102 -3.456636e-03 4.474138e-03
647 0.0041586368 0.0085679315 4.435933e-03 1.957455e-03
648 0.0007575758 0.0002912621 0.000000e+00 2.053306e-03
649 0.0046429473 -0.0138309230 -4.435798e-03 1.541798e-03
650 0.0049731250 -0.0488164953 4.181975e-03 -9.733133e-04
651 0.0008497451 -0.0033110870 2.724477e-04 -7.555498e-04
652 0.0004494831 0.0049831300 -8.657588e-04 -1.790813e-04
653 -0.0058905751 0.0020143588 8.178287e-04 -1.213991e-03
654 0.0000000000 0.0167525773 4.864050e-05 9.365068e-04
655 0.0010043186 0.0048162231 0.000000e+00 -2.110146e-03
656 -0.0024079462 -0.0100403633 -2.431907e-03 -9.176600e-04
657 -0.0095544604 -0.0193670047 0.000000e+00 -8.935435e-03
658 0.0008123477 0.0114339172 2.437835e-03 5.530483e-03
659 0.0022828734 -0.0015415446 -3.239300e-03 2.765060e-03
660 0.0049096523 -0.0001029283 3.199079e-02 2.327835e-03
661 -0.0027702226 -0.0357198003 9.456712e-04 3.189602e-04
662 -0.0008081216 -0.0139311449 -2.891020e-02 -1.295363e-03
663 -0.0033867462 0.0068745264 -2.529552e-03 -1.496588e-04
664 -0.0015216068 -0.0558572120 -3.023653e-03 -7.992975e-03
665 0.0052829422 0.0181072771 4.304652e-03 -3.319519e-03
666 0.0084386054 0.0448545861 -8.182748e-04 4.279284e-03
667 -0.0076664829 -0.0059415480 -2.047362e-04 6.059936e-03
668 -0.0062108665 -0.0039847073 7.313506e-04 5.993467e-04
669 -0.0053350948 0.0068119154 -1.042631e-02 -2.056524e-03
670 -0.0263588067 0.0245395479 -2.188962e-02 -6.732491e-03
671 -0.0021511018 0.0220649895 1.412435e-02 1.702085e-03
672 0.0205058100 -0.0007179119 3.057527e-03 -1.002423e-02
673 0.0096862280 -0.0194488633 1.207407e-03 -1.553899e-03
674 0.0007143951 -0.0068557672 6.227450e-03 1.790274e-03
675 -0.0021926470 -0.0051114507 -6.267498e-03 -1.035691e-03
676 0.0076655765 -0.0139300847 6.583825e-03 3.059472e-03
677 -0.0032457653 0.0180480206 -4.635495e-03 1.064002e-03
678 0.0036633764 0.0060676410 -2.762676e-04 5.364970e-04
679 -0.0008111122 -0.0013635410 -1.065898e-03 1.214059e-03
680 0.0050228311 0.0055141267 3.003507e-03 1.121643e-03
681 -0.0007067495 0.0147281558 -2.699002e-03 -1.514035e-04
682 -0.0024248548 0.0002573473 -2.113685e-03 -1.423409e-03
683 -0.0002025624 0.0138417207 -4.374895e-03 1.415328e-04
684 -0.0141822418 -0.0169517332 -3.578920e-03 -1.799234e-03
685 -0.0005651749 -0.0259693324 -5.926428e-03 -3.635333e-03
686 0.0004112688 0.0133043570 -1.545642e-03 1.981828e-03
687 -0.0150565262 -0.0107757493 -1.717916e-02 -1.328749e-02
688 0.0039129754 -0.0441013167 -8.376631e-03 -5.653841e-04
689 0.0019748467 0.0115063340 -2.835394e-02 7.868428e-03
690 0.0072614108 0.0358764014 3.586897e-02 7.960077e-03
691 -0.0003604531 0.0106119001 1.024769e-04 -2.733651e-04
What I should do is look for each portfolio (each row of final_output is a portfolio, i.e. 4 stocks portfolio) in perf and compute the volatility (standard deviation) of that portfolio using the stocks historical daily performances of the last three months. (Of course, here I have pasted only 4 stocks performances for simplicity.) Once done for the first, I should do the same for all the other rows (portfolios).
Below is the formula I used for computing the volatility:
#formula for computing the volatility
sqrt(t(weights) %*% covariance_matrix %*% weights)
#where covariance_matrix is
cov(portfolio_component_monthly_returns)
#All the portfolios are equiponderated
weights = [ 0.25 0.25 0.25 0.25 ]
What I'm trying to do since yesterday is to automize the process for all the rows, indeed I have more than 10'000 rows. I'm an RStudio naif, so even trying and surfing on the new I have no results and no ideas of how to automize it. Would someone have a clue how to do it?
Hope to have been clearer as possible, in case do not hesitate to ask me.
Many thanks

Row wise operation on data.table

Let's say I'd like to calculate the magnitude of the range over a few columns, on a row-by-row basis.
set.seed(1)
dat <- data.frame(x=sample(1:1000,1000),
y=sample(1:1000,1000),
z=sample(1:1000,1000))
Using data.frame(), I would do something like this:
dat$diff_range <- apply(dat,1,function(x) diff(range(x)))
To put it more simply, I'm looking for this operation, over each row:
diff(range(dat[1,]) # for i 1:nrow(dat)
If I were doing this for the entire table, it would be something like:
setDT(dat)[,diff_range := apply(dat,1,function(x) diff(range(x)))]
But how would I do it for only named (or numbered) rows?
pmax and pmin find the min and max across columns in a vectorized way, which is much better than splitting and working with each row separately. It's also pretty concise:
dat[, r := do.call(pmax,.SD) - do.call(pmin,.SD)]
x y z r
1: 266 531 872 606
2: 372 685 967 595
3: 572 383 866 483
4: 906 953 437 516
5: 201 118 192 83
---
996: 768 945 292 653
997: 61 231 965 904
998: 771 145 18 753
999: 841 148 839 693
1000: 857 252 218 639
How about this:
D[,list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I][c(1:4,15:18)]
# I V1
#1: 1 971
#2: 2 877
#3: 3 988
#4: 4 241
#5: 15 622
#6: 16 684
#7: 17 971
#8: 18 835
#actually this will be faster
D[c(1:4,15:18),list(I=.I,x,y,z)][,diff(range(x,y,z)),by=I]
use .I to give you an index to call with the by= parameter, then you can run the function on each row. The second call pre-filters by any list of row numbers, or you can add a key and filter on that if your real table looks different.
You can do it by subsetting before/during the function. If you only want every second row for example
dat_Diffs <- apply(dat[seq(2,1000,by=2),],1,function(x) diff(range(x)))
Or for rownames 1:10 (since their names weren't specified they are just numbers counting up)
dat_Diffs <- apply(dat[rownames(dat) %in% 1:10,],1,function(x) diff(range(x)))
But why not just calculate per row then subset later?

R How to remove duplicates from a list of lists

I have a list of lists that contain the following 2 variables:
> dist_sub[[1]]$zip
[1] 901 902 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
[26] 929 930 931 933 934 935 936 937 938 939 940 955 961 962 963 965 966 968 969 970 975 981
> dist_sub[[1]]$hu
[1] 4990 NA 168 13224 NA 3805 NA 6096 3884 4065 NA 16538 NA 12348 10850 NA
[17] 9322 17728 NA 13969 24971 5413 47317 7893 NA NA NA NA NA 140 NA 4
[33] NA NA NA NA NA 13394 8939 NA 3848 7894 2228 17775 NA NA NA
> dist_sub[[2]]$zip
[1] 921 934 952 956 957 958 959 960 961 962 965 966 968 969 970 971
> dist_sub[[2]]$hu
[1] 17728 140 4169 32550 18275 NA 22445 0 13394 8939 3848 7894 2228 17775 NA 12895
Is there a way remove duplicates such that if a zipcode appears in one list is removed from other lists according to specific criteria?
Example: zipcode 00921 is present in the two lists above. I'd like to keep it only on the list with the lowest sum of hu (housing units). In this I would like to keep zipcode 00921 in the 2nd list only since the sum of hu is 162,280 in list 2 versus 256,803 in list 1.
Any help is very much appreciated.
Here is a simulate dataset for your problem so that others can use it too.
dist_sub <- list(list("zip"=1:10,
"hu"=rnorm(10)),
list("zip"=8:12,
"hu"=rnorm(5)),
list("zip"=c(1, 3, 11, 7),
"hu"=rnorm(4))
)
Here's a solution that I was able to come up with. I realized that loops were really the cleaner way to do this:
do.this <- function (x) {
for(k in 1:(length(x) - 1)) {
for (l in (k + 1):length(x)) {
to.remove <- which(x[[k]][["zip"]] %in% x[[l]][["zip"]])
x[[k]][["zip"]] <- x[[k]][["zip"]][-to.remove]
x[[k]][["hu"]] <- x[[k]][["hu"]][-to.remove]
}
}
return(x)
}
The idea is really simple: for each set of zips we keep removing the elements that are repeated in any set after it. We do this until the penultimate set because the last set will be left with no repeats in anything before it.
The solution to use the criterion you have, i.e. lowest sum of hu can be easily implemented using the function above. What you need to do is reorder the list dist_sub by sum of hu like so:
sum_hu <- sapply(dist_sub, function (k) sum(k[["hu"]], na.rm=TRUE))
dist_sub <- dist_sub[order(sum_hu, decreasing=TRUE)]
Now you have dist_sub sorted by sum_hu which means that for each set the sets that come before it have larger sum_hu. Therefore, if sets at values i and j (i < j) have values a in common, then a should be removed from ith element. That is what this solution does too. Do you think that makes sense?
PS: I've called the function do.this because I usually like writing generic solutions while this was a very specific question, albeit, an interesting one.

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Resources