Using tapply and cumsum function for multiple vectors in R - r

I have a data frame with four columns.
country date pangolin_lineage n cum_country
1 Albania 2020-09-05 B.1.236 1 1
2 Algeria 2020-03-02 B.1 2 2
3 Algeria 2020-03-08 B.1 1 3
4 Algeria 2020-06-09 B.1.1.119 1 4
5 Algeria 2020-06-15 B.1 1 5
6 Algeria 2020-06-15 B.1.36 1 6
I wished to calculate the cumulative sum of n across country and date. I was able to do that with this code:
date_country$cum_country <- as.numeric(unlist(tapply(date_country$n, date_country$country, cumsum)))
I now, however, would like to do the same thing, but the cumulative sum across country, pangolin_lineage, and date. I have tried to add another vector into the above function, but it seems you can only input one index input and one vector input for tapply. I get this error:
date_country$cum_country_pangol <- as.numeric(unlist(tapply(date_country$n, date_country$country, date_country$pangolin_lineage, cumsum)))
Error in match.fun(FUN) :
'date_country$pangolin_lineage' is not a function, character or symbol
Does anyone have any ideas how how to use cumsum in tapply across multiple vectors (country, pangolin_lineage, date?

if there are more than one group, wrap it in a list, but note that tapply in a summarising function and it can split up when we specify function like cumsum.
tapply(date_country$n, list(date_country$country, date_country$pangolin_lineage), cumsum))
But, this is much more easier with ave i.e. if we want to create a new column, avoid the hassle of unlist etc. by just using ave
ave(date_country$n, date_country$country,
date_country$pangolin_lineage, FUN = cumsum)
#[1] 1 2 3 1 4 1

Related

How to match third (or whatever) from the bottom in a rolling fashion in R?

Here is my example data frame with the expected output.
data.frame(index=c("3435pear","3435grape","3435apple","3435avocado","3435orange","3435kiwi","3436grapefruit","3436apple","3436banana","3436grape","3437apple","3437grape","3437avocado","3437orange","3438apple","3439apple","3440apple"),output=c("na","na","na","na","na","na","na","na","na","na","na","na","na","na","3435apple","3436apple","3437apple"))
index output
1 3435pear na
2 3435grape na
3 3435apple na
4 3435avocado na
5 3435orange na
6 3435kiwi na
7 3436grapefruit na
8 3436apple na
9 3436banana na
10 3436grape na
11 3437apple na
12 3437grape na
13 3437avocado na
14 3437orange na
15 3438apple 3435apple
16 3439apple 3436apple
17 3440apple 3437apple
I want to match the fruit that is third from the bottom as I go down the column. If there are not three previous fruits it should return NA. Once the 4th apple appears it matches the apple 3 before it, then the 5th apple appears it matches the one 3 before that one, and so on.
I was trying to use rollapply, match, and tail to make this work, but I don't know how to reference the current row for the matching. In excel I would use the large, if, and row functions to do this. Excel makes my computer grind for hours to calculate everything and I know R could do this in minutes(seconds?).
You can do this:
library(dplyr)
df %>%
mutate(fruit = gsub("[0-9]", "", index)) %>%
group_by(fruit) %>%
mutate(new_output = lag(index, 3)) %>%
select(-fruit) %>%
ungroup
By each group of fruit, your new_output gives you the index value lagged by 3. I preserved the output column and saved my results in new_output so that you can compare.

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

Use dplyr to compute lagging difference

My data frame consists of three columns: state name, year, and the tax receipt for each year and each state. Below is an example for just one state.
year RealTaxRevs
1 1971 8335046
2 1972 9624026
3 1973 10498935
4 1974 10052305
5 1975 8708381
6 1976 8911262
7 1977 10759032
I'd like to compute the change in tax receipt from one year to the next, for each state. I used the following code:
data %>% group_by(state) %>% summarise(diff(RealTaxRevs, lag = 1, differences = 1))
but it gives me "Error: expecting a single value".
Could anyone explain this error message, and help me do this correctly using dplyr? Thank you.
If you want to use diff like function, then consider using the zoo library as well. Then you can have code which looks like the following:
library(zoo)
diff(as.zoo(1:4), na.pad=T)
In a data frame setting it would be like:
dat <- data.frame(a=c(8335046, 9624026, 10498935, 10052305, 8708381, 8911262, 10759032))
dat %>% mutate(b=diff(as.zoo(a), na.pad=T))
# a b
# 1 8335046 NA
# 2 9624026 1288980
# 3 10498935 874909
# 4 10052305 -446630
# 5 8708381 -1343924
# 6 8911262 202881
# 7 10759032 1847770
This way you can easily increase the number of lags, without continually adding NA
dat %>% mutate(b2=diff(as.zoo(a), lag=2, na.pad=T))
# a b2
# 1 8335046 NA
# 2 9624026 NA
# 3 10498935 2163889
# 4 NA NA
# 5 8708381 -1790554
# 6 8911262 NA
# 7 10759032 2050651
We can use data.table
library(data.table)
setDT(data)[, Diffs := RealTaxRevs - shift(RealTaxRevs)[[1]], state]

How to specific rows from a split list in R based on column condition

I am new to R and to programming in general and am looking for feedback on how to approach what is probably a fairly simple problem in R.
I have the following dataset:
df <- data.frame(county = rep(c("QU","AN","GY"), 3),
park = (c("Downtown","Queens", "Oakville","Squirreltown",
"Pinhurst", "GarbagePile","LottaTrees","BigHill",
"Jaynestown")),
hectares = c(12,42,6,18,92,6,4,52,12))
df<-transform(df, parkrank = ave(hectares, county,
FUN = function(x) rank(x, ties.method = "first")))
Which returns a dataframe looking like this:
county park hectares parkrank
1 QU Downtown 12 2
2 AN Queens 42 1
3 GY Oakville 6 1
4 QU Squirreltown 18 3
5 AN Pinhurst 92 3
6 GY GarbagePile 6 2
7 QU LottaTrees 4 1
8 AN BigHill 52 2
9 GY Jaynestown 12 3
I want to use this to create a two-column data frame that lists each county and the park name corresponding to a specific rank (e.g. if when I call my function I add "2" as a variable, shows the second biggest park in each county).
I am very new to R and programming and have spent hours looking over the built in R help files and similar questions here on stack overflow but I am clearly missing something. Can anyone give a simple example of where to begin? It seems like I should be using split then lapply or maybe tapply, but everything I try leaves me very confused :(
Thanks.
Try,
df2 <- function(A,x) {
# A is the name of the data.frame() and x is the rank No
df <- A[A[,4]==x,]
return(df)
}
> df2(df,2)
county park hectares parkrank
1 QU Downtown 12 2
6 GY GarbagePile 6 2
8 AN BigHill 52 2

zoo object aggregation

Dear Community,
the data I receive will be in a data frame:
Var_1 Var_2 Date VaR_3 VaR_4 VaR_5 Var_6
1 4 2010-01-18 7 apple 10 sweet
2 5 2010-07-19 8 orange 11 sour
3 6 2010-01-18 9 kiwi 12 juicy
... ... ... ... ... ... ...
I would like to use zoo, since it seems to be a flexible object class. I'm just starting with R and I tried to read the description (vignettes) for the package.
Questions:
Given the above data as a data frame, which method is recommended to convert the complete df into a zoo object, telling zoo that it shall use the third column as date column (dates can occur multiple times in the data)?
How do I aggregate all other columns monthly, except columns 4 and 6 using zoo built-in functions? Is zoo able to automatically discard categorical variables and just use those columns that are suited for aggregation?
How do I aggregate all numeric columns monthly, for each category in column 4 (column 6 shall not be included, since it is non-numeric).
Thanks for your support.
zoo objects are time series and are normally numeric vectors or matrices. It seems that what you really have is a bunch of different time series where column 5 identifies which series it is. That is, there is an apple series, an orange series, a kiwi series, etc. and each of them have several columns.
Dropping the last column since its not numeric, using the third column as the index and splitting on column 5 we have:
# create test data
Lines <- "Var_1 Var_2 Date VaR_3 VaR_4 VaR_5 Var_6
1 4 2010-01-18 7 apple 10 sweet
2 5 2010-07-19 8 orange 11 sour
3 6 2010-01-18 9 kiwi 12 juicy"
cat(Lines, "\n", file = "data.txt")
library(zoo)
z <- read.zoo("data.txt", header = TRUE, index = 3, split = "VaR_5",
colClasses = c(Var_6 = "NULL"))
The result is:
> z
Var_1.apple Var_2.apple VaR_3.apple VaR_5.apple Var_1.kiwi
2010-01-18 1 4 7 10 3
2010-07-19 NA NA NA NA NA
Var_2.kiwi VaR_3.kiwi VaR_5.kiwi Var_1.orange Var_2.orange
2010-01-18 6 9 12 NA NA
2010-07-19 NA NA NA 2 5
VaR_3.orange VaR_5.orange
2010-01-18 NA NA
2010-07-19 8 11
The above assumes that for a given value of column 5 that the dates are unique. If that is not the case then include the aggregate = mean argument or some other value for aggregate.
To now aggregate it into a monthly zoo series we have:
aggregate(z, as.yearmon, mean)
It would also be possible to convert it straight away to monthly by using the FUN = as.yearmon argument:
zm <- read.zoo("data.txt", header = TRUE, index = "Date", split = "VaR_4",
FUN = as.yearmon, colClasses = c(Var_6 = "NULL"), aggregate = mean)
See ?read.zoo, vignette("zoo-read"), ?aggregate.zoo and the other vignettes and help files as well.

Resources