DPlyr summarize variable arguments - r

I want to calculate many different statistics using summarize. How can I do something like in the example below?
Eg in this example, I would want to generate a table with counts for each month with the days that have a temperature less than 60,61... up to 90 degrees.
aq = airquality
aq %>% group_by(Month) %>% summarize(num_days_60=sum(Temp<60), num_days_61=sum(Temp<61) .... num_days_90=sum(Temp<90))
The output should look like this
Month num_days_60 num_days_61 ... etc all the way up to 90 for example
5 8 8
6 0 0
7 0 0
8 0 0
9 0 0

Related

Problem finding number of elements in a dataframe in R

I have downloaded the data frame casos_hosp_uci_def_sexo_edad_provres_60_mas.csv, which describes the amount of people infected from Covid-19 in Spain classified for their province, age, gender... from this webpage. I read and represent the dataframe as:
db<-read.csv(file = 'casos_hosp_uci_def_sexo_edad_provres.csv')
The first five rows are shown
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
The first four colums of the data frame show the name of the province, gender of the people, age group and date, the latest four columns show the number of people who got ill, were hospitalized, in ICU or dead.
I want to use R to find the day with the highest rate of contagions. To do that, I have to sum the elements of the fifth row num_casos for each different value of the column fecha.
I have already been able to calculate the number of sick males as hombresEnfermos=sum(db[which(db$sexo=="H"), 5]). However, I think there has to be a better way to check the days with higher contagion than go manually counting. However, I cannot find out how.
Can someone please help me?
Using dplyr to get the total by date:
library(dplyr)
db %>% group_by(fecha) %>% summarise(total = sum(num_casos))
Two alternatives in base R:
data.frame(fecha = sort(unique(db$fecha)),
total = sapply(split(db, f = db$fecha), function(x) {sum(x[['num_casos']])}))
Or more simply,
aggregate(db$num_casos, list(db$fecha), FUN=sum)
An alternative in data.table:
library(data.table)
db <- as.data.table(db)
db[, list(total=sum(num_casos)), by = fecha]

How can I create a vector with values that are obtained by a function that returns different values for every row?

I have a function club_points(club) that returns me the total points of the club. Now I want to make a data frame with club on the rows and the club_points values of the respective club in the columns. Is there a way to iterate my function in order to automatically assign the points in the same row as the club?
After some research I believe I have to use the apply family... but since I am new I dont know how to do it
teams total_points
1 Rio Ave 0
2 Moreirense 0
3 Sp Lisbon 0
4 Tondela 0
5 Boavista 0
6 Guimaraes 0
7 Setubal 0
8 Estoril 0
9 Belenenses 0
10 Chaves 0
11 Maritimo 0
12 Pacos Ferreira 0
13 Porto 0
14 Arouca 0
15 Benfica 0
16 Feirense 0
17 Sp Braga 0
18 Nacional 0
this the current format of my dataframe final_pos, but i would like to iterate the club_points function in the total_points column
Do you mean something like
final_pos$total_points <- Vectorize(club_points, "club")(final_pos$teams)
or
final_pos$total_points <- sapply(final_pos$teams,club_points)

Cumulative values for columns based on previous row [duplicate]

This question already has an answer here:
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
Assume I need calculate the cumulative value based on other column in the same row and also the value from same column but previous row. Example: to obtain cumulative time based on time intervals.
> data <- data.frame(interval=runif(10),time=0)
> data
interval time
1 0.95197753 0
2 0.73623490 0
3 0.63938696 0
4 0.32085833 0
5 0.92621764 0
6 0.02801951 0
7 0.09071334 0
8 0.60624511 0
9 0.35364178 0
10 0.79759991 0
I can generate the cumulative value of time using the (ugly) code below:
for( i in 1:nrow(data)){
data[i,"time"] <- data[i,"interval"] + ifelse(i==1,0,data[i-1,"time"])
}
> data
interval time
1 0.95197753 0.9519775
2 0.73623490 1.6882124
3 0.63938696 2.3275994
4 0.32085833 2.6484577
5 0.92621764 3.5746754
6 0.02801951 3.6026949
7 0.09071334 3.6934082
8 0.60624511 4.2996533
9 0.35364178 4.6532951
10 0.79759991 5.4508950
Is it possible to do this without the for iteration, using a single command?
Maybe what you are looking for is cumsum():
library(tidyverse)
data <- data %>%
mutate(time = cumsum(interval))
As Ronak says and you do this as well using dplyr and the pipe:
library(dplyr)
data <- data %>%
mutate(time = cumsum(interval))

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Resources