Problem finding number of elements in a dataframe in R - r

I have downloaded the data frame casos_hosp_uci_def_sexo_edad_provres_60_mas.csv, which describes the amount of people infected from Covid-19 in Spain classified for their province, age, gender... from this webpage. I read and represent the dataframe as:
db<-read.csv(file = 'casos_hosp_uci_def_sexo_edad_provres.csv')
The first five rows are shown
provincia_iso sexo grupo_edad fecha num_casos num_hosp num_uci num_def
1 A H 0-9 2020-01-01 0 0 0 0
2 A H 10-19 2020-01-01 0 0 0 0
3 A H 20-29 2020-01-01 0 0 0 0
4 A H 30-39 2020-01-01 0 0 0 0
5 A H 40-49 2020-01-01 0 0 0 0
The first four colums of the data frame show the name of the province, gender of the people, age group and date, the latest four columns show the number of people who got ill, were hospitalized, in ICU or dead.
I want to use R to find the day with the highest rate of contagions. To do that, I have to sum the elements of the fifth row num_casos for each different value of the column fecha.
I have already been able to calculate the number of sick males as hombresEnfermos=sum(db[which(db$sexo=="H"), 5]). However, I think there has to be a better way to check the days with higher contagion than go manually counting. However, I cannot find out how.
Can someone please help me?

Using dplyr to get the total by date:
library(dplyr)
db %>% group_by(fecha) %>% summarise(total = sum(num_casos))
Two alternatives in base R:
data.frame(fecha = sort(unique(db$fecha)),
total = sapply(split(db, f = db$fecha), function(x) {sum(x[['num_casos']])}))
Or more simply,
aggregate(db$num_casos, list(db$fecha), FUN=sum)
An alternative in data.table:
library(data.table)
db <- as.data.table(db)
db[, list(total=sum(num_casos)), by = fecha]

Related

Calculate the length of an interval if data are equal to zero

I have a dataframe with time points and the corresponding measure of the activity in different subjects. Each time point it's a 5 minutes interval.
time Subject1 Subject2
06:03:00 6,682129 8,127075
06:08:00 3,612061 20,58838
06:13:00 0 0
06:18:00 0,9030762 0
06:23:00 0 0
06:28:00 0 0
06:33:00 0 0
06:38:00 0 7,404663
06:43:00 0 11,55835
...
I would like to calculate the length of each interval that contains zero activity, as the example below:
Subject 1 Subject 2
Interval_1 1 5
Interval_2 5
I have the impression that I should solve this using loops and conditions, but as I am not so experienced with loops I do not know where to start. Do you have any idea to solve this? Any help is really appreciated!
You can use rle() to find runs of consecutive values and the length of the runs. We need to filter the results to only runs where the value is 0:
result = lapply(df[-1], \(x) with(rle(x), lengths[values == 0]))
result
# $Subject1
# [1] 1 5
#
# $Subject2
# [1] 5
As different subjects can have different numbers of 0-runs, the results make more sense in a list than a rectangular data frame.

Find average of a column in a data set based on value of another column in R

I am trying to find the average age (the values in the first column) based on how many subjects are in the AD, MCI, and Normal columns in the data set below. Basically, I need the average age for subjects in the AD column, MCI column, and then the Normal column. Is there an R function that takes the average of another column based on the presence of a nonzero number in another column? Thanks!
table(ADNI$AGE, ADNI$DX)
AD MCI Normal
55.6 1 0 0
55.9 1 0 0
56.2 2 1 0
56.3 0 1 0
57.8 3 1 0
58.4 0 0 2
There isn't function that can do this but here's how you can solve your problem.
For the average age of AD mean(data[,1] * data[,2])
For the average age of MCI mean(data[,1] * data[,3])
For the average age of Normal mean(data[,1] * data[,3])
You don't have to worry about the zeros because when you multiply those ages aren't accounted for.
You could use the dplyr package to do this. The code below will group the data by DX then for each group sum the age and divide by the count for that age
library(dplyr)
ADNI %>%
group_by(DX) %>%
summarise(avg_age = sum(AGE)/n())

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

DPlyr summarize variable arguments

I want to calculate many different statistics using summarize. How can I do something like in the example below?
Eg in this example, I would want to generate a table with counts for each month with the days that have a temperature less than 60,61... up to 90 degrees.
aq = airquality
aq %>% group_by(Month) %>% summarize(num_days_60=sum(Temp<60), num_days_61=sum(Temp<61) .... num_days_90=sum(Temp<90))
The output should look like this
Month num_days_60 num_days_61 ... etc all the way up to 90 for example
5 8 8
6 0 0
7 0 0
8 0 0
9 0 0

Resources