Unable to Group and Sum Properly - r

I have data similar to this Sample Data:
Cities Country Date Cases
1 BE A 2/12/20 12
2 BD A 2/12/20 244
3 BF A 2/12/20 1
4 V 2/12/20 13
5 Q 2/13/20 2
6 D 2/14/20 4
7 GH N 2/15/20 6
8 DA N 2/15/20 624
9 AG J 2/15/20 204
10 FS U 2/16/20 433
11 FR U 2/16/20 38
I want to organize the data by on the date and country and then sum a country's daily case. However, I try something like, it reveal the total sum:
my_data %>%
group_by(Country, Date)%>%
summarize(Cases=sum(Cases))

Your summarize function is likely being called from another package (plyr?). Try calling dplyr::sumarize like this:
my_data %>%
group_by(Country, Date)%>%
dplyr::summarize(Cases=sum(Cases))
# A tibble: 7 x 3
# Groups: Country [7]
Country Date Cases
<fct> <fct> <int>
1 A 2/12/20 257
2 D 2/14/20 4
3 J 2/15/20 204
4 N 2/15/20 630
5 Q 2/13/20 2
6 U 2/16/20 471
7 V 2/12/20 13
I sympathize with you that this is can be very frustrating. I have gotten into a habit of always using dplyr::select, dplyr::filter and dplyr::summarize. Otherwise you spend needless time frustrated about why your code isn't working.

We can also use aggregate
aggregate(Cases ~ Country + Date, my_data, sum)

Related

Is there a R function to convert numeric values from a vector into observations(rows) in a dataframe? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
First time asking a question here, sry if I aren't clear enough
Here's my data:
df <- data.frame(Year=c("2018","2018","2019","2019","2018","2018","2019","2019"),Area=c("CF","CF","CF","CF","NY","NY","NY","NY"), Birth=c(1000,1100,1100,1000,2000,2100,2100,2000),Gender= c("F","M","F","M","F","M","F","M"))
df
# Year Area Birth Gender
# 1 2018 CF 1000 F
# 2 2018 CF 1100 M
# 3 2019 CF 1100 F
# 4 2019 CF 1000 M
# 5 2018 NY 2000 F
# 6 2018 NY 2100 M
# 7 2019 NY 2100 F
# 8 2019 NY 2000 M
where birth is the new babies born..
What I want to do is creates a classification model where it predicts how likely a new born baby would be a male/female, with area/year as predictor.
yes I know it should be linear regression with Y as birth, X as others, however I just somehow fall into this situation.
With the given data, I already know the results as 50% of an observation being male and 50% of an observation being female. What I want to know is the probability of a baby being male/female, not which observation(row) being male/female which I already knows.
Is their a way that I can make birth as observation which is 1000+1100+1100+1000+2000+2100+2100+2000=12400 rows of data? which would be something like 1st observation is a 2018 born female baby from CF, 2nd observation is a 2018 born male baby from CF. With 12400 of it.
Or any suggestion to deal with this?
We may use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(Birth) %>%
as_tibble
-output
# A tibble: 12,400 x 3
Year Area Gender
<chr> <chr> <chr>
1 2018 CF F
2 2018 CF F
3 2018 CF F
4 2018 CF F
5 2018 CF F
6 2018 CF F
7 2018 CF F
8 2018 CF F
9 2018 CF F
10 2018 CF F
# … with 12,390 more rows
Or using base R
transform(df[rep(seq_len(nrow(df)), df$Birth),], Birth = sequence(df$Birth))
You could use dplyr and summarize:
library(tidyverse)
df_expanded <- df %>%
group_by(Year, Area, Gender) %>%
summarize(expanded = 1:Birth)
# A tibble: 12,400 x 4
# Groups: Year, Area, Gender [8]
Year Area Gender expanded
<chr> <chr> <chr> <int>
1 2018 CF F 1
2 2018 CF F 2
3 2018 CF F 3
4 2018 CF F 4
5 2018 CF F 5
6 2018 CF F 6
7 2018 CF F 7
8 2018 CF F 8
9 2018 CF F 9
10 2018 CF F 10
# … with 12,390 more rows
Uncount is without a doubt the best solution for this problem. One alternative to the solutions shown could be
library(dplyr)
library(tidyr)
df %>%
mutate(Birth = lapply(Birth, function(n) 1:n)) %>%
unnest(Birth)
This returns
# A tibble: 12,400 x 4
Year Area Birth Gender
<chr> <chr> <int> <chr>
1 2018 CF 1 F
2 2018 CF 2 F
3 2018 CF 3 F
4 2018 CF 4 F
5 2018 CF 5 F
6 2018 CF 6 F
7 2018 CF 7 F
8 2018 CF 8 F
9 2018 CF 9 F
10 2018 CF 10 F
# ... with 12,390 more rows

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

cbind arguments in large dataframe

I have searched unsuccessfully for several days for an answer to this question: I have a dataframe with 279 columns and want to generate subtotals using aggregate(), or indeed, anything suitable. Here is a subset:
LGA off.cat sub.cat Jan1995 Feb1995
1 Albury Homicide Murder * 0 0
2 Albury Homicide Attempted murder 0 0
3 Albury Homicide Murder accessory, conspiracy 0 0
4 Albury Homicide Manslaughter * 0 0
5 Albury Assault Domestic violence related assault 7 7
6 Albury Assault Non-domestic violence related assault 29 20
7 Albury Assault Assault Police 12 3
8 Albury Sexual offences Sexual assault 4 3
The full dataframe contains dozens of LGA values, and many more date columns. I would like to obtain subtotals for each unique LGA value grouped by unique values of off.cat and sub.cat, summed over all dates. I tried using cbind in aggregate, but found no way to generate the 276 date column names that would not cause errors. Explicit column names worked fine. Apologies for the lack of clarity in the earlier post, and thanks to those who valiantly tried to interpret my meaning.
Your question is a bit unclear, but you may be successful using the formula syntax of aggregate. Here's an example:
df <- data.frame(group = letters[1:5],
x = 1:5,
y = 6:10,
z = 11:15)
group x y z
1 a 1 6 11
2 b 2 7 12
3 c 3 8 13
4 d 4 9 14
5 e 5 10 15
We now sum all three variables x, y and z by the levels of group, using setdiff to get a vector of column names except group, and pasting them together to use in as.formula:
aggregate(as.formula(paste(paste(setdiff(names(df), c("group")), collapse = "+"), "~ group")), data = df, sum)
group x + y + z
1 a 18
2 b 21
3 c 24
4 d 27
5 e 30
Hope this helps.

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

Resources