I am trying to sum row values by specific columns using mutate_at and sum function. The dataset is given below:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 0 0 0 3 1 2
Chili 1 0 1 4 2 1
China 23 26 123 12 56 70
China 45 25 56 23 16 18
I am using following code but instead of summing all the column values, I am getting zeroes.
tb <- confirmed_raw %>% group_by(`Country/Region`) %>%
filter(`Country/Region` != "Cruise Ship") %>%
select(-`Province/State`, -Lat, -Long) %>%
mutate_at(vars(-group_cols()), ~sum)
The output which I want is:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 2 0 1 7 3 3
China 68 51 179 35 72 88
But instead of above, all the date columns are coming 0. How can I solve this?
Can you try summarise_all instead of mutate_at(vars(-group_cols()), ~sum)?
tb %>% group_by(`Country.Region`) %>% summarise_all(funs(sum))
PS: I guess you have few typos here such as tb[1,1] should return 1, not 2. Also, the example code does not correspond to the data entirely (ther is no Cruise Ship or Province/State in it. Still, ignoring those, I found this works to generate the expected output.
To complete, another option :
tb %>% group_by(`Country/Region`) %>% mutate_all(sum) %>% distinct(`Country/Region`,.keep_all = TRUE)
Related
Below is the sample data
indcode <- c(71,72,81,82,99,000000,71,72,81,82,99,000000)
year <- c(2020,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021,2021)
employment <- c(3,5,7,9,2,26,4,6,8,10,3,31)
test <- data.frame(indcode, year, employment)
The task is to create a new column that would the 000000 value for each year. I know that this involves a pivot wider but how to get the 000000 to repeat is my struggle. Below is the desired result. Hoping to not have 000000 (Total, all industries) be a row. It would be essentially be a duplicate.
Year indcode employment total
2020 71 3 26
2020 72 5 26
2020 81 7 26
2020 82 9 26
2020 99 2 26
2021 71 4 31
and so on...
We could do this by detecting one or more zeros (+) from the start (^) to the end ($) of the string in 'indcode' to subset the 'employment' for each 'year' (grouped) to create a new column and then filter out the 0 rows
library(dplyr)
library(stringr)
test %>%
group_by(year) %>%
mutate(total = employment[str_detect(indcode, '^0+$')]) %>%
ungroup %>%
filter(str_detect(indcode, "^0+$", negate = TRUE))
I am looking to make a new variable to mark which of my data is duplicated, selecting the oldest datapoint to be the "original". My dataframe is ordered by date, but by ID.
ID Name Number Datetime (dd/mm/yyy/hh/MM)
1 ace 114 15.03.2019 15:26
2 bert 197 18.03.2019 07:28
3 vance 245 16.03.2019 14:03
4 chad 116 17.03.2019 02:02
5 chad 116 18.03.2019 18:23
6 ace 114 12.03.2019 23:15
Ordering the dataframe works and selecting the duplicated lines also works, but not in combination, which leads to the originals not being the first presentation. Even if I order the dataframe before marking the represenation the dataframe is seems to be unordered for the next command and linking the two commands with %>% is not working.
df %>% arrange(Datetime)
df$representations <- if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
df$represntations <- df %>%
arrange(Datetime) %>%
if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
How can i be sure, that the the originals will be the first datapoint to the number (like this)?
ID Name Number Datetime (dd/mm/yyy/hh/MM) representation
1 ace 114 15.03.2019 15:26 1
2 bert 197 18.03.2019 07:28 0
3 vance 245 16.03.2019 14:03 0
4 chad 116 17.03.2019 02:02 0
5 chad 116 18.03.2019 18:23 1
6 ace 114 12.03.2019 23:15 0
Try the below code
df <- df %>%
arrange(Datetime) %>%
mutate(representations = if_else(duplicated(number, .keep_all =TRUE), 1, 0)) %>%
arrange(ID)
library(dplyr)
df %>%
arrange(`Datetime(dd/mm/yyy/hh/MM)`) %>%
mutate(flag = duplicated(Number)*1) %>%
arrange(ID)
1 ace 114 15.03.2019 1
2 2 bert 197 18.03.2019 0
3 3 vance 245 16.03.2019 0
4 4 chad 116 17.03.2019 0
5 5 chad 116 18.03.2019 1
6 6 ace 114 12.03.2019 0
I ended up using this code and the sample I checked seemed to be correct, thank you! (even though the as.Date changed the year from 2019 to 2020, but the order is correct)
# split time and date, so as.Date can be used
emerge$date <- as.Date(sapply(strsplit(as.character(emerge$Falleinzeitdatum.Notfall), " "), "[", 1), format = "%d.%m.%y")
# arrange as proposed
emerge <- emerge %>%
arrange(date) %>%
mutate(re = if_else(duplicated(Patientennummer, .keep_all = TRUE), 1, 0))
I'm trying to collect data on what events have happened prior to a specific event (i.e. bDragons)which can be recurring based on the full observation. These are just an excerpt of one observation where a dragon is taken more than once, and I want to be able to pull insights on each and every one over many observations. So in the data set below, I would want to know that only 1 outer turret was taken prior to the first dragon at Time == 12.891. The next is taken at 20.215, which 4 towers and a drake before it.
ID TeamObj Time Type Lane League Year Season bResult rResult gamelength Gold
1 1 bTowers 9.397 OUTER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
2 1 bDragons 12.891 AIR_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
3 1 bTowers 16.215 OUTER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
4 1 bTowers 16.591 INNER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
5 1 bTowers 19.830 OUTER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
6 1 bDragons 20.215 EARTH_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
7 1 bBarons 22.512 BARON_NASHOR <NA> CBLoL 2017 Summer 1 0 34 NA
8 1 bTowers 23.962 INNER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
9 1 bTowers 24.707 INNER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
10 1 bTowers 24.962 BASE_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
I'd want this for every TeamObj of that type but the issue comes up where I try to group_by address and filter by (Time <= which(Team == bDragons)and the wrong things get filtered out or I can't summarize based on that count(Type) or anything. I'm looking for help on recording some type of recurring function or a better way to record and summarize that. Looking to fit the observations into a linear model later on, but I can't get to that square one which causes the issue.
Am I thinking about my filter incorrectly? My summarize? tst3 %>% group_by(ID) %>% filter(Time <= which(Team == "bDragons")) %>% summarize(count(Type))
Something like:
ID dragonID dragonType Time Baron_Nashor Base_Turret Inner_Turret Nexus_Turret Outer_Turret
1 1 AIR_DRAGON 12.891 N/A N/A N/A N/A 1
2 2 EARTH_DRAGON 20.215 N/A N/A 1 N/A 3
and so on, if that is clear. Want to be able to use each as an observation.
How about the following
tst3 %>%
group_by(ID) %>%
# arrange(Time) %>% # uncomment if needed
mutate(
Type = factor(Type),
dragonID = cumsum(dplyr::lag(TeamObj == 'bDragons', default = 1))) %>%
group_by(ID, dragonID) %>%
summarize(
dragonType = last(Type),
Time = last(Time),
tmp = list(as.data.frame(table(Type)))) %>%
unnest() %>%
spread(Type, Freq, fill = 0) %>%
# select(-ends_with("DRAGON")) %>%
group_by(ID) %>%
mutate_at(vars(BARON_NASHOR:OUTER_TURRET), cumsum) %>%
filter(str_detect( dragonType, "DRAGON"))
library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))
I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.