Using dplyr to create new groups inside a column - r

This is my dataframe:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido", "16 anos", "17 anos",
"18 anos", "19 anos", "20 anos", "21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L, 48253L, 67401L,
79398L, 88233L, 90738L, 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I would like to have grouped the observations into one group called: 16 a 20 anos.
"16 anos", "17 anos",
"18 anos", "19 anos", "20 anos"
In other words I would like to "merge" the rows 2-6 and sum its observations on the n column. I would have one row represent the sum of rows 2-6.
Is it possible to do this using group_by and then summarise(sum(DS_FAIXA_ETARIA)) verbs from dplyr?
This would be the output that I want:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido","16 a 20 anos" ,"21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L,374023L , 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Many thanks

This should the job. First sum with summarize.
Then add_row to the original dataframe. slice_tail and arrange
df1 <- mydf %>%
summarise(`16 a 20 anos`= sum(n[2:6]))
mydf %>%
add_row(DS_FAIXA_ETARIA=names(df1), n=df1$`16 a 20 anos`[1]) %>%
slice_tail(n=5) %>%
arrange(DS_FAIXA_ETARIA)
Output:
DS_FAIXA_ETARIA n
<chr> <int>
1 16 a 20 anos 374023
2 21 a 24 anos 149634
3 25 a 29 anos 198848
4 30 a 34 anos 238406
5 35 a 39 anos 265509

We create a grouping variable based on the occurrence of 'Invalido' or those elements with only digits (\\d+) followed by space and 'anos', then summarise by pasteing the first and last elements while getting the sum of 'n'
library(dplyr)
library(stringr)
mydf %>%
group_by(grp = replace(cumsum(!str_detect(DS_FAIXA_ETARIA,
'^\\d+\\s+anos$')), DS_FAIXA_ETARIA == 'Inválido', 0)) %>%
summarise(DS_FAIXA_ETARIA = if(n() > 1)
str_c(DS_FAIXA_ETARIA[c(1, n())], collapse="_") else
DS_FAIXA_ETARIA, n = sum(n), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 x 2
# DS_FAIXA_ETARIA n
# <chr> <int>
#1 Inválido 5202
#2 16 anos_20 anos 374023
#3 21 a 24 anos 149634
#4 25 a 29 anos 198848
#5 30 a 34 anos 238406
#6 35 a 39 anos 265509

Related

Conditional subtraction in R data frame

I have a fairly straightforward need, but I can't find a previously asked question that is similar enough. I've been trying with dplyr, but can't figure it out.
julian year
088 22
049 19
041 22
105 18
125 22
245 20
What I want is for each value where data$julian < 105, subtract '1' from data$year, so that
julian year
088 21
049 18
041 21
105 18
125 22
245 20
OP asked about using dplyr in the post. Here, is one with dplyr
library(dplyr)
df1 <- df1 %>%
mutate(year = case_when(as.numeric(julian) < 105 ~ year -1,
TRUE ~ as.numeric(year)))
-output
df1
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
data
df1 <- structure(list(julian = c("088", "049", "041", "105", "125",
"245"), year = c(22L, 19L, 22L, 18L, 22L, 20L)), row.names = c(NA,
-6L), class = "data.frame")
Another option with base R:
df$year[df$julian < 105] <- df$year[df$julian < 105] - 1
Output
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
Data
df <- structure(list(name = c("KKSWAP", "KKSWAP"), code = c("The liquidations code for Marco are: 51-BMR05, 74-VAD08, 176-VNF09.",
"The liquidations code for Clara are: 88-BMR05, 90-VAD08, 152-VNF09."
)), class = "data.frame", row.names = c(NA, -2L))

how to piping a flite and select for a data.frame

I have a data that looks like this:
Sample data can be build using codes
df<-structure(list(ID = c(1, 1, 1, 2, 2, 3, 3, 3), Date = c("Day 1",
"Day 7", "Day 29", "Day_8", "Day9", "Day7", "Day.1", "Day 21"
), Score = c("A", "B", "E", "D", "F", "G", "A", "B"), Pass = c("Y",
"Y", "N", "Y", "N", "N", "Y", "Y")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
How can I write a piping code to complete a filter and selection. The filter I want is the earliest date, and data selection is (ID, Date, Score). If it is doable, I would like to clean data a little bit as it is allover the place right now. The final data might looks like this:
Could anyone give me some guidance on this. if possible, both base and tidyverse?
my thought is:
df1 <- df %>% filter() %>% select(-Pass)
PartII:
If date is something like ad date, how should I get the max(Date)?
New data set can be build using
df2<- structure(list(Subject = c("39-903", "39-903", "39-903",
"39-903", "39-903", "39-903", "39-903", "39-903", "39-905",
"39-906", "39-907", "304-902", "301-902", "301-903", "301-904"
), DT = c("30 Apr 2019", "25 Jun 2019", "23 Jul 2019", "24 Oct 2019",
"19 Dec 2019", "27 Jan 2020", "05 Apr 2020", "29 Apr 2020", "",
"03 Dec 2018", "12 Jul 2019", "29 Apr 2020", "30 Dec 2019", "13 Jan 2020",
"8 Jun 2020")), class = "data.frame", row.names = c(NA, -15L))
I tried
df2<- df2%>% group_by(Subject)%>% mutate(Date=dmy("DT")) %>% filter (Date==max(Date)
and got warning Warning messages: 1: All formats failed to parse. No formats found.. Mine did not work.
You can also try:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% group_by(ID) %>%
mutate(Date=gsub('\\.','_',Date),
Val=parse_number(Date)) %>%
filter(Val==min(Val)) %>%
select(c(ID,Date,Score))
Output:
# A tibble: 3 x 3
# Groups: ID [3]
ID Date Score
<dbl> <chr> <chr>
1 1 Day 1 A
2 2 Day_8 D
3 3 Day_1 A
Update: More versatile solution:
#Code 2
newdf <- df %>% group_by(ID) %>%
mutate(Val=as.numeric(gsub("[^0-9-]", "", Date))) %>%
filter(Val==min(Val)) %>%
select(c(ID,Date,Score))
Output:
# A tibble: 3 x 3
# Groups: ID [3]
ID Date Score
<dbl> <chr> <chr>
1 1 Day 1 A
2 2 Day_8 D
3 3 Day.1 A
You can try with this:
library(dplyr)
df %>%
mutate(Date = as.numeric(stringr::str_extract(Date, "\\d+"))) %>%
group_by(ID) %>%
slice_min(Date) %>%
ungroup() %>%
select(-Pass)
#> # A tibble: 3 x 3
#> ID Date Score
#> <dbl> <dbl> <chr>
#> 1 1 1 A
#> 2 2 8 D
#> 3 3 1 A
For Date I believe it's better if you keep just the number instead of Day X.
We can use base R
df1 <- transform(df, Date = as.numeric(gsub("\\D+", "", Date)))
df1[with(df1, ave(Date, ID, FUN = min) == Date),]
# ID Date Score Pass
#1 1 1 A Y
#4 2 8 D Y
#7 3 1 A Y
if we need to keep the NA elements as well, can use an OR (|) condition
library(dplyr)
library(lubridate)
df2%>%
group_by(Subject)%>%
mutate(Date=dmy(DT)) %>%
filter (Date==max(Date) |is.na(Date))

Density plot using population data for a specific year

Is it possible to create a density plot using this population data? Age_group is a categorical variable. Does it have to be numeric to create a density plot?
library(tidyverse)
df <- structure(list(year = c(1971, 1971, 1971, 1971, 1971, 1971, 1971,
1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971
), age_group = structure(2:19, .Label = c("All ages", "0 to 4 years",
"5 to 9 years", "10 to 14 years", "15 to 19 years", "20 to 24 years",
"25 to 29 years", "30 to 34 years", "35 to 39 years", "40 to 44 years",
"45 to 49 years", "50 to 54 years", "55 to 59 years", "60 to 64 years",
"65 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years",
"85 to 89 years", "90 to 94 years", "95 to 99 years", "100 years and over",
"Median age"), class = "factor"), population = c(1836149, 2267794,
2329323, 2164092, 1976914, 1643264, 1342744, 1286302, 1284154,
1252545, 1065664, 964984, 785693, 626521, 462065, 328583, 206174,
101117)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-18L))
You can convert the text to numeric ranges, e.g.:
library(tidyverse) # if not already loaded
df %>%
# These extract the 1st and 3rd "word" of age_group
# Uses stringr::word(), loaded as part of tidyverse
mutate(age_min = word(age_group, 1) %>% as.numeric,
age_max = word(age_group, 3) %>% as.numeric) %>%
head
# A tibble: 6 x 5
year age_group population age_min age_max
<dbl> <fct> <dbl> <dbl> <dbl>
1 1971 0 to 4 years 1836149 0 4
2 1971 5 to 9 years 2267794 5 9
3 1971 10 to 14 years 2329323 10 14
4 1971 15 to 19 years 2164092 15 19
5 1971 20 to 24 years 1976914 20 24
6 1971 25 to 29 years 1643264 25 29
From that, you could display in ggplot a bunch of ways:
... %>%
ggplot(aes(age_numeric, population)) +
geom_step()
... %>%
ggplot(aes(age_numeric, population)) +
geom_col()
... %>%
ggplot(aes(age_numeric, y = population)) +
geom_density(stat = "identity")

aggregate all columns in r

So this is what I do I total up their grades by using this, there are more columns than this
test<-with(data,table(Student,Subject))
test <- cbind(test)
test <- as.data.frame(test)
row.names Maths Science English Geography History Art ...
1 George 64 70 40 50 60 70
2 Anna 40 20 65 54 30 50
3 Scott 30 64 30 40 50 20
...
Summarize <- data.frame(
aggregate(.~Maths, data=test, min),
aggregate(.~English, data=test, max),
aggregate(.~Science, data=test, mean))
Is there a way to select all the columns itself and aggregate(range and the mean) the columns to a new dataframe?
Min Mean Max
Maths 30 60 90
Science
English
Geography
...
Thanks in advance !
Try:
library(dplyr)
library(tidyr)
df %>%
summarise_each(funs(min=min(., na.rm=TRUE), max=max(., na.rm=TRUE),
mean=mean(., na.rm=TRUE)), -Student) %>%
gather(Var, Value, Maths_min:Art_mean) %>%
separate(Var, c("Subject", "Var")) %>%
spread(Var, Value)
# Subject max mean min
#1 Art 70 46.66667 20
#2 English 65 45.00000 30
#3 Geography 54 48.00000 40
#4 History 60 46.66667 30
#5 Maths 64 44.66667 30
#6 Science 70 51.33333 20
Update
Or you could use aggregate with melt
library(reshape2)
res <- aggregate(value~variable, melt(df, id="Student"),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE)))
res1 <- do.call(`data.frame`, res)
colnames(res1) <- gsub(".*\\.", "", colnames(res1))
res1
# variable Min Mean Max
#1 Maths 30 44.66667 64
#2 Science 20 51.33333 70
#3 English 30 45.00000 65
#4 Geography 40 48.00000 54
#5 History 30 46.66667 60
#6 Art 20 46.66667 70
Or using only base R
res2 <- do.call(`data.frame`,
aggregate(values~ind, stack(df, select=-1),
FUN=function(x) c(Min=min(x, na.rm=TRUE), Mean=mean(x, na.rm=TRUE),
Max=max(x, na.rm=TRUE))))
colnames(res2) <- gsub(".*\\.", "", colnames(res2))
res2
# ind Min Mean Max
#1 Art 20 46.66667 70
#2 English 30 45.00000 65
#3 Geography 40 48.00000 54
#4 History 30 46.66667 60
#5 Maths 30 44.66667 64
#6 Science 20 51.33333 70
data
df <- structure(list(Student = c("George", "Anna", "Scott"), Maths = c(64L,
40L, 30L), Science = c(70L, 20L, 64L), English = c(40L, 65L,
30L), Geography = c(50L, 54L, 40L), History = c(60L, 30L, 50L
), Art = c(70L, 50L, 20L)), .Names = c("Student", "Maths", "Science",
"English", "Geography", "History", "Art"), class = "data.frame", row.names = c(NA,
-3L))

strsplit by variable separator

I have some strings of data separated by " " that needs to be split into columns. Is there an easy way to split the data by every nth separator. For example, the first value in x tells you that the first 4 values in y correspond to the first trial. The second value in x tells you that the next 3 values in y correspond to the second trial, and so on.
x <- c("4 3 3", "3 3 3 2 3")
y <- c("110 88 77 66 55 44 33 22 33 44 11 22 11", "44 55 66 33 22 11 22 33 44 55 66 77 88 66 77 88")
The goal is something like this:
structure(list(session = 1:2, trial.1 = structure(1:2, .Label = c("110 88 77",
"44 55 66"), class = "factor"), trial.2 = structure(c(2L, 1L), .Label = c("33 22 11",
"66 55 44"), class = "factor"), trial.3 = structure(1:2, .Label = c("22 33 44",
"23 33 44"), class = "factor"), trial.4 = structure(c(NA, 1L), .Label = "55 66", class = "factor"),
trial.5 = structure(c(NA, 1L), .Label = "77 88 66", class = "factor")), .Names = c("session",
"trial.1", "trial.2", "trial.3", "trial.4", "trial.5"), class = "data.frame", row.names = c(NA,
-2L))
Ideally, any extra values from y need to be dropped from the resulting data frame, and the uneven row lengths should be filled with NA's.
This maybe useful
dumx<-strsplit(x,' ')
dumy<-strsplit(y,' ')
dumx<-lapply(dumx,function(x)(cumsum(as.numeric(x))))
dumx<-lapply(dumx,function(x){mapply(seq,c(1,x+1)[-(length(x)+1)],x,SIMPLIFY=FALSE)})
ans<-mapply(function(x,y){lapply(x,function(w,z){z[w]},z=y)},dumx,dumy)
I will leave you to convert the resulting list to dataframe :)

Resources