Aggregating Dataset to "ignore" categorical variable - r

I have this dataset wich is structured like this
Neighborhood, var1, var2, COUNTRY, DAY, categ 1, categ 2
1 700 724 AL 0 YES YES
1 500 200 FR 0 YES NO
....
1 701 659 IT 1 NO YES
1 791 669 IT 1 NO YES
....
2 239 222 GE 0 YES NO
and so on...
So that the hyerarchy is "Neighborhood > DAY > COUNTRY" and for every neighborhood,for every day, for every country I have the observation of var1,var2,categ1 and categ2
I'm not interested for the moment in analyzing the country, so what I want to do is to aggregate that (by summing "over" the country field var1 and var2, the categorical variables categ1 and categ2 are not influenced by the country), and have a dataset that for each Neighborhood and for each Day gives me the infos on var1, var2, categ1 and categ2
I'm quite new to R-programming and basically don't know a lot of packages (I would write a program in c++, but I'm forcing myself to learn R)...
So do you have any idea on how to do this?
Data
df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L),
var1 = c(700L, 500L, 701L, 791L, 239L),
var2 = c(724L, 200L, 659L, 669L, 222L),
COUNTRY = c("AL", "FR", "IT", "IT", "GE"),
DAY = c(0L, 0L, 1L, 1L, 0L),
`categ 1` = c("YES", "YES", "NO", "NO", "YES"),
`categ 2` = c("YES", "NO", "YES", "YES", "NO")),
.Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"),
class = "data.frame", row.names = c(NA, -5L))
EDIT: #akrun
when I try your command, the result is:
aggregate(.~Neighborhood+DAY+COUNTRY, data= df1[!grepl("^categ", names(df1))], mean)
Neighborhood, DAY, COUNTRY, var1, var2
1 1 0 AL 700 724
2 1 0 FR 500 200
3 2 0 GE 239 222
4 1 1 IT 746 664
But (in this example) what I would like to have is:
Neighborhood, DAY, var1, var2
1 1 0 1200 924 //wher var1=700+500....
2 1 1 1492 1328
3 2 0 239 222

If we are not interested in the 'categ' columns, we can grep them out and use aggregate
aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum)
# Neighborhood DAY var1 var2
#1 1 0 1200 924
#2 2 0 239 222
#3 1 1 1492 1328
Or using dplyr
library(dplyr)
df1 %>%
group_by(Neighborhood, DAY) %>%
summarise_each(funs(sum), matches("^var"))
# Neighborhood DAY var1 var2
# (int) (int) (int) (int)
#1 1 0 1200 924
#2 1 1 1492 1328
#3 2 0 239 222

Related

Aggregate over consecutive years

New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:
huc_code_eight
year
count_1
count_2
6010105
1946
4
4
6010105
1947
6
0
6010105
1948
2
0
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight. The expected output would look like:
huc_code_eight
year
count_1
count_2
6010105
1946 - 1948
12
4
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.
Any help would be greatly appreciated!
Here is a data.table approach
set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]
create the character year (range if necessary, and sum of counts, by id
dat[,.(
year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
count_1=sum(count_1),count_2=sum(count_2)),
by=.(grp,huc_code_eight)][,grp:=NULL][]
Output:
huc_code_eight year count_1 count_2
1: 6010105 1946-1948 12 4
2: 6010105 1957 4 4
3: 6020001 1957 2 0
4: 8010203 1957 0 0
We can create a grouping column based on difference of adjacent elements in 'year' along with 'huc_code_eight' and then summarise
library(dplyr)
library(stringr)
df1 %>%
group_by(huc_code_eight) %>%
mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>%
group_by(year_grp, .add = TRUE) %>%
summarise(year = if(n() > 1)
str_c(range(year), collapse = ' - ') else as.character(year),
across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>%
dplyr::select(-year_grp)
-output
# A tibble: 4 × 4
huc_code_eight year count_1 count_2
<int> <chr> <int> <int>
1 6010105 1946 - 1948 12 4
2 6010105 1957 4 4
3 6020001 1957 2 0
4 8010203 1957 0 0
data
df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L,
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L,
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L,
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))

How to remove duplicates if specific column has value in r

I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!
One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))
We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

How to undummy a datasset with R

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

Problems of joining datasets on R

I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...

Using dplyr to summarize by multiple groups

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

Resources