How to undummy a datasset with R - r

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!

We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))

If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

Related

Aggregate over consecutive years

New to r and I'm having difficulty getting the counts I'm after. I have a dataset that contains several columns of various counts per year. Here is an example:
huc_code_eight
year
count_1
count_2
6010105
1946
4
4
6010105
1947
6
0
6010105
1948
2
0
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I want to aggregate these counts based upon consecutive years, grouped by huc_code_eight. The expected output would look like:
huc_code_eight
year
count_1
count_2
6010105
1946 - 1948
12
4
6010105
1957
4
4
6020001
1957
2
0
8010203
1957
0
0
I would like to avoid iterating through the data and summing these manually, but, though I've found many examples of aggregating in r, I've been unable to successfully refactor them to fit my use case.
Any help would be greatly appreciated!
Here is a data.table approach
set as data.table,, get the subsequent year, set to 1 if NA, and create run-length id
dat <- setDT(dat)[, yr:= year-shift(year),by=huc_code_eight][is.na(yr), yr:=1][,grp:=rleid(huc_code_eight,yr)]
create the character year (range if necessary, and sum of counts, by id
dat[,.(
year = fifelse(.N>1,paste0(min(year),"-",max(year)),paste0(year, collapse="")),
count_1=sum(count_1),count_2=sum(count_2)),
by=.(grp,huc_code_eight)][,grp:=NULL][]
Output:
huc_code_eight year count_1 count_2
1: 6010105 1946-1948 12 4
2: 6010105 1957 4 4
3: 6020001 1957 2 0
4: 8010203 1957 0 0
We can create a grouping column based on difference of adjacent elements in 'year' along with 'huc_code_eight' and then summarise
library(dplyr)
library(stringr)
df1 %>%
group_by(huc_code_eight) %>%
mutate(year_grp = cumsum(c(TRUE, diff(year) != 1))) %>%
group_by(year_grp, .add = TRUE) %>%
summarise(year = if(n() > 1)
str_c(range(year), collapse = ' - ') else as.character(year),
across(starts_with('count'), sum, na.rm = TRUE), .groups = 'drop') %>%
dplyr::select(-year_grp)
-output
# A tibble: 4 × 4
huc_code_eight year count_1 count_2
<int> <chr> <int> <int>
1 6010105 1946 - 1948 12 4
2 6010105 1957 4 4
3 6020001 1957 2 0
4 8010203 1957 0 0
data
df1 <- structure(list(huc_code_eight = c(6010105L, 6010105L, 6010105L,
6010105L, 6020001L, 8010203L), year = c(1946L, 1947L, 1948L,
1957L, 1957L, 1957L), count_1 = c(4L, 6L, 2L, 4L, 2L, 0L), count_2 = c(4L,
0L, 0L, 4L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-6L))

r transfer values from one dataset to another by ID

I have two datasets , the first dataset is like this
ID Weight State
1 12.34 NA
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 NA
5 14.12 NA
The second dataset is a lookup table for state values by ID
ID State
1 WY
2 IA
3 MA
4 OR
4 CA
5 FL
As you can see there are two different state values for ID 4, which is normal.
What I want to do is replace the NAs in dataset1 State column with State values from dataset 2. Expected dataset
ID Weight State
1 12.34 WY
2 11.23 IA
2 13.12 IN
3 12.67 MA
4 10.89 OR,CA
5 14.12 FL
Since ID 4 has two state values in dataset2 , these two values are collapsed and separated by , and used to replace the NA in dataset1. Any suggestion on accomplishing this is much appreciated. Thanks in advance.
Collapse df2 value and join it with df1 by 'ID'. Use coalesce to use non-NA value from the two state columns.
library(dplyr)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)), by = 'ID') %>%
mutate(State = coalesce(State.x, State.y)) %>%
select(-State.x, -State.y)
# ID Weight State
#1 1 12.3 WY
#2 2 11.2 IA
#3 2 13.1 IN
#4 3 12.7 MA
#5 4 10.9 OR, CA
#6 5 14.1 FL
In base R with merge and transform.
merge(df1, aggregate(State~ID, df2, toString), by = 'ID') |>
transform(State = ifelse(is.na(State.x), State.y, State.x))
Tidyverse way:
library(tidyverse)
df1 %>%
left_join(df2 %>%
group_by(ID) %>%
summarise(State = toString(State)) %>%
ungroup(), by = 'ID') %>%
transmute(ID, Weight, State = coalesce(State.x, State.y))
Base R alternative:
na_idx <- which(is.na(df1$State))
df1$State[na_idx] <- with(
aggregate(State ~ ID, df2, toString),
State[match(df1$ID, ID)]
)[na_idx]
Data:
df1 <- structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 5L), Weight = c(12.34,
11.23, 13.12, 12.67, 10.89, 14.12), State = c("WY", "IA", "IN",
"MA", "OR, CA", "FL")), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 4L, 5L), State = c("WY",
"IA", "MA", "OR", "CA", "FL")), class = "data.frame", row.names = c(NA,
-6L))

Dividing each row by the previous one in R

I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)

Problems of joining datasets on R

I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...

Using dplyr to summarize by multiple groups

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

Resources