I am using the url link to download this dataset:
https://files.hawaii.gov/dbedt/census/census_2020/data/redistricting/PLtable1_2020-county.xlsx
So in R I am coding it as:
url_dbedt_dicennial <- "https://files.hawaii.gov/dbedt/census/census_2020/data/redistricting/PLtable1_2020-county.xlsx"
# download the xls to a temporary file
temp <- tempfile(fileext = ".xlsx")
download.file(url = url_dbedt_dicennial, destfile = temp, mode = "wb")
# data from dbedt dicennial (look at each step to understand)
data_in_dbedt_dicennial <- temp %>%
readxl::read_excel(
range = cellranger::as.cell_limits("A6:H15"),) %>%
t() %>%
The generated output is the following:
What I am struggling right now after transpose is to how relabel the columns as "time", "HI", "HON", "HAW", "KAU", "MAU" and then to eliminate V1, V3, V8, and V9. I know I can eliminate columns manually one-by-one but there is a clever way of doing it? County should be relabeled as time.
Eventually I want to use the mutate function for the time variable, that is,
mutate(time)
and convert the data into time series with
tsbox::ts_long()
State of Hawaii should be labeled as "HI", Hawaii County as "HAW", City and County of Honolulu as "HON", Kauai County as "KAU", and Maui County 1/ as "MAU"
So this turned out to be a little more complicated than I first thought, in part because of t(), which is really designed to work with matrices. Fortunately, I was able to find some guidance elsewhere on SO, where I found transpose_df(). Though this works, I imagine this could be cleaned up a bit.
data_in_dbedt_dicennial <- temp %>%
readxl::read_excel(
range = cellranger::as.cell_limits("A6:H15"),) %>%
na.omit()
transpose_df <- function(df) {
t_df <- data.table::transpose(df)
colnames(t_df) <- rownames(df)
rownames(t_df) <- colnames(df)
t_df <- t_df %>%
tibble::rownames_to_column(.data = .) %>%
tibble::as_tibble(.)
return(t_df)
}
data_in_dbedt_dicennial <- transpose_df(data_in_dbedt_dicennial) %>%
.[-1,] %>%
rename(
Year = rowname, HI = `1`, HAW = `2`,
HON = `3`, KAU = `4`, MAU = `5`
) %>%
mutate(across(everything(), as.integer))
Output:
# A tibble: 7 × 6
Year HI HAW HON KAU MAU
1 1960 632772 61332 500409 28176 42855
2 1970 769913 63468 630528 29761 46156
3 1980 964691 92053 762565 39082 70991
4 1990 1108229 120317 836231 51177 100504
5 2000 1211537 148677 876156 58463 128241
6 2010 1360301 185079 953207 67091 154924
7 2020 1455271 200629 1016508 73298 164836
Related
I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14
I have two data frames as following:
df <- data.frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data.frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
What I want to do is create a row called total_car which would be a product of car and mazda. However if df_whole doesnt have row car I would still like to create a row total_car which would mazda*0.The output I want to get to is as below. How could I calculate something like this in R
df_car <- data.frame(month = c("mazda", "yamaha","total"),
april = c(11,12,1.1),
may = c(14,15,7))
It sounds like a sql query would work better to join your tables and get the summary data.
Here's a tidyverse solution (note the inner_join(), which is sql):
library(dplyr)
library(tidyr)
df <- data_frame(month = c("mazda", "yamaha"),
april = c(11,12),
may = c(14,15))
df_whole <- data_frame(month = c("car", "bikes"),
april = c(.1,.2),
may = c(.5,.2))
# adds car row if missing - there is probably a better way to handle the missing data
if (!"car" %in% df_whole$month) df_whole <- bind_rows(df_whole, data_frame(month = "car"))
# converts NAs to 0
df_whole[is.na(df_whole)] <- 0
# convert to long format
df_long <- df %>%
rename("brand" = "month") %>%
gather(key = "month", value = "val", april:may)
df_whole_long <- df_whole %>%
rename("type" = "month") %>%
gather(key = "month", value = "val", april:may)
# calcualte the multiplication
dat <- inner_join(df_long, df_whole_long, by = "month") %>% # combine dfs
filter(brand == "mazda", type == "car") %>% # filter out key rows
mutate(total = val.x * val.y) # do multiplication and add into new column
# reformat and append
df_car <- bind_rows(df,
bind_cols(data_frame(month = "total"),
dat %>%
select(month, total) %>%
spread(month, total)))
df_car is:
month april may
1 mazda 11.0 14
2 yamaha 12.0 15
3 total 1.1 7
I used tidyverse data frames (data_frame() rather than data.frame()) so strings weren't converted to factors (the same can be achieved by adding stringsAsFactors = F to data.frame()) and I converted your data to long format, which would probably be a better structure for your database anyway.
I am very new programming, and I am learning how to use dplyr, and I am wondering how to solve this problem:
I have this dataframe:
countries <- c("USA","Canada","Denmark","Albania", "Turkey","France", "Italy")
values <- c(1, 1, 3, 3,7,8,9)
old_df <- data.frame(countries, values, stringsAsFactors = FALSE)
I want to modify the order into my dataset to obtain this:
countries <- c("USA , Canada","Denmark , Albania", "Turkey","France", "Italy")
values <- c(1,3,7,8,9)
new_df <- data.frame(countries, values, stringsAsFactors = FALSE)
Because I am using dyplr I think that the best way to solve my problem could be:
library(dplyr)
new_df <- group_by(values) %>%
transmute(countries = countries) %>%
ungroup
Thank you in advance for any clue about how to solve this.
library(dplyr)
old_df %>%
group_by(values) %>%
summarise(countries = paste0(countries, collapse = ", "))
# # A tibble: 5 x 2
# values countries
# <dbl> <chr>
# 1 1 USA, Canada
# 2 3 Denmark, Albania
# 3 7 Turkey
# 4 8 France
# 5 9 Italy
The point here is that for each unique value in values you want to combine some of your rows, so you need to use summarise (i.e. you want to end up with one row per values value).
You can use summarise(countries = paste0(sort(countries), collapse = ", ")) if you want to apply an alphabetical order when you combine countries.
I feel like I have a bit of a complicated problem (or at least for me it is!).
I have a table of prices which will need to be read from a csv which will look exactly like this:
V1 <- c("","Destination","Spain","Spain","Spain","Portugal","Portugal","Portugal","Italy","Italy","Italy")
V2 <- c("","Min_Duration",rep(c(1,3,6),3))
V3 <- c("","Max_Duration",rep(c(2,5,10),3))
V4 <- c("Full-board","Level_1",runif(9,100,200))
V5 <- c("Full-board","Level_2",runif(9,201,500))
V6 <- c("Full-board","Level_3",runif(9,501,1000))
V7 <- c("Half-board","Level_1",runif(9,100,200))
V8 <- c("Half-board","Level_2",runif(9,201,500))
V9 <- c("Half-board","Level_3",runif(9,501,1000))
Lookup_matrix <- as.data.frame(cbind(V1,V2,V3,V4,V5,V6,V7,V8))
The prices in the above table will of course come out a bit strange as they're completely random - but we can ignore that...
I also have a table like this:
Destination <- c("Spain", "Italy", "Portugal")
Duration <- c(2,4,8)
Level <- c(1,3,3)
Board <- c("Half-board","Half-board","Full-board")
Price <- "Empty"
Price_matrix <- as.data.frame(cbind(Destination,Duration,Level,Board,Price))
My question is - how do I populate the 'Price' column of the price matrix with the corresponding prices that can be found in the lookup matrix? Please note that the duration variable of the price matrix will have to fit into a range found between the 'Min_Duration' and 'Max_Duration' columns in the lookup matrix.
In Excel I would use an Index,Match formula. But I'm stumped with R.
Thanks in advance,
Dan
Here is a tidyverse possibility
First, please note that I rename your input objects; both Price_matrix and Lookup_matrix are data.frames (not matrices).
df1 <- Price_matrix
df2 <- Lookup_matrix
Next we need to fix the column names of df2 = Lookup_matrix.
# Fix column names
colnames(df2) <- gsub("^_", "", apply(df2[1:2, ], 2, paste0, collapse = "_"))
df2 <- df2[-(1:2), ]
We now basically do a left join of df1 and df2; in order for df2 to be in a suitable format we spread data from wide to long, extract Price values for every Board and Level, and expand entries from Min_Duration to Max_Duration. Then we join by Destination, Duration, Level and Board.
Note that in your example, Destination = Italy has no Level = 3 entry in Lookup_matrix; we therefore get Price = NA for this entry.
library(tidyverse)
left_join(
df1 %>%
mutate_if(is.factor, as.character) %>%
select(-Price),
df2 %>%
mutate_if(is.factor, as.character) %>%
gather(key, Price, -Destination, -Min_Duration, -Max_Duration) %>%
separate(key, into = c("Board", "Level"), sep = "_", extra = "merge") %>%
mutate(Level = sub("Level_", "", Level)) %>%
rowwise() %>%
mutate(Duration = list(seq(as.numeric(Min_Duration), as.numeric(Max_Duration)))) %>%
unnest() %>%
select(-Min_Duration, -Max_Duration) %>%
mutate(Duration = as.character(Duration)))
#Joining, by = c("Destination", "Duration", "Level", "Board")
# Destination Duration Level Board Price
#1 Spain 2 1 Half-board 119.010942545719
#2 Italy 4 3 Half-board <NA>
#3 Portugal 8 3 Full-board 764.536124917446
Using datatable:
library(data.table)
nms = trimws(do.call(paste, transpose(Lookup_matrix[1:2, ])))# column names
cat(do.call(paste, c(collapse="\n", Lookup_matrix[-(1:2), ])), file = "mm.csv")
# Rewrite the data in the correct format. You do not have to.
# Just doing Lookup_matrix1 = setNames(Lookup_matrix[-(1:2),],nms) is enough
# but it will not have rectified the column classes.
Lookup_matrix1 = fread("mm.csv", col.names = nms)
melt(Lookup_matrix1, 1:3)[,
c("Board", "Level") := .(sub("[.]", "-", sub("\\.Leve.*", "", variable)), sub("\\D+", "", variable))][
Price_matrix[, -5], on=c("Destination", "Board", "Level", "Min_Duration <= Duration", "Max_Duration >= Duration")]
Destination Min_Duration Max_Duration variable value Board Level
1: Spain 2 2 Half.board.Level_1 105.2304 Half-board 1
2: Italy 4 4 <NA> NA Half-board 3
3: Portugal 8 8 Full.board.Level_3 536.5132 Full-board 3
I have one data frame outlining pollution levels continuously measured from two sites.
Dates <- as.data.frame(seq(as.Date("2015/01/01"), as.Date("2017/01/01"),"day"))
Pollution_Site.A <- as.data.frame(c(seq(from = 1, to = 366, by = 1),
(seq(from = 366, to = 1, by = -1))))
Pollution_Site.B <- as.data.frame(c(seq(from = 0, to = 365, by = 1),
(seq(from = 365, to = 0, by = -1))))
df1 <- cbind(Dates,Pollution_Site.A,Pollution_Site.B)
colnames(df1) <- c("Dates","Site.A","Site.B")
I have a separate data frame highlighting when surveyors (each site has one unique surveyor) visited each site.
Site<- c("Site.A","Site.A","Site.B","Site.B")
Survey_Dates <- as.data.frame(as.POSIXct(c("2014/08/17","2016/08/01",
"2015/02/01","2016/10/31")))
df2 <- as.data.frame(cbind(Site,Survey_Dates))
colnames(df2) <- c("Site","Survey_Dates")
What I want to do is (i) define a high pollution event (although perhaps some form of 'apply' function would be better to do this iteratively across multiple sites)?
High_limit_Site.A <- 1.5*median(df1$Site.A)
High_limit_Site.B <- 1.5*median(df1$Site.B)
The I want to (ii) subset the second data frame to show which surveyors have visited the site before and after a high pollution event within 1 year (providing there is pollution data as well). I presume something along the 'difftime' function will work here, but am not sure how I would apply this.
Finally, I would like (iii) the subsetted data frame to highlight whether the surveyor was out before or after the pollution event.
So in the example above, the desired output should only contain Site B. This is because Site A's first survey date precedes the first pollution measurement AND was over a year before the high pollution event. Thank you in advance for any help on this.
You need to pivot df1 and then cross-join it with df2
library(dplyr)
library(tidyr)
df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
# this will function as cross-join because Site is not a unique ID
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Here you can see earliest and latest event corresponding to each visit
# A tibble: 3 x 5
# Groups: Site, Survey_Dates [?]
Site Survey_Dates Been_Before Event_date_min Event_date_max
<chr> <dttm> <chr> <date> <date>
1 Site.A 2016-08-01 after 2015-10-03 2016-04-01
2 Site.B 2015-02-01 before 2015-10-02 2016-01-30
3 Site.B 2016-10-31 after 2015-11-01 2016-04-02
Just to build on the answer #dmi3kno displayed above, I can then subset sites which contain both a "before" and "after" sign for each site.
Output_df <- df1 %>% gather(key=Site, value=Pollution, -Dates) %>%
group_by(Site) %>%
mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>%
filter(HighLimit==1) %>%
left_join(df2, by=c("Site")) %>%
mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
Been_Before = ifelse(Time_Lag>0, "after", "before")) %>%
filter(abs(Time_Lag)<365) %>%
group_by(Site, Survey_Dates, Been_Before) %>%
summarise(Event_date_min=min(Dates),
Event_date_max=max(Dates))
Then using dplyr again:
Final_df <- Output_df %>%
group_by(Site) %>%
filter(all(c("before", "after") %in% Been_Before))