Running regression with panel data on different geographical levels in the US and Euro area with weights that essentially look like this:
lm(log(POP25) ~ log(EMPLOY25), weights = weights, data = data)
The weights are the 2007 observations of POP25 for every grouping. This is the code and data for Europe. (For this dataset I don't experience any trouble.)
require(dplyr)
Data <- Data |>
group_by(NUTS_ID) |>
mutate(weights = POP25[TIME==2007])
Data 1:
structure(list(...1 = 1:6, TIME = 2007:2012, NUTS_ID = c("AT",
"AT", "AT", "AT", "AT", "AT"), NUMBER = c(1L, 1L, 1L, 1L, 1L,
1L), POP15 = c(5529.1, 5549.3, 5558.5, 5572.1, 5601.1, 5620.8
), POP20 = c(5047.1, 5063.2, 5072.6, 5090, 5127.1, 5151.9), POP25 = c(4544,
4560.7, 4571.3, 4587.8, 4621.5, 4639), EMPLOY15 = c(3863.6, 3928.7,
3909.3, 3943.9, 3982.3, 4013.4), EMPLOY20 = c(3676.2, 3737, 3723.8,
3761.9, 3802.3, 3835), EMPLOY25 = c(3333.5, 3390.4, 3384.7, 3424.6,
3454.4, 3486.4), weights = c(4544, 4544, 4544, 4544, 4544, 4544
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L), groups = structure(list(NUTS_ID = "AT", .rows = structure(list(
1:6), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L), .drop = TRUE))
However, I am not able to do the same code on data for US counties.
US_County <- US_County |>
group_by(NAME) |>
mutate(weights = POP25[year==2007])
Data 2:
structure(list(NAME = c("Ada County, Idaho", "Ada County, Idaho",
"Ada County, Idaho", "Ada County, Idaho", "Ada County, Idaho",
"Ada County, Idaho"), GEOID = c(16001, 16001, 16001, 16001, 16001,
16001), year = c(2007, 2008, 2009, 2010, 2011, 2012), POP25 = c(205888,
208506, 212770, 212272, 216058, 220856), EMPLOY25 = c(161385,
160303, 152131, 155292, 155574, 164830), State = c("Idaho", "Idaho",
"Idaho", "Idaho", "Idaho", "Idaho"), StateID = c(16, 16, 16,
16, 16, 16)), row.names = c(NA, 6L), class = "data.frame")
When doing it with the last dataset I get this error message that I can´t figure out what means.
Error in `mutate()`:
! Problem while computing
`weights = POP25[year == 2007]`.
x `weights` must be size 5 or 1,
I don´t know if there is anything wrong with the data. I have tried specifying the class so that everything should be equal across the datasets. I have also tried removing all NA observations, however with no luck.
Am I doing something wrong in my code?
Are there any other ways to do the same?
Related
I have a dataset with all natural disaster that occured over a certain time period. I would like to summarize them by year and state. When summarizing I would like to create a variable (= d_disasters) that shows me the unique types of natural disasters, e.g. for Texas, I would expect to only show Hurricane.
I am currently using dplyr:group_by and dplyr::summarize to summarize my data by year and by state & dplyr::mutate and dplyr:map_int to create new variables with the total number of natural disasters per year ($n_disasters using length) and the unique number of natural disasters ($n_distinct using n_distinct()).
Starting dataset:
structure(list(year = c(1998, 1998, 1998, 1998, 1998), country = c("US",
"US", "US", "US", "US"), state = c("Texas", "Texas", "California",
"New York", "New York"), deaths = c(12, 5, 9, 10, 18), injured = c(3,
1, 3, 5, 9), disastertype = c("Hurricane", "Hurricane", "Wild fire",
"Flood", "Epidemic")), class = "data.frame", row.names = c(NA,
-5L))
Result dataset:
structure(list(year = c(1998, 1998, 1998), state = c("California",
"New York", "Texas"), u_disastertype = c("Wild fire", "Flood, Epidemic",
"Hurricane"), disastertype = c("Wild fire", "Flood, Epidemic",
"Hurricane, Hurricane"), deaths = c(9, 28, 17), injured = c(3,
14, 4), n_distinct = c(1L, 2L, 1L), n_disasters = c(1L, 2L, 2L
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), groups = structure(list(year = 1998, .rows = structure(list(
1:3), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-1L), .drop = TRUE))
EDIT: Edited for clarification.
Try aggregate. This takes the output of 2 3 aggregates and puts them together.
list2 <- function(x){ c(unique(x),length(table(x))) }
lt <- list(year=dat$year, county=dat$country, state=dat$state )
data.frame( aggregate( dat[,c(4,5)], lt, sum ),
setNames( aggregate( dat$disastertype, lt, list2 )[,4, drop=F], colnames(dat)[6] ),
setNames( aggregate( dat$disastertype, lt, length )[,4, drop=F], "n_disasters") )
year county state deaths injured disastertype n_disasters
1 1998 US California 9 3 Wild fire, 1 1
2 1998 US New York 28 14 Flood, Epidemic, 2 2
3 1998 US Texas 17 4 Hurricane, 1 2
Not sure if you want to keep the n_... columns or not though...
EDIT: added "n_disasters"
EDIT2: added suggestion to include "distinct disasters"
The solution using dplyr with group_by and summarize. The key part is to run u_disastertype = toString(unique(disastertype)), before disastertype = paste(disastertype, collapse = ', '),
naturaldisaster2 <- naturaldisaster %>%
group_by(year, state) %>%
summarise(
u_disastertype = toString(unique(disastertype)),
disastertype = paste(disastertype, collapse = ', '),
deaths=sum(deaths),
injured=sum(injured)
)
The answer is based on this Stackoverflow answer to a similar question, where only one operation was run on the column whereas I am running two operations on the same column: https://stackoverflow.com/a/46367425/11045110
I have a network with some directed and some undirected edges. I'm trying to use igraph to plot it using the arrow.mode parameter, but the graph is always showing arrows with default parameters. Here's an example
Here are some data:
spearRhoP_lagged4 <- structure(list(Var1 = c("ARISA_538.9", "ARISA_538.9", "ARISA_666.4",
"ARISA_686.9", "ARISA_538.9", "ARISA_594.1"), Var2 = c("ARISA_666.4",
"ARISA_686.9", "ARISA_686.9", "ARISA_666.4", "ARISA_561.8", "ARISA_561.8"
), rho = c(0.280885191364122, 0.415365287156247, 0.614493076574831,
0.312630564055403, 0.295296877306726, 0.381890811408216), p = c(0.00206314544835896,
2.9098006351119e-06, 1.35005674822095e-13, 0.000567475872663549,
0.00116911931220592, 1.98010880043619e-05), delay = c(0, 0, 0,
1, 0, 0), fdr = c(0.0135393920048557, 7.97032347878478e-05, 2.83511917126399e-11,
0.00503534929264839, 0.00898225813036257, 0.000366902513022),
arrow = c("-", "-", "-", ">", "-", "-")), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
Var1 = c("ARISA_538.9", "ARISA_538.9", "ARISA_538.9", "ARISA_594.1",
"ARISA_666.4", "ARISA_686.9"), Var2 = c("ARISA_561.8", "ARISA_666.4",
"ARISA_686.9", "ARISA_561.8", "ARISA_686.9", "ARISA_666.4"
), .rows = list(5L, 1L, 2L, 6L, 3L, 4L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
Then I build the graph
LaggedSpearGraph <- graph_from_data_frame(spearRhoP_lagged4)
Lastly I plot the graph, telling it that I want the arrow direction to be specified by the parameter arrow
plot(LaggedSpearGraph,
vertex.size=2,
arrow.mode = E(LaggedSpearGraph)$arrow)
I get an output that looks like this.
But what I want is a network where there is only one edge with an arrow on it.
Any suggestions?
You need to add edge as a prefix:
LaggedSpearGraph <- graph_from_data_frame(spearRhoP_lagged4, directed=T)
plot(LaggedSpearGraph,
vertex.size=10,
edge.arrow.mode = E(LaggedSpearGraph)$arrow)
See here:
https://github.com/igraph/igraph/issues/954
So I'm running a package in which the output of the function I'm using is something similar to this:
area ID structure
1 150 1 house
I have several of these which I get by looping through some stuff. Basically this is my loop function:
for (k in 1:length(models)) {
for (l in 1:length(patients)) {
print(result[[l]][[k]])
tableData[[l]][[k]] <- do.call(rbind, result[[l]][[k]])
}
}
So the print(result[[l]][[k]]) gives the output I showed you in the beginning. So my issue is to put all of these into one dataframe. And so far it just doesn't work, i.e. the do.call function, which I have read is the one to use when combining lists into dataframes.
So where am I going wrong here ?
Updated:
dput() output (area = value in this case):
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))
Edit: I initially used purrr::map_dfr to solve this problem, but purrr::reduce is much more appropriate.
The list nesting means we have to bind rows together twice. Here's a solution using the purrr and dplyr packages and assigning your dput list to the variable my_list:
library(purrr)
library(dplyr)
my_df <- reduce(my_list, bind_rows)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
my_df
#> value ID structure model
#> 1 0.03947978 1 house house wood
#> 2 0.03947978 1 house house stone
#> 3 0.03069239 2 house house wood
#> 4 0.03069239 2 house house stone
I find map-ing with purrr way more intuitive than do.call. Let me know if this helps!
I wish to create a constituency map of the UK in leaflet which when hovered over provides constituency name and detailed result.
The data is contained in a tibble with 2 columns:
constituency, which contains the name
result, which is a list column within each cell containing a data.frame including candidate name, party, votes, % and order.
I have included a sample of two constituencies below
df <- structure(list(constituency = c("Knowsley", "Bristol West"),
result = list(structure(list(name = c("George Howarth", "James Spencer",
"Neil Miney", "Carl Cashman", "Steve Baines"), party = c("Labour",
"Conservative", "UKIP", "LD", "Green"), votes = c(47351L,
5137L, 1285L, 1189L, 521L), pc = c(85.34, 9.26, 2.32, 2.14,
0.94), order = 1:5), .Names = c("name", "party", "votes",
"pc", "order"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame")), structure(list(name = c("Thangam Debbonaire",
"Annabel Tall", "Molly Scott Cato", "Stephen Williams", "Jodian Rodgers"
), party = c("Labour", "Conservative", "Green", "LD", "Money Free Party"
), votes = c(47213L, 9877L, 9216L, 5201L, 101L), pc = c(65.93,
13.79, 12.87, 7.26, 0.14), order = 1:5), .Names = c("name",
"party", "votes", "pc", "order"), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame")))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("constituency", "result"))
If I just wanted the constituency in the label, I would have coded like this that I could apply to the leaflet output:
labels <- sprintf(
df$constituency
) %>% lapply(htmltools::HTML)
But I wish to add in the result details.
How bout this?
labels = lapply(1:length(df$result), function(i) {
tmp = format(df$result[[i]])
tmp = tmp[3:length(tmp)]
tmp[1] = df$constituency[i]
htmltools::HTML(paste(tmp, collapse = "<br>"))
})
Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.