Is there an R function for including column into row data - r

I would like to perform chi-square test in R by transform data frame from csv file using R from the following structure
Observed Values East West North South
Males 50 142 131 70
Females 435 1523 1356 750
to
following example
Row Observed value Region
1 1 East
2 1 East
3 1 East
...
435 0 East
Given that 1 = male. 0 = female
I been trying to use stack and data frame function to create the new table using R. I need the following table to perform chi-square test in R. The code I am trying is as below:
Stacked_data <- stack(data)
library(dummies)
df1 <- data.frame(id = 1:0, Observed.Values )
df2 <- cbind(Stacked_data, dummy(df1$id, sep = "_"))
Expected result will contain 2 column (observed value and region). Observed value will contain the categorical value for male = 1, and female = 0.Region will contain the region for respective observed value.
So that when i perform
table(Region,Observed Values)
It will produce
Observed Values
Region 1 0
East 50 435
West 142 1523
North 131 1356
South 70 750

Update: based on your expected output, you don't need much at all. Using obs from below, all you need to get your output (on which you can run chisq.test) is:
obs2 <- t(obs[,-1])
dimnames(obs2) <- list(Region = rownames(obs2), Observed = c('0', '1'))
obs2
# Observed
# Region 0 1
# East 50 435
# West 142 1523
# North 131 1356
# South 70 750
But, then again, if all you need is to run a chisq.test on them, it doesn't matter which orientation you use:
### original frame you provided
chisq.test(obs[,-1])
# Pearson's Chi-squared test
# data: as.matrix(obs[, -1])
# X-squared = 1.5959, df = 3, p-value = 0.6603
### transposed/re-labeled frame
chisq.test(obs2)
# Pearson's Chi-squared test
# data: obs2
# X-squared = 1.5959, df = 3, p-value = 0.6603
No difference. Perhaps all you needed was the [,-1] part?
Here's an attempt, though I don't know that it's exactly what you expect. (Input data is at the bottom of this answer.)
library(dplyr)
library(tidyr)
out1 <- obs %>%
gather(Region, v, -Observed) %>%
rowwise() %>%
do( tibble(Region = .$Region, Observed = rep(1L * (.$Observed == "Males"), .$v)) ) %>%
ungroup() %>%
mutate(Row = row_number())
out1
# # A tibble: 4,457 x 3
# Region Observed Row
# <chr> <int> <int>
# 1 East 1 1
# 2 East 1 2
# 3 East 1 3
# 4 East 1 4
# 5 East 1 5
# 6 East 1 6
# 7 East 1 7
# 8 East 1 8
# 9 East 1 9
# 10 East 1 10
# # ... with 4,447 more rows
We can verify that it is reversible with
xtabs(~ Observed + Region, data = out1)
# Region
# Observed East North South West
# 0 435 1356 750 1523
# 1 50 131 70 142
(even if the columns and rows are in a different order as the input, the numbers match).
Data:
obs <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Observed East West North South
Males 50 142 131 70
Females 435 1523 1356 750 ")

Related

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

How to create rate on R

I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087061 0.029876 2.914 0.00438 **
intersection 0.009192 0.003086 2.978 0.00362 **
Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.
Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.
> head(c)
# A tibble: 6 x 15
STATE STATENAME PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME CITY DAY MONTH YEAR LATITUDE LONGITUD
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Alabama 0 3 3 0 81 LEE (81) 2340 7 2 2019 32.7 -85.3
2 1 Alabama 0 2 2 0 55 ETOWAH (55) 1280 23 1 2019 34.0 -86.1
3 1 Alabama 0 4 4 0 29 CLEBURNE (29) 0 22 1 2019 33.7 -85.4
4 1 Alabama 1 1 1 1 55 ETOWAH (55) 2562 22 1 2019 34.0 -86.1
5 1 Alabama 0 1 1 0 3 BALDWIN (3) 0 18 1 2019 30.7 -87.8
6 1 Alabama 0 2 2 0 85 LOWNDES (85) 0 7 1 2019 32.2 -86.4
# … with 1 more variable: FATALS <dbl>
Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)
#Libraries
rm(list=ls()) # this is to clear anything in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)
setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")
state <- unique(df$STATE)
for(i in state){
df1<- df %>%
filter(STATE==i) %>%
dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
group_by(date) %>% # Group by State id and date
# summarise_at(.vars = vars(PEDS), sum)
summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>%
ggplot()+
geom_line(aes(x=date,y=pedday))+ylim(0,13)+
theme_bw()
ggsave(paste0("state_",i,".png"),width=8,height=6, )
ts1<-ts(df1,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts")
ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))
for(i in state){
print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
df <- read.csv(paste0("state_",i,".csv"))
print("------Linear Model------")
print(summary(lm(pedday~intersection,data=df)))
}
Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.
pop <- read.csv('census.csv')
df <- read.csv('accidents.csv') %>%
left_join(pop, by = c('STATENAME' = 'Geographic Area') %>%
mutate(rate = (PEDS / X2019) * <scale>) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

How to Merge Shapefile and Dataset?

I want to create a spatial map showing drug mortality rates by US county, but I'm having trouble merging the drug mortality dataset, crude_rate, with the shapefile, usa_county_df. Can anyone help out?
I've created a key variable, "County", in both sets to merge on but I don't know how to format them to make the data mergeable. How can I make the County variables correspond? Thank you!
head(crude_rate, 5)
Notes County County.Code Deaths Population Crude.Rate
1 Autauga County, AL 1001 74 975679 7.6
2 Baldwin County, AL 1003 440 3316841 13.3
3 Barbour County, AL 1005 16 524875 Unreliable
4 Bibb County, AL 1007 50 420148 11.9
5 Blount County, AL 1009 148 1055789 14.0
head(usa_county_df, 5)
long lat order hole piece id group County
1 -97.01952 42.00410 1 FALSE 1 0 0.1 1
2 -97.01952 42.00493 2 FALSE 1 0 0.1 2
3 -97.01953 42.00750 3 FALSE 1 0 0.1 3
4 -97.01953 42.00975 4 FALSE 1 0 0.1 4
5 -97.01953 42.00978 5 FALSE 1 0 0.1 5
crude_rate$County <- as.factor(crude_rate$County)
usa_county_df$County <- as.factor(usa_county_df$County)
merge(usa_county_df, crude_rate, "County")
[1] County long lat order hole
[6] piece id group Notes County.Code
[11] Deaths Population Crude.Rate
<0 rows> (or 0-length row.names)`
My take on this. First, you cannot expect a full answer with code because you did not provide a link to you data. Next time, please provide a full description of the problem with the data.
I just used the data you provided here to illustrate.
require(tidyverse)
# Load the data
crude_rate = read.csv("county_crude.csv", header = TRUE)
usa_county = read.csv("usa_county.csv", header = TRUE)
# Create the variable "county_join" within the county_crude to "left_join" on with the usa_county data. Note that you have to have the same type of data variable between the two tables and the same values as well
crude_rate = crude_rate %>%
mutate(county_join = c(1:5))
# Join the dataframes using a left join on the county_join and County variables
df_all = usa_county %>%
left_join(crude_rate, by = c("County"="county_join")) %>%
distinct(order,hole,piece,id,group, .keep_all = TRUE)
Data link: county_crude
Data link: usa_county
Blockquote

Collapsing Levels of a Factor Variable in one column while summing the counts in another

I originally had a vary wide data (4 rows with 158 columns) which I used reshape::melt() on to create a long data set (624 rows x 3 columns).
Now, however, I have a data set like this:
demo <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1_a", "Writing_2_a", "Writing_3_a", "Writing_4_a",
"Writing_1_b", "Writing_2_b", "Writing_3_b", "Writing_4_b")),
counts = as.integer(c(18, 27, 99, 42, 36, 144, 99, 9)))
Which produces a table similar to the one below:
region criteria counts
North Writing_1_a 18
South Writing_2_a 27
East Writing_3_a 99
West Writing_4_a 42
North Writing_1_b 36
South Writing_2_b 144
East Writing_3_b 99
West Writing_4_b 9
Now what I want to create is something like this:
goal <- data.frame(region = as.factor(c("North", "South", "East", "West")),
criteria = as.factor(c("Writing_1", "Writing_2", "Writing_3", "Writing_4")),
counts = as.integer(c(54, 171, 198, 51)))
Meaning that when I collapse the criteria columns it sums the counts:
region criteria counts
North Writing_1 54
South Writing_2 171
East Writing_3 198
West Writing_4 51
I have tried using forcats::fct_collapse and forcats::recode()but to no avail - I'm positive I'm just not doing it right. Thank you in advance for any assistance you can provide.
You can think about what exactly you're trying to do to change factor levels—fct_collapse would manually collapse several levels into one level, and fct_recode would manually change the labels of individual levels. What you're trying to do is change all the labels based on applying some function, in which case fct_relabel is appropriate.
You can write out an anonymous function when you call fct_relabel, or just pass it the name of a function and that function's argument(s). In this case, you can use stringr::str_remove to find and remove a regex pattern, and regex such as _[a-z]$ to remove any underscore and then lowercase letter that appear at the end of a string. That way it should scale well with your real data, but you can adjust it if not.
library(tidyverse)
...
new_crits <- demo %>%
mutate(crit_no_digits = fct_relabel(criteria, str_remove, "_[a-z]$"))
new_crits
#> region criteria counts crit_no_digits
#> 1 North Writing_1_a 18 Writing_1
#> 2 South Writing_2_a 27 Writing_2
#> 3 East Writing_3_a 99 Writing_3
#> 4 West Writing_4_a 42 Writing_4
#> 5 North Writing_1_b 36 Writing_1
#> 6 South Writing_2_b 144 Writing_2
#> 7 East Writing_3_b 99 Writing_3
#> 8 West Writing_4_b 9 Writing_4
Verifying that this new variable has only the levels you want:
levels(new_crits$crit_no_digits)
#> [1] "Writing_1" "Writing_2" "Writing_3" "Writing_4"
And then summarizing based on that new factor:
new_crits %>%
group_by(crit_no_digits) %>%
summarise(counts = sum(counts))
#> # A tibble: 4 x 2
#> crit_no_digits counts
#> <fct> <int>
#> 1 Writing_1 54
#> 2 Writing_2 171
#> 3 Writing_3 198
#> 4 Writing_4 51
Created on 2018-11-04 by the reprex package (v0.2.1)
A dplyr solution using regular expressions:
demo %>%
mutate(criteria = gsub("(_a)|(_b)", "", criteria)) %>%
group_by(region, criteria) %>%
summarize(counts = sum(counts)) %>%
arrange(criteria) %>%
as.data.frame
region criteria counts
1 North Writing_1 54
2 South Writing_2 171
3 East Writing_3 198
4 West Writing_4 51

Resources