This is just one of those things that I can't figure out how to word in order to search for a solution to my problem. I have some election data for Democratic and Republican candidates. The data is contained in 2 rows per county with one of those rows corresponding to one of the two candidates.
I need a data frame with one row per county and I need to create a new column out of the second row for each county. I've tried to un-nest the dataframe, but that doesn't work. I've seen something about using un-nest and mutate together, but I can't figure that out. Transposing the dataframe didn't help either. I've also tried to ungroup without success.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# Remove unnecessary columns
election <- within(election, rm('ElectionDate','OfficeCode.Text.','DistrictCode.Text.','StatusCode','CountyCode','OfficeDescription','PartyOrder','PartyName','CandidateID','CandidateFirstName','CandidateMiddleName','CandidateFormerName','WriteIn.W..Uncommitted.Z.','Recount...','Nominated.N..Elected.E.'))
# Remove offices other than POTUS
election <- election[-c(167:2186),]
# Keep only DEM and REP parties
election <- election %>%
filter(PartyDescription == "Democratic" |
PartyDescription == "Republican")
[
I'd like it to look like this:
dplyr
library(dplyr)
library(tidyr) # pivot_wider
election %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
slice(-(167:2186)) %>%
filter(PartyDescription %in% c("Democratic", "Republican")) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
# # A tibble: 83 x 25
# CountyName Biden Trump Richer LaFave Cambensy Wagner Metsa Markkanen Lipton Strayhorn Carlone Frederick Bernstein Diggs Hubbard Meyers Mosallam Vassar `O'Keefe` Schuitmaker Dewaelsche Stancato Gates Land
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ALCONA 2142 4848 NA NA NA NA NA NA 1812 1748 4186 4209 1818 1738 4332 4114 1696 1770 4273 4187 1682 1733 4163 4223
# 2 ALGER 2053 3014 NA NA 2321 2634 NA NA 1857 1773 2438 2470 1795 1767 2558 2414 1757 1769 2538 2444 1755 1757 2458 2481
# 3 ALLEGAN 24449 41392 NA NA NA NA NA NA 20831 19627 37681 38036 20043 19640 38805 37375 18820 19486 37877 39052 19081 19039 37322 38883
# 4 ALPENA 6000 10686 NA NA NA NA NA NA 5146 4882 8845 8995 5151 4873 9369 8744 4865 4935 9212 8948 4816 4923 9069 9154
# 5 ANTRIM 5960 9748 NA NA NA NA NA NA 5042 4798 8828 8886 4901 4797 9108 8737 4686 4810 9079 8867 4679 4781 8868 9080
# 6 ARENAC 2774 5928 NA NA NA NA NA NA 2374 2320 4626 4768 2396 2224 4833 4584 2215 2243 5025 4638 2185 2276 4713 4829
# 7 BARAGA 1478 2512 NA NA NA NA 1413 2517 1267 1212 2057 2078 1269 1233 2122 2003 1219 1243 2090 2056 1226 1228 2072 2074
# 8 BARRY 11797 23471 NA NA NA NA NA NA 9794 9280 20254 20570 9466 9215 20885 20265 9060 9324 21016 20901 8967 9121 20346 21064
# 9 BAY 26151 33125 NA NA NA NA NA NA 23209 22385 26021 26418 23497 22050 27283 25593 21757 22225 27422 25795 21808 21999 26167 26741
# 10 BENZIE 5480 6601 NA NA NA NA NA NA 4704 4482 5741 5822 4584 4479 6017 5681 4379 4449 5979 5756 4392 4353 5704 5870
# # ... with 73 more rows
#r2evans had the right idea, but slicing the data before filtering lost a lot of the voting data. I hadn't realized that before.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# That's an ugly dataset...let's make it better
election <- election[-c(1:5,7:9,11,13:15,17:19)]
election <- election %>%
filter(CandidateLastName %in% c("Biden", "Trump")) %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
Related
I used this method to gather mean and sd result successly before here .And then, I tried to use this methond to gather my gene counts DEG data with "logFC","cil","cir","ajustP_value" .But I failed because something wrong with my result.
Just like this:
data_1<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_1) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
head(data_1)
A<-paste0(1:48,"_logFC")
data_logFC<-data.frame(A=A,data_1)
#
data_2<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_2) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
B_L<-paste0(1:48,"_CI.L")
data_CIL<-data.frame(A=B_L,data_2)
data_CIL[1:48,1:6]
#
data_3<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_3) <- c(paste0("Gene_", 1:25))
rownames(data_3)<-NULL
C_R<-paste0(1:48,"_CI.R")
data_CIR<-data.frame(A=C_R,data_3)
data_CIR[1:48,1:6]
#
data_4<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_4) <- c(paste0("Gene_", 1:25))
rownames(data_4)<-NULL
D<-paste0(1:48,"_adj.P.Val")
data_ajustP<-data.frame(A=D,data_4)
data_ajustP[1:48,1:6]
# combine data_logFC data_CIL data_CIR data_ajustP
data <- bind_rows(list(
logFC = data_logFC,
CIL = data_CIL,
CIR =data_CIR,
AJSTP=data_ajustP
), .id = "stat")
data[1:10,1:6]
data_DEG<- data %>%
pivot_longer(-c(stat,A), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,100)
str(data_DEG$CIL)
> head(data_DEG,100)
# A tibble: 100 x 6
A Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1_logFC Gene_1 504 NA NA NA
2 1_logFC Gene_2 100 NA NA NA
3 1_logFC Gene_3 689 NA NA NA
4 1_logFC Gene_4 779 NA NA NA
5 1_logFC Gene_5 397 NA NA NA
6 1_logFC Gene_6 1152 NA NA NA
7 1_logFC Gene_7 780 NA NA NA
8 1_logFC Gene_8 155 NA NA NA
9 1_logFC Gene_9 142 NA NA NA
10 1_logFC Gene_10 1150 NA NA NA
# … with 90 more rows
Why is there so many NAs ?
Can somebody help me ? Vary thankful.
EDITE:
I confused the real sample group of my data. So I reshape my data without a right index.
Here is my right method:
data[1:10,1:6]
data<-separate(data,A,c("Name","stat2"),"_")
data<-data[,-3]
data_DEG<- data %>%
pivot_longer(-c(stat,Name), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,10)
tail(data_DEG,10)
> head(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1 Gene_1 504 1116 774 278
2 1 Gene_2 100 936 448 887
3 1 Gene_3 689 189 718 933
4 1 Gene_4 779 943 690 19
5 1 Gene_5 397 976 40 135
6 1 Gene_6 1152 304 343 647
7 1 Gene_7 780 1076 796 1024
8 1 Gene_8 155 645 469 180
9 1 Gene_9 142 256 889 1047
10 1 Gene_10 1150 976 1194 670
> tail(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 48 Gene_16 448 633 1080 1122
2 48 Gene_17 73 772 14 388
3 48 Gene_18 652 999 699 912
4 48 Gene_19 600 1163 512 241
5 48 Gene_20 428 1119 1142 348
6 48 Gene_21 66 553 240 82
7 48 Gene_22 753 1119 630 117
8 48 Gene_23 1017 305 1120 447
9 48 Gene_24 432 1175 447 670
10 48 Gene_25 482 394 371 696
It's a perfect result!!
A sample of my data is available here.
I am trying to calculate the growth rate (change in weight (wt) over time) for each squirrel.
When I have my data in wide format:
squirrel fieldBirthDate date1 date2 date3 date4 date5 date6 age1 age2 age3 age4 age5 age6 wt1 wt2 wt3 wt4 wt5 wt6 litterid
22922 2017-05-13 2017-05-14 2017-06-07 NA NA NA NA 1 25 NA NA NA NA 12 52.9 NA NA NA NA 7684
22976 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 3 25 NA NA NA NA 15.5 50.9 NA NA NA NA 7692
22926 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 0 25 NA NA NA NA 10.1 48 NA NA NA NA 7719
I am able to calculate growth rate with the following code:
library(dplyr)
#growth rate between weight 1 and weight 3, divided by age when weight 3 is recorded
growth <- growth %>%
mutate (g.rate=((wt3-wt1)/age3))
#growth rate between weight 1 and weight 2, divided by age when weight 2 is recorded
merge.growth <- merge.growth %>%
mutate (g.rate=((wt2-wt1)/age2))
However, when the data is in long format (a format needed for the analysis I am running afterwards):
squirrel litterid date age wt
22922 7684 2017-05-13 0 NA
22922 7684 2017-05-14 1 12
22922 7684 2017-06-07 25 52.9
22976 7692 2017-05-13 1 NA
22976 7692 2017-05-16 3 15.5
22976 7692 2017-06-07 25 50.9
22926 7719 2017-05-14 0 10.1
22926 7719 2017-06-08 25 48
I cannot use the mutate function I used above. I am hoping to create a new column that includes growth rate as follows:
squirrel litterid date age wt g.rate
22922 7684 2017-05-13 0 NA NA
22922 7684 2017-05-14 1 12 NA
22922 7684 2017-06-07 25 52.9 1.704
22976 7692 2017-05-13 1 NA NA
22976 7692 2017-05-16 3 15.5 NA
22976 7692 2017-06-07 25 50.9 1.609
22926 7719 2017-05-14 0 10.1 NA
22926 7719 2017-06-08 25 48 1.516
22758 7736 2017-05-03 0 8.8 NA
22758 7736 2017-05-28 25 43 1.368
22758 7736 2017-07-05 63 126 1.860
22758 7736 2017-07-23 81 161 1.879
22758 7736 2017-07-26 84 171 1.930
I have been calculating the growth rates (growth between each wt and the first time it was weighed) in excel, however I would like to do the calculations in R instead since I have a large number of squirrels to work with. I suspect if else loops might be the way to go here, but I am not well versed in that sort of coding. Any suggestions or ideas are welcome!
You can use group_by to calculate this for each squirrel:
group_by(df, squirrel) %>%
mutate(g.rate = (wt - nth(wt, which.min(is.na(wt)))) /
(age - nth(age, which.min(is.na(wt)))))
That leaves NaNs where the age term is zero, but you can change those to NAs if you want with df$g.rate[is.nan(df$g.rate)] <- NA.
alternative using data.table and its function "shift" that takes the previous row
library(data.table)
df= data.table(df)
df[,"growth":=(wt-shift(wt,1))/age,by=.(squirrel)]
I have been trying to calculate the growth rate comparing quarter 1 from one year to quarter 1 for the following year.
In excel the formula would look like this ((B6-B2)/B2)*100.
What is the best way to accomplish this in R? I know how to get the differences from period to period, but cannot accomplish it with 4 time periods' difference.
Here is the code:
date <- c("2000-01-01","2000-04-01", "2000-07-01",
"2000-10-01","2001-01-01","2001-04-01",
"2001-07-01","2001-10-01","2002-01-01",
"2002-04-01","2002-07-01","2002-10-01")
value <- c(1592,1825,1769,1909,2022,2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
Which will produce this data frame:
date value
1 2000-01-01 1592
2 2000-04-01 1825
3 2000-07-01 1769
4 2000-10-01 1909
5 2001-01-01 2022
6 2001-04-01 2287
7 2001-07-01 2169
8 2001-10-01 2366
9 2002-01-01 2001
10 2002-04-01 2087
11 2002-07-01 2099
12 2002-10-01 2258
Here's an option using the dplyr package:
# Convert date column to date format
df$date = as.POSIXct(df$date)
library(dplyr)
library(lubridate)
In the code below, we first group by month, which allows us to operate on each quarter separately. The arrange function just makes sure that the data within each quarter is ordered by date. Then we add the yearOverYear column using mutate which calculates the ratio of the current year to the previous year for each quarter.
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1))
date value month yearOverYear
1 2000-01-01 1592 1 NA
2 2001-01-01 2022 1 1.2701005
3 2002-01-01 2001 1 0.9896142
4 2000-04-01 1825 4 NA
5 2001-04-01 2287 4 1.2531507
6 2002-04-01 2087 4 0.9125492
7 2000-07-01 1769 7 NA
8 2001-07-01 2169 7 1.2261164
9 2002-07-01 2099 7 0.9677271
10 2000-10-01 1909 10 NA
11 2001-10-01 2366 10 1.2393924
12 2002-10-01 2258 10 0.9543533
If you prefer to have the data frame back in overall date order after adding the year-over-year values:
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1)) %>%
ungroup() %>% arrange(date)
Or using data.table
library(data.table) # v1.9.5+
setDT(df)[, .(date, yoy = (value-shift(value))/shift(value)*100),
by = month(date)
][order(date)]
Here's a very simple solution:
YearOverYear<-function (x,periodsPerYear){
if(NROW(x)<=periodsPerYear){
stop("too few rows")
}
else{
indexes<-1:(NROW(x)-periodsPerYear)
return(c(rep(NA,periodsPerYear),(x[indexes+periodsPerYear]-x[indexes])/x[indexes]))
}
}
> cbind(df,YoY=YearOverYear(df$value,4))
date value YoY
1 2000-01-01 1592 NA
2 2000-04-01 1825 NA
3 2000-07-01 1769 NA
4 2000-10-01 1909 NA
5 2001-01-01 2022 0.27010050
6 2001-04-01 2287 0.25315068
7 2001-07-01 2169 0.22611645
8 2001-10-01 2366 0.23939235
9 2002-01-01 2001 -0.01038576
10 2002-04-01 2087 -0.08745081
11 2002-07-01 2099 -0.03227294
12 2002-10-01 2258 -0.04564666
df$yoy <- c(rep(NA,4),(df$value[5:nrow(df)]-df$value[1:(nrow(df)-4)])/df$value[1:(nrow(df)-4)]*100);
df;
## date value yoy
## 1 2000-01-01 1592 NA
## 2 2000-04-01 1825 NA
## 3 2000-07-01 1769 NA
## 4 2000-10-01 1909 NA
## 5 2001-01-01 2022 27.010050
## 6 2001-04-01 2287 25.315068
## 7 2001-07-01 2169 22.611645
## 8 2001-10-01 2366 23.939235
## 9 2002-01-01 2001 -1.038576
## 10 2002-04-01 2087 -8.745081
## 11 2002-07-01 2099 -3.227294
## 12 2002-10-01 2258 -4.564666
Another base R solution. Requires that the date is in date format, so that the common months can be used as a grouping variable to which the function to calculate growth rate can be passed
# set date to a date objwct
df$date <- as.Date(df$date)
# order by date
df <- df[order(df$date), ]
# function to calculate differences
f <- function(x) c(NA, 100*diff(x)/x[-length(x)])
df$yoy <- ave(df$value, format(df$date, "%m"), FUN=f)
# date value yoy
# 1 2000-01-01 1592 NA
# 2 2000-04-01 1825 NA
# 3 2000-07-01 1769 NA
# 4 2000-10-01 1909 NA
# 5 2001-01-01 2022 27.010050
# 6 2001-04-01 2287 25.315068
# 7 2001-07-01 2169 22.611645
# 8 2001-10-01 2366 23.939235
# 9 2002-01-01 2001 -1.038576
# 10 2002-04-01 2087 -8.745081
# 11 2002-07-01 2099 -3.227294
# 12 2002-10-01 2258 -4.564666
or
c(rep(NA, 4,), 100* diff(df$value, lag=4) / head(df$value, -4))
I have two data frames with two different dimensions :
1:
head(x)
Year GDP_deflator
1 1825 NA
2 1826 NA
3 1827 NA
4 1828 NA
5 1829 NA
6 1829 NA
7 1830 NA
8 1830 NA
9 1830 NA
10 1831 NA
dim(x)
1733 2
2:
head(dataDef)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1830 1.866437
7 1831 1.960316
8 1832 2.029601
9 1833 1.880957
10 1834 1.845750
dim(dataDef)
101 2
I would like to substitute values from dataDef$GDP_deflator column into x$GDP_deflator column conditioned on Year column. In other words, I would like the answer to be:
head (x)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1829 1.781999
7 1830 1.866437
8 1830 1.866437
9 1830 1.866437
10 1831 1.960316
So the repeating years (i.e. 1830) get the same value, 1.866437. Any suggestions?
Best Regards
One possibility is to use match:
x$GDP_deflator <- dataDef$GDP_deflator[match(x$Year, dataDef$Year)]
You want to merge the two data.frames. It's a many-to-one merge.
I'm having a dataframe like ba.
I need to extract the dataframe based on region and merge based on date.
It is working if I do manually as like below. But If the number of region is more than two, I need to extract using sapply and then I need to merge(not sure how I can do using loop or sapply). Please advise how I can extract based on "region" and then merge even there are more than two regions(ex: betasol, alpha, atpTax) dynamically.
> ba
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-07 betasol 1408
6 2012-06-12 betasol 1471
7 2012-06-15 betasol 1384
8 2012-06-21 betasol 1390
9 2012-06-22 atpTax 1252
10 2012-06-23 betasol 1442
> dfa <- ba[ab$region == "atpTax", c("date", "AveElapsedTime")]
> dfb <- ba[ab$region == "betasol", c("date", "AveElapsedTime")]
> merge(dfa, dfb, by="date", all=TRUE)
date AveElapsedTime.x AveElapsedTime.y
1 2012-05-19 NA 1372
2 2012-05-22 1652 NA
3 2012-06-02 1552 1630
4 2012-06-07 NA 1408
5 2012-06-12 NA 1471
6 2012-06-15 NA 1384
7 2012-06-21 NA 1390
8 2012-06-22 1252 NA
9 2012-06-23 NA 1442
extractfun <- function(z, ab) {
df[z] <- ab[ab$region == z, c("date","region")]
}
sapply(unique(ba$region), FUN=extractfun, ab=avg_data)
require(reshape)
cast(ba,date~region)