This is just one of those things that I can't figure out how to word in order to search for a solution to my problem. I have some election data for Democratic and Republican candidates. The data is contained in 2 rows per county with one of those rows corresponding to one of the two candidates.
I need a data frame with one row per county and I need to create a new column out of the second row for each county. I've tried to un-nest the dataframe, but that doesn't work. I've seen something about using un-nest and mutate together, but I can't figure that out. Transposing the dataframe didn't help either. I've also tried to ungroup without success.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# Remove unnecessary columns
election <- within(election, rm('ElectionDate','OfficeCode.Text.','DistrictCode.Text.','StatusCode','CountyCode','OfficeDescription','PartyOrder','PartyName','CandidateID','CandidateFirstName','CandidateMiddleName','CandidateFormerName','WriteIn.W..Uncommitted.Z.','Recount...','Nominated.N..Elected.E.'))
# Remove offices other than POTUS
election <- election[-c(167:2186),]
# Keep only DEM and REP parties
election <- election %>%
filter(PartyDescription == "Democratic" |
PartyDescription == "Republican")
[
I'd like it to look like this:
dplyr
library(dplyr)
library(tidyr) # pivot_wider
election %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
slice(-(167:2186)) %>%
filter(PartyDescription %in% c("Democratic", "Republican")) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
# # A tibble: 83 x 25
# CountyName Biden Trump Richer LaFave Cambensy Wagner Metsa Markkanen Lipton Strayhorn Carlone Frederick Bernstein Diggs Hubbard Meyers Mosallam Vassar `O'Keefe` Schuitmaker Dewaelsche Stancato Gates Land
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ALCONA 2142 4848 NA NA NA NA NA NA 1812 1748 4186 4209 1818 1738 4332 4114 1696 1770 4273 4187 1682 1733 4163 4223
# 2 ALGER 2053 3014 NA NA 2321 2634 NA NA 1857 1773 2438 2470 1795 1767 2558 2414 1757 1769 2538 2444 1755 1757 2458 2481
# 3 ALLEGAN 24449 41392 NA NA NA NA NA NA 20831 19627 37681 38036 20043 19640 38805 37375 18820 19486 37877 39052 19081 19039 37322 38883
# 4 ALPENA 6000 10686 NA NA NA NA NA NA 5146 4882 8845 8995 5151 4873 9369 8744 4865 4935 9212 8948 4816 4923 9069 9154
# 5 ANTRIM 5960 9748 NA NA NA NA NA NA 5042 4798 8828 8886 4901 4797 9108 8737 4686 4810 9079 8867 4679 4781 8868 9080
# 6 ARENAC 2774 5928 NA NA NA NA NA NA 2374 2320 4626 4768 2396 2224 4833 4584 2215 2243 5025 4638 2185 2276 4713 4829
# 7 BARAGA 1478 2512 NA NA NA NA 1413 2517 1267 1212 2057 2078 1269 1233 2122 2003 1219 1243 2090 2056 1226 1228 2072 2074
# 8 BARRY 11797 23471 NA NA NA NA NA NA 9794 9280 20254 20570 9466 9215 20885 20265 9060 9324 21016 20901 8967 9121 20346 21064
# 9 BAY 26151 33125 NA NA NA NA NA NA 23209 22385 26021 26418 23497 22050 27283 25593 21757 22225 27422 25795 21808 21999 26167 26741
# 10 BENZIE 5480 6601 NA NA NA NA NA NA 4704 4482 5741 5822 4584 4479 6017 5681 4379 4449 5979 5756 4392 4353 5704 5870
# # ... with 73 more rows
#r2evans had the right idea, but slicing the data before filtering lost a lot of the voting data. I hadn't realized that before.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# That's an ugly dataset...let's make it better
election <- election[-c(1:5,7:9,11,13:15,17:19)]
election <- election %>%
filter(CandidateLastName %in% c("Biden", "Trump")) %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
A sample of my data is available here.
I am trying to calculate the growth rate (change in weight (wt) over time) for each squirrel.
When I have my data in wide format:
squirrel fieldBirthDate date1 date2 date3 date4 date5 date6 age1 age2 age3 age4 age5 age6 wt1 wt2 wt3 wt4 wt5 wt6 litterid
22922 2017-05-13 2017-05-14 2017-06-07 NA NA NA NA 1 25 NA NA NA NA 12 52.9 NA NA NA NA 7684
22976 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 3 25 NA NA NA NA 15.5 50.9 NA NA NA NA 7692
22926 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 0 25 NA NA NA NA 10.1 48 NA NA NA NA 7719
I am able to calculate growth rate with the following code:
library(dplyr)
#growth rate between weight 1 and weight 3, divided by age when weight 3 is recorded
growth <- growth %>%
mutate (g.rate=((wt3-wt1)/age3))
#growth rate between weight 1 and weight 2, divided by age when weight 2 is recorded
merge.growth <- merge.growth %>%
mutate (g.rate=((wt2-wt1)/age2))
However, when the data is in long format (a format needed for the analysis I am running afterwards):
squirrel litterid date age wt
22922 7684 2017-05-13 0 NA
22922 7684 2017-05-14 1 12
22922 7684 2017-06-07 25 52.9
22976 7692 2017-05-13 1 NA
22976 7692 2017-05-16 3 15.5
22976 7692 2017-06-07 25 50.9
22926 7719 2017-05-14 0 10.1
22926 7719 2017-06-08 25 48
I cannot use the mutate function I used above. I am hoping to create a new column that includes growth rate as follows:
squirrel litterid date age wt g.rate
22922 7684 2017-05-13 0 NA NA
22922 7684 2017-05-14 1 12 NA
22922 7684 2017-06-07 25 52.9 1.704
22976 7692 2017-05-13 1 NA NA
22976 7692 2017-05-16 3 15.5 NA
22976 7692 2017-06-07 25 50.9 1.609
22926 7719 2017-05-14 0 10.1 NA
22926 7719 2017-06-08 25 48 1.516
22758 7736 2017-05-03 0 8.8 NA
22758 7736 2017-05-28 25 43 1.368
22758 7736 2017-07-05 63 126 1.860
22758 7736 2017-07-23 81 161 1.879
22758 7736 2017-07-26 84 171 1.930
I have been calculating the growth rates (growth between each wt and the first time it was weighed) in excel, however I would like to do the calculations in R instead since I have a large number of squirrels to work with. I suspect if else loops might be the way to go here, but I am not well versed in that sort of coding. Any suggestions or ideas are welcome!
You can use group_by to calculate this for each squirrel:
group_by(df, squirrel) %>%
mutate(g.rate = (wt - nth(wt, which.min(is.na(wt)))) /
(age - nth(age, which.min(is.na(wt)))))
That leaves NaNs where the age term is zero, but you can change those to NAs if you want with df$g.rate[is.nan(df$g.rate)] <- NA.
alternative using data.table and its function "shift" that takes the previous row
library(data.table)
df= data.table(df)
df[,"growth":=(wt-shift(wt,1))/age,by=.(squirrel)]
I want to conditionally replace missing revenue up to 16th July 2017 with zero using tidyverse.
My Data
library(tidyverse)
library(lubridate)
df<- tribble(
~Date, ~Revenue,
"2017-07-01", 500,
"2017-07-02", 501,
"2017-07-03", 502,
"2017-07-04", 503,
"2017-07-05", 504,
"2017-07-06", 505,
"2017-07-07", 506,
"2017-07-08", 507,
"2017-07-09", 508,
"2017-07-10", 509,
"2017-07-11", 510,
"2017-07-12", NA,
"2017-07-13", NA,
"2017-07-14", NA,
"2017-07-15", NA,
"2017-07-16", NA,
"2017-07-17", NA,
"2017-07-18", NA,
"2017-07-19", NA,
"2017-07-20", NA
)
df$Date <- ymd(df$Date)
Date up to which I want to conditionally replace NAs
max.date <- ymd("2017-07-16")
Output I desire
# A tibble: 20 × 2
Date Revenue
<chr> <dbl>
1 2017-07-01 500
2 2017-07-02 501
3 2017-07-03 502
4 2017-07-04 503
5 2017-07-05 504
6 2017-07-06 505
7 2017-07-07 506
8 2017-07-08 507
9 2017-07-09 508
10 2017-07-10 509
11 2017-07-11 510
12 2017-07-12 0
13 2017-07-13 0
14 2017-07-14 0
15 2017-07-15 0
16 2017-07-16 0
17 2017-07-17 NA
18 2017-07-18 NA
19 2017-07-19 NA
20 2017-07-20 NA
The only way I could work this out was to split the df into several parts, update for NAs and then rbind the whole lot.
Could someone please help me do this efficiently using tidyverse.
We can mutate the 'Revenue' column to replace the NA with 0 using a logical condition that checks whether the element is NA and the 'Date' is less than or equal to 'max.date'
df %>%
mutate(Revenue = replace(Revenue, is.na(Revenue) & Date <= max.date, 0))
# A tibble: 20 x 2
# Date Revenue
# <date> <dbl>
# 1 2017-07-01 500
# 2 2017-07-02 501
# 3 2017-07-03 502
# 4 2017-07-04 503
# 5 2017-07-05 504
# 6 2017-07-06 505
# 7 2017-07-07 506
# 8 2017-07-08 507
# 9 2017-07-09 508
#10 2017-07-10 509
#11 2017-07-11 510
#12 2017-07-12 0
#13 2017-07-13 0
#14 2017-07-14 0
#15 2017-07-15 0
#16 2017-07-16 0
#17 2017-07-17 NA
#18 2017-07-18 NA
#19 2017-07-19 NA
#20 2017-07-20 NA
It can be achieved with data.table by specifying the logical condition in 'i and assigning (:=) the 'Revenue' to 0
library(data.table)
setDT(df)[is.na(Revenue) & Date <= max.date, Revenue := 0]
Or with base R
df$Revenue[is.na(df$Revenue) & df$Date <= max.date] <- 0
I have a very simple doubt in R but still I cannot find the solution in previous answers for what I need, or I missed it. I want a sort of vlookup (like Excel) formula but only for specific rows in a dataframe. Let’s say I have a data frame like the following:
id obs year a1 a2 b1 b2 c
604 43 2003 NA NA NA NA NA
605 43 2004 NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152
769 55 2003 NA NA NA NA NA
770 55 2004 NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307
844 61 2003 1800 11633 157977428 0.0089 69901689
845 61 2004 2200 14841 228966763 0.0012 86853166
846 61 2005 2500 15559 345889717 0.0081 103029905
2209 178 2003 NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685
2211 178 2005 250 47202 610500 0.1605 12813908
Then, I apply a formula to all the complete cases in the data so, for this particular example, I will get a matrix with 5 lines of results (and 2 results per observation) that I am showing down here:
id x y
606 8000 30
844 1700 90
845 8000 61
846 400 82
2211 600 30
So now, what I basically want is, only for rows in year 2005 in the dataframe, check where there is a matching (by id) in the matrix and modify a specific column in the dataframe (that I created before as “value”) with its corresponding result in the “y” column of the matrix. Consider here some points: (a) for the non complete cases it should offer NA, (b) I only want year 2005 to be modified; other years will be modified later with other follow up formulas that will offer a different matrix result. Given this, to my knowledge, functions like merge, match, cbind or plyr ones, will affect the whole column and I am not looking for that. Other options like %in% or %l% didn’t work neither, or I am using them mistakenly. This is what I tried so far with no success:
df$value [c(df$year==2005)] <- matrix[,3[matrix[,1]==df$id]]
df$value [c(df$year==2005)] <- matrix[,3][matrix[,1]==df$id]
Maybe a loop can be the solution but I am still learning how to build them and was unfruitful too.
Here the result that I would expect, for better understanding.
id obs year a1 a2 b1 b2 c value
604 43 2003 NA NA NA NA NA NA
605 43 2004 NA NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152 30
769 55 2003 NA NA NA NA NA NA
770 55 2004 NA NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307 NA
844 61 2003 1800 11633 157977428 0.0089 69901689 NA
845 61 2004 2200 14841 228966763 0.0012 86853166 NA
846 61 2005 2500 15559 345889717 0.0081 103029905 82
2209 178 2003 NA NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685 NA
2211 178 2005 250 47202 610500 0.1605 12813908 30
Thanks a lot for any hint and keep on doing the great job. I was checking this web for about a year already and it helped me a lot!!!
Using akrun's data, you could, also, use:
ifelse(df1$year == 2005 & rowSums(sapply(df1[-(1:3)], is.na)) == 0,
m1[match(df1$id, m1[, "id"]), "y"],
NA)
#[1] NA NA 30 NA NA NA NA NA 82 NA NA 30
i.e. if the year is 2005 and there is no NAin the row, take the respective "y" from the matrix else NA.
You could try: df1 is the data.frame and m1 matrix
indx <- which(df1$year==2005)
Update
I guess I missed one of the conditions i.e. complete.cases (though in the example dataset, it didn't change the results). The new indx should be
indx <- which(df1$year==2005 & !rowSums(is.na(df1[-(1:3)]))) #inspired from #alexis_laz answer
df1$value <- NA
df1$value[indx[df1$id[indx] %in% m1[,"id"] ]] <- m1[, "y"][m1[,"id"] %in% df1$id[indx]]
df1
# id obs year a1 a2 b1 b2 c value
#1 604 43 2003 NA NA NA NA NA NA
#2 605 43 2004 NA NA NA NA NA NA
#3 606 43 2005 9000 6421 1748365 0.1616 36872152 30
#4 769 55 2003 NA NA NA NA NA NA
#5 770 55 2004 NA NA NA NA NA NA
#6 771 55 2005 2500 12449 NA NA 125992307 NA
#7 844 61 2003 1800 11633 157977428 0.0089 69901689 NA
#8 845 61 2004 2200 14841 228966763 0.0012 86853166 NA
#9 846 61 2005 2500 15559 345889717 0.0081 103029905 82
#10 2209 178 2003 NA NA NA NA NA NA
#11 2210 178 2004 200 45093 NA NA 11668685 NA
#12 2211 178 2005 250 47202 610500 0.1605 12813908 30
data
df1 <- structure(list(id = c(604L, 605L, 606L, 769L, 770L, 771L, 844L,
845L, 846L, 2209L, 2210L, 2211L), obs = c(43L, 43L, 43L, 55L,
55L, 55L, 61L, 61L, 61L, 178L, 178L, 178L), year = c(2003L, 2004L,
2005L, 2003L, 2004L, 2005L, 2003L, 2004L, 2005L, 2003L, 2004L,
2005L), a1 = c(NA, NA, 9000L, NA, NA, 2500L, 1800L, 2200L, 2500L,
NA, 200L, 250L), a2 = c(NA, NA, 6421L, NA, NA, 12449L, 11633L,
14841L, 15559L, NA, 45093L, 47202L), b1 = c(NA, NA, 1748365L,
NA, NA, NA, 157977428L, 228966763L, 345889717L, NA, NA, 610500L
), b2 = c(NA, NA, 0.1616, NA, NA, NA, 0.0089, 0.0012, 0.0081,
NA, NA, 0.1605), c = c(NA, NA, 36872152L, NA, NA, 125992307L,
69901689L, 86853166L, 103029905L, NA, 11668685L, 12813908L)), .Names = c("id",
"obs", "year", "a1", "a2", "b1", "b2", "c"), class = "data.frame", row.names = c(NA,
-12L))
m1 <- structure(c(606L, 844L, 845L, 846L, 2211L, 8000L, 1700L, 8000L,
400L, 600L, 30L, 90L, 61L, 82L, 30L), .Dim = c(5L, 3L), .Dimnames = list(
NULL, c("id", "x", "y")))
If I was in your shoes, I probably will write a for loop and a function to loop through every record since it seems like they have several difference logic going on based on the condition.
Here is my understanding of your 'specification':
work on only on the rows which obeys certain criteria (year equals 2005 in this case) instead of affecting the whole column.
Here is some code, it is a bit long but I don't know if the idea of breaking the dataframe into two parts and then put them back together using melt/cast will be helpful:
mytext1 <- "id obs year a1 a2 b1 b2 c
604 43 2003 NA NA NA NA NA
605 43 2004 NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152
769 55 2003 NA NA NA NA NA
770 55 2004 NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307
844 61 2003 1800 11633 157977428 0.0089 69901689
845 61 2004 2200 14841 228966763 0.0012 86853166
846 61 2005 2500 15559 345889717 0.0081 103029905
2209 178 2003 NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685
2211 178 2005 250 47202 610500 0.1605 12813908"
mytext2 <- "id x y
606 8000 30
844 1700 90
845 8000 61
846 400 82
2211 600 30"
data.1 <- read.table(text=mytext1, header=TRUE)
data.2 <- read.table(text=mytext2, header=TRUE)
require(plyr)
require(reshape2)
a <- merge(x=subset(data.1, year==2005), y=data.2, by="id")
b <- subset(data.1, year!=2005)
a.new <- melt(a, id.vars=c('id'))
b.new <- melt(b, id.vars=c('id'))
result.new <- rbind(a.new, b.new)
result <- dcast(result.new, id ~ variable)
Now you have the result likes this:
> result
id obs year a1 a2 b1 b2 c x y
1 604 43 2003 NA NA NA NA NA NA NA
2 605 43 2004 NA NA NA NA NA NA NA
3 606 43 2005 9000 6421 1748365 0.1616 36872152 8000 30
4 769 55 2003 NA NA NA NA NA NA NA
5 770 55 2004 NA NA NA NA NA NA NA
6 844 61 2003 1800 11633 157977428 0.0089 69901689 NA NA
7 845 61 2004 2200 14841 228966763 0.0012 86853166 NA NA
8 846 61 2005 2500 15559 345889717 0.0081 103029905 400 82
9 2209 178 2003 NA NA NA NA NA NA NA
10 2210 178 2004 200 45093 NA NA 11668685 NA NA
11 2211 178 2005 250 47202 610500 0.1605 12813908 600 30
You still need to change the name either in the end or before putting them back together.. :)