Pivot wider to new column based on condition R - r

I have a dataset associating a single application number with a series of different applicants from different countries. I have a column with each applicant's country of origin as the possible value. I want to condense everything down to 2 columns:
column 1 = count of applicants within USA
column 2 = count of applicants Outside USA
I guessed I would need to use an ifelse but I haven't managed to get anything to work so far, can someone please help?
Thanks!!
ps. If anyone knows how I could do this and produce a list of the countries outside USA like #sotos has done here Pivot wider returning 1 column? that would be even better, but that's just bonus :)

Like so?
df <- data.frame(app_num = c(1,1,1,2,2),
country = c(LETTERS[c(1:4,1)]))
library(tidyverse)
df %>%
count(A = if_else(country == "A", "USA", "Other")) %>%
pivot_wider(names_from = A, values_from = n)

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

Using data table or dplyr instead of loops

Here I have the code with a for loop:
for (i in 1:length(mc_1$code))
{cmc1 = mc_1$code[i]
cmc2 = mc_1[mc_1$code == cmc1,]
cmc3 = cmc2[order(cmc2[ ,2], cmc2[ ,3]),]
mc_1[mc_1$code == cmc1,]$region = last(cmc3$region)
}
For each value in the variable "code", mc_1 have different number of rows. And mc_1 also has columns of year and month (column 2 and 3), and another column, say, region. "region" is different even for same "code" at different month and year.
For each "code", I want to select only the most recent region by month and year (that's why I use "order") and assign that region to all the regions in all the rows for that certain code.
I did have this for loop, which works. But for efficiency and code length issue, how can I rewrite it better using something like data table or dplyr?
you can try this using the dplyr package
and the fact that n() returns the number of rows in each group
mc_1 %>%
group_by(code) %>%
arrange(year, month ) %>%
mutate(region = region[n()])
hope it helps!!

How to group twice?

I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way.
I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far:
library(tidyverse)
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv")
brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>%
filter(data >= "2020-05-15") %>%
subset(!(coduf == 76)) %>%
mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000,
av_dth = (obitosAcumulado/populacaoTCU2019)*100000)
brazilcorona_hilow_rates <- brazilcorona_hl_rates %>%
group_by(estado) %>%
summarize(top_dth = top_n(1, av_dth))
In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from.
rm(list=ls())
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv")
#
#remove NA's from municipios
brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),]
#here I am gonna use the column "obitosAcumulado" but you should use the one you want
brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado)
states<-as.list(unique(brazilcorona$estado))
result<-lapply(states,FUN=function(x){
df<-brazilcorona[brazilcorona$estado==x,]
df<-df[order(df$obitosAcumulado,decreasing = T),]
return(c(paste(x),as.character(df[1:2,"municipio"]),
as.character(df[(nrow(df)-1):nrow(df),"municipio"])))})
I hope it helps you...

Dplyr solution for difference in row values based on two factor levels in separate columns

I am trying to use dplyr to calculate the difference between two row values based on factor levels in large data frame. In practical terms, I want the vote distance between two groups across each party within each country. For the data below, I would like to end up with a data frame with rows indicating the difference between the vote values for each group pair for each party level within each country level. The lag function does not seem to work with my data as the number of factor levels varies by country, meaning each country has a different total number of groups and parties. A small sample of the setup is below.
df1 <- data.frame(id = c(1:12),
country = c("a","a","a","a","a","a","b","b","b","b","b","b"),
group = c("x","y","z","x","y","z","x","y","z","x","y","z"),
party = c("d","d","d","e","e","e","d","d","d","e","e","e"),
vote = c(.15,.02,.7, .5, .6, .22,.47,.33,.09,.83,.77,.66))
This is how I would like the end product to look.
df2 <- data.frame(id= c(1:12),
country = c("a","a","a","b","b","b","a","a","a","b","b","b"),
group1 = c("x","x","y","x","x","y","x","x","y","x","x","y"),
group2 = c("y","z","z","y","z","z","y","z","z","y","z","z"),
party = c("d","d","d","d","d","d","e","e","e","e","e","e"),
dist = c(.13,-.5,-.68,.14,.38,.24,-.1,.28,.38,.06,.17,.11))
I have tried dcast previously and if I fill with the column I want, it doesn't line up and produces NA or 0 where there should be values. The lag function doesn't work in my case because the number of parties and groups are unique for each country and not fixed. Whenever I have tried different intervals for the lag the values are comparing across countries of across parties rather than across groups in some instances.
I have found solutions outside of dplyr but for parsimony in presenting code I am wondering if there is a way in dplyr. Also, the code I have is incredibly long and clunky, and uses six or seven packages just for this problem.
Thanks
We can use combn to create the difference
library(dplyr)
df1 %>%
group_by(country, party) %>%
mutate(dist = combn(vote, 2, FUN = function(x) x[1] - x[2]))
Another way is to use
library(tidyverse)
df1 %>%
left_join(df1 %>% select(-id), by = c("country", "party"), suffix = c("1", "2")) %>%
filter(group1 != group2) %>%
mutate(dist = vote1 - vote2)

Data frame subset according to matching values in R

I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age
Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.

Resources