'Join columns must be present in data' error

'Join columns must be present in data' error - r

I am having this weird issue.
The following code works:
Jakarta_Covid <- left_join(DKI_Jakarta, Covid_DF,
by = c("Sub_District" = "Sub_District"))
However the code chunk below is giving me 'Join columns must be present in data.
x Problem with Sub_District.
Jakarta_Death <- Covid_DF %>%
inner_join(DKI_Jakarta, by=c("Sub_District"="Sub_District")) %>%
group_by(Sub_District, Month) %>%
summarise(`Covid Death Per 10,000 Population` = (((sum(Death))/(Total_Population))*10000))
Jakarta_Death <- Jakarta_Death %>% left_join(DKI_Jakarta,
by=c("Sub_District"="Sub_District"))
How can I calculate the 'Covid Death Per 10,000 Population' from two DF and I need the Geometry column in DKI_Jakarta to plot into a map later on.

left_join(DKI_Jakarta, Covid_DF, by = c("Sub_District")
If you have the same column name in both tables just left one in the by=()

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?

Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

How to combine multiple rows into 1 row in R?

I am having trouble combining multiple rows into 1 row, below is my current data:
I want one row of symptoms for each VAERS_ID. However, because the number of rows each VAERS_ID is inconsistent, I am having trouble.
I have tried this:
test= data %>%
select(VAERS_ID, SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5) %>%
group_by(VAERS_ID) %>%
mutate(Grp = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse
= ",")) %>%
distinct(VAERS_ID, Grp, .keep_all = TRUE)
This gives me the original data, plus another column labeled Grp containing all of the symptoms for each VAERS_ID pasted together, with a comma between each set.
Any help would be appreciated.

Your approach seems right but since data cannot be copied and tested, I am not able to reproduce your error. Some changes suggested, which you can try.
since you want "ALL Symptoms" in 1 place for each VAERS_ID, which is a common real world use case and I face this often. If you don't need original data in output, simply use this
data%>%
group_by(VAERS_ID) %>%
summarise("Symptoms" = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse = ",")
With mutate you get original data since it adds a new column.
To address the warning to ungroup, just added %>%ungroup at end or within summarise add .groups="drop"

I need to use a loop in R but don't know where to start

I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)

I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.

I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.

How to group twice?

I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way.
I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far:
library(tidyverse)
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv")
brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>%
filter(data >= "2020-05-15") %>%
subset(!(coduf == 76)) %>%
mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000,
av_dth = (obitosAcumulado/populacaoTCU2019)*100000)
brazilcorona_hilow_rates <- brazilcorona_hl_rates %>%
group_by(estado) %>%
summarize(top_dth = top_n(1, av_dth))

In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from.
rm(list=ls())
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv")
#
#remove NA's from municipios
brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),]
#here I am gonna use the column "obitosAcumulado" but you should use the one you want
brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado)
states<-as.list(unique(brazilcorona$estado))
result<-lapply(states,FUN=function(x){
df<-brazilcorona[brazilcorona$estado==x,]
df<-df[order(df$obitosAcumulado,decreasing = T),]
return(c(paste(x),as.character(df[1:2,"municipio"]),
as.character(df[(nrow(df)-1):nrow(df),"municipio"])))})
I hope it helps you...

Why numbers not mapped to each row?

So I am trying to find the number of occurrences of each name in another dataset. The code I am trying to run is:
Data$Count <- grep(Data$Name,OtherDataSet$LeadName) %>% length()
The issue is when I run this, the number for the first name gets mapped to each spot in that column. Why is this happening?

library(tidyverse)
Data <- data_frame(Name=c("Dog","Cat","Bird"))
OtherDataSet <- data_frame(LeadName=c("Frog","Cat","Catfish","BirdOfPrey","Bird","Bird"))
Data <- Data %>% mutate(Count=map(.x = Name,~str_detect(.,pattern = OtherDataSet$LeadName)) %>% map_int(~sum(.)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

'Join columns must be present in data' error - r

left_join(DKI_Jakarta, Covid_DF, by = c("Sub_District") If you have the same column name in both tables just left one in the by=()

Related

Lead and lag issue using dplyr

How to combine multiple rows into 1 row in R?

I need to use a loop in R but don't know where to start

How to group twice?

Why numbers not mapped to each row?

Categories

Resources