I am trying to make a graph showing the average temp in Australia from 1950 to 2000. My dataset contains a "Country" table which contains Australia but also other countries as well. The dataset also includes years and average temp for every country. How would I go about excluding all the other data to make a graph just for Australia?
Example of the dataset
You just need to subset your data so that it only contains observations about Australia. I can't see the details of your dataset from your picture, but let's assume that your dataset is called d and the column of d detailing which country that observation is about is called country. Then you could do the following using base r:
d_aus <- d[d$country == "Australia", ]
Or using dplyr you could do:
library(dplyr)
d_aus <- d %>%
filter(country == "Australia")
Then d_aus would be the dataset containing only the observations about Australia (in which `d$country == "Australia"), which you could use to make your graph.
This should make the job. Alternatively, change the names of the columns to those of yours.
library("ggplot2")
library("dplyr")
data %>% filter(Country == "Australia" & Year %in% (1950:2000)) %>% ggplot(.,aes(x=Year,y=Temp)) + geom_point()
Related
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
I processed the dataset.
df$countryName[df$countryName == "United States"] <- "United States of America"
Changed here for United States of America Arrived in population data.
df8$death_pop <- df8$death / df8$PopTotal
I totally calculated the death/pop.
most, 10 countries. death/pop. how can I find?
Using base R:
df8[order(df8$death_pop, decreasing = TRUE)[1:10],]
This orders your data.frame by death_pop and extracts the first 10 rows.
Using the package dplyr there is the function top_n, which gives you the desired result. I added arrange(desc() to give you a sorted output. Remove this part if you don't need it.
df8 %>% top_n(10, death_pop) %>% arrange(desc(death_pop))
I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way.
I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far:
library(tidyverse)
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv")
brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>%
filter(data >= "2020-05-15") %>%
subset(!(coduf == 76)) %>%
mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000,
av_dth = (obitosAcumulado/populacaoTCU2019)*100000)
brazilcorona_hilow_rates <- brazilcorona_hl_rates %>%
group_by(estado) %>%
summarize(top_dth = top_n(1, av_dth))
In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from.
rm(list=ls())
brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv")
#
#remove NA's from municipios
brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),]
#here I am gonna use the column "obitosAcumulado" but you should use the one you want
brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado)
states<-as.list(unique(brazilcorona$estado))
result<-lapply(states,FUN=function(x){
df<-brazilcorona[brazilcorona$estado==x,]
df<-df[order(df$obitosAcumulado,decreasing = T),]
return(c(paste(x),as.character(df[1:2,"municipio"]),
as.character(df[(nrow(df)-1):nrow(df),"municipio"])))})
I hope it helps you...
I'm a noob in r programming.
I have 2010 census data in the link-
census data.
This is my dataframe-
dataframe.
What I'd like to do is add the population column 'P001001' from the census data for each state into the dataframe. I'm not able to figure out how to map the state abbreviations in the dataframe to the full names in the census data, and add the respective population to each row for that state in the data frame. The data is for all of the states. What should be the simplest way to do this?
Thanks in advance.
Use the inbuilt datasets for USA states: state.abb and state.name see State name to abbreviation in R
Here's a simple bit of code which will give you a tidyverse approach to the problem.
1) add the state abbreviation to the census table
2) left join the census with the df by state abbrevation
library(tibble)
library(dplyr)
census <-tibble(name = c("Colorado", "Alaska"),
poo1oo1 = c(100000, 200000))
census <-
census %>%
mutate(state_abb = state.abb[match(name, state.name)])
df <- tibble(date = c("2011-01-01", "2011-02-01"),
state = rep("CO", 2),
avg = c(123, 1234))
df <-
df %>%
left_join(census, by = c("state" = "state_abb"))
I have a following data set
df = data.frame("Country" = rep(sample(c("USA", "Germany", "Japan", "Slovakia", "Togo")),2))
df$Value = sample(c(1:1000), 10)
Now I want to randomly sample from that df, lets say, 3 countries. Which means I want to have all 6 rows pertaining to 3 countries. So every time I decide to sample from a variable country, I always get all (here two) rows that pertain to that country.
How could I do it, the following code doesnt work all the time, returning sometimes 2 countries only
df %>% filter(Country %in% sample(Country, 3))
Thanks!
We can wrap with unique to remove the duplicates from 'Country' and use that in the sample to make sure that there would be always 3 sample 'Country'
library(dplyr)
df %>%
filter(Country %in% sample(unique(Country), 3))
I have some data (download link: http://spreadsheets.google.com/pub?key=0AkBd6lyS3EmpdFp2OENYMUVKWnY1dkJLRXAtYnI3UVE&output=xls) that I'm trying to filter. I had reconfigured the data so that instead of one row per country, and one column per year, each row of the data frame is a country-year combination (i.e. Afghanistan, 1960, NA).
Now that I've done that, I want to create a subset of the initial data that excludes any country that has 10+ years of missing contraceptive use data.
I had thought to create a list of the unique country names in a second data frame, and then add a variable to that frame that holds the # of rows for each country that have an NA for contraceptive use (i.e. for Afghanistan it would have 46). My first thought (being most fluent in VB.net) was to use a for loop to iterate through the countries, get the NA count for that country, and then update the second data frame with that value.
In that vein I tried the following:
for(x in cl){
+ x$rc = nrow(subset(BCU, BCU$Country == x$Country))
+ }
After that failed, a little more Googling brought me to a question on here (forgot to grab the link) that suggested using by(). Based on that I tried:
by(cl, 1:nrow(cl), cl$rc <- nrow(subset(BCU, BCU$Country == cl$Country
& BCU$Contraceptive_Use == "NA")))
(cl is the second data frame listing the country names, and BCU is the initial contraceptive use data frame)
I'm fairly new to R (the problem I'm working is for an R course on Udacity), so I'll freely admit this may not be the best approach, but I'm still curious how to do this sort of aggregation.
They all seem to have >= 10 years of missing data (unless I miscalculated somewhere):
library(tidyr)
library(dplyr)
dat <- read.csv("contraceptive use.csv", stringsAsFactors=FALSE, check.names=FALSE)
dat <- rename(gather(dat, year, value, -1),
country=`Contraceptive prevalence (% of women ages 15-49)`)
dat %>%
group_by(country) %>%
summarise(missing_count=sum(is.na(value))) %>%
arrange(desc(missing_count)) -> missing
sum(missing$missing_count >= 10)
## [1] 213
length(unique(dat$country))
## [1] 213