r - ratio calculation via data set - r
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
I processed the dataset.
df$countryName[df$countryName == "United States"] <- "United States of America"
Changed here for United States of America Arrived in population data.
df8$death_pop <- df8$death / df8$PopTotal
I totally calculated the death/pop.
most, 10 countries. death/pop. how can I find?
Using base R:
df8[order(df8$death_pop, decreasing = TRUE)[1:10],]
This orders your data.frame by death_pop and extracts the first 10 rows.
Using the package dplyr there is the function top_n, which gives you the desired result. I added arrange(desc() to give you a sorted output. Remove this part if you don't need it.
df8 %>% top_n(10, death_pop) %>% arrange(desc(death_pop))
Related
Making a graph from a dataset
I am trying to make a graph showing the average temp in Australia from 1950 to 2000. My dataset contains a "Country" table which contains Australia but also other countries as well. The dataset also includes years and average temp for every country. How would I go about excluding all the other data to make a graph just for Australia? Example of the dataset
You just need to subset your data so that it only contains observations about Australia. I can't see the details of your dataset from your picture, but let's assume that your dataset is called d and the column of d detailing which country that observation is about is called country. Then you could do the following using base r: d_aus <- d[d$country == "Australia", ] Or using dplyr you could do: library(dplyr) d_aus <- d %>% filter(country == "Australia") Then d_aus would be the dataset containing only the observations about Australia (in which `d$country == "Australia"), which you could use to make your graph.
This should make the job. Alternatively, change the names of the columns to those of yours. library("ggplot2") library("dplyr") data %>% filter(Country == "Australia" & Year %in% (1950:2000)) %>% ggplot(.,aes(x=Year,y=Temp)) + geom_point()
How to group twice?
I'd like to know how to group twice in a data set. I must answer the following question: "For each state, which municipalities have the lowest and the highest infections and death rates?". This question is part of a homework (https://github.com/umbertomig/intro-prob-stat-FGV/blob/master/assignments/hw6.Rmd) and I don't know how to do it. I've tried to use top_n, but I am not sure if this is the best way. I wanted to generate a data set in which, for each state, there were four municipalities (two with the highest rates of infection and death from coronavirus and two with the smallest). This is what a have done so far: library(tidyverse) brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/data sets/brazilcorona.csv") brazilcorona_hl_rates <- select(brazilcorona, (estado:emAcompanhamentoNovos)) %>% filter(data >= "2020-05-15") %>% subset(!(coduf == 76)) %>% mutate(av_inf = (casosAcumulado/populacaoTCU2019)*100000, av_dth = (obitosAcumulado/populacaoTCU2019)*100000) brazilcorona_hilow_rates <- brazilcorona_hl_rates %>% group_by(estado) %>% summarize(top_dth = top_n(1, av_dth))
In my example, I find two cities per state with the maximum and minimum values for the column "obitosAcumulado", but to solve your problem you should simply change it to the column containing the information you want to extract the information from. rm(list=ls()) brazilcorona <- read.csv("https://raw.githubusercontent.com/umbertomig/intro-prob-stat-FGV/master/datasets/brazilcorona.csv") # #remove NA's from municipios brazilcorona<-brazilcorona[!is.na(brazilcorona$municipio),] #here I am gonna use the column "obitosAcumulado" but you should use the one you want brazilcorona$obitosAcumulado<-as.numeric(brazilcorona$obitosAcumulado) states<-as.list(unique(brazilcorona$estado)) result<-lapply(states,FUN=function(x){ df<-brazilcorona[brazilcorona$estado==x,] df<-df[order(df$obitosAcumulado,decreasing = T),] return(c(paste(x),as.character(df[1:2,"municipio"]), as.character(df[(nrow(df)-1):nrow(df),"municipio"])))}) I hope it helps you...
Sample grouped rows from a data frame/tibble in R
I have a following data set df = data.frame("Country" = rep(sample(c("USA", "Germany", "Japan", "Slovakia", "Togo")),2)) df$Value = sample(c(1:1000), 10) Now I want to randomly sample from that df, lets say, 3 countries. Which means I want to have all 6 rows pertaining to 3 countries. So every time I decide to sample from a variable country, I always get all (here two) rows that pertain to that country. How could I do it, the following code doesnt work all the time, returning sometimes 2 countries only df %>% filter(Country %in% sample(Country, 3)) Thanks!
We can wrap with unique to remove the duplicates from 'Country' and use that in the sample to make sure that there would be always 3 sample 'Country' library(dplyr) df %>% filter(Country %in% sample(unique(Country), 3))
Summing across rows conditional on groups with dplyr using select, group_by, and mutate
Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market. Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio. Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)): df<- blp %>% select(cat,yr,s) %>% group_by(cat,yr) %>% summarise(group_share = sum(s)) #in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer blp$group_share=0 #initializing the group_share, the 50th col for(i in 1:501){ for(j in 1:15){ if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS} blp[i,50]=df[j,3] } } } This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful. Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor. #45 obs, 3 cats, 5 yrs cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large") yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81) s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002) blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s)))) names(blp)<-c("cat","yr","s") head(blp) #note: one example of a group share would be summing the share from (group_share.blp.large.81.s=(blp[cat== "large" &yr==81,])) #works thanks to akrun: applying the code I provided for what leads to the 15 groups df <- blp %>% select(cat,yr,s) %>% group_by(cat,yr) %>% summarise(group_share = sum(as.numeric(as.character(s)))) #manually filling doesn't work, but this is what I'd want if I didn't want pipelining blp$group_share=0 for(i in 1:45){ if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS} blp[i,4]=df[j,3]; } }
if I understood your problem correctly this should ideally help! Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one. # Sample input ## 45 obs, 3 cats, 5 yrs cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large") yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81) s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002) # Calculation blp <- data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe group_by(cat, yr) %>% # Grouping by category and year mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year ungroup() Expected output Expected output
R fill in variable for a specific observation in a data frame
I have some data (download link: http://spreadsheets.google.com/pub?key=0AkBd6lyS3EmpdFp2OENYMUVKWnY1dkJLRXAtYnI3UVE&output=xls) that I'm trying to filter. I had reconfigured the data so that instead of one row per country, and one column per year, each row of the data frame is a country-year combination (i.e. Afghanistan, 1960, NA). Now that I've done that, I want to create a subset of the initial data that excludes any country that has 10+ years of missing contraceptive use data. I had thought to create a list of the unique country names in a second data frame, and then add a variable to that frame that holds the # of rows for each country that have an NA for contraceptive use (i.e. for Afghanistan it would have 46). My first thought (being most fluent in VB.net) was to use a for loop to iterate through the countries, get the NA count for that country, and then update the second data frame with that value. In that vein I tried the following: for(x in cl){ + x$rc = nrow(subset(BCU, BCU$Country == x$Country)) + } After that failed, a little more Googling brought me to a question on here (forgot to grab the link) that suggested using by(). Based on that I tried: by(cl, 1:nrow(cl), cl$rc <- nrow(subset(BCU, BCU$Country == cl$Country & BCU$Contraceptive_Use == "NA"))) (cl is the second data frame listing the country names, and BCU is the initial contraceptive use data frame) I'm fairly new to R (the problem I'm working is for an R course on Udacity), so I'll freely admit this may not be the best approach, but I'm still curious how to do this sort of aggregation.
They all seem to have >= 10 years of missing data (unless I miscalculated somewhere): library(tidyr) library(dplyr) dat <- read.csv("contraceptive use.csv", stringsAsFactors=FALSE, check.names=FALSE) dat <- rename(gather(dat, year, value, -1), country=`Contraceptive prevalence (% of women ages 15-49)`) dat %>% group_by(country) %>% summarise(missing_count=sum(is.na(value))) %>% arrange(desc(missing_count)) -> missing sum(missing$missing_count >= 10) ## [1] 213 length(unique(dat$country)) ## [1] 213