I have a following data set
df = data.frame("Country" = rep(sample(c("USA", "Germany", "Japan", "Slovakia", "Togo")),2))
df$Value = sample(c(1:1000), 10)
Now I want to randomly sample from that df, lets say, 3 countries. Which means I want to have all 6 rows pertaining to 3 countries. So every time I decide to sample from a variable country, I always get all (here two) rows that pertain to that country.
How could I do it, the following code doesnt work all the time, returning sometimes 2 countries only
df %>% filter(Country %in% sample(Country, 3))
Thanks!
We can wrap with unique to remove the duplicates from 'Country' and use that in the sample to make sure that there would be always 3 sample 'Country'
library(dplyr)
df %>%
filter(Country %in% sample(unique(Country), 3))
Related
I am trying to make a graph showing the average temp in Australia from 1950 to 2000. My dataset contains a "Country" table which contains Australia but also other countries as well. The dataset also includes years and average temp for every country. How would I go about excluding all the other data to make a graph just for Australia?
Example of the dataset
You just need to subset your data so that it only contains observations about Australia. I can't see the details of your dataset from your picture, but let's assume that your dataset is called d and the column of d detailing which country that observation is about is called country. Then you could do the following using base r:
d_aus <- d[d$country == "Australia", ]
Or using dplyr you could do:
library(dplyr)
d_aus <- d %>%
filter(country == "Australia")
Then d_aus would be the dataset containing only the observations about Australia (in which `d$country == "Australia"), which you could use to make your graph.
This should make the job. Alternatively, change the names of the columns to those of yours.
library("ggplot2")
library("dplyr")
data %>% filter(Country == "Australia" & Year %in% (1950:2000)) %>% ggplot(.,aes(x=Year,y=Temp)) + geom_point()
I have a dataframe in R in which every observation is a paper. The continent column denotes which continent the paper looks at and the impacts column denotes which "impacts" it mentions (0 for not mentioned).
It looks like this:
What I am trying to do is this:
I was successful in producing the data in that second picture by doing the following:
water <- with(df,tapply(Water,Continent,FUN=sum))
colnames(water) <- "Water"
temp<- with(df,tapply(Temperature,Continent,FUN=sum))
colnames(temp) <- "Temperature")
new.df <- cbind(water,temp)
new.df2 <- cbind(newColName = rownames(new.df), new.df)
rownames(new.df) <- 1:nrow(new.df)
new.df2 <- melt(new.df2)
names(new.df2) <- c("Continent", "Impact", "Occurrences")
However, I have 50 columns I would like to do this for and am struggling to find a method that would do it succinctly. In sum what i am trying to do is find a easy way to:
sum each "impact" column (every column except the paper number and continent columns) BY the continent column
melt it so that for each Continent and each impact it tells me the number of occurrences that impact appears in that continent
I think a tidyverse approach would look like this, but hard to know for sure without testing on a portion of your data:
tidyr::pivot_longer(
df1,
-c(`paper#`, Continent),
names_to = "Impact",
values_to = "Occurrences"
) %>%
dplyr::group_by(Continent, Impact) %>%
dplyr::summarise(Occurrences = sum(Occurrences))
I have a dataset associating a single application number with a series of different applicants from different countries. I have a column with each applicant's country of origin as the possible value. I want to condense everything down to 2 columns:
column 1 = count of applicants within USA
column 2 = count of applicants Outside USA
I guessed I would need to use an ifelse but I haven't managed to get anything to work so far, can someone please help?
Thanks!!
ps. If anyone knows how I could do this and produce a list of the countries outside USA like #sotos has done here Pivot wider returning 1 column? that would be even better, but that's just bonus :)
Like so?
df <- data.frame(app_num = c(1,1,1,2,2),
country = c(LETTERS[c(1:4,1)]))
library(tidyverse)
df %>%
count(A = if_else(country == "A", "USA", "Other")) %>%
pivot_wider(names_from = A, values_from = n)
I've got this dataset: link. I'm making a choropleth out of it, but there are certain names that do not match with the world data frame (from mapproj, converted with map_data from ggplot2). One of them is Trindad and Tobago, which I want to replace with two (or four if there are two matches; or six...) other rows, identical to the original except for this detail – the name. One should be named Trindad and the other, Tobago.
I think there's probably some function (if it were an entire row, I'd use separate_rows() from tidyr) that sounds like this: function_name("Trindad and Tobago", sep = " and "), but I don't know it.
These are the packages I loaded additionally (it's best if these packages are used instead of others, or at least something in tidyverse:
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(mapproj)
I realise that the way I'm asking this isn't very clear, so if I can help in any way...
if you can create a tibble with the names of countries that you want to seperate (called new_names_tbl here) you can join it onto your covid_data:
library(tibble)
library(tidyr)
covid_data <- tibble::tribble(
~location, ~cases,
"Afghanistan", 12,
"USA", 34,
"Trindidad and Tobago", 45
)
new_names_tbl <- tibble::tribble(
~location, ~new_names_variable,
"Trindidad and Tobago", "Trinidad",
"Trindidad and Tobago", "Tobago"
)
covid_data %>%
left_join(new_names_tbl, by = c("location" = "location")) %>%
mutate(location = case_when(is.na(new_names_variable) ~ location, TRUE ~ new_names_variable)) %>%
select(-new_names_variable)
Here I explicitly stated that coid_data and new_names_tbl are to be joined by the variables location.
This should do the trick!
df <- read.csv('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv')
df$countryName = as.character(df$countryName)
I processed the dataset.
df$countryName[df$countryName == "United States"] <- "United States of America"
Changed here for United States of America Arrived in population data.
df8$death_pop <- df8$death / df8$PopTotal
I totally calculated the death/pop.
most, 10 countries. death/pop. how can I find?
Using base R:
df8[order(df8$death_pop, decreasing = TRUE)[1:10],]
This orders your data.frame by death_pop and extracts the first 10 rows.
Using the package dplyr there is the function top_n, which gives you the desired result. I added arrange(desc() to give you a sorted output. Remove this part if you don't need it.
df8 %>% top_n(10, death_pop) %>% arrange(desc(death_pop))