Creating a 2 column df from a loop of unique values [duplicate] - r

This question already has an answer here:
How to count the number of unique values by group? [duplicate]
(1 answer)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I am looking to create a matrix babyList storing the number of unique names in the babynames package for every year in that package. I am only interested in using a loop to do this.
library(babynames)
###### Here I create the matrix where I want my output to be in
babyList <- data.frame(matrix(ncol=2,nrow=range(babynames$year[2]-range(babynames$year[1])),))
colnames(babyList) <- c("years","unique names")
The for loop is giving me trouble. Here is what I know it will need:
Pseudo code
for (i in babynames$year) {
length(unique(babynames$name[babynames$year == i]
}
How can I put this all together in a correct structure?

I would suggest using dplyr::summarise() instead:
library(babynames)
library(dplyr)
babynames %>%
group_by(year) %>%
summarise(unique_names = length(unique(name)))
which gives the number of unique names for each year in the babynames dataset:
# A tibble: 138 x 2
year unique_names
* <dbl> <int>
1 1880 1889
2 1881 1830
3 1882 2012
4 1883 1962
...

Related

Remove rows in a dataframe based on number of element per factor in one column in R [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 6 years ago.
I'm using the dplyr package in R and have grouped my data by 3 variables (Year, Site, Brood).
I want to get rid of groups made up of less than 3 rows. For example in the following sample I would like to remove the rows for brood '2'. I have a lot of data to do this with so while I could painstakingly do it by hand it would be so helpful to automate it using R.
Year Site Brood Parents
1996 A 1 1
1996 A 1 1
1996 A 1 0
1996 A 1 0
1996 A 2 1
1996 A 2 0
1996 A 3 1
1996 A 3 1
1996 A 3 1
1996 A 3 0
1996 A 3 1
I hope this makes sense and thank you very much in advance for your help! I'm new to R and stackoverflow so apologies if the way I've worded this question isn't very good! Let me know if I need to provide any other information.
One way to do it is to use the magic n() function within filter:
library(dplyr)
my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))
my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)
The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).
Throwing the data.table approach here to join the party:
library(data.table)
setDT(my_data)
my_data[ , if (.N >= 3L) .SD, by = .(Year, Site, Brood)]
You can also do this using base R:
temp <- read.csv(paste(folder,"test.csv", sep=""), head=TRUE, sep=",")
matches <- aggregate(Parents ~ Year + Site + Brood, temp, FUN="length")
temp <- merge(temp, matches, by=c("Year","Site","Brood"))
temp <- temp[temp$Parents.y >= 3, c(1,2,3,4)]

How would you substrate a column using tidyverse in R? [duplicate]

This question already has answers here:
Extract the first 2 Characters in a string
(4 answers)
Closed 2 years ago.
So I'm needing to rename a column in R and from that column I need to condense the column. For example in the initial data frame it would say "2017-18" and "2018-19" and I need it to condense to the first four digits, essentially cutting off the "-##" portion. I've attempted to use substr() and when I do it says that I'm having issues with it converting to characters or attempting to convert to a character.
data <- read_excel("nba.xlsx")
data1<- data %>%
rename(year=season) %>%
select(year)
data1 <- data1 + as.numeric(substr(year,1,4))
Above is my code that I currently and have tried rearranging and moving things around. Any help would be greatly appreciated. Thank you!
Use str_replace:
df <- tibble(season = c("2017-18", "2018-19"))
df %>% mutate(year = str_replace(season, "-.*", ""))
# A tibble: 2 x 2
season year
<chr> <chr>
1 2017-18 2017
2 2018-19 2018
Alternately, use str_sub:
str_sub(season, 1, 4)

Can I sort a column (Character) by last part of a string / value? [duplicate]

This question already has answers here:
R Sort strings according to substring
(2 answers)
Closed 3 years ago.
I have a data.frame with a column (character) that has a list of values such as (the prefix refers to the season and suffix a year):
Wi_1984,
Su_1985,
Su_1983,
Wi_1982,
Su_1986,
Su_1984,
I want to keep the column type and format as it is, but what I would like to do is order the df by this column in ascending season_year order. So I would like to produce:
Wi_1982,
Su_1983,
Su_1984,
Wi_1984,
Su_1985,
Su_1986,
Using normal sorting will arrange by Wi_ or Su_ and not by _1984 i.e. _year. Any help much appreciated. If this could be done in dplyr / tidyverse that would be grand.
We can use parse_number to get the numeric part and use that in arrange
library(dplyr)
library(readr)
df1 %>%
arrange(parse_number(col1))
Or if the numbers can appear as prefix, then extract the last part
df1 %>%
arrange(as.numeric(str_extract(col1, "\\d+$")))
To answer based on #zx8754 comment, you can do,
library(dplyr)
df %>%
separate(X1, into = c('season', 'year')) %>%
arrange_at(vars(c(2, 1)))
which gives,
# A tibble: 6 x 2
season year
<chr> <chr>
1 Wi 1982
2 Su 1983
3 Su 1984
4 Wi 1984
5 Su 1985
6 Su 1986
In base R, we can extract the numeric part using sub and order
df[order(as.integer(sub(".*?(\\d+)", "\\1", df$col))), ]

R: Selecting cases when a column condition is met [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 4 years ago.
I have a data frame like this:
x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada
I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.
I suppose that I need a loop "for" to do so, but I can't figure how to do it.
First get the names of the countries you want. Then subset by those names.
tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]
With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:
library(data.table)
setDT(df)
df[ , count := .N, by=country]
df[count > 1000]
One possible solution using dplyr:
library(dplyr)
df %>%
group_by(country) %>%
summarise(count = n()) %>%
filter(count >= 1000) %>%
arrange(desc(count))

Count by Year Based on Criterion in R

I am trying to count the number of rows with values fcoli>15 and produce a vector sorting these counts by year.
Some sample data:
Year <- c(1996,1996,1997,19971998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
I have been able to count the number of rows one year at a time using:
nrow(subset(sample, sample$fcoli > 15 & sample$Year == 1996))
However I have not been able to use this criterion to produce counts for all the years at once. My actual data consists of over 20 years of data and so I would rather not have to manually iterate this code for each year.
Any suggestions? Thanks!
Here is a simple enough answer.
Year <- c(1996,1996,1997,1997,1998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
aggregate(fcoli~Year,FUN=length,data=sample[sample$fcoli>15,])
library(dplyr)
df1%>% #df1 is yor data frame
filter(fcoli>15) %>%
group_by(Year)%>%
summarise(freq=n())
Source: local data frame [4 x 2]
Year freq
1 1996 1
2 1997 1
3 1998 1
4 1999 2

Resources