Adding a column with consecutive numbers in R - r

I apologize if this question is abhorrently simple, but I'm looking for a way to just add a column of consecutive integers to a data frame (if my data frame has 200 observations, for example, starting with 1 for the first observation, and ending with 200 on the last one).
How can I do this?

For a dataframe (df) you could use
df$observation <- 1:nrow(df)
but if you have a matrix you would rather want to use
ma <- cbind(ma, "observation"=1:nrow(ma))
as using the first option will transform your data into a list.
Source: http://r.789695.n4.nabble.com/adding-column-of-ordered-numbers-to-matrix-td2250454.html

Or use dplyr.
library(dplyr)
df %>% mutate(observation = 1:n())
You might want it to be the first column of df.
df %>% mutate(observation = 1:n()) %>% select(observation, everything())

Probably, function tibble::rowid_to_column is what you need if you are using tidyverse ecosystem.
library(tidyverse)
dat <- tibble(x=c(10, 20, 30),
y=c('alpha', 'beta', 'gamma'))
dat %>% rowid_to_column(var='observation')
# A tibble: 3 x 3
observation x y
<int> <dbl> <chr>
1 1 10 alpha
2 2 20 beta
3 3 30 gamma

Related

Balance observations in data.frame by factor level [duplicate]

This question already has answers here:
Take random sample by group
(9 answers)
Closed 3 days ago.
I would like to subsample a dataframe that has an imbalanced number of observations by factor level.
The output I want is another dataframe built from data from the original one where the number of observations by factor level is similar across factor levels (doesn't need to be exactly the same number for each level, but roughly similar).
I am not sure if this called "thinning" the data, or "undersampling" the data.
Consider for instance this dataframe:
data <- data.frame(id = 1:1000,
class = c(rep("A", 700), rep("B", 200), rep("C", 50), rep("D", 50)))
How can I slice rows so that I extract ~200 rows, 50 for each class A, B, C and D?
I can do this manually, but I would like to find a method that I can use with larger datasets and based on a factor with more levels.
I would also be thankful for advice on the name of what I need (thinning? undersampling? stratified sampling?). Thanks!
You can use slice_sample in dplyr:
library(dplyr)
data %>%
group_by(class) %>%
slice_sample(n = 50)
In dplyr 1.1.0 and above:
slice_sample(data, n = 50, by = class)
Base R option using lapply with split based on group and sample 50 rows. After that combine them back using rbind like this:
df = lapply(split(data, data$class), function(x) x[sample(nrow(x), 50),])
df_sampled = do.call(rbind, df)
# Check number of observations
library(dplyr)
df_sampled %>%
group_by(class) %>%
summarise(n = n())
#> # A tibble: 4 × 2
#> class n
#> <chr> <int>
#> 1 A 50
#> 2 B 50
#> 3 C 50
#> 4 D 50
Created on 2023-02-17 with reprex v2.0.2

Obtaining a summary of grouped counts in R

This should be simple but I have been stumped by it: I am trying to figure out an efficient method for obtaining summary stats of a grouped count. Here's a toy example:
df = tibble(pid = c(1,2,2,3,3,3,4,4,4,4), y = rnorm(10))
df %>% group_by(pid) %>% count(pid)
which outputs the expected
# A tibble: 4 × 2
# Groups: pid [4]
pid n
<dbl> <int>
1 1 1
2 2 2
3 3 3
4 4 4
However, what if I want a summary of those grouped counts? Attempting to mutate a new variable or add_count hasn't worked I assume because the variables are different sizes. For instance:
df %>% group_by(pid) %>% count(pid) %>% mutate(count = summary(n))
generates an error. What would be a simple way to generate summary statistics of the grouped counts (e.g., min, max, mean, etc.)?
mutate is for adding columns to a data frame - you don't want that here, you need to pull the column out of the data frame.
df %>%
count(pid) %>%
pull(n) %>%
summary()

how to determine the number of unique values based on multiple criteria dplyr

I've got a df that looks like:
df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).
I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.
Here is part of the code I have been using:
Df1 <- Df %>%
filter(Year>1985)%>%
mutate(nYear = n_distinct(Year[Year %in% site]))%>%
group_by(Species,Site, Year) %>%
arrange(Species, .by_group=TRUE)
ungroup()
The approach is good, a few things to correct.
First, let's make some reproducible data (your code gave errors).
df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)
You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).
mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.
The order of your functions in the chains also needs to change.
df %>%
filter(Year>1985) %>%
group_by(species,site) %>%
summarise(nYear = length(unique(Year))) %>% # instead of mutate
arrange(species, .by_group=TRUE) %>%
ungroup()
First, group_by(species,site), not year, then summarise and arrange.
# A tibble: 5 × 3
species site nYear
<int> <chr> <int>
1 1 A 5
2 2 B 5
3 3 C 5
4 4 D 5
5 5 E 5
You can use distinct() on the filtered frame, and then count by your groups of interest:
distinct(Df %>% filter(Year>1985)) %>%
count(Site, Species,name = "nYear")

R and dplyr: case_when throws 'incorrect length error' despite not being asked to evaluate group

I have a panel dataset where some groups have observations starting at an earlier year than others and would like to calculate the change in value from the earliest possible time period. I expected that by using case_when within mutate, R would not try to evaluate the code for groups where the earlier dates do not exist, but this does not seem to be the case. I have included a reprex below.
library("dplyr")
dataset <- data.frame(names=c("a","a","a","b","b"),
values=c(2,3,4,2,3),
dates=c("2010","2011","2012","2011","2012"))
dataset_calc <- dataset %>%
group_by(names) %>%
mutate(new_val = case_when(names=="a" ~ values-values[dates=="2010"],
TRUE ~ values-values[dates=="2011"]))
Is there a better solution for what I would like to do?
The resulting dataframe should be something like:
names values dates new_val
1 a 2 2010 0
2 a 3 2011 1
3 a 4 2012 2
4 b 2 2011 0
5 b 3 2012 1
If you arrage the data by group, then you can just subtract off the first value for each group
dataset %>%
group_by(names) %>%
arrange(dates) %>%
mutate(new_val = values - first(values))
If you wanted to hard code different reference years, you would want to use the case_when part over the year rather than the values. For example
dataset %>%
group_by(names) %>%
mutate(
ref_year = case_when(names=="a" ~ "2010", TRUE~"2011"),
new_val = values - values[dates==ref_year],
ref_year = NULL
)
(you don't need to use the temporary ref_year variable, I just added it here for clarity of how the function was working)

Simple column splitting and joining using dplyr [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I was wondering how I might simply split a numerical column by a second grouping variable in a dataset, then cbind the numerical column. This would most likely be a simple extension of the separate function for dplyr. For example, changing X below:
Y <- rbind(2,5,3,6,3,2)
Z <- rbind("A", "A", "A", "B", "B", "B")
X <- data.frame(Y,Z)
Into
A B
2 6
5 3
3 2
Then ideally extract the rowMeans into a new vector. (Issue also arises here when there is only one character in Z, given rowmeans requires 2).
This would need to be infinitely expandable based on the number of unique variables in Z. e.g., if Z had A, B, and C, then the final data.frame would require 3 columns. This would allow me to capture the row means from infinite number of groups in Z.
Thanks in advance,
Conal
Looks like a job for tidyr::spread.
library(dplyr)
library(tidyr)
X2 <- X %>%
group_by(Z) %>%
mutate(ID = 1:n()) %>%
spread(Z, Y) %>%
select(-ID)
X2
# A tibble: 3 x 2
A B
* <dbl> <dbl>
1 2 6
2 5 3
3 3 2

Resources