Add non occurent factors to data frame in R - r

I have a dataframe of factors and corresponding values like this:
df <- data.frame(week = factor(c(1,2,49,50)), occurrences = c(1,4,2,3))
week occurrences
1 1 1
2 2 4
3 49 2
4 50 3
I want to add factors for all the "missing" weeks in (1-53) with the corresponding occurrences value of 0. What is the best way to do this? I have to do this to several data frames that may not be "missing" the same factors so I would like to generalize it in a function.

You can use rbind() to append the necessary lines to your df, in this example, I first create the df to be added before appending it for clarity. setdiff() will return the numbers currently not present in your week column:
df_to_app = data.frame(week = factor(setdiff(1:52, df$week)), occurrences = 0)
df = rbind(df, df_to_app)
I hope that helps!

Here's an approach with tidyr::complete. First, we need to add the additional levels to our week column. We can use forcats::fct_expand. Then tidyr::complete will fill the data.frame with those levels and we can use the fill = argument to indicate that we want 0.
library(tidyverse)
df %>%
mutate(week = fct_expand(week,paste0(1:52))) %>%
complete(week, fill = list(occurrences = 0))
# A tibble: 52 x 2
week occurrences
<fct> <dbl>
1 1 1
2 2 4
3 49 2
4 50 3
5 3 0
6 4 0
7 5 0
8 6 0
9 7 0
10 8 0
# … with 42 more rows
Or with a right join to a data.frame containing all weeks:
library(dplyr)
df %>%
right_join(data.frame(week = as.factor(1:52))) %>%
mutate(occurrences = replace_na(occurrences,0))

Related

mutate string into numeric, ignore alphabetical order of factor

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result
I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))
You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

Repeat (duplicate) just one row twice in R

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!
Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6
One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

Remove duplicate rows based on conditions from multiple columns (decreasing order) in R

I have a 3-columns data.frame (variables: ID.A, ID.B, DISTANCE). I would like to remove the duplicates under a condition: keeping the row with the smallest value in column 3.
It is the same problem than here :
R, conditionally remove duplicate rows
(Similar one: Remove duplicates based on 2nd column condition)
But, in my situation, there is second problem : I have to remove rows when the couples (ID.A, ID.B, DISTANCE) are duplicated, and not only when ID.A is duplicated.
I tried several things, such as:
df <- ddply(df, 1:3, function(df) return(df[df$DISTANCE==min(df$DISTANCE),]))
but it didn't work
Example :
This dataset
id.a id.b dist
1 1 1 12
2 1 1 10
3 1 1 8
4 2 1 20
5 1 1 15
6 3 1 16
Should become:
id.a id.b dist
3 1 1 8
4 2 1 20
6 3 1 16
Using dplyr, and a suitable modification to Remove duplicated rows using dplyr
library(dplyr)
df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)
Another way of achieving the solution and retaining all the columns:
df %>% arrange(dist) %>%
distinct(id.a, id.b, .keep_all=TRUE)
# id.a id.b dist
# 1 1 1 8
# 2 3 1 16
# 3 2 1 20

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Resources