Remove rows below certain row number/condition by group - r

I'm trying to subset a dataframe in R. It contains several categories. The first few rows for each category need to be removed. The number of rows to remove is inconsistent, but there is a row that indicates the cutoff. How do I remove everything above the cutoff (including that row) for each group?
Example data:
category <- c(rep("A", 3), rep("B", 5), rep("C", 4))
info <- as.character(c("Junk", "Border", "Useful",
"This", "is", "Useless", "Border", "Yes please",
"Unwanted", "Row", "Border", "Required"))
example_df <- data.frame(category, info)
example_df$row_number <- 1:nrow(example_df)
I can extract the row numbers of the border and the start of each group:
border_rows <- which(example_df$info == "Border")
start_rows <- example_df %>%
group_by(category) %>%
slice(1)
start_rows <- start_rows$row_number
I've tried the following, but this only removes the first two rows (i.e. the ones that need to be removed for group A).
for(i in 1:length(border_rows)) {
new_df <- example_df[-(start_rows[i]:border_rows[i]), ]
}

You can easily do this with dplyr package -
library(dplyr)
example_df %>%
group_by(category) %>%
filter(row_number() > which(info == "Border")) %>%
ungroup()
# A tibble: 3 x 2
category info
<fct> <fct>
1 A Useful
2 B Yes please
3 C Required

Related

Correlate two variables between two dataframe with different sample size

I would like to make a correlation between two dataframes.
Let me show an example:
Here are my two dataframes
set.seed(123)
v1<- c(rep("a", 4), rep("b", 3))
v2<- c(rnorm(4), rnorm(3))
df1 <- data.frame(v1, v2)
v1<- c(rep("a", 3), rep("b", 5))
v2<- c(rnorm(3), rnorm(5))
df2 <- data.frame(v1, v2)
I would like to correlate "a" modalities between df1 and df2 and do the same for "b" and so on. I could do it manually but I have hundreds of different modalities in my variable v1.
Another issue is sometimes I have different sizes between both dataframes.
For instance, in df1 we have "a" repeated four times while in df2 it is repeated three times.
I would like to correlate always based on the dataframe having the minimum modality size and discard the other on the dataframe having more like in this stackoverflow example here
Let me know if you need more details
To calculate the correlation between the "a" modalities in df1 and df2, you can first subset the data for each dataframe by the values in the v1 column:
df1_a <- df1[df1$v1 == "a", ]
df2_a <- df2[df2$v1 == "a", ]
Then, you can calculate the correlation using the cor() function, which calculates the Pearson correlation coefficient between two variables:
cor(df1_a$v2, df2_a$v2)
If the sizes of the data frames are different, you can take the minimum of the two sizes and use that to subset the data before calculating the correlation:
n <- min(nrow(df1_a), nrow(df2_a))
cor(df1_a$v2[1:n], df2_a$v2[1:n])
To do this for all modalities in the v1 column, you can use a for loop to iterate over all unique values in the v1 column and calculate the correlations for each modality:
modalities <- unique(df1$v1)
correlations <- list()
for (modality in modalities) {
df1_subset <- df1[df1$v1 == modality, ]
df2_subset <- df2[df2$v1 == modality, ]
n <- min(nrow(df1_subset), nrow(df2_subset))
correlations[[modality]] <- cor(df1_subset$v2[1:n], df2_subset$v2[1:n])
}
correlations
This will create a list of correlations for each modality, with the modality names as the names of the list elements.
I hope this helps!
This is one approach using dplyr
First attach a group to the list of both data frames.
Then get the group size for later filtering of row numbers in groups.
Finally get the correlation for the matching groups.
library(dplyr)
bind_rows(df1 %>% mutate(grp = 1), df2 %>% mutate(grp = 2)) %>%
group_by(v1, grp) %>%
mutate(size = n(), rown = row_number()) %>%
group_by(v1) %>%
mutate(minsize = min(size)) %>%
filter(rown <= minsize) %>%
summarize(corr = cor(v2[grp == 1], v2[grp == 2]))
# A tibble: 2 × 2
v1 corr
<chr> <dbl>
1 a 0.819
2 b -0.693

How to sum values of cells by their column and row names to create a matrix in R?

I have a data matrix where row and column names are similar. However, row names are repetitive. I want to sum cell values by a unique combination of row names and column names (for example, sum of all cell values having dimension of Row1 * Col1) and create a matrix in R. The new matrix will be sum of all cells by unique combination or Row and Colum names. Thank you
Example Dataset:
You can do as such. See the comments trough the code as well. Also you need the pckgs that G. Grothendieck gave in his/her answer.
df <- data.frame(Row = c("r_1", "r_2",etc.),
Col = c("Col1", "Col2",etc.),
Value = c(1, 2, etc.))
row_names <- unique(df$Row)
col_names <- unique(df$Col)
#this will give you all possible combinations
combinations <- expand.grid(Row = row_names, Col = col_names)
result <- df %>% group_by(Row, Col) %>% summarize(Value = sum(Value))
result <- left_join(combinations, result, by = c("Row" = "Row", "Col" = "Col"))
result[is.na(result)] <- 0
names(result)[1] <- "R"# choose the value that suits you best
names(result)[2] <- "C"# choose the value that suits you best
Here is a full tidyverse solution in the single pipe:
library(dplyr)
library(tidyr)
rn <- c("name1", "name2", "name3",
"name1", "name2", "name3") # necessary as duplicate names are not allowed in data.frame
# and will be dropped at type cast
matrix(1:18, ncol=3) %>%
`colnames<-`(c("name1", "name2", "name3")) %>%
`rownames<-`(rn) %>% # here is the sample matrix
# alike one from your question ready for analysis
data.frame() %>%
mutate(rn = rn, .before=1) %>%
# restore the row names
# you can store them from your matrix in a variable before type cast
pivot_longer(-rn) %>%
group_by(rn, name) %>%
summarise(value=sum(value)) %>%
pivot_wider()
It will result in:
# A tibble: 3 x 4
# Groups: rn [3]
rn name1 name2 name3
<chr> <int> <int> <int>
1 name1 5 17 29
2 name2 7 19 31
3 name3 9 21 33

Create columns based on other columns names R

I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

Subset rows with first observation after a given occurrence

I am trying to accomplish the following:
group data by id
remove any rows after '3' occurs.
find the closest '1','2' or NA that precedes '3' and only keep that row.
My data:
data <- data.frame(
id=c(1,1,1,1,1, 2,2,2,2, 3,3,3),
a=c(NA,1,2,3,3, NA,3,2,3, 1,5,3))
Desired output:
desired <- data.frame(
id=c(1,2,3), a=c(2,NA,1))
For steps 1-2, I have tried:
data %>% group_by(id) %>% slice(if(first(a) == 3))
but that seems quite off.
Thank you.
This breaks the problem into separate steps
data %>%
group_by(id) %>%
filter(row_number()<first(which(a==3))) %>% # drop things past a 3
filter(a %in% c(1,2,NA)) %>% # only keep 1,2 or NA
filter(row_number()==n()) # choose the last row in each group

Resources