I have a long dataset of around 15,000 rows that looks like this
df <- data.frame("id" = c(3,3,3,55,55,55,63,63,63), "name" = c("house","home","apartment","boat","ship","sailboat","car","automobile","truck"))
I am trying to develop a function that searches for a string within the "name" vector of this dataframe and returns all strings with a corresponding "id".
For example, an input of "house" returns "house, home, "apartment" because they're all matching IDs as house, 3.
input = "house"
library(dplyr)
df %>%
group_by(id) %>%
filter(input %in% name) %>%
ungroup()
# # A tibble: 3 × 2
# id name
# <dbl> <chr>
# 1 3 house
# 2 3 home
# 3 3 apartment
As a function,
foo = function(data, input) {
data %>%
group_by(id) %>%
filter(input %in% name) %>%
ungroup()
}
foo(df, "home")
# A tibble: 3 × 2
# id name
# <dbl> <chr>
# 1 3 house
# 2 3 home
# 3 3 apartment
I might just be going about it the wrong way, but I'm having trouble pulling out all of the female scores and all of the male scores into their own respective dataframes.
I don't need to have any of the exam information, so really I could just get every 'f' and it's corresponding score and every 'm' and it's corresponding score into a dataframe.
data <- tribble(~"X",~"Exam1",~"X.1",~"Exam2",~"X.2",
"n","Score","Gender","Score","Gender",
"1","45","m","66","f",
"2","60","f","73","m")
# Create informative column names
Colnames <- colnames(data) %>% str_c(.,dplyr::slice(data,1) %>% unlist,sep = "_")
# Set column names
data <- data %>%
setNames(Colnames) %>%
dplyr::slice(-1)
# Arrange data by exam type by first getting exam "number"
colnames(data) %>%
str_extract("\\d|\\d\\d") %>%
str_subset("\\d") %>%
unique %>%
# Split and arrange data by exams
purrr::map_df(~{
data %>%
dplyr::select(matches(str_c("X_n|",.x))) %>%
dplyr::mutate(Exam = str_c("Exam ",.x)) %>%
dplyr::rename_all(~c("Serial number","Exam score","Gender","Exam"))
}) %>%
# Split data by gender
dplyr::group_by(Gender) %>%
dplyr::group_split()
Output:
[[1]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 2 60 f Exam 1
2 1 66 f Exam 2
[[2]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 1 45 m Exam 1
2 2 73 m Exam 2
I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152
Assuming the below dataframe:
df<-data.frame(a=c("red", "blue", "yellow", "orange"), b=c(1,4,5,7), c=c(2,7,4,1), d=c(4,3,8,1))
Using dplyr, I would like to get the a element corresponding to each max and min in columns b,c and d:
For max, this would return orange, blue and yellow
I was able to get the index of the max value but couldn't get the value from column a:
df %>% summarise(across(-c(1), ~which.max(.x)))
Try this. Reshape data to long and then group by the name variable which contains the columns. After that filter to get the maximum values and identify the observations in a variable. Here the code:
library(tidyverse)
#Code
newdf <- df %>% pivot_longer(-a) %>% group_by(name) %>% filter(value==max(value))
Output:
# A tibble: 3 x 3
# Groups: name [3]
a name value
<fct> <chr> <dbl>
1 blue c 7
2 yellow d 8
3 orange b 7
We can use slice_max after reshaping to 'long' format
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-a) %>%
group_by(name) %>%
slice_max(value)
-output
# A tibble: 3 x 3
# Groups: name [3]
# a name value
# <chr> <chr> <dbl>
#1 orange b 7
#2 blue c 7
#3 yellow d 8
I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA