Count string length using external table - r

Suppose you have a table of data:
df<-tibble(person = c("Alice", "Bob", "Mary"),
colour = c("Red", "Green", "Blue"),
city = c("London", "Paris", "New York"))
# A tibble: 3 x 3
person colour city
<chr> <chr> <chr>
1 Alice Red London
2 Bob Green Paris
3 Mary Blue New York
And a second table which contains the field names and the maximum string length of each field:
len<-tibble(field_name = c("person", "colour", "city"),
field_length = c(12, 4, 6))
# A tibble: 3 x 2
field_name field_length
<chr> <dbl>
1 person 12
2 colour 4
3 city 6
How can I check, for each field in len, whether a string in df is less than or equal to len$field_length, returning rows which fail the test?
As an example:
Output Row 1 in df would pass because:
'Alice' <= 12 characters long,
'Red' is <= 4 characters long and
'London' is <= 6 characters long.
However,
Row 2 would fail because:
'Green' > 4 characters long and
Row 3 would fail because:
'New York' > 6 characters long.
Thus the returned data frame should only display Rows 2 and Row 3 of the original df.

A dplyr solution with c_across():
library(dplyr)
df %>%
rowwise() %>%
filter(any(nchar(c_across(everything())) > len$field_length)) %>%
ungroup()
# # A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
# 1 Bob Green Paris
# 2 Mary Blue New York

Using base R with mapply :
df[rowSums(mapply(function(x, y) nchar(x) > y, df, len$field_length)) > 0, ]
# A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
#1 Bob Green Paris
#2 Mary Blue New York
If column names in df are not in the same order as len$field_name use df[len$field_name] in mapply.
In tidyverse we can get data in long format join it with len data by column name, select the rows which fail and get data in wide format again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
left_join(len, by = c('name' = 'field_name')) %>%
group_by(row) %>%
filter(any(nchar(value) > field_length)) %>%
dplyr::select(-field_length) %>%
pivot_wider()

It's easier to solve your problem in terms of 2 matrices, first the length of each of your entries:
nchar(as.matrix(df))
person colour city
[1,] 5 3 6
[2,] 3 5 5
[3,] 4 4 8
And a corresponding matrix of allowed length:
allowed = replicate(nrow(df),len$field_length[match(colnames(df),len$field_name)])
allowed
[,1] [,2] [,3]
[1,] 12 12 12
[2,] 4 4 4
[3,] 6 6 6
Then matrix wise comparison, and only keep those where the rowSums() are
df[rowMeans(nchar(as.matrix(df)) > allowed)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
If your two data.frames are already in the same order like your example, you can do (thanks to #zx8754) for pointing it out:
df[rowMeans(nchar(as.matrix(df)) > len$field_length)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

Pivot df into the same format as len and join the two. After this, it is trivial to compare each string to the field_length.
library(tidyverse)
test_result_df <- df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = 'field_name') %>%
left_join(len, by = 'field_name') %>%
mutate(test_passed = str_length(value) <= field_length) %>%
group_by(id) %>%
summarise(all_passed = all(test_passed))
df[!test_result_df$all_passed,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

Related

Searching a long dataframe for a string and returning all other strings with matching identifier

I have a long dataset of around 15,000 rows that looks like this
df <- data.frame("id" = c(3,3,3,55,55,55,63,63,63), "name" = c("house","home","apartment","boat","ship","sailboat","car","automobile","truck"))
I am trying to develop a function that searches for a string within the "name" vector of this dataframe and returns all strings with a corresponding "id".
For example, an input of "house" returns "house, home, "apartment" because they're all matching IDs as house, 3.
input = "house"
library(dplyr)
df %>%
group_by(id) %>%
filter(input %in% name) %>%
ungroup()
# # A tibble: 3 × 2
# id name
# <dbl> <chr>
# 1 3 house
# 2 3 home
# 3 3 apartment
As a function,
foo = function(data, input) {
data %>%
group_by(id) %>%
filter(input %in% name) %>%
ungroup()
}
foo(df, "home")
# A tibble: 3 × 2
# id name
# <dbl> <chr>
# 1 3 house
# 2 3 home
# 3 3 apartment

How to sum all values of a cell if it corresponds with a specific value in another cell?

I might just be going about it the wrong way, but I'm having trouble pulling out all of the female scores and all of the male scores into their own respective dataframes.
I don't need to have any of the exam information, so really I could just get every 'f' and it's corresponding score and every 'm' and it's corresponding score into a dataframe.
data <- tribble(~"X",~"Exam1",~"X.1",~"Exam2",~"X.2",
"n","Score","Gender","Score","Gender",
"1","45","m","66","f",
"2","60","f","73","m")
# Create informative column names
Colnames <- colnames(data) %>% str_c(.,dplyr::slice(data,1) %>% unlist,sep = "_")
# Set column names
data <- data %>%
setNames(Colnames) %>%
dplyr::slice(-1)
# Arrange data by exam type by first getting exam "number"
colnames(data) %>%
str_extract("\\d|\\d\\d") %>%
str_subset("\\d") %>%
unique %>%
# Split and arrange data by exams
purrr::map_df(~{
data %>%
dplyr::select(matches(str_c("X_n|",.x))) %>%
dplyr::mutate(Exam = str_c("Exam ",.x)) %>%
dplyr::rename_all(~c("Serial number","Exam score","Gender","Exam"))
}) %>%
# Split data by gender
dplyr::group_by(Gender) %>%
dplyr::group_split()
Output:
[[1]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 2 60 f Exam 1
2 1 66 f Exam 2
[[2]]
# A tibble: 2 × 4
`Serial number` `Exam score` Gender Exam
<chr> <chr> <chr> <chr>
1 1 45 m Exam 1
2 2 73 m Exam 2

Clustering similar strings based on another column in R

I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152

Finding max values in columns using dplyr and return element from different column

Assuming the below dataframe:
df<-data.frame(a=c("red", "blue", "yellow", "orange"), b=c(1,4,5,7), c=c(2,7,4,1), d=c(4,3,8,1))
Using dplyr, I would like to get the a element corresponding to each max and min in columns b,c and d:
For max, this would return orange, blue and yellow
I was able to get the index of the max value but couldn't get the value from column a:
df %>% summarise(across(-c(1), ~which.max(.x)))
Try this. Reshape data to long and then group by the name variable which contains the columns. After that filter to get the maximum values and identify the observations in a variable. Here the code:
library(tidyverse)
#Code
newdf <- df %>% pivot_longer(-a) %>% group_by(name) %>% filter(value==max(value))
Output:
# A tibble: 3 x 3
# Groups: name [3]
a name value
<fct> <chr> <dbl>
1 blue c 7
2 yellow d 8
3 orange b 7
We can use slice_max after reshaping to 'long' format
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-a) %>%
group_by(name) %>%
slice_max(value)
-output
# A tibble: 3 x 3
# Groups: name [3]
# a name value
# <chr> <chr> <dbl>
#1 orange b 7
#2 blue c 7
#3 yellow d 8

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

Resources