How to arrange ny dataframe by number of characters in column? - r

I have a dataset:
id value
1 "include details"
2 "language"
2 "describe what you've tried"
How could I arrange it by number of characters in column value with strings? %>% arrange(value) doesnt work. How to do that?

I would use stringr package:
Data:
df <- data.frame(id = c(1,2,3),
value = c("include details","language","describe what you've tried"))
Code:
library(stringr)
df %>%
arrange(str_count(value))
Output:
id value
1 2 language
2 1 include details
3 3 describe what you've tried

arrange it by nchar -
library(dplyr)
df %>% arrange(nchar(value))
# id value
#1 2 language
#2 1 include details
#3 2 describe what you've tried
Or in descending order -
df %>% arrange(desc(nchar(value)))
# id value
#1 2 describe what you've tried
#2 1 include details
#3 2 language
Or in base R -
df[order(nchar(df$value)), ]

Related

Counting number of strings despite multiple elements in one cell

I got a vector A <- c("Tom; Jerry", "Lisa; Marc")
and try to identity the number of occurrences of every name.
I already used the code:
sort(table(unlist(strsplit(A, ""))), decreasing = TRUE)
However, this code is only able to create output like this:
Tom; Jerry: 1 - Lisa; Marc: 1
I am looking for a way to count every name, despite the fact, that two names are present in one cell. Consequently, my preferred result would be:
Tom: 1 Jerry: 1 Lisa: 1 Marc:1
The split should be ; followed by zero or more spaces (\\s*)
sort(table(unlist(strsplit(A, ";\\s*"))), decreasing = TRUE)
-output
Jerry Lisa Marc Tom
1 1 1 1
Use separate_rows to split the strings, group_by the names and summarise them:
library(tidyverse)
data.frame(A) %>%
separate_rows(A, sep = "; ") %>%
group_by(A) %>%
summarise(N = n())
# A tibble: 4 × 2
A N
<chr> <int>
1 Jerry 1
2 Lisa 1
3 Marc 1
4 Tom 1

R: Using regular expression to keep rows of data with 6 digits

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
id name
1 372303 Adam
2 KN5232 Jane
3 231244 TJ
4 283472-3822 Joyce
In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.
My final data should look like this:
> mydat2
id name
1 372303 Adam
3 231244 TJ
2 283472 Joyce
I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.
One method would be to split at - and then extract with filter or subset
library(dplyr)
library(tidyr)
library(stringr)
mydat %>%
separate_rows(id, sep = "-") %>%
filter(str_detect(id, '^\\d{6}$'))
-output
# A tibble: 3 × 2
id name
<chr> <chr>
1 372303 Adam
2 231244 TJ
3 283472 Joyce
You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.
You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.
E.g.
library(dplyr)
library(stringr)
mydat |>
mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
na.omit()
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce

R - Return column name for row where first given value is found

I am trying to find the first occurrence of a FALSE in a dataframe for each row value. My rows are specific occurrences and the columns are dates. I would like to be able to find the date of first FALSE so that I can use that value to find a return date.
An example structure of my dataframe:
df <- data.frame(ID = c(1,2,3), '2001' = c(TRUE, TRUE, TRUE),
'2002' = c(FALSE, TRUE, FALSE), '2003' = c(TRUE, FALSE, TRUE))
I want to end up with a second dataframe or list that contains the ID and the column name that identifies the first instance of a FALSE.
For example :
ID | Date
1 | 2002
2 | 2003
3 | 2002
I do not know the mechanism to find such a result.
The actual dataframe contains a couple thousand rows so I unfortunately can't do it by hand.
I am a new R user so please don't refrain from suggesting things you might expect a more experienced R user to have already thought about.
Thanks in advance
Try this using tidyverse functions. You can reshape data to long and then filter for F values. If there are some duplicated rows the second filter can avoid them. Here the code:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
filter(!duplicated(value)) %>% select(-value) %>%
rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Another option without duplicated values can be using the row_number() to extract the first value (row_number()==1):
library(dplyr)
library(tidyr)
#Code 2
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
mutate(V=ifelse(row_number()==1,1,0)) %>%
filter(V==1) %>%
select(-c(value,V)) %>% rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Or using base R with apply() and a generic function:
#Code 3
out <- data.frame(df[,1,drop=F],Res=apply(df[,-1],1,function(x) names(x)[min(which(x==F))]))
Output:
ID Res
1 1 2002
2 2 2003
3 3 2002
We can use max.col with ties.method = 'first' after inverting the logical values.
cbind(df[1], Date = names(df[-1])[max.col(!df[-1], ties.method = 'first')])
# ID Date
#1 1 2002
#2 2 2003
#3 3 2002

Adding prefixes to some variables without touching others?

I'd like to produce a data frame like df3 from df1, ie adding a prefix (important_) to variables without one, whilst not touching variables with certain prefixes (gea_, win_, hea_). Thus far I've only managed something like df2 where the important_ variables end up in a separate dataframe, but I'd like all variables in the same data frame. Any thoughts on it would be much appreciated.
What I've got:
library(dplyr)
df1 <- data.frame("hea_income"=c(45000,23465,89522),"gea_property"=c(1,1,2) ,"win_state"=c("AB","CA","GA"), "education"=c(1,2,3), "commute"=c(13,32,1))
df2 <- df1 %>% select(-contains("gea_")) %>% select(-contains("win_")) %>% select(-contains("hea_")) %>% setNames(paste0('important_', names(.)))
What I would like:
df3 <- data.frame("hea_income"=c(45000,23465,89522),"gea_property"=c(1,1,2) ,"win_state"=c("AB","CA","GA"), "important_education"=c(1,2,3), "important_commute"=c(13,32,1))
An option would be rename_at
dfN <- df1 %>%
rename_at(4:5, funs(paste0("important_", .)))
identical(dfN, df3)
#[1] TRUE
We can also include some regex if we want to specify the variables not by numeric index. Here the assumption is that all those columns that doesn't already have a _
df1 %>%
rename_at(vars(matches("^[^_]*$")), funs(paste0("important_", .)))
# hea_income gea_property win_state important_education important_commute
#1 45000 1 AB 1 13
#2 23465 1 CA 2 32
#3 89522 2 GA 3 1
Or with matches and -
df1 %>%
rename_at(vars(-matches("_")), funs(paste0("important_", .)))
# hea_income gea_property win_state important_education important_commute
#1 45000 1 AB 1 13
#2 23465 1 CA 2 32
#3 89522 2 GA 3 1
All three solutions above get the expected output as showed in the OP's post
Here's another possibility:
names(df1) <- names(df1) %>% {ifelse(grepl("_",.),.,paste0("important_",.))}
# > df1
# hea_income gea_property win_state important_education important_commute
# 1 45000 1 AB 1 13
# 2 23465 1 CA 2 32
# 3 89522 2 GA 3 1

Using regex to extract email address after # in dplyr pipe and then groupby to count occurrences [duplicate]

This question already has an answer here:
Filtering observations in dplyr in combination with grepl
(1 answer)
Closed 6 years ago.
I have dataframe which has column called email. I want to find email addresses after # symbol and then group by e.g (gmail,yahoo,hotmail) and count the occurrences of the same.
registrant_email
chamukan#yahoo.com
tmrsons1974#yahoo.com
123ajumohan#gmail.com
123#websiterecovery.org
salesdesk#2techbrothers.com
salesdesk#2techbrothers.com
Now I can extract emails after # using below code
sub(".*#", "", df$registrant_email)
How can I use it in dplyr pipe and then count occurrences of each email address
tidyr::separate is useful for splitting columns:
library(tidyr)
library(dplyr)
# separate email into `user` and `domain` columns
df %>% separate(registrant_email, into = c('user', 'domain'), sep = '#') %>%
# tally occurrences for each level of `domain`
count(domain)
## # A tibble: 4 x 2
## domain n
## <chr> <int>
## 1 2techbrothers.com 2
## 2 gmail.com 1
## 3 websiterecovery.org 1
## 4 yahoo.com 2
By first splitting into a character matrix, after coercing to data.frame, we can use common dplyr idioms
library(dplyr)
library(stringr)
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% group_by(X2) %>% count(X1)
The result is as follows
X2 X1 n
<fctr> <fctr> <int>
1 2techbrothers.com salesdesk 2
2 gmail.com 123ajumohan 1
3 websiterecovery.org 123 1
4 yahoo.com chamukan 1
5 yahoo.com tmrsons1974 1
If you want to set variable names for better code comprehension, you can use
str_split_fixed(df$registrant_email, pattern = "#", n =2) %>%
data.frame %>% setNames(c("local", "domain")) %>%
group_by(domain) %>% count(local)
We can use base R methods for this
aggregate(V1~V2, read.table(text = df1$registrant_email,
sep="#", stringsAsFactors=FALSE), FUN = length)
# V2 V1
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2
Or using the OP's method and wrap it with table
as.data.frame(table(sub(".*#", "", df1$registrant_email)))
# Var1 Freq
#1 2techbrothers.com 2
#2 gmail.com 1
#3 websiterecovery.org 1
#4 yahoo.com 2

Resources