How to count unique values over multiple columns using R?

How to count unique values over multiple columns using R? - r

Let's say I have the following df:
1 2 3
home, work work, home home, work
leisure, work work, home, leisure work, home
home, leisure work, home home, work
I want to count all unique variables over the entire data.frame (not by columns or row, I'm interested in the cell values)
So the output should look like this:
freq
home, work 3
leisure, work 1
home, leisure 1
work, home 3
work, home, leisure 1
I have not found a way to do that. The count() function seems to work only with single columns.
Thank you very much for the help:)

You could unlist and use table to get count in base R :
stack(table(unlist(df)))
#Same as
#stack(table(as.matrix(df)))
If you prefer tidyverse get data in long format using pivot_longer and count.
df %>%
tidyr::pivot_longer(cols = everything()) %>%
dplyr::count(value)
# A tibble: 5 x 2
# value n
# <chr> <int>
#1 home,leisure 1
#2 home,work 3
#3 leisure,work 1
#4 work,home 3
#5 work,home,leisure 1
data
df <- structure(list(X1 = c("home,work", "leisure,work", "home,leisure"
), X2 = c("work,home", "work,home,leisure", "work,home"), X3 = c("home,work",
"work,home", "home,work")), class = "data.frame", row.names = c(NA, -3L))

With tidyverse, we can use gather
library(dplyr)
library(tidyr)
df %>%
gather %>%
count(value)
# value n
#1 home,leisure 1
#2 home,work 3
#3 leisure,work 1
#4 work,home 3
#5 work,home,leisure 1
data
df <- structure(list(X1 = c("home,work", "leisure,work", "home,leisure"
), X2 = c("work,home", "work,home,leisure", "work,home"), X3 = c("home,work",
"work,home", "home,work")), class = "data.frame", row.names = c(NA, -3L))

Related

How to identify the text that are in common between sentences?

I would like to find the text or string that appeared in 3 of my columns.
> dput(df1)
structure(list(Jan = "The price of oil declined.", Feb = "The price of gold declined.",
Mar = "Prices remained unchanged."), row.names = c(NA, -1L
), class = c("tbl_df", "tbl", "data.frame"))
I want to get something like
Word Count
The 2
price 3
declined 2
of 2
Thank you.

You can count the occurrence of each word in the text and keep only the ones that occur more than once.
library(dplyr)
library(tidyr)
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything()) %>%
separate_rows(value, sep = '\\s+') %>%
mutate(value = tolower(gsub('[[:punct:]]', '', value))) %>%
count(value) %>%
filter(n > 1)

May be this:
setNames(data.frame(table(unlist
(strsplit
(trimws(tolower(stack(df)$values),whitespace = '\\.'), '\\s+', perl=TRUE)
)
)
), c('words', 'Frequency'))
stack(df) will stack the df to columnar structure from row structure, then using values column we get all the sentences. we use trimws to remove all the unnecessary punctuation. we use strsplit to split data with spaces. Finally unlisting it to make it flatten. Taking the table and then converting to data.frame yields the desired results.setNames renames the columns.
Output:
# words Frequency
#1 declined 2
#2 gold 1
#3 of 2
#4 oil 1
#5 price 2
#6 prices 1
#7 remained 1
#8 the 2
#9 unchanged 1

This code won't process the data as you may wish, for ex. treating "price" and "Prices" as the same word. If you want that it will get more complicated.
> data.frame(table(strsplit(tolower(gsub("\\.|\\,","",paste(as.character(unlist(df)),collapse=" ")))," ")))
Var1 Freq
1 declined 2
2 gold 1
3 of 2
4 oil 1
5 price 2
6 prices 1
7 remained 1
8 the 2
9 unchanged 1

Base R solution:
setNames(
data.frame(
table(
unlist(strsplit(tolower(do.call(c, df1)), "\\s+|[[:punct:]]"))
)
),
c("Words", "Frequency")
)

Adapting string variables to specific characteristics in R

I have the following data:
id code
1 I560
2 K980
3 R30
4 F500
5 650
I would like to do the following two actions regarding the colum code:
i) select the two numbers after the letter and
ii) remove those observations that do not start with a letter. So in the end, the data frame should look like this:
id code
1 I56
2 K98
3 R30
4 F50

In base R, you could do :
subset(transform(df, code = sub('([A-Z]\\d{2}).*', '\\1', code)),
grepl('^[A-Z]', code))
Or using tidyverse functions
library(dplyr)
library(stringr)
df %>%
mutate(code = str_extract(code, '[A-Z]\\d{2}')) %>%
filter(str_detect(code, '^[A-Z]'))
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50

An option with substr from base R
df1$code <- substr(df1$code, 1, 3)
df1[grepl('^[A-Z]', df1$code),]
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
data
df1 <- structure(list(id = 1:5, code = c("I56", "K98", "R30", "F50",
"650")), row.names = c(NA, -5L), class = "data.frame")

R coalesce down columns by identifer [duplicate]

This question already has answers here:
combine rows in data frame containing NA to make complete row
(7 answers)
Closed 2 years ago.
I have a long dataset with student grades and courses going across many semesters. It has many NAs and many rows for each student. I want it to have one long row per student to fill in those NAs but keep the same column names.
Here's a sample:
library(tidyverse)
sample <- tibble(student = c("Corey", "Corey", "Sibley", "Sibley"),
fall_course_1 = c("Math", NA, "Science", NA),
fall_course_2 = c(NA, "English", NA, NA),
fall_grade_1 = c(90, NA, 98, NA),
fall_grade_2 = c(NA, 60, NA, NA))
And here's what I'd like it to look like:
library(tidyverse)
answer <- tibble(student = c("Corey", "Sibley"),
fall_course_1 = c("Math", "Science"),
fall_course_2 = c("English", NA),
fall_grade_1 = c(90, 98),
fall_grade_2 = c(60, NA))
Some semesters, some students take many classes and some just one. I've tried using coalesce(), but I can't figure it out. Any help would be appreciated!

This should do it, pivot the data long, remove the na's and then pivot it back to wide.
You need to convert the numeric values to character temporarily so they can go in the same column as the course labels, then type_convert() is a lazy way to put them back again.
library(dplyr)
library(tidyr)
library(readr)
reshaped <- sample %>%
mutate_if(is.numeric, as.character) %>%
pivot_longer(-student) %>%
drop_na() %>%
pivot_wider(student, names_from = name, values_from = value) %>%
type_convert()

You could get the first non-NA value in each column for each student.
library(dplyr)
sample %> group_by(student) %>% summarise_all(~na.omit(.)[1])
# A tibble: 2 x 5
# student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
# <chr> <chr> <chr> <dbl> <dbl>
#1 Corey Math English 90 60
#2 Sibley Science NA 98 NA
This approach returns NA if there are all NA values in a group.

Using a custom colaesce function and dplyr:
coalesce_all_columns <- function(df) {
return(coalesce(!!! as.list(df)))
}
library(dplyr)
sample %>%
group_by(student) %>%
summarise_all(coalesce_all_columns)
# A tibble: 2 x 5
student fall_course_1 fall_course_2 fall_grade_1 fall_grade_2
<chr> <chr> <chr> <dbl> <dbl>
1 Corey Math English 90 60
2 Sibley Science NA 98 NA

You could also use data.table package as follows:
library(data.table)
setDT(sample)[, lapply(.SD, na.omit), student]
sample
# 1: Corey Math English 90 60
# 2: Sibley Science <NA> 98 NA

Group, summarize and transpose

I have a dataframe that looks like this:
ctgroup (dataframe)
Camera Trap Name Animal Name a_sum
1 CAM27 Chicken 1
2 CAM27 Dog 1
3 CAM27 Dog 4
4 CAM28 Cat 3
5 CAM28 Dog 22
6 CAM28 Dog 1
*a_sum = No. of animals recorded in a camera
So essentially I want to - Group by 2 fields(Camera Trap Name, Scientific Name) and then Count the number of record in the column "a_sum", and transpose the data so that Animal. Name becomes column and Camera Trap Name my rows. I want to display all the animal names in columns, with 0 if no data available
i.e.,
Camera trap name Dog Cat Wolf Chicken
CAM28 23 4 1 4
CAM27 5 0 0 4
I tried using the following code
dcast (ctgroup, Camera.Trap.name + Animal.name, value.var = "a_sum")
And I got the following error:
In dcast(ctgroup, Camera.Trap.name + Scientific.name, value.var = "a_sum") :
The dcast generic in data.table has been passed a grouped_df and will attempt to redirect to the reshape2::dcast; please note that reshape2 is deprecated, and this redirection is now deprecated as well. Please do this redirection yourself like reshape2::dcast(ctgroup). In the next version, this warning will become an error.
I don't think I know enough to construct the correct code for carrying out this work.

With data.table ...
# Load data.table.
require(data.table)
# Create data.set.
df <- data.frame(Camera = c("CAM27", "CAM27", "CAM27", "CAM28", "CAM28", "CAM28"),
Animal = c("Chicken", "Dog", "Dog", "Cat", "Dog", "Dog"),
a_sum = c(1, 1, 4, 3, 22, 1))
# Set the data.frame as a data.table.
setDT(df)
# Cast by `Camera` and `Animal` and sum `a_sum`.
dcast(df, Camera ~ Animal, value.var = "a_sum", fun.aggregate = sum)
# Camera Cat Chicken Dog
# 1: CAM27 0 1 5
# 2: CAM28 3 0 23
# If you want to coerce back to a data.frame.
setDF(df)

The dplyr approach:
library(dplyr)
library(tidyr)
ctgroup %>%
group_by(Camera, Animal) %>%
summarize(a_sum = sum(a_sum)) %>%
pivot_wider(id_cols = Camera, names_from = Animal, values_from = a_sum, values_fill = list(a_sum = 0))

Creating a new column of consecutive token (like n-gram) in R

I have this dataset;
A B
URBAN 1
PLAN 2
I wish that new column is added like this;
A A` B
URBAN URB 1
URBAN RBA 1
URBAN BAN 1
PLAN PLA 2
PLAN LAN 2
How do I make the A' column in R?

dat=read.table(text="A B
URBAN 1
PLAN 2",h=T,stringsAsFactors=F)
library(zoo)
d=lapply(dat$A,function(y)
rollapply(1:nchar(y),3,function(x)substr(y,min(x),max(x))))
data.frame(dat[rep(dat$B,lengths(d)),],A1=unlist(d),row.names = NULL)
A B unlist.d.
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN

Here is one possible way. I am sure there are much more concise way to handle this job. But I think the following will do. For each row in mydf, I applied substr() to create three-letter elements. The Map() part is producing the elements. Since there are some non-desired elements, I further subsetted them with another lapply(). Finally, unnest() splits elements in each list and create a long-format data.
library(tidyverse)
mydf %>%
mutate(whatever = lapply(1:nrow(mydf), function(x) {
unlist(Map(function(j, k) substr(mydf$A[x], start = j, stop = k),
1:nchar(mydf$A[x]), 3:nchar(mydf$A[x])))
}) %>%
lapply(function(x) x[nchar(x) ==3])) %>%
unnest(whatever)
A B whatever
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN
DATA
mydf <- structure(list(A = c("URBAN", "PLAN"), B = 1:2), .Names = c("A",
"B"), class = "data.frame", row.names = c(NA, -2L))

Here is an option with str_match
library(stringr)
merge(stack(lapply(setNames(str_match_all(mydf$A, "(?=(...))"),
mydf$A), `[`, , 2))[2:1], mydf, by.x = 'ind', by.y = 'A')
Or using similar idea with tidyverse
library(purrr)
library(dplyr)
mydf %>%
mutate(Anew = str_match_all(A, "(?=(...))") %>%
map(~.x[,2])) %>%
unnest
# A B Anew
#1 URBAN 1 URB
#2 URBAN 1 RBA
#3 URBAN 1 BAN
#4 PLAN 2 PLA
#5 PLAN 2 LAN

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to count unique values over multiple columns using R? - r

Related

How to identify the text that are in common between sentences?

Adapting string variables to specific characteristics in R

R coalesce down columns by identifer [duplicate]

Group, summarize and transpose

Creating a new column of consecutive token (like n-gram) in R

Categories

Resources