How to merge attributes on a frequency table in R? - r

Assume that i have two variables. See Dummy data below:
Out of 250 records:
SEX
Male : 100
Female : 150
HAIR
Short : 110
Long : 140
The code i currently use is provided below, For each variable a different table is created:
sexTable <- table(myDataSet$Sex)
hairTable <- table(myDataSet$Hair)
View(sexTable):
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Male | 100 |
| Female | 150 |
|------------------|------------------|
View(hairTable)
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Short | 110 |
| Long | 140 |
|------------------|------------------|
My question is how to merge the two tables in R that will have the following format As well as to calculate the percentage of frequency for each group of levels:
|---------------------|------------------|------------------|
| Variables | Level | Frequency |
|---------------------|------------------|------------------|
| Sex(N=250) | Male | 100 (40%) |
| | Female | 150 (60%) |
| Hair(N=250) | Short | 110 (44%) |
| | Long | 140 (56%) |
|---------------------|------------------|------------------|

We can use bind_rows after converting to data.frame
library(dplyr)
bind_rows(list(sex = as.data.frame(sexTable),
Hair = as.data.frame(hairTable)), .id = 'Variables')
Using a reproducible example
tbl1 <- table(mtcars$cyl)
tbl2 <- table(mtcars$vs)
bind_rows(list(sex = as.data.frame(tbl1),
Hair = as.data.frame(tbl2)), .id = 'Variables')%>%
mutate(Variables = replace(Variables, duplicated(Variables), ""))
If we also need the percentages
dat1 <- transform(as.data.frame(tbl1),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl1) * 100)))
dat2 <- transform(as.data.frame(tbl2),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl2) * 100)))
bind_rows(list(sex = dat1, Hair = dat2, .id = 'Variables')

Related

R: Subset Large Data Frame with Multiple Conditions

I have a large data frame with 12 million rows and 5 columns. I want to subset the large data frame with multiple conditions. I need to do this multiple times with different criteria, so I created a Look-Up Table and a for loop.
The code below loops through and subsets the large data frame, saving each iteration as an list within a list. After the loop completes, I combined the lists into a data frame.
My current set-up functions, but it is painfully slow (about 15 minutes for 8 loops). Subsetting is actually taking more time than it took to calculate the mean and SD for the 12 million-row table!
Any advice on how to speed this up?
>scaled
| chr | site | Average_CPMn | SD_CPMn |
|------|------|--------------|---------|
| chrI | 1 | 0.071 | 0.070 |
| chrI | 2 | 0.120 | 0.111 |
| chrI | 3 | 0.000 | 0.000 |
| chrI | 4 | 0.000 | 0.000 |
| chrI | 5 | 0.000 | 0.000 |
| chrI | 6 | 0.156 | 0.056 |
...12,000,000 rows
>genes.df
| Gene | Chromosome | Meta_Start | Meta_Stop |
|---------|------------|------------|-----------|
| YGL234W | chrVII | 55982 | 59390 |
| YGR061C | chrVII | 611389 | 616465 |
| YMR120C | chrXIII | 507002 | 509780 |
| YLR359W | chrXII | 843782 | 846230 |
scaled <- read_rds("~/Desktop/scaled.rds")
subset_list = list()
for (i in 1:nrow(genes.df)) {
subset <- scaled %>%
dplyr::filter(chr == genes.df$Chromosome[i] & site >= genes.df$Meta_Start[i] & site <= genes.df$Meta_Stop[i]) %>%
dplyr::mutate(Gene = genes.df$Gene[i])
subset_list[[i]] <- subset
#combine gene-list into single dataframe
counts_subset <- as.data.frame(do.call(rbind, subset_list)) %>%
left_join(genes.df, by = "Gene")
You havne't shared the data/sample so it is difficult to demostrate, however, it is suggested to use semi_join (if you want to subset only) or left_join (if want to mutate instead) somewhat like this in tidyverse
scaled %>% semi_join(genes.df %>% pivot_longer(c(Meta_start, Meta_stop)) %>%
group_by(Gene, Chromosome) %>%
complete(value = seq(min(value), max(value), 1)) %>%
ungroup %>% select(-name), by = c('chr' = 'Chromosome', 'site' = 'value'))

R Programming: How to drop variable labels as first column name in the table output from the fre function of the expss package?

I'm doing exploratory analysis of survey data and the dataframe is a haven labelled dataset, that is, each variable already has value labels and variable labels.
I want to store frequencies tables in a list, and then name each list element as the variable label. I'm using the expss package. The problem is that the output tables contain in the first column name this description: values2labels(Df$var. How can this description be dropped from the table?
Reproducible example:
# Dataframe
df <- data.frame(sex = c(1, 1, 2, 2, 1, 2, 2, 2, 1, 2),
agegroup= c(1, 3, 1, 2, 3, 3, 2, 2, 2, 1),
weight = c(100, 20, 400, 300, 50, 50, 80, 250, 100, 100))
library(expss)
# Variable labels
var_lab(df$sex) <-"Sex"
var_lab(df$agegroup) <-"Age group"
# Value labels
val_lab(df$sex) <- make_labels("1 Male
2 Female")
val_lab(df$agegroup) <- make_labels("1 1-29
2 30-49
3 50 and more")
# Save variable labels
var_labels1 <- var_lab(df$sex)
var_labels2 <- var_lab(df$agegroup)
# Drop variable labels
var_lab(df$sex) <- NULL
var_lab(df$agegroup) <- NULL
# Save frequencies
f1 <- fre(values2labels(df$sex))
f2 <- fre(values2labels(df$agegroup))
# Note: I use the function 'values2labels' from 'expss' package in order to display the value <br />
labels instead of the values of the variable.In this example, since I manually created the value <br />
labels, I don't need that function, but when I import haven labelled data, I need it to
display value labels by default.
# Add frequencies on list
my_list <- list(f1, f2)
# Name lists elements as variable labels
names(my_list) <- c(var_labels1,
var_labels2)
In the following output, how can I get rid of the first column name on both tables: values2labels(df$sex) and values2labels(df$agegroup) ?
$Sex
| values2labels(df$sex) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| --------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| Female | 6 | 60 | 60 | 60 | 60 |
| Male | 4 | 40 | 40 | 40 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
$`Age group`
| values2labels(df$agegroup) | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
| -------------------------- | ----- | ------------- | ------- | ------------ | ----------------------- |
| 1-29 | 3 | 30 | 30 | 30 | 30 |
| 30-49 | 4 | 40 | 40 | 40 | 70 |
| 50 and more | 3 | 30 | 30 | 30 | 100 |
| #Total | 10 | 100 | 100 | 100 | |
| <NA> | 0 | | 0 | | |
You need to set var_lab to empty string instead of NULL:
library(expss)
a = 1:2
val_lab(a) = c("One" = 1, "Two" = 2)
var_lab(a) = ""
fre(values2labels(a))
# | | Count | Valid percent | Percent | Responses, % | Cumulative responses, % |
# | ------ | ----- | ------------- | ------- | ------------ | ----------------------- |
# | One | 1 | 50 | 50 | 50 | 50 |
# | Two | 1 | 50 | 50 | 50 | 100 |
# | #Total | 2 | 100 | 100 | 100 | |
# | <NA> | 0 | | 0 | | |

How to generate Z-test including totals of variables in columns in expss?

Two questions in fact.
How to add totals for variables in columns in expss?
Is it possible to perform Z-test for variables in columns including total as a different category?
Below you can find a piece of code I'd run but it didn't work... I mean I couldn't even add totals on the right/left side of column variable...
test_table = tab_significance_options(data = df, compare_type = "subtable", bonferroni = TRUE, subtable_marks = "both") %>%
tab_cells(VAR1) %>%
tab_total_statistic("w_cpct") %>%
tab_cols(VAR2) %>%
tab_stat_cpct() %>%
tab_cols(total(VAR2)) %>%
tab_last_sig_cpct() %>%
tab_pivot(stat_position = "outside_columns")
I would be grateful for any advice.
To compare with first column you need additionally specify "first_column" in the 'compare_type'. Secondary, for correct result one of total statistic should be cases. Taking into the account all of the above:
library(expss)
data(mtcars)
test_table = mtcars %>%
tab_significance_options(compare_type = c("first_column", "subtable"), bonferroni = TRUE, subtable_marks = "both") %>%
tab_total_statistic(c("u_cases", "w_cpct")) %>%
tab_cells(gear) %>%
tab_cols(total(am), am) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
test_table
# | | | #Total | am | |
# | | | | 0 | 1 |
# | | | | A | B |
# | ---- | ---------------- | ------ | -------- | -------- |
# | gear | 3 | 46.9 | 78.9 + | |
# | | 4 | 37.5 | 21.1 < B | 61.5 > A |
# | | 5 | 15.6 | | 38.5 |
# | | #Total cases | 32 | 19 | 13 |
# | | #Total wtd. cpct | 100 | 100 | 100 |

Grouping data based on repetitive records using R

I have a dataset which contains repetitive records/common records. It looks something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| B | P | 150 |
| C | Q | 300 |
| A | P | 290 |
I need to group similar records together but I do not want to summarize my amount. I want to have the amount value being represented individually. The output should like something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| A | P | 290 |
| | | |
| B | P | 150 |
| | | |
| C | Q | 300 |
I thought of using split(), but since my original data has too many records, the split function creates too many lists and it becomes tedious to create new datasets from them. How can I achieve the above stated output with any other method?
EDIT:
Let us assume that we have an additional column called date and the dataset now looks like this:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 100 | 3/6/2019 |
| B | P | 150 | 7/6/2018 |
| C | Q | 300 | 4/21/2018 |
| A | P | 290 | 6/5/2018 |
Once, each buyer and vendor is grouped together, I need to arrange the dates in ascending order for each buyer and vendor such that it looks something like the below one:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |
| | | | |
| B | P | 150 | 7/6/2018 |
| | | | |
| C | Q | 300 | 4/21/2018 |
and then remove the single transactions to get the final table containing only
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |
In the following we sort the data frame and add a group column which allows easy subsequent processing of individual groups. For example, to process the groups without creating a large split of DF:
for(g in unique(DFout$group)) {
DFsub <- subset(DFout, group == g)
... process DFsub ...
}
1) Base R Sort the data and then assign the group column using cumsum on the non-duplicated elements.
library(data.table)
o <- with(DF, order(Vendor, Buyer))
DFo <- DF[o, ]
DFout <- transform(DFo, group = cumsum(!duplicated(data.frame(Vendor, Buyer))))
DFout
giving:
Vendor Buyer Amount group
1 A P 100 1
4 A P 290 1
2 B P 150 2
3 C Q 300 3
I am not sure this is such a good idea to do in the first place but if you really want to add a row of NAs after each group:
ix <- unname(unlist(tapply(DFout$group, DFout$group, function(x) c(x, NA))))
ix[!is.na(ix)] <- seq_len(nrow(DFout))
DFout[ix, ]
2) data.table Convert to data.table, set the key (which sorts it) and use rleid to assign the group number.
library(data.table)
DT <- data.table(DF)
setkey(DT, Vendor, Buyer)
DT[, group := rleid(Vendor, Buyer)]
3) sqldf Another approach is to use SQL. This requires the development version of RSQLite on github. Here dense_rank acts similarly to rleid above.
library(sqldf)
sqldf("select *, dense_rank() over (order by Vendor, Buyer) as [group]
from DF
order by Vendor, Buyer")
giving:
Vendor Buyer Amount group
1 A P 100 1
2 A P 290 1
3 B P 150 2
4 C Q 300 3
Note
DF <- structure(list(Vendor = structure(c(1L, 2L, 3L, 1L), .Label = c("A",
"B", "C"), class = "factor"), Buyer = structure(c(1L, 1L, 2L,
1L), .Label = c("P", "Q"), class = "factor"), Amount = c(100L,
150L, 300L, 290L)), class = "data.frame", row.names = c(NA, -4L
))

Combine dplyr mutate function with a search through the whole table

I'm quite new to R and especially to the tidy verse. I'm trying to write a script with which we can rewrite a list of taxons. We already have one using quite a lot for and if loops and I want to try to simplify it with the tidyverse, but I'm kind of stuck how to do that.
what I have is a table that looks something like that (really simplified)
taxon_file<- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order")
)
+-------------+-----+-----------+----------+
| name | Id | parent_ID | rank |
+=============+=====+===========+==========+
| cockroach | 445 | 200 | genus |
| cockroach2 | 448 | 200 | genus |
| grasshopper | 446 | 200 | genus |
| spider | 778 | 300 | genus |
| lobster | 543 | 400 | genus |
| insect | 200 | 200 | order |
| crustacea | 400 | 400 | order |
| arachnid | 300 | 300 | order |
+-------------+-----+-----+------------+----------+
Now I want to rearrange it so that I get a new column where I can add the order that matches the parent_ID (so when parent_ID == ID then write name in column order). The end result should look kinda like this
+-------------+------------+------+-----------+
| name | order | Id | parent_ID |
+=============+============+======+===========+
| cockroach | insect | 445 | 200 |
| cockroach2 | insect | 448 | 200 |
| grasshopper | insect | 446 | 200 |
| spider | arachnid | 778 | 300 |
| lobster | crustacea | 543 | 400 |
+-------------+------------+------+-----------+
I tried to combine mutate with an ifelse statement but this just adds NA's to the whole order column.
tibble is named taxon_list
taxon_list %>%
mutate(order = ifelse(parent_ID == Id, Name, NA))
I know this will not work because it doesn't search the whole data-set for the correct row (that's what I did before with alle the for loops). Maybe someone can point me in the right direction?
One way is to filter to each rank type to 2 separate dfs, subset using select, and merge the 2.
df <- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order"))
library(tidyverse)
df_order <- df %>%
filter(rank == "order") %>%
select(order = name, parent_ID)
df_genus <- df %>%
filter(rank == "genus") %>%
select(name, Id, parent_ID) %>%
merge(df_order, by = "parent_ID")
Result:
parent_ID name Id order
1 200 cockroach 445 insect
2 200 cockroach2 448 insect
3 200 grasshopper 446 insect
4 300 spider 778 arachnid
5 400 lobster 543 crustacea

Resources