How to sort frequencies in CreateTableOne - r

Here is a representation of my dataset
mydata<-data.frame(ID=1:37,var=c(rep("A",12),rep("B",8),rep("C",17)))
I calculated the frequency of each modality of the variable var
library(tableone)
CreateTableOne(data = mydata["var"])
What I want is to sort the frequencies in decreasing way, like below:
Overall
n 37
var (%)
C 17 (45.9)
A 12 (32.4)
B 8 (21.6)

Change the factor levels to be in decreasing order of frequency.
library(tableone)
mydata2 <- transform(mydata, var = factor(var, names(sort(-table(var)))))
CreateTableOne("var", data = mydata2)
giving:
Overall
n 37
var (%)
C 17 (45.9)
A 12 (32.4)
B 8 (21.6)

Related

Using barplot in R studio

when I try this code for barplot (L$neighbourhood is the apartment neighbourhood in Paris for example, Champs-Elysées, Batignolles, which is string data, and L$price is the numeric data for apartment price).
barplot(L$neighbourhood, L$price, main = "TITLE", xlab = "Neighbourhood", ylab = "Price")
But, I get an error:
Error in barplot.default(L$neighbourhood, L$price, main = "TITLE",
xlab = "Neighbourhood", : 'height' must be a vector or a matrix
We cannot use string data as an input in barplot function in R? How can I fix this error please?
allneighbourhoods
Quite unclear what you want to barplot. Let's assume you want to see the average price per neighborhood. If that's what you're after you can proceed like this.
First some illustrative data:
set.seed(123)
Neighborhood <- sample(LETTERS[1:4], 10, replace = T)
Price <- sample(10:100, 10, replace = T)
df <- data.frame(Neighborhood, Price)
df
Neighborhood Price
1 C 23
2 C 34
3 C 99
4 B 100
5 C 78
6 B 100
7 B 66
8 B 18
9 C 81
10 A 35
Now compute the averages by neighborhood using the function aggregate and store the result in a new dataframe:
df_new <- aggregate(x = df$Price, by = list(df$Neighborhood), function(x) mean(x))
df_new
Group.1 x
1 A 35
2 B 71
3 C 63
And finally you can plot the average prices in variable x and add the neighborhood names from the Group.1column:
barplot(df_new$x, names.arg = df_new$Group.1)
An even simpler solution is this, using tapplyand mean:
df_new <- tapply(df$Price, df$Neighborhood, mean)
barplot(df_new, names.arg = names(df_new))

Selecting 10 names based on 10 highest numbers of other column

I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808

Capture the output of arules::inspect as data.frame

In "Zero frequent items" when using the eclat to mine frequent itemsets, the OP is interested in the groupings/clusterings based on how frequent they are ordered together. This grouping can be inspected by the arules::inspect function.
library(arules)
dataset <- read.transactions("8GbjnHK2.txt", sep = ";", rm.duplicates = TRUE)
f <- eclat(dataset,
parameter = list(
supp = 0.001,
maxlen = 17,
tidLists = TRUE))
inspect(head(sort(f, by = "support"), 10))
The data set can be downloaded from https://pastebin.com/8GbjnHK2.
However, the output cannot be easily saved to another object as a data frame.
out <- inspect(f)
So how do we capture the output of inspect(f) for use as data frame?
We can use the methods labels to extract the associations/groupings and quality to extract the quality measures (support and count). We can then use cbind to store these into a data frame.
out <- cbind(labels = labels(f), quality(f))
head(out)
# labels support count
# 1 {3031093,3059242} 0.001010 16
# 2 {3031096,3059242} 0.001073 17
# 3 {3060614,3060615} 0.001010 16
# 4 {3022540,3072091} 0.001010 16
# 5 {3061698,3061700} 0.001073 17
# 6 {3031087,3059242} 0.002778 44
Coercing the itemsets to a data.frame also creates the required output.
> head(as(f, "data.frame"))
items support count
1 {3031093,3059242} 0.001010101 16
2 {3031096,3059242} 0.001073232 17
3 {3060614,3060615} 0.001010101 16
4 {3022540,3072091} 0.001010101 16
5 {3061698,3061700} 0.001073232 17
6 {3031087,3059242} 0.002777778 44

R- Trimming a string in a dataframe after a particular pattern

I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Resources