I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:
DF <- read.table(text = " Comp Category
Comp1 1
Comp2 1
Comp3 4,2
Comp4 1,3
Comp1 1,2
Comp3 3 ", header = TRUE)
barplot(table(DF$Comp))
So this worked perfectly for me.
After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.
My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.
DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
it will give me for example: for Comp1
1 2 3 4
Comp1 2 1 0 0
But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).
How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?
Thanks a lot !
Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the Category column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:
DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))
I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp.
The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do() later).
cat_perc <- function(cats, vec) {
# percentages
nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
perc <- nums/sum(nums)
final <- perc * length(vec)
df <- as.data.frame(as.list(final))
names(df) <- cats
return(df)
}
Running the function on the complete data frame gives:
cat_perc(cats, DF$Category)
## 1 4 2 3
## 1 2.666667 0.6666667 1.333333 1.333333
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of Comp, which can be done using the dplyr package:
library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
##
## Comp 1 4 2 3
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2 Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3 Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4 Comp4 0.500000 0.0000000 0.0000000 0.5000000
This first groups the data by Comp and then applies the function cat_perc to only the subset of the data frame with a given Comp.
I will plot the data with the ggplot2 package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr package as follows:
library(tidyr)
plot_data <- gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
##
## Comp Category value
## (fctr) (chr) (dbl)
## 1 Comp1 1 1.333333
## 2 Comp2 1 1.000000
## 3 Comp3 1 0.000000
## 4 Comp4 1 0.500000
## 5 Comp1 4 0.000000
## 6 Comp2 4 0.000000
As you can see, there is now a single data point per row, characterised by Comp, Category and the corresponding value.
Plotting the data
Now that everything is read, we can plot the data using ggplot:
library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
geom_bar(stat = "identity")
Related
I have seen this Subsetting a data frame based on a logical condition on a subset of rows and that https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
I want to subset a data.frame according to a specific value in the row.names.
data <- data.frame(x1 = c(3, 7, 1, 8, 5), # Create example data
x2 = letters[1:5],
group = c("ga1", "ga2", "gb1", "gc3", "gb1"))
data # Print example data
# x1 x2 group
# 3 a ga1
# 7 b ga2
# 1 c gb1
# 8 d gc3
# 5 e gb1
I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?
The result should look like this
data.a
# x1 x2 group
# 3 a ga1
# 7 b ga2
data.b
# x1 x2 group
# 1 c gb1
# 5 e gb1
data.c
# 8 d gc3
I would be interested in how to subset one of these output examples, or perhaps a loop would work too.
I modified the example from here https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
Extract the data which you want to split on :
sub('\\d+', '', data$group)
#[1] "ga" "ga" "gb" "gc" "gb"
and use the above in split to divide the data into groups.
new_data <- split(data, sub('\\d+', '', data$group))
new_data
#$ga
# x1 x2 group
#1 3 a ga1
#2 7 b ga2
#$gb
# x1 x2 group
#3 1 c gb1
#5 5 e gb1
#$gc
# x1 x2 group
#4 8 d gc3
It is better to keep data in a list however, if you want separate dataframes for each group you can use list2env.
list2env(new_data, .GlobalEnv)
We can use group_split with str_remove in tidyverse
library(dplyr)
library(stringr)
data %>%
group_split(grp = str_remove(group, "\\d+$"), .keep = FALSE)
Good question. This solution uses inputs and outputs that closely match the request: "I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?".
The code below uses the data frame that was provided (named data), and uses grep(), and subsets by group.
code:
ga <- grep("ga", data$group) # seperate the data by group type
gb <- grep("gb", data$group)
gc <- grep("gc", data$group)
ga1 <- data[ga,] # subset ga
gb1 <- data[gb,] # subset gb
gc1 <- data[gc,] # subset gc
print(ga1)
print(gb1)
print(gc1)
Windows and Jupyter Lab were used. This output here closely matches the output that was shown above.
Output shown at link: link1
With a data frame like below:
set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))
> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4
I need to do an aggregation by id such that ,for each col prop1 to propN, a histogram data is generated as follows.
For each id,
prop1 need to capture ratio of number of discrete values - "A"s , "B"s for all records with same id which can be accessed via names like prop1[["A"]] & prop1[["B"]]
prop2 need to capture ratio of number of discrete values - "TRUE"s , "FALSE"s for all records with same id which can be accessed via names like prop1[["TRUE"]] & prop1[["FALSE"]]
prop3 need to capture ratio of number of discrete values - "3, 4, 5, 6" for all records with same id which can be accessed via names like prop1[["3"]], prop1[["4"]], prop1[["5"]], prop1[["6"]]
How to get the aggregation for prop1 to propN done in the above format - using base R
Update:Adding output representation.
I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id basis for further computation.
Here is an idea which uses a custom function defined as follows:
It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.
#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i)
prop.table(table(i[,n])))))}
data.frame(id = unique(sort(df$id)),
do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))
# id A B FALSE. TRUE. X3 X4 X5 X6
#1 1 0.0000000 1.0000000 0.0 1.0 0.0000000 0.0000000 1 0.0000000
#2 2 0.5000000 0.5000000 0.5 0.5 0.0000000 1.0000000 0 0.0000000
#3 3 0.6666667 0.3333333 1.0 0.0 0.3333333 0.3333333 0 0.3333333
Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])
#so you can extract each one separately,
l1[['prop1']]
# A B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333
I think you want this:
library(reshape)
df[-1] <- lapply(df[-1],as.factor)
# second, rearrange vars in a named vector
df <- melt(df,id=c("id"),variable_name = "prop")
df$prop <- as.factor(df$prop)
#third, make the histograms with ggplot2
library(ggplot2)
h <- ggplot(df,aes(x=id))
h + geom_bar(stat="count", aes(fill=id)) + facet_grid(~ prop + value)
To create some plots, I've already summarized my data using the following approach, which includes all the needed information.
# Load Data
RawDataSet <- read.csv("http://pastebin.com/raw/VP6cF31A", sep=";")
# Load packages
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
# summarising the data
new.df <- RawDataSet %>%
group_by(UserEmail,location,context) %>%
tally() %>%
mutate(n2 = n * c(1,-1)[(location=="NOT_WITHIN")+1L]) %>%
group_by(UserEmail,location) %>%
mutate(p = c(1,-1)[(location=="NOT_WITHIN")+1L] * n/sum(n))
With some other analysis I've identified distinct user groups. Since I would like to plot my data, it would be great to have a plot visualizing my data in the right order.
The order is based on the UserEmail and is defined by the following:
order <- c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")
Asking for the type of my new.df with typeof(new.df) it says that this is a list. I've already tried some approaches like order_by or with_order, but I until now I have not managed it to order my new.df depending on my order-vector. Of course, the order process could also be done in the summarising part.
Is there any way to do so?
I couldn't bring myself to create a vector named order out of respect for the R function by that name. Using match to construct an index to use as the basis ordering (as a function):
sorted.df <- new.df[ order(match(new.df$UserEmail, as.integer(c("28","27","25","23","22","21","20","16","12","10","9","8","5","4","2","1","29","19","17","15","14","13","7","3","30","26","24","18","11","6")) )), ]
head(sorted.df)
#---------------
Source: local data frame [6 x 6]
Groups: UserEmail, location [4]
UserEmail location context n n2 p
(int) (fctr) (fctr) (int) (dbl) (dbl)
1 28 NOT_WITHIN Clicked A 16 -16 -0.8421053
2 28 NOT_WITHIN Clicked B 3 -3 -0.1578947
3 28 WITHIN Clicked A 2 2 1.0000000
4 27 NOT_WITHIN Clicked A 4 -4 -0.8000000
5 27 NOT_WITHIN Clicked B 1 -1 -0.2000000
6 27 WITHIN Clicked A 1 1 1.0000000
(I didn't load plyr or reshape2 since at least one of those packages has a nasty habit of interaction poorly with the dplyr functions.)
My Initial Data looks like this:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
Data<-data.frame(ID, Value)
I want to create a MeanValue from Value for every Row. And if the Value is having no Value in it, i would like to take the Mean for the Value.
My Idea to Compute the Mean for the first Step was:
library(stringr)
AverageMean<-mean(as.numeric(str_split(Data$Value, ",")))
But it is Creating an Error
The Final Data should kinda look like:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
AverageMean<-c(1.5,-0.5,1,0.666,0.666)
FinalData<-data.frame(ID, Value, AverageMean)
Based on the info, and working on your code, first you do str_split on the concerned column and the output is a list. For getting the mean for individual list elements, you can use lapply with mean. Then unlist it and replace the last value Val[length(Val)] with the mean of all other values and create a new column AverageMean based on the above.
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
Val[length(Val)] <- mean(Val[-length(Val)], na.rm=TRUE)
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
Update
If you have multiple missing values and want to replace that with the mean of the column,
Data <- data.frame(ID=1:5, Value=c("1,2", "0,-1", 1, "", ""))
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
The above steps are the same. Create a logical index with is.na and replace all those missing values by the mean of values that are not missing by negating the logical index !is.na.
Val[is.na(Val)] <- mean(Val[!is.na(Val)])
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
#5 5 0.6666667
I have a correlation matrix (called correl)that is 390 x 390 so I would like to scan for values that are within 0.80 & 0.99. I have written the following loop:
cc1 <- NA #creates a NA vector to store values between 0.80 & 0.99
cc2 <- NA #creates a NA vector to store desired values
p <- dim(correl)[2] #dim returns the size of the correlation matrix
i =1
while (i <= p) {
cc1 <- correl[,correl[,i] >=0.80 & correl[,i] < 1.00]
cc2<- cbind(cc2,cc1)
i <- i +1
}
The problem I am having is that I also get undesired correlations ( those below 0.80) into cc2.
#Sample of what I mean:
SPY.Adjusted AAPL.Adjusted CHL.Adjusted CVX.Adjusted
1 SPY.Adjusted 1.0000000 0.83491778 0.6382930 0.8568000
2 AAPL.Adjusted 0.8349178 1.00000000 0.1945304 0.1194307
3 CHL.Adjusted 0.6382930 0.19453044 1.0000000 0.2991739
4 CVX.Adjusted 0.8568000 0.11943067 0.2991739 1.0000000
5 GE.Adjusted 0.6789054 0.13729877 0.3356743 0.5219169
6 GOOGL.Adjusted 0.5567947 0.10986655 0.2552149 0.2128337
I only want to return the correlations within the desired range ( 0.80 & 0.99) without losing the row.names or col.names as I would not know which are which.
Let's create a simple reproducible example
m = matrix(runif(100), ncol=10)
rownames(m) = LETTERS[1:10]
colnames(m) = rownames(m)
The tricky part is getting a nice return structure that contains the variable names. So I would collapse the matrix into a standard data frame
dd = data.frame(cor = as.vector(m1),
id1=rownames(m),
id2=rep(rownames(m), each=nrow(m)))
Remove duplicate entries
dd = dd[as.vector(upper.tri(m, TRUE)),]
Then select as usual
dd[dd$cor > 0.8 & dd$cor < 0.99,]
Glad you found an answer, but here's another that puts the results in a tidy data frame just in case others are looking for this.
This solution uses the corrr package (and using dplyr functions that are attached with it):
library(corrr)
mtcars %>%
correlate() %>%
shave() %>%
stretch(na.rm = TRUE) %>%
filter(between(r, .8, .99))
#> # A tibble: 3 × 3
#> x y r
#> <chr> <chr> <dbl>
#> 1 cyl disp 0.9020329
#> 2 cyl hp 0.8324475
#> 3 disp wt 0.8879799
Explanation:
mtcars is the data.
correlate() creates a correlation data frame.
shave() is optional and removes the upper triangle (to remove duplicates).
stretch() converts the data frame (in matrix format) to a long format.
filter(between(r, .8, .99)) selects only the correlations between .8 and .99
When I understood your problem correctly, one wouldn't expect a symmetric matrix as return object. For every variable of yours, you want to extract the other variables that are highly correlated with it - but this amount differs from variable to variable, so you cannot work with a matrix.
If you insist on a matrix/data frame, I would rather replace small correlations with NA
correl[correl<0.8] <- NA
and then access the column names for highly correlated with variable (e.g. in the first row) like this
colnames(correl)[!is.na(correl[1,])]
(Although then the NA step is kind of useless, as you could access the colnames straight with the constraint
colnames(correl)[correl[1,]>0.8)]
)