R - histogram data in aggreagation - r

With a data frame like below:
set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))
> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4
I need to do an aggregation by id such that ,for each col prop1 to propN, a histogram data is generated as follows.
For each id,
prop1 need to capture ratio of number of discrete values - "A"s , "B"s for all records with same id which can be accessed via names like prop1[["A"]] & prop1[["B"]]
prop2 need to capture ratio of number of discrete values - "TRUE"s , "FALSE"s for all records with same id which can be accessed via names like prop1[["TRUE"]] & prop1[["FALSE"]]
prop3 need to capture ratio of number of discrete values - "3, 4, 5, 6" for all records with same id which can be accessed via names like prop1[["3"]], prop1[["4"]], prop1[["5"]], prop1[["6"]]
How to get the aggregation for prop1 to propN done in the above format - using base R
Update:Adding output representation.
I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an id basis for further computation.

Here is an idea which uses a custom function defined as follows:
It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.
#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)
fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i)
prop.table(table(i[,n])))))}
data.frame(id = unique(sort(df$id)),
do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))
# id A B FALSE. TRUE. X3 X4 X5 X6
#1 1 0.0000000 1.0000000 0.0 1.0 0.0000000 0.0000000 1 0.0000000
#2 2 0.5000000 0.5000000 0.5 0.5 0.0000000 1.0000000 0 0.0000000
#3 3 0.6666667 0.3333333 1.0 0.0 0.3333333 0.3333333 0 0.3333333
Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.
l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])
#so you can extract each one separately,
l1[['prop1']]
# A B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333

I think you want this:
library(reshape)
df[-1] <- lapply(df[-1],as.factor)
# second, rearrange vars in a named vector
df <- melt(df,id=c("id"),variable_name = "prop")
df$prop <- as.factor(df$prop)
#third, make the histograms with ggplot2
library(ggplot2)
h <- ggplot(df,aes(x=id))
h + geom_bar(stat="count", aes(fill=id)) + facet_grid(~ prop + value)

Related

how to create a row that is calculated from another row automatically like how we do it in excel?

does anyone know how to have a row in R that is calculated from another row automatically? i.e.
lets say in excel, i want to make a row C, which is made up of (B2/B1)
e.g. C1 = B2/B1
C2 = B3/B2
...
Cn = Cn+1/Cn
but in excel, we only need to do one calculation then drag it down. how do we do it in R?
In R you work with columns as vectors so the operations are vectorized. The calculations as described could be implemented by the following commands, given a data.frame df (i.e. a table) and the respective column names as mentioned:
df["C1"] <- df["B2"]/df["B1"]
df["C2"] <- df["B3"]/df["B2"]
In R you usually would name the columns according to the content they hold. With that, you refer to the columns by their name, although you can also address the first column as df[, 1], the first row as df[1, ] and so on.
EDIT 1:
There are multiple ways - and certainly some more elegant ways to get it done - but for understanding I kept it in simple base R:
Example dataset for demonstration:
df <- data.frame("B1" = c(1, 2, 3),
"B2" = c(2, 4, 6),
"B3" = c(4, 8, 12))
Column calculation:
for (i in 1:ncol(df)-1) {
col_name <- paste0("C", i)
df[col_name] <- df[, i+1]/df[, i]
}
Output:
B1 B2 B3 C1 C2
1 1 2 4 2 2
2 2 4 8 2 2
3 3 6 12 2 2
So you iterate through the available columns B1/B2/B3. Dynamically create a column name in every iteration, based on the number of the current iteration, and then calculate the respective column contents.
EDIT 2:
Rowwise, as you actually meant it apparently, works similarly:
a <- c(10,15,20, 1)
df <- data.frame(a)
for (i in 1:nrow(df)) {
df$b[i] <- df$a[i+1]/df$a[i]
}
Output:
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 NA
You can do this just using vectors, without a for loop.
a <- c(10,15,20, 1)
df <- data.frame(a)
df$b <- c(df$a[-1], 0) / df$a
print(df)
a b
1 10 1.500000
2 15 1.333333
3 20 0.050000
4 1 0.000000
Explanation:
In the example data, df$a is the vector 10 15 20 1.
df$a[-1] is the same vector with its first element removed, 15 20 1.
And using c() to add a new element to the end so that the vector has the same lenght as before:
c(df$a[-1],0) which is 15 20 1 0
What we want for column b is this vector divided by the original df$a.
So:
df$b <- c(df$a[-1], 0) / df$a

Plot many categories

I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:
DF <- read.table(text = " Comp Category
Comp1 1
Comp2 1
Comp3 4,2
Comp4 1,3
Comp1 1,2
Comp3 3 ", header = TRUE)
barplot(table(DF$Comp))
So this worked perfectly for me.
After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.
My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.
DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
it will give me for example: for Comp1
1 2 3 4
Comp1 2 1 0 0
But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).
How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?
Thanks a lot !
Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.
Preparing the data
You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.
First, I have to make sure that the Category column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:
DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))
I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp.
The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do() later).
cat_perc <- function(cats, vec) {
# percentages
nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
perc <- nums/sum(nums)
final <- perc * length(vec)
df <- as.data.frame(as.list(final))
names(df) <- cats
return(df)
}
Running the function on the complete data frame gives:
cat_perc(cats, DF$Category)
## 1 4 2 3
## 1 2.666667 0.6666667 1.333333 1.333333
The values sum up to six, which is indeed the total number of rows in the original data frame.
Now we want to run that function for each value of Comp, which can be done using the dplyr package:
library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
##
## Comp 1 4 2 3
## (fctr) (dbl) (dbl) (dbl) (dbl)
## 1 Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2 Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3 Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4 Comp4 0.500000 0.0000000 0.0000000 0.5000000
This first groups the data by Comp and then applies the function cat_perc to only the subset of the data frame with a given Comp.
I will plot the data with the ggplot2 package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr package as follows:
library(tidyr)
plot_data <- gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
##
## Comp Category value
## (fctr) (chr) (dbl)
## 1 Comp1 1 1.333333
## 2 Comp2 1 1.000000
## 3 Comp3 1 0.000000
## 4 Comp4 1 0.500000
## 5 Comp1 4 0.000000
## 6 Comp2 4 0.000000
As you can see, there is now a single data point per row, characterised by Comp, Category and the corresponding value.
Plotting the data
Now that everything is read, we can plot the data using ggplot:
library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
geom_bar(stat = "identity")

subset of data frame on based on multiple conditions

I'm actually having a trouble with a particular task of my code. I have a data frame as
n <- 6
set.seed(123)
df <- data.frame(x=paste0("x",seq_along(1:n)), A=sample(c(-2:2),n,replace=TRUE), B=sample(c(-1:3),n,replace=TRUE))
#
# x A B
# 1 x1 -1 1
# 2 x2 1 3
# 3 x3 0 1
# 4 x4 2 1
# 5 x5 2 3
# 6 x6 -2 1
and a decision tree as
A>0;Y;Y;N;N
B>1;Y;N;Y;N
C;1;2;2;1
that I load by
dt <- read.csv2("tmp.csv", header=FALSE)
I'd like to create a loop for all the possible combinations of (A>0) and (B>1) and set the C value to the subset x column that satisfy that condition. So, here's what I did
nr <- 3
nc <- 5
cond <- dt[1:(nr-1),1,drop=FALSE]
rule <- dt[nr,1,drop=FALSE]
subdf <- vector(mode="list",2^(nr-1))
for (i in 2:nc) {
check <- paste0("")
for (j in 1:(nr-1)) {
case <- paste0(dt[j,1])
if (dt[j,i]=="N")
case <- paste0("!",case)
check <- paste0(check, "(", case, ")" )
if (j<(nr-1))
check <- paste0(check, "&")
}
subdf[i] <- subset(df,check)
subdf[i]$C <- dt[nr,i]
}
unlist(subdf)
unfortunately, I got an error using subset as by this, it cannot parse the conditions from my string statements. what should I do?
Your issue is your creating of the subset: the subset commands expects a boolean and you gave it a string. ('check'). So the simplest solution here is to add a 'parse'. I feel there is a more elegant way to solve this problem and I hope someone'll come along and do it, but you can fix the final part of your code with the following
mysubset <- subset(df,with(df,eval(parse(text=check))))
if(nrow(mysubset)>0){
mysubset$C <- dt[nr,i]
}
subdf[[i]]<-mysubset
I have added the parse/eval part to generate a vector of booleans to subset only the 'TRUE' cases, and added a check for whether C could be added (will give error if there are no rows).
Based on the previous answer, I came up with a more elegant/practical way of generating a vector of combined rules, and then applying them all to the data, using apply/lapply.
##create list of formatted rules
#format each 'building' block separately,
#based on rows in 'dt'.
part_conditions <- apply(dt[-nrow(dt),],MARGIN=1,FUN=function(x){
res <- sprintf("(%s%s)", ifelse(x[-1]=="Y","","!"), x[1])
})
# > part_conditions
# 1 2
# [1,] "(A>0)" "(B>1)"
# [2,] "(A>0)" "(!B>1)"
# [3,] "(!A>0)" "(B>1)"
# [4,] "(!A>0)" "(!B>1)"
#combine to vector of conditions
conditions <- apply(part_conditions, MARGIN=1,FUN=paste, collapse="&")
# > conditions
# [1] "(A>0)&(B>1)" "(A>0)&(!B>1)" "(!A>0)&(B>1)" "(!A>0)&(!B>1)"
#for each condition, test in data wheter condition is 'T'
temp <- sapply(conditions, function(rule){
return(with(df, eval(parse(text=rule))))
}
)
rules <- as.numeric(t(dt[nrow(dt),-1]))
#then find which of the (in this case) four is 'T', and put the appropriate rule
#in df
df$C <- rules[apply(temp,1,which)]
> df
x A B C
1 x1 -1 1 1
2 x2 1 3 1
3 x3 0 1 1
4 x4 2 1 2
5 x5 2 3 1
6 x6 -2 1 1

Weighted Sampling with multiple probability vectors in R

I have a similar question like this:
Weighted sampling with 2 vectors
I now have a dataset which contains 1000 observations and 4 columns for each observation. I want to sample 200 observations from the original dataset with replacement.
But the PROBLEM is: I need to assign different probability vector for each column. For example, for the first column. I want equal probability c(0.001,0.001,0.001,0.001...). For the second column, I want something different like c(0.0005,0.0002,......). Of course, each probability vector sum up to 1.
I know sample can do with one vector. But I am not sure about other commands. Please HELP me!
Thank you in advance!
Colamonkey
data frame with sample probabilities
# in your case the rows are 1000 and the columns 4,
# but it is just to show the procedure
samp_prob <- data.frame(A = rep(.25, 4), B = c(.5, .1, .2, .2), C = c(.3, .6, .05, .05))
data frame of values to sample from with replacement
df <- data.frame(a = 1:4, b = 2:5, c = 3:6)
sampling
sam <- mapply(function(x, y) sample(x, 200, T, y), df, samp_prob)
head(sam)
a b c
[1,] 4 5 6
[2,] 1 2 4
[3,] 1 2 4
[4,] 4 4 4
[5,] 4 4 4
[6,] 1 2 4
# you can also write (it is equivalent):
mapply(df, samp_prob, FUN = sample, size = 200, replace = T)

Mean of Cell in R

My Initial Data looks like this:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
Data<-data.frame(ID, Value)
I want to create a MeanValue from Value for every Row. And if the Value is having no Value in it, i would like to take the Mean for the Value.
My Idea to Compute the Mean for the first Step was:
library(stringr)
AverageMean<-mean(as.numeric(str_split(Data$Value, ",")))
But it is Creating an Error
The Final Data should kinda look like:
ID<-c(1,2,3,4)
Value<-c("1,2","0,-1",1,"","")
AverageMean<-c(1.5,-0.5,1,0.666,0.666)
FinalData<-data.frame(ID, Value, AverageMean)
Based on the info, and working on your code, first you do str_split on the concerned column and the output is a list. For getting the mean for individual list elements, you can use lapply with mean. Then unlist it and replace the last value Val[length(Val)] with the mean of all other values and create a new column AverageMean based on the above.
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
Val[length(Val)] <- mean(Val[-length(Val)], na.rm=TRUE)
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
Update
If you have multiple missing values and want to replace that with the mean of the column,
Data <- data.frame(ID=1:5, Value=c("1,2", "0,-1", 1, "", ""))
Val <- unlist(lapply(str_split(Data$Value, ","),
function(x) mean(as.numeric(x), na.rm=TRUE)))
The above steps are the same. Create a logical index with is.na and replace all those missing values by the mean of values that are not missing by negating the logical index !is.na.
Val[is.na(Val)] <- mean(Val[!is.na(Val)])
Data$AverageMean <- Val
Data
# ID Value AverageMean
#1 1 1,2 1.5000000
#2 2 0,-1 -0.5000000
#3 3 1 1.0000000
#4 4 0.6666667
#5 5 0.6666667

Resources