combine rowPerc, colPerc in one matrix? - r

Suppose I have data with sales of different products in different categories in different months and I want to see their percentage of sales or number of items in each category
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 1 X K John
1 A 2 6 9 2 X K John
1 A 2 5 8 3 X K John
2 B 2 4 6 1 X L Sam
2 B 2 3 4 2 X L Sam
2 B 2 5 7 3 X L Sam
3 C 2 5 11 1 X M John
3 C 2 5 11 2 X L John
3 C 2 5 11 3 X K John
4 D 2 8 10 1 Y M John
4 D 2 8 10 2 Y K John
4 D 2 5 7 3 Y K John
5 E 2 5 9 1 Y M Sam
5 E 2 5 9 2 Y L Sam
5 E 2 5 9 3 Y M Sam
6 F 2 4 7 1 Z M Kyle
6 F 2 5 8 2 Z L Kyle
6 F 2 5 8 3 Z M Kyle
applying table on category and mode will show us how many times particular category existed under particular mode
K L M
X 4 4 1
Y 2 1 3
Z 0 1 2
Now rowPerc and colPerc will give us either percentage on row wise or column wise.
But what if I am interested to know for example X/K makes up how much percentage of the total data which is 22.22% (total data in matrix is 18).
Is there any way I can get a matrix of percentage of particular point in total data.
something like this
K L M
X 22.22 22.22 5.55
Y 11.11 5.55 16.67
Z 0.00 5.55 11.11
so total sum of matrix is 100% instead of rows or column.
I hope I explained it clearly. Thanks

If df is your data.frame
with(df, prop.table(table(Category, Mode))*100)

Related

Assign unique non-repeated ID to nested groups with the same values in R

I have run across similar questions, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to include a unique non-repeating ID to nested groups that can have identical values. While I regularly conduct this type of data wrangling, both the structure of this data set as well as the required outcome are beyond my skillset at this time.
Below I have provided an example data set (df) and what the results should look like.
I used the below code in my actual data set, but realized that it fails under certain circumstances...which are exaggerated in the example data set provided here. I prefer the ID to be sequentially numbered.
df$ID = cumsum(c(TRUE, diff(df$LENGTH) != 0))
I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP REGION TIME LENGTH
a x 1 3
a x 2 3
a x 3 3
a y 4 3
a y 5 3
a y 6 3
a z 7 2
a z 8 2
b z 1 2
b z 2 2
b x 3 2
b x 4 2
c x 1 2
c x 2 2
c y 3 2
c y 4 2
c x 5 2
c x 6 2
c z 7 1", header = TRUE)
result <- read.table(text = "GROUP REGION TIME LENGTH ID
a x 1 3 1
a x 2 3 1
a x 3 3 1
a y 4 3 2
a y 5 3 2
a y 6 3 2
a z 7 2 3
a z 8 2 3
b z 1 2 4
b z 2 2 4
b x 3 2 5
b x 4 2 5
c x 1 2 6
c x 2 2 6
c y 3 2 7
c y 4 2 7
c x 5 2 8
c x 6 2 8
c z 7 1 9", header = TRUE)
Paste GROUP and REGION columns and use rle to create a sequential ID column.
transform(df,ID = with(rle(paste(GROUP, REGION)),rep(seq_along(values),lengths)))
In data.table we can use rleid.
library(data.table)
setDT(df)[, ID := rleid(GROUP, REGION)]
# GROUP REGION TIME LENGTH ID
# 1: a x 1 3 1
# 2: a x 2 3 1
# 3: a x 3 3 1
# 4: a y 4 3 2
# 5: a y 5 3 2
# 6: a y 6 3 2
# 7: a z 7 2 3
# 8: a z 8 2 3
# 9: b z 1 2 4
#10: b z 2 2 4
#11: b x 3 2 5
#12: b x 4 2 5
#13: c x 1 2 6
#14: c x 2 2 6
#15: c y 3 2 7
#16: c y 4 2 7
#17: c x 5 2 8
#18: c x 6 2 8
#19: c z 7 1 9
Another base R option, but without rle
transform(
df,
ID = cumsum(c(1, (s <- paste0(GROUP, REGION))[-1] != head(s, -1)))
)
gives
GROUP REGION TIME LENGTH ID
1 a x 1 3 1
2 a x 2 3 1
3 a x 3 3 1
4 a y 4 3 2
5 a y 5 3 2
6 a y 6 3 2
7 a z 7 2 3
8 a z 8 2 3
9 b z 1 2 4
10 b z 2 2 4
11 b x 3 2 5
12 b x 4 2 5
13 c x 1 2 6
14 c x 2 2 6
15 c y 3 2 7
16 c y 4 2 7
17 c x 5 2 8
18 c x 6 2 8
19 c z 7 1 9
With dplyr
library(dplyr)
library(data.table)
df %>%
mutate(ID = rleid(GROUP, REGION))

Output colMeans in columns rather than rows

When I use the colMeans function on a dataset, R outputs the means into rows rather than the original column format. Here is an example:
Year J F M A M J J A S O N D
1851 4 6 3 6 9 7 1 2 8 9 5 0
1852 3 8 5 5 5 3 2 8 6 7 4 2
1853 5 7 4 8 6 9 4 4 4 2 1 2
When I use the function
colMeans(df)
The output is returned as:
Year Mean
J 4
F 7
M 4
A 6
etc...
How can I develop the script to ensure the output is organised in columns like the original data rather than rows? It should look like:
J F M A M J J A S O N D
4 7 4 6 etc............
considering input as
dft <- read.table(header = TRUE, text = "Year J F M A M J J A S O N D
1851 4 6 3 6 9 7 1 2 8 9 5 0
1852 3 8 5 5 5 3 2 8 6 7 4 2
1853 5 7 4 8 6 9 4 4 4 2 1 2",stringsAsFactors=FALSE)
you could try
t(round(colMeans(dft),0))
which gives
Year J F M A M.1 J.1 J.2 A.1 S O N D
[1,] 1852 4 7 4 6 7 6 2 5 6 6 3 1
and then get rid of the Year field if you want to.

how to delete dataframe's row with 3 of 5 equal element?

I have a dataframe with 5 columns and many many rows, that have repetition of elements only for the first 3 columns (in short, it is a volume built by several volumes, and so there are same coordinates (x,y,z) with different labels, and I would like to eliminate the repeated coordinates).
How can I eliminate these with R commands?
Thanks
AV
You can use duplicated function, e.g. :
# create an example data.frame
Lab1<-letters[1:10]
Lab2<-LETTERS[1:10]
x <- c(3,4,3,3,4,2,4,3,9,0)
y <- c(3,4,3,5,4,2,1,5,7,2)
z <- c(8,7,8,8,4,3,1,8,6,3)
DF <- data.frame(Lab1,Lab2,x,y,z)
> DF
Lab1 Lab2 x y z
1 a A 3 3 8
2 b B 4 4 7
3 c C 3 3 8
4 d D 3 5 8
5 e E 4 4 4
6 f F 2 2 3
7 g G 4 1 1
8 h H 3 5 8
9 i I 9 7 6
10 j J 0 2 3
# remove rows having repeated x,y,z
DF2 <- DF[!duplicated(DF[,c('x','y','z')]),]
> DF2
Lab1 Lab2 x y z
1 a A 3 3 8
2 b B 4 4 7
4 d D 3 5 8
5 e E 4 4 4
6 f F 2 2 3
7 g G 4 1 1
9 i I 9 7 6
10 j J 0 2 3
EDIT :
To allow choosing amongst the rows having the same coordinates, you can use for example by function (even if is less efficient then previous approach) :
res <- by(DF,
INDICES=paste(DF$x,DF$y,DF$z,sep='|'),
FUN=function(equalRows){
# equalRows is a data.frame with the rows having the same x,y,z
# for exampel here we choose the first row ordering by Lab1 then Lab2
row <- equalRows[order(equalRows$Lab1,equalRows$Lab2),][1,]
return(row)
})
DF2 <- do.call(rbind.data.frame,res)
> DF2
Lab1 Lab2 x y z
0|2|3 j J 0 2 3
2|2|3 f F 2 2 3
3|3|8 a A 3 3 8
3|5|8 d D 3 5 8
4|1|1 g G 4 1 1
4|4|4 e E 4 4 4
4|4|7 b B 4 4 7
9|7|6 i I 9 7 6

Delete certain rows in a group of rows in R

Suppose I have this dataset
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 0 0 1 X K John
1 A 2 0 0 2 X K John
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
2 B 2 0 0 2 X L Sam
2 B 2 0 0 3 X M John
2 B 2 0 0 4 X L John
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I want to delete rows with zeroes for Sales and Profit column by Id group
So for a certain Id if two or more consecutive rows have zero values for sales and profit those rows will get delete. So this dataset will become like this.
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 3 X K John
1 A 2 5 8 4 X L Sam
2 B 2 3 4 1 X L Sam
3 C 2 0 0 1 X K John
3 C 2 8 10 2 Y M John
3 C 2 8 10 3 Y K John
3 C 2 0 0 4 Y K John
5 E 2 0 0 1 Y M Sam
5 E 2 5 5 2 Y L Sam
5 E 2 5 9 3 Y M Sam
5 E 2 0 0 4 Z M Kyle
5 E 2 5 8 5 Z L Kyle
5 E 2 5 8 6 Z M Kyle
I can remove all rows if they have zero values for Sales and Profit with
df1 = df[!(df$sales==0 & test$Profit==0),]
But how to delete rows only in certain group in this case by Id
P.S The idea is to delete entries for those products if they started selling after few months or got abandoned after few months in a year cycle.
Here's an approach using rleid from "data.table":
library(data.table)
as.data.table(mydf)[, N := .N, by = .(Id, rleid(sales == 0 & Profit == 0))][
!(sales == 0 & Profit == 0 & N >= 2)]
## Id Name Price sales Profit Month Category Mode Supplier N
## 1: 1 A 2 5 8 3 X K John 2
## 2: 1 A 2 5 8 4 X L Sam 2
## 3: 2 B 2 3 4 1 X L Sam 1
## 4: 3 C 2 0 0 1 X K John 1
## 5: 3 C 2 8 10 2 Y M John 2
## 6: 3 C 2 8 10 3 Y K John 2
## 7: 3 C 2 0 0 4 Y K John 1
## 8: 5 E 2 0 0 1 Y M Sam 1
## 9: 5 E 2 5 5 2 Y L Sam 2
## 10: 5 E 2 5 9 3 Y M Sam 2
## 11: 5 E 2 0 0 4 Z M Kyle 1
## 12: 5 E 2 5 8 5 Z L Kyle 2
## 13: 5 E 2 5 8 6 Z M Kyle 2
Here's how to do it with dplyr. Basically, I'm only keeping lines that are not zero OR that the previous/following lines is not zero.
table1 %>%
group_by(Id) %>%
mutate(Lag=lag(sales),Lead=lead(sales)) %>%
rowwise() %>%
mutate(Min=min(Lag,Lead,na.rm=TRUE)) %>%
filter(sales>0|Min>0) %>%
select(-Lead,-Lag,-Min)
Id Name Price sales Profit Month Category Mode Supplier
(int) (chr) (int) (int) (int) (int) (chr) (chr) (chr)
1 1 A 2 5 8 3 X K John
2 1 A 2 5 8 4 X L Sam
3 2 B 2 3 4 1 X L Sam
4 3 C 2 0 0 1 X K John
5 3 C 2 8 10 2 Y M John
6 3 C 2 8 10 3 Y K John
7 3 C 2 0 0 4 Y K John
8 5 E 2 0 0 1 Y M Sam
9 5 E 2 5 5 2 Y L Sam
10 5 E 2 5 9 3 Y M Sam
11 5 E 2 0 0 4 Z M Kyle
12 5 E 2 5 8 5 Z L Kyle
13 5 E 2 5 8 6 Z M Kyle
Data
table1 <-read.table(text="
Id,Name,Price,sales,Profit,Month,Category,Mode,Supplier
1,A,2,0,0,1,X,K,John
1,A,2,0,0,2,X,K,John
1,A,2,5,8,3,X,K,John
1,A,2,5,8,4,X,L,Sam
2,B,2,3,4,1,X,L,Sam
2,B,2,0,0,2,X,L,Sam
2,B,2,0,0,3,X,M,John
2,B,2,0,0,4,X,L,John
3,C,2,0,0,1,X,K,John
3,C,2,8,10,2,Y,M,John
3,C,2,8,10,3,Y,K,John
3,C,2,0,0,4,Y,K,John
5,E,2,0,0,1,Y,M,Sam
5,E,2,5,5,2,Y,L,Sam
5,E,2,5,9,3,Y,M,Sam
5,E,2,0,0,4,Z,M,Kyle
5,E,2,5,8,5,Z,L,Kyle
5,E,2,5,8,6,Z,M,Kyle
",sep=",",stringsAsFactors =FALSE, header=TRUE)
UPDATE
To filter on more than one column with these criteria, here's how to do it. In the present case, the result is the same because when sales are 0, profits are also 0.
library(dplyr)
table1 %>%
group_by(Id) %>%
mutate(LagS=lag(sales),LeadS=lead(sales),LagP=lag(Profit),LeadP=lead(Profit)) %>%
rowwise() %>%
mutate(MinS=min(LagS,LeadS,na.rm=TRUE),MinP=min(LagP,LeadP,na.rm=TRUE)) %>%
filter(sales>0|MinS>0|Profit>0|MinP>0) %>% # "|" means OR
select(-LeadS,-LagS,-MinS,-LeadP,-LagP,-MinP)
I can't do it in one line, but here it is in three:
x <- df$sales==0 & df$Profit==0
y <- cumsum(c(1,head(x,-1)!=tail(x,-1)))
df[ave(x,df$Id,y,FUN=sum)<2,]
# Id Name Price sales Profit Month Category Mode Supplier
# 3 1 A 2 5 8 3 X K John
# 4 1 A 2 5 8 4 X L Sam
# 5 2 B 2 3 4 1 X L Sam
# 9 3 C 2 0 0 1 X K John
# 10 3 C 2 8 10 2 Y M John
# 11 3 C 2 8 10 3 Y K John
# 12 3 C 2 0 0 4 Y K John
# 13 5 E 2 0 0 1 Y M Sam
# 14 5 E 2 5 5 2 Y L Sam
# 15 5 E 2 5 9 3 Y M Sam
# 16 5 E 2 0 0 4 Z M Kyle
# 17 5 E 2 5 8 5 Z L Kyle
# 18 5 E 2 5 8 6 Z M Kyle
This works by first identifying all rows where sales and Profit are both 0 (x). The variable y groups consecutive TRUE and FALSE values. The ave() function splits the first input variable (x) according to the subsequent input variables (df$Id and y) then applies the function within groups. Since the function is sum(), it will add up all the TRUE values in x, then it returns a vector of the same length as x, so we just need to keep all the rows where the result is less than 2.
Here my solution:
aux <- lapply(tapply(df$sales + df$Profit, df$Id, rle), function(x)
with(x, cbind(rep(values, lengths), rep(lengths, lengths))))
df[!(do.call(rbind, aux)[,1]==0 & do.call(rbind, aux)[,2] >= 2),]
Id Name Price sales Profit Month Category Mode Supplier
3 1 A 2 5 8 3 X K John
4 1 A 2 5 8 4 X L Sam
5 2 B 2 3 4 1 X L Sam
9 3 C 2 0 0 1 X K John
10 3 C 2 8 10 2 Y M John
11 3 C 2 8 10 3 Y K John
12 3 C 2 0 0 4 Y K John
13 5 E 2 0 0 1 Y M Sam
14 5 E 2 5 5 2 Y L Sam
15 5 E 2 5 9 3 Y M Sam
16 5 E 2 0 0 4 Z M Kyle
17 5 E 2 5 8 5 Z L Kyle
18 5 E 2 5 8 6 Z M Kyle

Arrange a matrix with regard to the rownames and colnames

Image we have a matrix, M*N, M rows and N columns, like
b a d c e
a 2 1 4 3 5
b 3 2 5 4 6
c 1 3 3 2 4
I want to write a function, where take the above matrix, return the following matrix:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4
Where the first part of the matrix M*M, 3*3 in this case is symmetric in terms of rownames and colnames, and 3*5 in total, the rest 3*2 matrix is pushed afterwards.
For an N x M matrix where N <= M and all row names are contained in col names, this will bring the columns with names existing in row names to the front in the same order as the row names, and leave the rest of the columns in their original order after that:
mat_ord <- function(mx) mx[, c(rownames(mx), setdiff(colnames(mx), rownames(mx)))]
mat_ord(mx)
produces:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4
To see the difference, consider mx2 which has rows and columns ordered differently than mx:
e a b d c
b 6 2 3 5 4
a 5 1 2 4 3
c 4 3 1 3 2
And with mat_ord(mx2) we get:
b a c e d
b 3 2 4 6 5
a 2 1 3 5 4
c 1 3 2 4 3
UPDATE: this sorts rows and columns while ensuring symmetry on first N cols/rows:
mat_ord2 <- function(mx) mx[sort(rownames(mx)), c(sort(rownames(mx)), sort(setdiff(colnames(mx), rownames(mx))))]
mat_ord2(mx2)
produces:
a b c d e
a 1 2 3 4 5
b 2 3 4 5 6
c 3 1 2 3 4

Resources