R - nested aggregate - r

I have a table d1 like this (three columns, JB Y and P)
JB Y P
AA 11 1
BB 11 2
AA 12 3
BB 12 4
AA 13 3
CC 12 4
CC 13 2
DD 11 1
DD 12 1
DD 13 3
BB 12 3
and what I am trying to do is is get nested aggregate. I mean the result should like this:
JB Y Avergare (P)
AA 11 1
AA 12 2
AA 13 3
BB 11 2
BB 12 3.5
CC 12 4
CC 13 2
DD 11 1
DD 12 1
DD 13 3
The nested aggregate first aggregates using Y and than JB and provides mean P? Not sure if possible. I know how to get just simple aggregate but wonder if there is a way to analyse data in two (or more steps)

Here is a solution using data.table:
library(data.table)
dt <- data.table(
JB = c("AA", "BB", "AA", "BB", "AA", "CC", "CC", "DD", "DD", "DD", "BB"),
Y = c(11, 11, 12, 12, 13, 12, 13, 11, 12, 13, 12),
P = c(1, 2, 3, 4, 3, 4, 2, 1, 1, 3, 3))
dt[order(JB), .(avg = mean(P)), by = .(JB, Y)]
The .() in the middle is used to name the aggregation result. If ordering is not necessary, you may omit the first part, e.g. just call
dt[, .(avg = mean(P)), by = .(JB, Y)].

We can use data.table
library(data.table)
setDT(df)[, list(P= mean(P)) , .(JB, Y)]

By the looks of it, this is a vanilla aggregate problem, so you have lots of tools available.
In base R, the obvious candidate is aggregate.
aggregate(P ~ JB + Y, mydf, mean)
You can also use the "dplyr" package, as suggested by #eipi10, if that is more your style:
library(dplyr)
mydf %>% group_by(JB, Y) %>% summarise(P = mean(P))

Related

Update a third column for each common row identified by two different columns - R

I want to update the column Participation_type (five different values are available) based on common rows identified by two different columns(custID and accntID). Also i have over 130000 records with around 7000 different combinations of custID and accntID. I was thinking if i could use random sampling from one of the five participation_type values to populate this variable. But not sure how.
**Also there is no visible pattern for the combination of custID and accntID (say repetition of the combination of values for custID and accntID). So I believe vectorization will not work.
Sample data:
library(data.table)
df <- data.table(custID = rep(c("a", "b"), times = 2),
accntID = rep(c(4, 7), times = 2),
Batch_ID = c(1, 1, 2, 2),
Participation_type = character(4))
custID accntID Batch_ID
a 4 1
b 7 1
c 8 1
b 7 2
a 4 2
d 4 1
Final Data: The output should be as mentioned below.
custID accntID Batch_ID Participation_type
a 4 1 BEN
b 7 1 AC
c 8 1 RC
b 7 2 AC
a 4 2 BEN
d 4 1 BEN
Thanks a lot for you suggestion and help.
We create a key/value dataset and then join with the original dataset to create the 'Participation_type'
Create the key/value dataset based on the unique combinations of 'custID', 'accntID' and the possible values of 'Participation_type'
library(data.table)
keyvalDat <- data.table(custID = c('a', 'b', 'c', 'd', 'e'),
accntID = c(4, 7, 8, 9, 10),
Participation_type = c("BEN", "AC", "RC", "O", "A"))
then join with the original dataset
df[keyvalDat, Participation_type := Participation_type, on = .(custID, accntID)]
df
# custID accntID Batch_ID Participation_type
#1: a 4 1 BEN
#2: b 7 1 AC
#3: c 8 1 RC
#4: a 4 2 BEN
#5: b 7 2 AC
data
df <- data.table(custID = c('a', 'b', 'c', 'a', 'b'),
accntID = c(4, 7, 8, 4, 7), Batch_ID = rep(1:2, c(3, 2)))
v1 <- c("BEN", "AC", "RC")

How do I reorganize this data.frame in R

I have the following data.frame.
u = c("aa", "bb", "cc", "dd")
v = c(1, 6, 9, 10)
w = c(2, 7, "", 11)
x = c(3, 8, "", 12)
y = c(4, "", "", 13)
z = c(5, "", "", "")
df = data.frame(cbind(u, v, w, x, y, z))
df
u v w x y z
1 aa 1 2 3 4 5
2 bb 6 7 8
3 cc 9
4 dd 10 11 12 13
I want the final product to be reorganized as such
1 aa
2 aa
3 aa
4 aa
5 aa
6 bb
7 bb
8 bb
9 cc
10 dd
11 dd
12 dd
13 dd
14 dd
I have the following script worked up but I'm missing something. I would appreciate guidance on what I'm missing.
dat <- df[,-1]
dat <- dat[,!apply (is.na(dat), 2, all)]
dat[is.na(dat)]="|"
dat <- apply(dat, 1, paste, collapse="|")
dat <- gsub("\\|\\|","", dat)
dat <- trimws(gsub("\\|$","",dat))
all.dat <- unlist(strsplit(dat,"\\|"))
dat.tmp <- data.frame(matrix(ncol = 2, nrow = length(all.dat)))
col1 <- df[,1]
for(i in 1:length(dat)){
tmp <- dat[i]
tmp <- unlist(strsplit(tmp, "\\|"))
for(j in 1:length(tmp)){
dat.tmp[i,1] <- tmp[j]
dat.tmp[i,2] <- as.character(col1[i])
}
print(i)
}
dat.tmp
You can use the reshape() function in the stats package.
df <- sapply(df, as.character) #PRE-PROCESS DATA
df[df == ""] <- NA #PRE-PROCESS DATA
df.new <- reshape(df, idvar = "u", direction = "long", varying = list(2:dim(df)[2]),
v.names = "vars")
df.new <- df.new[!is.na(df$vars), ]
rownames(df.new) <- seq(1, df.new[1])
You can also use the melt() function in reshape2
#USING PREPROCESSED DF.NEW
df.new <- melt(df, id.vars = "u", na.rm = T)
This is a fairly strange data structure, since every variable is a factor variable. A second method is to explicitly construct the two vectors of the desired data.frame using t and as.integer and rep for the second variable.
# transpose numeric values and convert to integer vector. repeat categorical
dat <- data.frame(val=as.integer(t(df[-1])), cat=rep(df[,1], each=ncol(df)-1L))
Now, drop the NA values
dat <- dat[!is.na(dat$val),]
dat
val cat
1 1 aa
2 2 aa
3 3 aa
4 4 aa
5 5 aa
6 6 bb
7 7 bb
8 8 bb
11 9 cc
16 10 dd
17 11 dd
18 12 dd
19 13 dd
ind <- apply(df, 1, function(x) sum(!is.na(as.numeric(x[-1]))))
as.data.frame(rep(df$u, ind))
1 aa
2 aa
3 aa
4 aa
5 aa
6 bb
7 bb
8 bb
9 cc
10 dd
11 dd
12 dd
13 dd
Here is a dplyr/tidyr solution
library(dplyr)
library(tidyr)
df[] <- lapply(df, gsub, pattern = "^$|^ $", replacement = NA)
df <- gather(df, id, value, v:z, na.rm = TRUE) %>%
arrange(u) %>%
select(u)

Making contingency table

I'm having trouble with contingency table.
I want to convert that kind of table:
dat <- read.csv(text="Gatunek,Obecnosc,Lokalizacja,Frekwencja
Koń dziki,TAK,Polska,11
Koń dziki,NIE,Polska,14
Koń dziki,TAK,Kujawy,39
Koń dziki,NIE,Kujawy,31",header=TRUE)
# Gatunek Obecnosc Lokalizacja Frekwencja
#Koń dziki TAK Polska 11
#Koń dziki NIE Polska 14
#Koń dziki TAK Kujawy 39
#Koń dziki NIE Kujawy 31
to this:
Don't be afraid, it's just Polish language.
For that moment I only have table which look like this:
xtabs should do the trick:
x <- data.frame(a = c(1, 2, 1, 2), b = c("a", "a", "b", "b"), c = c(11, 14, 39, 31))
xtabs(c ~ a + b, data = x)
# b
#a a b
# 1 11 39
# 2 14 31

How can I plot a line/bar for following type of data?

b <- data.frame(head=c("a", "b", "c", "d", "e"),
ab=c(1, 2, 3, 4, 5), bc=c(4, 5, 6, 7, 8), ca=c(2, 3, 4, 5, 6))
and so on.
I want to plot (5 individual plots in this case) for different head values, e.g. a plot for a for different values of ab,bc,ca same for b and so on.
The problem is it's easier to plot this if the table is transposed but difficult in this way.
Example if the data would have been in this way:
b <- data.frame(head=c("ab", "bc", "ca"),
a=c(1, 4, 2), b=c(2, 5, 3), c=c(3, 6, 4), d=c(4, 7, 5), e=c(5, 8, 6))
then it would be simple to plot for a with a command barplot(b$a). But how can I plot the same for the data presented in other way as shown in first line.
You could use reshape2 to transform the dataset b to your expected b
library(reshape2)
d1 <- dcast(melt(b,id.var="head"), variable~head, value.var="value")
d1
# variable a b c d e
#1 ab 1 2 3 4 5
#2 bc 4 5 6 7 8
#3 ca 2 3 4 5 6
Or in this case:
b1 <- t(b[,-1])
colnames(b1) <- b[,1]
b1
# a b c d e
#ab 1 2 3 4 5
#bc 4 5 6 7 8
#ca 2 3 4 5 6
If you want to plot 5 barplots on the same window:
library(ggplot2)
mb <- melt(b, id.var="head")
ggplot(mb, aes(head, value))+
geom_bar(aes(fill=variable), position="dodge", stat="identity") +
theme_bw()
If you need 5 individual bar plots using the original b dataset, you could try:
pdf("barplots.pdf")
apply(b[,-1], 1, function(x) barplot(x))
dev.off()
'barplot' can be used with original b data.frame:
barplot(as.matrix(b[,-1]), beside=T, legend.text=b$head)
For other grouping, transpose the data (as pointed out by #akrun):
barplot(t(as.matrix(b[,-1])), beside=T, legend.text=names(b)[2:4], names.arg=b$head)

Merge select columns from multiple tables using common identifiers in R

I would like to combine (merge) select columns from multiple tables with following organization.
Here's two datasets as examples that I want to combine
"dataset1"
A B C D E F (header)
1 2 3 4 5 F1(1st row)
6 7 8 9 10 F2(2nd row)
11 12 13 14 15 F3 (3rd row)
....
"dataset2"
A B C D E F (header)
16 17 18 19 20 F1(1st row)
21 22 23 24 25 F2(2nd row)
26 27 28 29 30 F3 (3rd row)
....
Here, header for all different datasets (I have more than 100 datasets) are identical, and I want to use names in F columns (F1, F2, F3...more than F200) as unique identifier.
For example, If I combine column "A" from all different datasets using column F as identifier, the results should look like this. Also to distinguish where the data come from, header also needs to be changed to dataset ID.
dataset1 dataset2 F (header)
1 16 F1 (1st row)
6 21 F2 (2nd row)
11 26 F3 (3rd row)
....
Note that all datasets I have contain different numbers of row, so that some data point values corresponding to F1~F200 could be missing. in this case I want to put NA or leave it as empty.
To this end, I tried following code
x <- merge(dataset1, dataset2, by="F", all=T)
But this way, I cannot extract only column A, rather it merges evert columns.
Similarly, I tried also
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1, dataset2))
This gave me actually identical results as previous code. To further extract only column A using this code, I tried following one, but did not worked.
x <- Reduce(function(x, y) merge(x, y, all=TRUE, by=("F")), list(dataset1[,1], dataset2[,1))
And I have no idea how to change name of header into the name of data set which came from.
Please understand I just started to learn R basics.
I'm using RStudio 0.98507 and currently all datasets (more than hundred) were loaded and in present in "Global Environment"
Thank you very much!
Here's one solution with the following four sample data frames:
dataset1 <- data.frame(A = c(1, 6, 11),
B = c(2, 7, 12),
C = c(3, 8, 12),
D = c(4, 9, 13),
E = c(5, 10, 14),
F = c("F1", "F2", "F3"))
dataset2 <- data.frame(A = c(16, 21, 26),
B = c(17, 22, 27),
C = c(18, 23, 28),
D = c(19, 24, 29),
E = c(20, 25, 30),
F = c("F1", "F2", "F3"))
dataset3 <- data.frame(A = c(30, 61),
B = c(57, 90),
C = c(38, 33),
D = c(2, 16),
E = c(77, 25),
F = c("F1", "F2"))
dataset4 <- data.frame(A = c(36, 61),
B = c(47, 30),
C = c(37, 33),
D = c(45, 10),
E = c(66, 29),
F = c("F1", "F2"))
First combine them into a list:
datasets <- list(dataset1, dataset2, dataset3, dataset4)
Then rename all the columns except the F column. This is because later when we merge the data frames together, if the columns all have the same names then merge will try to differentiate them by adding .x or .y to the names -- which is fine when you're only merging two data sets, but gets confusing with more than two.
for (i in seq_along(datasets)) {
for (j in seq_along(colnames(datasets[[i]]))) {
if (colnames(datasets[[i]])[j] != "F") {
colnames(datasets[[i]])[j] <- paste(colnames(datasets[[i]])[j], i, sep = ".")
}
}
}
This gives us data frames whose column headers look like this:
datasets[[1]]
## A.1 B.1 C.1 D.1 E.1 F
## 1 1 2 3 4 5 F1
## 2 6 7 8 9 10 F2
## 3 11 12 12 13 14 F3
Then use Reduce:
df <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "F"), datasets)
And select the columns you want, in this case all the columns with A in the column name:
df[, c("F", grep("A", names(df), value = TRUE))]
## F A.1 A.2 A.3 A.4
## 1 F1 1 16 30 36
## 2 F2 6 21 61 61
## 3 F3 11 26 NA NA

Resources