Subsetting character data in R - r

I have a data frame with several columns of varied character data. I want to find the average of each combination of that character data. I think I'm closing in on a solution, but am having trouble figuring out how to loop over characters. An example bit of data would be like:
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11
And trying to get it in the form of:
P1 P2 P3 Avg
a w j 11.667
a w k 10
a d j 20
a d k 15
a d l 23
b x L 20
b y k 17.5
c z j 15
c z k 11
c w l 45
I think the idea is something like:
test <- read.table("clipboard",header=T)
newdata <- subset(test,
Var1=='a'
& Var2=='w'
& Var3=='j',
select=M1
)
row.names(newdata)<-NULL
newdata2 <- as.data.frame(matrix(data=NA,nrow=3,ncol=4))
names(newdata2) <- c("P1","P2","P3","Avg")
newdata2[1,1] <- 'a'
newdata2[1,2] <- 'w'
newdata2[1,3] <- 'j'
newdata2[1,4] <- mean(newdata$M1)
Which works for the first line, but I'm not entirely sure how to automate this to loop over each character combination across the columns. Unless, of course, there's a similar apply-like function to use in this case?

library(dplyr)
newdata2 = summarise(group_by(test,Var1,Var2,Var3),Avg=mean(M1))
And the result:
> newdata2
Source: local data frame [10 x 4]
Groups: Var1, Var2
Var1 Var2 Var3 Avg
1 a d j 20.00000
2 a d k 7.50000
3 a d l 23.00000
4 a w j 11.66667
5 a w k 10.00000
6 b x L 20.00000
7 b y k 17.50000
8 c w l 45.00000
9 c z j 15.00000
10 c z k 11.00000

Using the base aggregate function:
mydata <- read.table(header=TRUE, text="
Var1 Var2 Var3 M1
a w j 20
a w j 15
a w k 10
a w j 0
b x L 30
b x L 10
b y k 20
b y k 15
c z j 20
c z j 10
c z k 11
c w l 45
a d j 20
a d k 4
a d l 23
a d k 11")
aggdata <-aggregate(mydata$M1, by=list(mydata$Var1,mydata$Var2,mydata$Var3) , FUN=mean, na.rm=TRUE)
output:
> aggdata
Group.1 Group.2 Group.3 x
1 a d j 20.00000
2 a w j 11.66667
3 c z j 15.00000
4 a d k 7.50000
5 a w k 10.00000
6 b y k 17.50000
7 c z k 11.00000
8 a d l 23.00000
9 c w l 45.00000
10 b x L 20.00000

Related

Getting the length of a list

I am attempting to decipher a list res which has structure as per below:
How would I go about converting this to a 21 (row) by 2 (column) dataframe?
I can do it by manually hard-coding the 21:
data.frame(matrix(unlist(res), nrow=21 ))
However I would like to use length(res) which unfortunately returns 1
As it is a list use [[ to index it to get the matrix and then convert to dataframe.
data.frame(res[[1]])
Or use unlist with recursive = FALSE
data.frame(unlist(res[[1]], recursive = FALSE))
Using a reproducble example,
res <- list(matrix(letters,ncol = 2))
data.frame(res[[1]])
# X1 X2
#1 a n
#2 b o
#3 c p
#4 d q
#5 e r
#6 f s
#7 g t
#8 h u
#9 i v
#10 j w
#11 k x
#12 l y
#13 m z
You can also magrittr::extract2
res %>% magrittr::extract2(1)
## A tibble: 21 x 2
# V1 V2
# <chr> <chr>
# 1 O M
# 2 W S
# 3 C Q
# 4 L C
# 5 M K
# 6 R M
# 7 U Q
# 8 I T
# 9 K J
#10 H V
## … with 11 more rows
or use purrr::flatten_dfc
purrr::flatten_dfc(res)
## A tibble: 21 x 2
# V1 V2
# <chr> <chr>
# 1 O M
# 2 W S
# 3 C Q
# 4 L C
# 5 M K
# 6 R M
# 7 U Q
# 8 I T
# 9 K J
#10 H V
## … with 11 more rows
Sample data
set.seed(2018)
res <- list(
as_tibble(matrix(sample(LETTERS, 21 * 2, replace = T), nrow = 21, ncol = 2))
)

Join 3 columns of different lengths in R

I have 3 columns
2 are the same length
1 is of a lesser length
here are the columns:
column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
I want all 3 columns to be joined together but have the missing 5 values in column1 to be NA?
What can i do to achieve this? a tibble?
You can change length of a vector
column1 <- letters[1:10]
column2 <- letters[1:15]
length(column1) <- length(column2)
Now
> column1
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA NA NA NA NA
We can wrap it in function
cbind_dif <- function(x = list()){
# Find max length
max_length <- max(unlist(lapply(x, length)))
# Set length of each vector as
res <- lapply(x, function(x){
length(x) <- max_length
return(x)
})
return(as.data.frame(res))
}
# Example usage:
> cbind_dif(list(column1 = column1, column2 = column2))
column1 column2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 g g
8 h h
9 i i
10 j j
11 <NA> k
12 <NA> l
13 <NA> m
14 <NA> n
15 <NA> o
n <- max(length(column1), length(column2), length(column3))
data.frame(column1[1:n],column2[1:n],column3[1:n])
column1.1.n. column2.1.n. column3.1.n.
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 <NA> k k
12 <NA> l l
13 <NA> m m
14 <NA> n n
15 <NA> o o
Using cbind.fill from rowr package you can do it easily.
library(rowr)
new<- cbind.fill(column1,column2,column3)
I hope this helps
column1 <- letters[1:10]
column2 <- letters[1:15]
column3 <- letters[1:15]
tibble(a = c(column1, rep(NA, length(column2) - length(column1))), b = column2, c = column3)
# A tibble: 15 × 3
a b c
<chr> <chr> <chr>
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
7 g g g
8 h h h
9 i i i
10 j j j
11 NA k k
12 NA l l
13 NA m m
14 NA n n
15 NA o o

R - Adding a total row in Excel output

I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>

Getting the maximum common words in R

I have data of the form:
ID A1 A2 A3 ... A100
1 john max karl ... kevin
2 kevin bosy lary ... rosy
3 karl lary bosy ... hale
.
.
.
10000 isha john lewis ... dave
I want to get one ID for each ID such that both of them have maximum number of common attributes(A1,A2,..A100)
How can I do this in R ?
Edit: Let's call the output a MatchId:
ID MatchId
1 70
2 4000
.
.
10000 3000
I think this gets what you're looking for:
library(dplyr)
# make up some data
set.seed(1492)
rbind_all(lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
})) -> dat
print(dat)
## Source: local data frame [15 x 11]
##
## ID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1 1 H F E C B A R J Z N
## 2 2 Q P E M L Z C G V Y
## 3 3 Q J D N B T L K G Z
## 4 4 D Y U F V O I C A W
## 5 5 T Z D I J F R C B S
## 6 6 Q D H U P V O E R N
## 7 7 C L I M E K N S X Z
## 8 8 M J S E N O F Y X I
## 9 9 R H V N M T Q X L S
## 10 10 Q H L Y B W S M P X
## 11 11 M N J K B G S X V R
## 12 12 W X A H Y D N T Q I
## 13 13 K H V J D X Q W A U
## 14 14 M U F H S T W Z O N
## 15 15 G B U Y E L A Q W O
# get commons
rbind_all(lapply(1:15, function(i) {
rbind_all(lapply(setdiff(1:15, i), function(j) {
data.frame(id1=i,
id2=j,
common=length(intersect(c(t(dat[i, 2:11])),
c(t(dat[j, 2:11])))))
}))
})) -> commons
commons %>%
group_by(id1) %>%
top_n(1, common) %>%
filter(row_number()==1) %>%
select(ID=id1, MatchId=id2)
## Source: local data frame [15 x 2]
## Groups: ID
##
## ID MatchId
## 1 1 5
## 2 2 7
## 3 3 5
## 4 4 12
## 5 5 1
## 6 6 9
## 7 7 8
## 8 8 7
## 9 9 10
## 10 10 9
## 11 11 9
## 12 12 13
## 13 13 12
## 14 14 8
## 15 15 2
Using similar data as provided by #hrbrmstr
set.seed(1492)
dat <- do.call(rbind, lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
}))
You could achieve the same using base R only
Res <- sapply(seq_len(nrow(dat)),
function(x) apply(dat[-1], 1,
function(y) length(intersect(dat[x, -1], y))))
diag(Res) <- -1
cbind(dat[1], MatchId = max.col(Res, ties.method = "first"))
# ID MatchId
# 1 1 5
# 2 2 7
# 3 3 5
# 4 4 12
# 5 5 1
# 6 6 9
# 7 7 8
# 8 8 7
# 9 9 10
# 10 10 9
# 11 11 9
# 12 12 13
# 13 13 12
# 14 14 8
# 15 15 2
If I understand correctly, the requirement is to obtain the maximum number of common attributes for each ID.
Frequency tables can be obtained using table() and recursively in lapply(), assuming that ID column is unique - slight modification is necessary if not (unique(df$ID) rather than df$ID in lapply()). The maximum frequencies can be taken and, if there is a tie, only the first one is chosen. Finally they are combined by do.call().
df <- read.table(header = T, text = "
ID A1 A2 A3 A100
1 john max karl kevin
2 kevin bosy lary rosy
3 karl lary bosy hale
10000 isha john lewis dave")
do.call(rbind, lapply(df$ID, function(x) {
tbl <- table(unlist(df[df$ID == x, 2:ncol(df)]))
data.frame(ID = x, MatchId = tbl[tbl == max(tbl)][1])
}))
# ID MatchId
#john 1 1
#kevin 2 1
#karl 3 1
#isha 10000 1

Flatten matrix in R to four columns (indexes and upper/lower triangles)

I'm using the cor.prob() function that's been posted several times around the mailing list to get a matrix of correlations (lower diagonal) and p-values (upper diagonals):
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
d <- data.frame(x=1:5, y=c(10,16,8,60,80), z=c(10,9,12,2,1))
cor.prob(d)
> cor.prob(d)
x y z
x NA 0.04856042 0.107654038
y 0.8807155 NA 0.003523594
z -0.7953560 -0.97945703 NA
How would I collapse the above correlation matrix (with the correlations in the lower half, p-values in the upper half) into a four-column matrix: two indexes, the correlation, and the p-value? E.g.:
i j cor pval
x y .88 .048
x z -.79 .107
y z -.97 0.0035
I've seen the answer to the previous question like this, but will only give me a 3-column matrix, not a four column matrix with separate columns for the p-value and correlation.
Any help is appreciated!
well it's not a matrix, because you can't mix characters and numerics. But:
this is my first attempt (before your label swap):
m <- cor.prob(d)
ut <- upper.tri(m)
lt <- lower.tri(m)
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=m[lt])
now apply the correction I suggested below and you get
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=t(m)[ut])
finally your label swap, use row()/col(), and write it as a function:
f1 <- function(m) {
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=tm[ut])
}
then
m<-matrix(1:25,5,dimnames=list(letters[1:5],letters[1:5])
> m
a b c d e
a 1 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
> f1(m)
i j cor p
1 a b 6 2
2 a c 11 3
3 b c 12 8
4 a d 16 4
5 b d 17 9
6 c d 18 14
7 a e 21 5
8 b e 22 10
9 c e 23 15
10 d e 24 20
Can you explain what you expected if it wasn't this?
cd <- cor.prob(d)
dcd <- as.data.frame( which( row(cd) < col(cd), arr.ind=TRUE) )
dcd$pval <- cd[row(cd) < col(cd)]
dcd$cor <- cd[row(cd) > col(cd)]
dcd[[2]] <-dimnames(cd)[[2]][dcd$col]
dcd[[1]] <-dimnames(cd)[[2]][dcd$row]
dcd
#--------------------
row col pval cor
1 x y 0.048560420 0.8807155
2 x z 0.107654038 -0.7953560
3 y z 0.003523594 -0.9794570

Resources