data.matrix() when character involved - r

In order to calculate the highest contribution of a row per ID I have a beautiful script which works when the IDs are a numeric. Today however I found out that it is also possible that IDs can have characters (for instance ABC10101). For the function to work, the dataset is converted to a matrix. However data.matrix(df) does not support characters. Can the code be altered in order for the function to work with all kinds of IDs (character, numeric, etc.)? Currently I wrote a quick workaround which converts IDs to numeric when ID=character, but that will slow the process down for large datasets.
Example with code (function: extract the first entry with the highest contribution, so if 2 entries have the same contribution it selects the first):
Note: in this example ID is interpreted as a factor and data.matrix() converts it to a numeric value. In the code below the type of the ID column should be character and the output should be as shown at the bottom. Order IDs must remain the same.
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
#Function that needs to be altered
uniqueMaxContr <- function(m, ID = 1, contribution = 2) {
t(
vapply(
split(1:nrow(m), m[,ID]),
function(i, x, contribution) x[i, , drop=FALSE]
[which.max(x[i,contribution]),], m[1,], x=m, contribution=contribution
)
)
}
df<-data.matrix(df) #only works when ID is numeric
highestdf<-uniqueMaxContr(df)
highestdf<-as.data.frame(highestdf)
In this case the outcome should be:
ID contribution uniqID
ABCUD022221 40 101
ABCUD022222 90 105
ABCUD022223 75 106

Others might be able to make it more concise, but this is my attempt at a data.table solution:
tc <- textConnection('
ID contribution uniqID
ABCUD022221 40 101
ABCUD022221 40 102
ABCUD022222 20 103
ABCUD022222 10 104
ABCUD022222 90 105
ABCUD022223 75 106
ABCUD022223 15 107
ABCUD022223 10 108 ')
df <- read.table(tc,header=TRUE)
library(data.table)
dt <- as.data.table(df)
setkey(dt,uniqID)
dt2 <- dt[,list(contribution=max(contribution)),by=ID]
setkeyv(dt2,c("ID","contribution"))
setkeyv(dt,c("ID","contribution"))
dt[dt2,mult="first"]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106
EDIT -- more concise solution
You can use .SD which is the subset of the data.table for the grouping, and then use which.max to extract a single row.
in one line
dt[,.SD[which.max(contribution)],by=ID]
## ID contribution uniqID
## [1,] ABCUD022221 40 101
## [2,] ABCUD022222 90 105
## [3,] ABCUD022223 75 106

Related

How to get a specific column from a matrix in r?

I have a matrix as following, how can I extract the desired column with [?
MX <- matrix(101:112,ncol=3)
MX[,2]
# [1] 105 106 107 108
`[`(MX, c(1:4,2))
# [1] 101 102 103 104 102
Obviously, it does not extract 2nd column as intuitive guess, but honestly gets the 2nd element of all.
More like I am asking how to express MX[,2] with [.
Please advise, Thanks
Keep the row index as blank
`[`(MX, ,2)
#[1] 105 106 107 108
or if we need to extract selected rows (1:4) of a specific column (2), specify the row, column index without concatenating. c will turn the row and column index to a single vector instead of two
`[`(MX, 1:4, 2)
#[1] 105 106 107 108

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Alternative to for loop R

I have written a function that will compare the similarity of IP addresses, and will let the user select the level of detail in the octet. for example, in the address 255.255.255.0 and 255.255.255.1, a user could specify that they only want to compare the first, first and second, first second third etc. octets.
the function is below:
did.change.ip=function(vec, detail){
counter=2
result.vec=FALSE
r.list=strsplit(vec, '.', fixed=TRUE)
for(i in vec){
if(counter>length(vec)){
break
}
first=as.numeric(r.list[[counter-1]][1:detail])
second=as.numeric(r.list[[counter]][1:detail])
if(sum(first==second)==detail){
result.vec=append(result.vec,FALSE)
}
else{
result.vec=append(result.vec,TRUE)
}
counter=counter+1
}
return(result.vec)
}
and it's really slow once the data starts getting larger. for a dataset of 500,000 rows, the system.time() results are:
user system elapsed
208.36 0.59 209.84
are there any R power users who have insight on how to write this more efficiently? I know lapply() is the preferred method for looping over vectors/dataframes, but I'm stumped as to how to access the previous element in a vector for this purpose. I've tried to sketch something out quickly, but It returns a syntax error:
test=function(vec, detail){
rlist=strsplit(vec, '.', fixed=TRUE)
r.value=vapply(rlist, function(x,detail) ifelse(x[1:detail]==x[1:detail] TRUE, FALSE))
}
I've created some sample data for testing purposes below:
stack.data=structure(list(V1 = c("247.116.209.66", "195.121.47.105", "182.136.49.12",
"237.123.100.50", "120.30.174.18", "29.85.72.70", "18.186.76.177",
"33.248.142.26", "109.97.92.50", "217.138.155.145", "20.203.156.2",
"71.1.51.190", "31.225.208.60", "55.25.129.73", "211.204.249.244",
"198.137.15.53", "234.106.102.196", "244.3.87.9", "205.242.10.22",
"243.61.212.19", "32.165.79.86", "190.207.159.147", "157.153.136.100",
"36.151.152.15", "2.254.210.246", "3.42.1.208", "30.11.229.18",
"72.187.36.103", "98.114.189.34", "67.93.180.224")), .Names = "V1", class = "data.frame", row.names = c(NA,
-30L))
Here's another solution just using base R.
did.change.ip <- function(vec, detail=4){
ipv <- scan(text=paste(vec, collapse="\n"),
what=c(replicate(detail, integer()), replicate(4-detail,NULL)),
sep=".", quiet=TRUE)
c(FALSE, rowSums(vapply(ipv[!sapply(ipv, is.null)],
diff, integer(length(vec)-1))!=0)>0)
}
Here we use scan() to break up the ip address into numbers. Then we we look down each octet for differences using diff. It seems this is faster than the original proposal, but slightly slower than #josilber's stringr solution (using microbenchmark with 3,000 ip addresses)
Unit: milliseconds
expr min lq median uq max neval
orig 35.251886 35.716921 36.019354 36.700550 90.159992 100
scan 2.062189 2.116391 2.170110 2.236658 3.563771 100
strngr 2.027232 2.075018 2.136114 2.200096 3.535227 100
The simplest way I can think of to do this is to build a transformed vector that only includes the parts of the IP you want. Then it's a one-liner to check if each element is equal to the one before it:
library(stringr)
did.change.josilber <- function(vec, detail) {
s <- str_extract(vec, paste0("^(\\d+\\.){", detail, "}"))
return(s != c(s[1], s[1:(length(s)-1)]))
}
This seems reasonably efficient for 500,000 rows:
set.seed(144)
big.vec <- sample(stack.data[,1], 500000, replace=T)
system.time(did.change.josilber(big.vec, 3))
# user system elapsed
# 0.527 0.030 0.554
The biggest issue with your code is that you call append each iteration, which requires reallocation of your vector 500,000 times. You can read more about this in the second circle of the R inferno.
Not sure if all you want is counts, but this is potentially a solution:
library(dplyr)
library(tidyr)
# split ip addresses into "octets"
octets <- stack.data %>%
separate(V1,c("first","second","third","fourth"))
# how many shared both their first and second octets?
octets %>%
group_by(first,second) %>%
summarize(n = n())
first second n
1 109 97 1
2 120 30 1
3 157 153 1
4 18 186 1
5 182 136 1
6 190 207 1
7 195 121 1
8 198 137 1
9 2 254 1
10 20 203 1
11 205 242 1
12 211 204 1
13 217 138 1
14 234 106 1
15 237 123 1
16 243 61 1
17 244 3 1
18 247 116 1
19 29 85 1
20 3 42 1
21 30 11 1
22 31 225 1
23 32 165 1
24 33 248 1
25 36 151 1
26 55 25 1
27 67 93 1
28 71 1 1
29 72 187 1
30 98 114 1

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Resources