bin and transpose in R - r

I am still getting the hang of R and coding in general, so bear with me on this.
my problem This is a dimension reduction idea I have consisting of three steps. I need help with the first two.
bin rows
transpose the binned rows into new columns so the columns will increase by number of bin, rows decrease by number of bins
Perform PCA to then reduce columns
So the data would go from this:
A B C D
1 W1 X1 Y1 Z1
2 W2 X2 Y2 Z2
3 W3 X3 Y3 Z3
4 W4 X4 Y4 Z4
5 W5 X5 Y5 Z5
6 W6 X6 Y6 Z6
so, if I bin by 2 and transpose it would look something like this:
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6
I'm pretty sure I need to nest bin and transpose in some sort of function, but I'm not sure which comes first, or really at all how to approach this, so any suggestions will help!
I really hope this makes some sense, let me know how I can rephrase if needed!
EDIT
I am working with integer datatypes, here is a snippet of my actual data I'd like to bin and expand.
> head(dataset[1:4])
EMG1 EMG2 EMG3 EMG4
1 32744 32571 32935 32279
2 32788 32934 32767 32624
3 32828 33202 32587 32377
4 32870 33269 32423 32954
5 32838 33319 32126 32721
6 32903 33502 32652 32151

Assuming these letter digit entries as not supposed to be stand ins for numerics, I would first run this:
dat[] <- lapply(dat, as.character) # ensures we get rid of factors
This uses recycling of logical indices inside a function that gets serially applied across your dataframe to create two lists from each column. That is then coerced to a dataframe. The initial result res has rather odd names which get shortened with some simple regex work.
res <- data.frame( lapply(dat,
function(cl){list( list(cl[c(TRUE,FALSE)],
list(cl[!c(TRUE,FALSE)]) )) }))
names(res) <- sub("\\..+$", "", names(res))
> res
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6

Related

How to speed up sequential analysis in R?

I need to sequently analyze a dataset while using subresults of the operations before.
As I am known to R I decided to work with this and one of the solution I tried is using an for loop.
The dataset which I loop through has around 8 million rows with 4 columns.
I use a data.table and the variables are of type character eg. "XXXXXXXXX"
I tried to loop through but it takes approx 0,7 second per cycle from which the "<-" operation takes half a second.
Can anybody recommend a better technique. Potentially rcpp, apply or whatever?
Thx for your support,
Holger
'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')
dt_loop <- data.table(
paste0("XXXXXXXXXX", 1:80000000),
paste0("YXXXXXXXXX", 1:80000000),
paste0("ZXXXXXXXXX", 1:80000000),
paste0("AXXXXXXXXX", 1:80000000)
)
colnames(dt_loop)[colnames(dt_loop)=="V1"] <- "m"
colnames(dt_loop)[colnames(dt_loop)=="V2"] <- "c"
colnames(dt_loop)[colnames(dt_loop)=="V3"] <- "ma"
colnames(dt_loop)[colnames(dt_loop)=="V4"] <- "unused"
for(i in 1:nrow(dt_loop)){
m <- dt_loop$m[i]
c <- dt_loop$m[i]
if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
dt_loop$ma[i] <- m
} else {
if(m %in% dt_loop$ma[1:i]){
dt_loop$ma[i] <- m
} else {
dt_loop$ma[i] <- c
}
}
}
This is a self-join Cartesian product solution. I modified your code to get somewhat meaningful results. I also think that if you have 8 million rows, you're going to have performance troubles when the nth loop depends on the nth before hand.
Changes in data structure:
Used sample to get some repeats in the data.table
Simplified the column names to data.table function setnames()
Added an ID field
Removed unused column.
'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')
# Generate Data -----------------------------------------------------------
set.seed(1)
n_rows <- 10
dt_loop <- data.table(
sample(paste0("X", 1:n_rows), n_rows, replace = T),
sample(paste0("Y", 1:n_rows), n_rows, replace = T),
sample(paste0("X", 1:n_rows), n_rows, replace = T)
)
setnames(dt_loop, c('m', 'c', 'ma'))
dt_loop[, ID := .I]
I made significant changes to your loop.
Assigned c <- dt_loop$c[i] as I don't know what using m there did.
Removed the first if statement because of the new assignment of c.
# Original loop with Minor Mod --------------------------------------------
for(i in 1:nrow(dt_loop)){
m <- dt_loop$m[i]
c <- dt_loop$c[i] #changed to c instead of m
#Removed first ifelse condition
#as it didn't make sense as originally constructed
# if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
# dt_loop$ma2[i] <- m
# } else {
if(m %in% dt_loop$ma[1:i]){
dt_loop$ma2[i] <- m
} else {
dt_loop$ma2[i] <- c
}
# }
}
dt_loop
m c ma ID ma2
1: X3 Y3 X10 1 Y3
2: X4 Y2 X3 2 Y2
3: X6 Y7 X7 3 Y7
4: X10 Y4 X2 4 X10
5: X3 Y8 X3 5 X3
6: X9 Y5 X4 6 Y5
7: X10 Y8 X1 7 X10
8: X7 Y10 X4 8 X7
9: X7 Y4 X9 9 X7
10: X1 Y8 X4 10 X1
The self-join seems to be faster than the loop when I up the rows to 10,000, but it still slows down. Of note is that you can see when there is duplication with ma because the cartesian product expands the results so you get N == 2.
I believe that there are ways to get the self-join to work so that you only get the Nth row which should relieve some pressure.
dt_loop[dt_loop
, on = .(ID <= ID
, ma = m)
, .(.N
,i.ma2 #for comparison - remove
,ma3 = ifelse(is.na(x.ID), i.c, i.m)
,i.ID, i.m, i.c, i.ma
,x.ID, x.m, x.c, x.ma
)
, by = .EACHI
, allow.cartesian = T]
ID ma N i.ma2 ma3 i.ID i.m i.c i.ma x.ID x.m x.c x.ma
1: 1 X3 0 Y3 Y3 1 X3 Y3 X10 NA <NA> <NA> <NA>
2: 2 X4 0 Y2 Y2 2 X4 Y2 X3 NA <NA> <NA> <NA>
3: 3 X6 0 Y7 Y7 3 X6 Y7 X7 NA <NA> <NA> <NA>
4: 4 X10 1 X10 X10 4 X10 Y4 X2 1 X3 Y3 X10
5: 5 X3 2 X3 X3 5 X3 Y8 X3 2 X4 Y2 X3
6: 5 X3 2 X3 X3 5 X3 Y8 X3 5 X3 Y8 X3
7: 6 X9 0 Y5 Y5 6 X9 Y5 X4 NA <NA> <NA> <NA>
8: 7 X10 1 X10 X10 7 X10 Y8 X1 1 X3 Y3 X10
9: 8 X7 1 X7 X7 8 X7 Y10 X4 3 X6 Y7 X7
10: 9 X7 1 X7 X7 9 X7 Y4 X9 3 X6 Y7 X7
11: 10 X1 1 X1 X1 10 X1 Y8 X4 7 X10 Y8 X1

R shifting values to the right

I'm facing a very simple problem I believe.
Example data:
example=data.frame(x=c(1,2,3,4,5,6,7,8), y=c(1,2,3,X,X,X,7,8), z=c(1,2,3,4,5,6,7,8))
which looks like this :
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
This is the ideal case. Because my data is coming in a single column format, I am formatting it into a dataframe like this. Now sometimes (not in every observation) there is an additional value that would always come at the same place (eg always in the same column after formatting it into a DF).
It looks like this :
data.frame(x=c("X1","X2","X3","X4","X5","X6","X7","X8"),
y=c("Y1","Y2","Y3","A38","A15","A8","Y7","Y8"),
z=c("Z1","Z2","Z3","Y4","Y5","Y6","Z7","Z8"),
aa=c(NA,NA,NA,"Z4","Z5","Z6",NA,NA))
resulting in :
x y z aa
1 X1 Y1 Z1 <NA>
2 X2 Y2 Z2 <NA>
3 X3 Y3 Z3 <NA>
4 X4 A38 Y4 Z4
5 X5 A15 Y5 Z5
6 X6 A8 Y6 Z6
7 X7 Y7 Z7 <NA>
8 X8 Y8 Z8 <NA>
The "aa" column was created in the first place only because I had an additional value (the A something) in the original single column formatted file. I want to save the A values in their column (always starting with a A for instance) and I do not really care about the "aa" one. I just want to shift all the values like this :
x y z aa A
1 X1 Y1 Z1 <NA> <NA>
2 X2 Y2 Z2 <NA> <NA>
3 X3 Y3 Z3 <NA> <NA>
4 X4 Y4 Z4 Z4 A38
5 X5 Y5 Z5 Z5 A15
6 X6 Y6 Z6 Z6 A8
7 X7 Y7 Z7 <NA> <NA>
8 X8 Y8 Z8 <NA> <NA>
For the sake of the example I have put a few columns here, but the real case scenario there can be as much as 300 columns, so almost that amount of values to shift left up to the point where we find NA in that row.
I have worked around this type of thing so far using this code :
format_A_things <- function(df, col) {
x <<- grep("A", df$ColumnWhereAareExpected) #selecting lines starting with an A
if (length(x) > 0){
for (i in x){
df[i,"SpecificColumnforA"] = df[i,col]
for (j in col:(ncol(df)-2)){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
return(df)
}
This is working fine expecting that it's very slow (I may have several hundred of lines to go through and it's taking tens of minutes which is not ideal). I know I've probably used the slowest way to do this type of thing, hence my question to you guys. How do I achieve such a result.
Thanks ahead of time for your help !
Based on comments, here is how I transformed my function to perform the action :
countString <- function(df, col, str) {
sel <<- grepl(str, df[,col], fixed = TRUE)
while (any(sel)){
df[sel,"Column"] = df[sel,"Column"] + 1
df[sel, c(col:(ncol(df)-2))] <- df[sel, c(col+1:(ncol(df)-3))]
sel <<- grepl(str, df[,col], fixed = TRUE)
}
return(df)
}
I would have hoped that with the while and resetting the selection at the end of the while, it would gone through several occurrences if needed. It does not, so when I call my function, I use again the selection and if any(sel) is true then I call again this function.
Anyways, it is much faster than my previous method which was the primary intent here.
Thanks everyone for helping out.

Adding data by row into an empty matrix and handling missing data

I have an empty matrix with a certain number of columns that I'm trying to fill row-by-row with output vectors of a for-loop. However, some of the output are not the same length as the number of columns as my matrix, and just want to fill up those "empty spaces" with NAs.
For example:
matrix.names <- c("x1", "x2", "x3", "x4", "y1", "y2", "y3", "y4", "z1", "z2", "z3", "z4")
my.matrix <- matrix(ncol = length(matrix.names))
colnames(my.matrix) <- matrix.names
This would be the output from one iteration:
x <- c(1,2)
y <- c(4,2,1,5)
z <- c(1)
Where I would want it in the matrix like this:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
The output from the next iteration would be, for example:
x <- c(1,1,1,1)
y <- c(0,4)
z <- c(4,1,3)
And added as a new row in the matrix:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
[2,] 1 1 1 1 0 4 NA NA 4 1 3 NA
It's not really a concern if I have a 0, it's just where there is no data. Also, the data is saved in such a way that whatever is there is listed in the row first, followed by NAs in empty slots. In other words, I'm not worried if an NA may pop up first.
Also, is such a thing better handled in data frames rather than matrices?
not the efficient answer : just a try
logic : extending the length to 4.(exception could be if already x/y/z is laready of length4) Therefore while rbinding I only extract the first 4 elements .
x[length(x)+1:4] <- NA
y[length(y)+1:4] <- NA
z[length(z)+1:4] <- NA
my.matrix <- rbind(my.matrix,c(x[1:4],y[1:4],z[1:4]))
Note : the exception I mentioned above is like below :
> x <- c(1,1,1,1)
> x
[1] 1 1 1 1
> x[length(x)+1:4] <- NA
> x
[1] 1 1 1 1 NA NA NA NA # therefore I extract only the first four
Here is an option to do this programmatically
d1 <- stack(mget(c("x", "y", "z")))[2:1]
nm <- with(d1, paste0(ind, ave(seq_along(ind),ind, FUN = seq_along)))
my.matrix[,match(nm,colnames(my.matrix), nomatch = 0)] <- d1$values
my.matrix
# x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
#[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
Or another option is stri_list2matrix from stringi
library(stringi)
m1 <- as.numeric(stri_list2matrix(list(x,y, z)))
Change the 'x', 'y', 'z' values
m2 <- as.numeric(stri_list2matrix(list(x,y, z)))
rbind(m1, m2)

cast() dataset and return two values

Aloha,
I am trying to cast() a dataset in which every unique combination of W-X-Y returns the max number of Z AND the associated week. For example:
W X Y week Z
w1 x1 y1 1 0
w1 x1 y1 2 0.1
w1 x1 y1 3 0.2
w2 x2 y1 1 0.5
w2 x2 y1 2 0.7
w2 x2 y1 3 0.3
w3 x1 y1 1 0.1
w3 x1 y1 2 0.2
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w4 x2 y2 2 0.3
w4 x2 y2 3 0.1
w5 x1 y2 1 0.3
w5 x1 y2 2 0.1
w5 x1 y2 3 0.2
Can I do this w/cast()? I am able to return just the max number of Z per unique W-X-Y combination, but not the week with the following:
cast(foo, W + X + Y ~ ., max, value="Z")
For the above dataset, I would like the output to look as such:
W X Y week Z
w1 x1 y1 3 0.2
w2 x2 y1 2 0.7
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w5 x1 y2 1 0.3
Mahalo for your suggestions!
cast is not the right tool for this. Consider instead the functions in the plyr package:
library("plyr")
ddply(foo, .(W, X, Y), summarise, week=week[which.max(Z)], Z=max(Z))

Selecting values from a 3-column dataframe in R

I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"

Resources