efficient rbind alternative with applied function - r

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!

This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

Related

Sorting dataframe in R in reverse order - Column name as a variable [duplicate]

I've looked and looked and the answer either does not work for me, or it's far too complex and unnecessary.
I have data, it can be any data, here is an example
chickens <- read.table(textConnection("
feathers beaks
2 3
6 4
1 5
2 4
4 5
10 11
9 8
12 11
7 9
1 4
5 9
"), header = TRUE)
I need to, very simply, sort the data for the 1st column in descending order. It's pretty straightforward, but I have found two things below that both do not work and give me an error which says:
"Error in order(var) : Object 'var' not found.
They are:
chickens <- chickens[order(-feathers),]
and
chickens <- chickens[sort(-feathers),]
I'm not sure what I'm not doing, I can get it to work if I put the df name in front of the varname, but that won't work if I put an minus sign in front of the varname to imply descending sort.
I'd like to do this as simply as possible, i.e. no boolean logic variables, nothing like that. Something akin to SPSS's
SORT BY varname (D)
The answer is probably right in front of me, I apologize for the basic question.
Thank you!
You need to use dataframe name as prefix
chickens[order(chickens$feathers),]
To change the order, the function has decreasing argument
chickens[order(chickens$feathers, decreasing = TRUE),]
The syntax in base R, needs to use dataframe name as a prefix as #dmi3kno has shown. Or you can also use with to avoid using dataframe name and $ all the time as mentioned by #joran.
However, you can also do this with data.table :
library(data.table)
setDT(chickens)[order(-feathers)]
#Also
#setDT(chickens)[order(feathers, decreasing = TRUE)]
# feathers beaks
# 1: 12 11
# 2: 10 11
# 3: 9 8
# 4: 7 9
# 5: 6 4
# 6: 5 9
# 7: 4 5
# 8: 2 3
# 9: 2 4
#10: 1 5
#11: 1 4
and dplyr :
library(dplyr)
chickens %>% arrange(desc(feathers))

Is there a way to create a permutation of a vector without using the sample() function in R?

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.
You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}
It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Resources