Sorting dataframe in R in reverse order - Column name as a variable [duplicate] - r

I've looked and looked and the answer either does not work for me, or it's far too complex and unnecessary.
I have data, it can be any data, here is an example
chickens <- read.table(textConnection("
feathers beaks
2 3
6 4
1 5
2 4
4 5
10 11
9 8
12 11
7 9
1 4
5 9
"), header = TRUE)
I need to, very simply, sort the data for the 1st column in descending order. It's pretty straightforward, but I have found two things below that both do not work and give me an error which says:
"Error in order(var) : Object 'var' not found.
They are:
chickens <- chickens[order(-feathers),]
and
chickens <- chickens[sort(-feathers),]
I'm not sure what I'm not doing, I can get it to work if I put the df name in front of the varname, but that won't work if I put an minus sign in front of the varname to imply descending sort.
I'd like to do this as simply as possible, i.e. no boolean logic variables, nothing like that. Something akin to SPSS's
SORT BY varname (D)
The answer is probably right in front of me, I apologize for the basic question.
Thank you!

You need to use dataframe name as prefix
chickens[order(chickens$feathers),]
To change the order, the function has decreasing argument
chickens[order(chickens$feathers, decreasing = TRUE),]

The syntax in base R, needs to use dataframe name as a prefix as #dmi3kno has shown. Or you can also use with to avoid using dataframe name and $ all the time as mentioned by #joran.
However, you can also do this with data.table :
library(data.table)
setDT(chickens)[order(-feathers)]
#Also
#setDT(chickens)[order(feathers, decreasing = TRUE)]
# feathers beaks
# 1: 12 11
# 2: 10 11
# 3: 9 8
# 4: 7 9
# 5: 6 4
# 6: 5 9
# 7: 4 5
# 8: 2 3
# 9: 2 4
#10: 1 5
#11: 1 4
and dplyr :
library(dplyr)
chickens %>% arrange(desc(feathers))

Related

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

loop over columns with semi like columnnames

I have the following variable and dataframe
welltypes <- c("LC","HC")
qccast <- data.frame(
LC_mean=1:10,
HC_mean=10:1,
BC_mean=rep(0,10)
)
Now I only want to see the welltypes I selected(in this case LC and HC, but it could also be different ones.)
for(i in 1:length(welltypes)){
qccast$welltypes[i]_mean
}
This does not work, I know.
But how do i loop over those columns?
And it has to happen variable wise, because welltypes is of an unkown size.
The second argument to $ needs to be a column name of the first argument. I haven't run the code, but I would expect welltypes[i]_mean to be a syntax error. $ is similar to [[, so you can use paste to create the column name string and subset via [[.
For example:
qccast[[paste(welltypes[i],"_mean",sep="")]]
Depending on the rest of your code, you may be able to do something like this instead.
for(i in paste(welltypes,"_mean",sep="")){
qccast[[i]]
}
Here's another strategy:
qccast[ sapply(welltypes, grep, names(qccast)) ]
LC_mean HC_mean
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Another easy way to access given welltypes
qccast[,paste(welltypes, '_mean', sep = "")]

Resources