How do I convert hex to numbers keeping the sign (+/-) in R? - r

I am a first time poster so sorry if the format is not exactly as required.
I have a data frame that looks something like this, with each row containing three columns of hex strings:
id x
1 1 FFF8
2 2 FFBC
3 3 FFAE
4 4 0068
If I understand correctly, "FFF8" should convert to "-8", however all I have managed to do is convert it to the positive equivalent - "65528".
I have used:
dataframe$x<-as.numeric(dataframe$x)
I haven't found any R function that can maintain the minus sign, as intended.
Can anyone kindly help with converting the hex strings into a number whilst maintaining the intended minus sign?
Many thanks in advance.

If you're assuming that the high-bit indicates negative, then
strtoi(dat$x, base=16)
# [1] 65528 65468 65454 104
dat$x2 <- strtoi(dat$x, base=16)
dat$x3 <- ifelse(bitwAnd(dat$x2, 0x8000) > 0, -0xFFFF-1, 0) + dat$x2
dat
# id x x2 x3
# 1 1 FFF8 65528 -8
# 2 2 FFBC 65468 -68
# 3 3 FFAE 65454 -82
# 4 4 0068 104 104

Related

Creating sublists from one bigger list

I am writing my Thesis in R and I would like, if possible, some help in a problem that I have.
I have a table, which is called tkalp, with 2 columns and 3001 rows and after a 'subset' command that I wrote this table contains now 1084 rows and called kp. Some values of kp are:
As you can see some values from the column V1 are continuously with step = 2 and some are not.
So my difficulty is:
1. I would like to 'break' this big list/table into smaller lists/tables that contain only continuous numbers. For this difficulty, I tried to implement it with these commands but it didn't go as planned:
for (n in 1:nrow(kp)) {
kp1 <- subset(kp, kp[n+1,1] - kp[n,1])==2)
}
2. After completing this task I would like to keep only the sublists that contain more than 10 rows.
Any idea or help is more than welcome! Thank you very much
EDIT
I have uploaded a picture of my table and I have separated the numbers that I want to be contained in different tables. And I would like to do that for all the original table.
blue is one smaller table than the original
black another
yellow another
red another
And after I create all those smaller tables I would like to keep only the tables that contain more than 10 numbers. For example I don't want to keep the yellow table since it contains only 4 numbers.
Thank you again
What about
df <- data.frame(V1=c(1,3,5,10,12,14, 20, 22), V2=runif(8))
df$diff <- c(2,diff(df$V1))
df$numSubset <- cumsum(df$diff != 2) + 1
iter <- seq(max(df$numSubset))
purrr::map(iter, function(i) filter(df, numSubset == i))
listOfSubsets <- purrr::map(iter, function(i) dplyr::filter(df, numSubset == i))
Then you loop through the list and select only those you want. Btw purrr also provides a means to filter the list you get without looping. Check the documentation of purrr.
With base R
kp=data.frame(V1=c(seq(8628,8618,by=-2),seq(8576,8566,by=-2),78,76),V2=runif(14))
kp$diffV1=c(-2,diff(kp$V1))/-2
kp$group=cumsum(ifelse(kp$diffV1/-2==1,0,1))+1
lkp=split(kp,kp$group)
# > kp
# V1 V2 diffV1 group
# 1 8628 0.74304325 -2 1
# 2 8626 0.84658101 -2 1
# 3 8624 0.74540089 -2 1
# 4 8622 0.83551473 -2 1
# 5 8620 0.63605222 -2 1
# 6 8618 0.92702915 -2 1
# 7 8576 0.81978587 -42 2
# 8 8574 0.01661538 -2 2
# 9 8572 0.52313859 -2 2
# 10 8570 0.39997951 -2 2
# 11 8568 0.61444445 -2 2
# 12 8566 0.23570017 -2 2
# 13 78 0.58397923 -8488 3
# 14 76 0.03634809 -2 3

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Alternative to for loop and indexing?

I have a large data set of 3 columns, Order, Discharge, Date (numeric). There are 20 years of daily Discharge values for each Order, which can extend beyond 100.
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 0.00000 6574
5 3.54100 6574
6 3.61500 6574
For a given Order x, I would like to replace the Discharge value with the average of the Discharge at x+1 and x-1 for that date. I have been doing this in a crude manner with a for loop and indexing, but it takes over an hour to process. I know there has to be a better way.
x <- 4
for(i in min(dat[,3]):max(dat[,3]))
dat[,2][dat[,3] == i & dat[,1] == x ] <-
mean(c(dat[,2][dat[,3] == i & dat[,1] == x + 1],
dat[,2][dat[,3] == i & dat[,1] == x - 1]))
Gives
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 1.77050 6574
5 3.54100 6574
6 3.61500 6574
Where the Discharge at Order 4, for date 6574 has been replaced with 1.77050. It works, but it's ridiculously slow.
I should specify that I don't need to do this calculation on every Order, but only a select few (only 8 out of a total of 117). Based on the answer, I have the following.
dat$NewDischarge <- by(dat$Discharge,dat$date,function(x)
colMeans(cbind(c(x[-1],NA), x,
c(NA, x[-length(x)])), na.rm=T))
I am trying to figure out a way still to only have the values of the select Orders to be calculated and am stuck in the rut of a for loop and indexing on date and Orders.
I would go by it as following:
Ensure that Order is a factor.
For each Order, you now have a sub-problem:
Sort the sub-data-frame by date.
Each Discharge-mean can be produced "vectorally" as:
colMeans(cbind(c(Discharge[-1], NA), Discharge, c(NA, Discharge[-length(Discharge)])))
The sub-problem can be dealt with a simple for-loop or the function by. I would prefer by.
Your data has been rearranged, but you can easily reorder it.
For point 2.2, imagine it (or try it) with a simple vector and see the effects of the cbind operation. It also forces you to consider the limit-situations; how is the first and last Discharge-value calculated (no preceding or proceeding dates).
There are several ways to solve your particular dilemma, but the basic question to ask when confronted with a slow for loop is, "How do I use vectorization to replace this loop?" (Well, maybe you should ask "Should I...?" first.) In your case, you're looping across dates, but there's no need to explicitly do that, since just grabbing all of the rows where dat$Order==x will implicitly grab all the dates.
The dataset you posted only has one date, but I can generate some fake data to illustrate:
generate.data <- function(n.order, n.date){
dat <- expand.grid(Order=seq_len(n.order), date=seq_len(n.date))
dat$Discharge <- rlnorm(n.order * n.date)
dat[, c("Order", "Discharge", "date")]
}
dat <- generate.data(10, 5)
head(dat)
# Order Discharge date
# 1 1 2.1925563 1
# 2 2 0.4093022 1
# 3 3 2.5525497 1
# 4 4 1.9274013 1
# 5 5 1.1941986 1
# 6 6 1.2407451 1
tail(dat)
# Order Discharge date
# 45 5 1.4344575 5
# 46 6 0.5757580 5
# 47 7 0.4986190 5
# 48 8 1.2076292 5
# 49 9 0.3724899 5
# 50 10 0.8288401 5
Here's all the rows where dat$Order==4, across all dates:
dat[dat$Order==4, ]
# Order Discharge date
# 4 4 1.9274013 1
# 14 4 3.5319072 2
# 24 4 0.2374532 3
# 34 4 0.4549798 4
# 44 4 0.7654059 5
You can just take the Discharge column, and you'll have the left-hand side of your assignment:
dat[dat$Order==4, ]$Discharge
# [1] 1.9274013 3.5319072 0.2374532 0.4549798 0.7654059
Now you just need the right side, which has two components: the x-1 discharges and the x+1 discharges. You can grab these the same way you grabbed the x discharges:
dat[dat$Order==4-1, ]$Discharge
# [1] 2.5525497 1.9143963 0.2800546 8.3627810 7.8577635
dat[dat$Order==4+1, ]$Discharge
# [1] 1.1941986 4.6076114 0.3963693 0.4190957 1.4344575
To obtain the new values, you need the parallel mean. R doesn't have a pmean function, but you can cbind these and take the rowMeans:
rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge, dat[dat$Order==4+1, ]$Discharge))
# [1] 1.8733741 3.2610039 0.3382119 4.3909383 4.6461105
So, in the end you have:
dat[dat$Order==4, ]$Discharge <- rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge,
dat[dat$Order==4+1, ]$Discharge))
You can even use %in% to make this work across all of your x values.
Note that this assumes your data is ordered.

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

how to select matrix element in R?

Reading the data the following way
data<-read.csv("userStats.csv", sep=",", header=F)
I tried to select an element at the specific position.
The example of the data (first five rows) is the following (V2 is the date and V3 is the day of week):
V1 V2
1 00002781A2ADA816CDB0D138146BD63323CCDAB2 2010-09-04
2 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-04
3 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-07
4 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-08
5 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-17
V3 V4 V5 V6 V7 V8 V9
1 Saturday 2 2 615 1 1 47
2 Saturday 2 2 77 1 1 43
3 Tuesday 1 3 201 1 1 117
4 Wednesday 1 1 44 1 1 74
5 Friday 1 1 3 1 1 18
I tried to divide 6th column with 9th column in the first row the following way:
data[1,6]/data[1,9]
but it returned an error
[1] NA
Warning message:
In Ops.factor(data[1, 6], data[1, 9]) : / not meaningful for factors
Then I tried to select just one element
> data[2,9]
[1] 43
11685 Levels: 0 1 2 3 ... 55311
but don't know what these Levels are and what causes an error. Does anyone know how to select an element at the specific position data[row, column]?
Thank you!
My favorite tool to check variable class is str().
What you have there is a data frame and at least one of the columns you're trying to work with is a factor. See Dirk's answer on how to change classes of a column.
Command
data[1,6]/data[1,9]
is selecting the value in the first row of sixth column and dividing with the value in first row of the ninth column. Is this what you want? If you want to use values from the entire column (and not just the first row), you would write
data[6] / data[9]
or
data[, 6] / data[, 9]
Both arguments are equivalent for data.frames.
The standard modeling data structure in R is a data.frame.
The data.frame objects can hold various types: numeric, character, factor, ...
Now, when reading data via read.csv() et al, you can get bitten by the default valus of the stringsAsFactors option. I presume that at least a row in your data had text, so R decides to decode it as a factor and presto! you no longer can do direct mathematical operations on the column.
In short, do summary(data) and/or a sweep of class() over all the columns. Convert as necessary, or turn the stringsAsFactors variable to a different value or both.
Once your data is numeric, you can divide, slice, dice, ... as you please.

Resources