I have a series of numbers in two columns, with the titles "a" and "b".
I want to get R to change the values in column "b" if the difference between a value in column "a" is greater than 10 from its neighboring cells.
For example:
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 1
21 | 1
22 | 1
23 | 1
24 | 1
... | ...
Then I would like R to change the values in column "b" to
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 0
21 | 0
22 | 1
23 | 1
24 | 1
... | ...
Because the values 4 and 21 in the a-column are greater than 10 from each other.
Any help would be greatly appreciated.
df <- data.frame(a = c(1:4, 21:24), b = 1)
# check whether differences are greater than 10
diffs <- diff(df$a) > 10
# create `b`
df$b <- as.integer(!(c(FALSE, diffs) | c(diffs, FALSE)))
The result:
a b
1 1 1
2 2 1
3 3 1
4 4 0
5 21 0
6 22 1
7 23 1
8 24 1
Some alternative.
df <- data.frame(a = c(1:4, 21:24), b = 1L)
local({
w10 <- with(df, which(diff(a) > 10)))
df$b[c(w10, w10+1)] <<- 0L
})
Related
I am trying to find out how to efficiently output minimum values of runtime_sec based on of a subset from hour column potentially using an anonymous function. Currently, I have a long way of creating a new dataframe, then joining it to the existing dataframe, but would like to do this more efficiently, without having to save out to a new dataframe. I've been looking at how to do this in map (purrr) functions but having a bit of trouble understanding. Apologies in advance if this is confusing, this is my first post on here.
Existing df:
| index | hour | runtime_sec |
|-----: |-----:| -----------:|
| 1 | 6 | 50 |
| 1 | 7 | 100 |
| 1 | 8 | 120 |
| 1 | 9 | 90 |
| 1 | 10 | 100 |
| 1 | 11 | 100 |
| 2 | 10 | 100 |
Current code:
df_min <- df %>%
group_by(index) %>%
filter(hour >= 8 & hour < 10) %>%
summarize(min_ref = min(runtime_sec))
df_join <- df %>%
left_join(df_min, by = "index")
Desired output:
| index | hour | runtime_sec | min_ref |
|----: |----: | ----: | ----: |
| 1 | 6 | 50 | 90 |
| 1 | 7 | 100 | 90 |
| 1 | 8 | 120 | 90 |
| 1 | 9 | 90 | 90 |
| 1 | 10 | 100 | 90 |
| 1 | 11 | 100 | 90 |
| 2 | 10 | 100 | 100 |
dat %>%
group_by(index) %>%
mutate(min_ref = if (any(hour >= 8 & hour < 10)) min(runtime_sec[hour >= 8 & hour < 10]) else NA) %>%
ungroup()
# # A tibble: 7 x 4
# index hour runtime_sec min_ref
# <int> <int> <int> <int>
# 1 1 6 50 90
# 2 1 7 100 90
# 3 1 8 120 90
# 4 1 9 90 90
# 5 1 10 100 90
# 6 1 11 100 90
# 7 2 10 100 NA
Your expectation of min_ref=100 for index==2 is against your rules: the hour is not < 10, so there should be no data that meets your condition. If you expect it to match, then you should be using hour <= 10, in which case one can replace hour >= 8 & hour <= 10 with between(hour, 8, 10).
You can reduce the code slightly if you accept that Inf is a reasonable "minimum" lacking values:
dat %>%
group_by(index) %>%
mutate(min_ref = suppressWarnings(min(runtime_sec[hour >= 8 & hour < 10]))) %>%
ungroup()
# # A tibble: 7 x 4
# index hour runtime_sec min_ref
# <int> <int> <int> <dbl>
# 1 1 6 50 90
# 2 1 7 100 90
# 3 1 8 120 90
# 4 1 9 90 90
# 5 1 10 100 90
# 6 1 11 100 90
# 7 2 10 100 Inf
though this just shortens the code a little.
I have two data.tables like so:
tests
id | test | score
=================
1 | 1 | 90
1 | 2 | 100
2 | 1 | 70
2 | 2 | 80
3 | 1 | 100
3 | 2 | 95
cheaters
id | test | score
=================
1 | 2 | 100
3 | 1 | 100
3 | 2 | 95
Say I now want to include a boolean column in all_scores to tell whether that particular test was cheated on, so the output would be like this:
tests
id | test | score | cheat
=========================
1 | 1 | 90 | FALSE
1 | 2 | 100 | TRUE
2 | 1 | 70 | FALSE
2 | 2 | 80 | FALSE
3 | 1 | 100 | TRUE
3 | 2 | 95 | TRUE
Is there an easy way to do this? The tables are keyed on id and test.
Create the cheat column with initial value of FALSE, then join with cheaters, and update cheat column to TRUE when there's match:
library(data.table)
setkey(setDT(tests), id, test)
setkey(setDT(cheaters), id, test)
tests[, cheat := FALSE][cheaters, cheat := TRUE]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE
Or without setting the keys, use on parameter to specify the columns to join on:
setDT(tests)
setDT(cheaters)
tests[, cheat := FALSE][cheaters, cheat := TRUE, on = .(id, test)]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE
I have the following vector:
x <- c(54.11, 58.09, 60.82, 86.59, 89.92, 91.61,
95.03, 95.03, 96.77, 98.52, 100.29, 102.07,
102.07, 107.51, 113.10, 130.70, 130.70, 138.93,
147.41, 149.57, 153.94, 158.37, 165.13, 201.06,
208.67, 235.06, 240.53, 251.65,254.47, 254.47, 333.29)
I want to get the following stem and leaf plot in R:
Stem Leaf
5 4 8
6 0
8 6 9
9 1 5 5 6 8
10 0 2 2 7
11 3
13 0 0 8
14 7 9
15 3 8
16 5
20 1 8
23 5
24 0
25 1 4 4
33 3
However, when I try the stem() function in R, I get the folliwing:
> stem(x)
The decimal point is 2 digit(s) to the right of the |
0 | 566999
1 | 000000011334
1 | 55567
2 | 0144
2 | 555
3 | 3
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |
4 | 48
6 | 1
8 | 7025579
10 | 02283
12 | 119
14 | 7048
16 | 5
18 |
20 | 19
22 | 5
24 | 1244
26 |
28 |
30 |
32 | 3
Question: Am I missing an argument in the stem() function? If not, is there another solution?
I believe what you want is a little non-standard: a stem-and-leaf should have on its left equally-spaced numbers/digits, and you're asking for irregularly-spaced. I understand your frustration that 54 and 58 are grouped within the 40s, but the stem-and-leaf graph is really just a textual representation of a horizontal histogram, and the numbers on the side reflect the "bins" which will often begin/end outside of the known data. Think of scale(x, scale=2) left-scale numbers as 40-59, 60-79, etc.
You probably already tried this, but
stem(x, scale=3)
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 7 |
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 12 |
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 17 |
# 18 |
# 19 |
# 20 | 19
# 21 |
# 22 |
# 23 | 5
# 24 | 1
# 25 | 244
# 26 |
# 27 |
# 28 |
# 29 |
# 30 |
# 31 |
# 32 |
# 33 | 3
This is a good start, and is "proper" in that the bins are equally sized.
If you must remove the empty rows (which to me are still statistically significant, relevant, informative, etc), then because stem's default is to print to the console, you'll need to capture the console output (might have problems in rmarkdown docs), filter out the empty rows, and re-cat them to the console.
cat(Filter(function(s) grepl("decimal|\\|.*[0-9]", s),
capture.output(stem(x, scale=3))),
sep="\n")
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 20 | 19
# 23 | 5
# 24 | 1
# 25 | 244
# 33 | 3
(My grepl regex could likely be improved to handle something akin to "if there is a pipe, then it must be followed by one or more digits", but I think this suffices for now.)
There are some inequalities, in that you want 6 | 0, but your 60.82 is rounding to 61 (ergo the "1"). If you really want the 60.82 to be a 6 | 0, then truncate it with stem(trunc(x), scale=3). It's not exact, but I'm guessing that's because your sample output is hand-jammed.
I have an H2O frame R object like this
h2odf
A | B | C | D
--|---|---|---
1 | NA| 2 | 0
2 | 1 | 2 | 0
3 | NA| 2 | 0
4 | 3 | 2 | 0
I want to remove all those rows where B is NA (1st and 3rd row). I have tried
na <- is.na(h2odf[,"b"])
h2odf <- h2odf[!na,]
and
h2odf <- h2odf[!is.na(h2odf$B),]
and
h2odf <- subset(h2odf, B!=NA)
This works for R Dataframe but not H2O. Giving this error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:54321: 'Cannot set illegal UUID value'
Desired output is
h2odf
A | B | C | D
--|---|---|---
2 | 1 | 2 | 0
4 | 3 | 2 | 0
One option I have is to convert it into R Dataframe, remove rows and convert it back to H2O frame. But that is taking long time because input file size is close to 4.5 GB. Is it possible to do this in H2O frame hex object itself?
I am running Rstudio on aws cluster.
> class(h2odf)
[1] "H2OFrame"
> h2odf
A B C D
1 1 NA 2 0
2 2 1 2 0
3 3 NA 2 0
4 4 3 2 0
[4 rows x 4 columns]
> h2odf[!is.na(as.numeric(as.character(h2odf$B))),]
A B C D
1 2 1 2 0
2 4 3 2 0
[2 rows x 4 columns]
I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.
CURRENT DATA
TransID SKUID COUNT
1 31 1
1 32 2
1 33 1
2 31 2
2 34 -1
DESIRED DATA
TransID 31 32 33 34
1 1 2 1 0
2 2 0 0 -1
In R, we can use either xtabs
xtabs(COUNT~., df1)
# SKUID
#TransID 31 32 33 34
# 1 1 2 1 0
# 2 2 0 0 -1
Or dcast
library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
# TransID 31 32 33 34
#1 1 1 2 1 0
#2 2 2 0 0 -1
Or spread
library(tidyr)
spread(df1, SKUID, COUNT, fill=0)
In Pandas, you can use pivot:
>>> df.pivot('TransID', 'SKUID').fillna(0)
COUNT
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
To avoid ambiguity, it is best to explicitly label your variables:
df.pivot(index='TransID', columns='SKUID').fillna(0)
You can also perform a groupby and then unstack SKUID:
>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
In GraphLab/SFrame, the relevant commands are unstack and unpack.
import sframe #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
'SKUID':[31, 32, 33, 31, 34],
'COUNT': [1, 2, 1, 2, -1]})
sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')
The missing values can be filled by column:
for c in out.column_names():
out[c] = out[c].fillna(0)
out.print_rows()
+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
| 1 | 1 | 2 | 1 | 0 |
| 2 | 2 | 0 | 0 | -1 |
+---------+----+----+----+----+