How to convert DataFrame's DateTime element to Int64 milliseconds in Julia?

How to convert DataFrame's DateTime element to Int64 milliseconds in Julia? - julia

using TimeSeries, DataFrames
s="DateTime,Open,High,Low,Close,Volume
2020/01/05 16:14:01,20,23,19,20,30
2020/01/05 16:14:11,23,27,19,22,20
2020/01/05 17:14:01,24,28,19,23,10
2020/01/05 18:14:01,25,29,20,24,40
2020/01/06 08:02:01,26,30,22,25,50"
ta=readtimearray(IOBuffer(s),format="yyyy/mm/dd HH:MM:SS")
df = DataFrame(ta)
df.ms = Dates.millisecond.(df.timestamp)
df
the output result is strange, every ms is just zero?
5×7 DataFrame
│ Row │ timestamp │ Open │ High │ Low │ Close │ Volume │ ms │
│ │ DateTime │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ 2020-01-05T16:14:01 │ 20.0 │ 23.0 │ 19.0 │ 20.0 │ 30.0 │ 0 │
│ 2 │ 2020-01-05T16:14:11 │ 23.0 │ 27.0 │ 19.0 │ 22.0 │ 20.0 │ 0 │
│ 3 │ 2020-01-05T17:14:01 │ 24.0 │ 28.0 │ 19.0 │ 23.0 │ 10.0 │ 0 │
│ 4 │ 2020-01-05T18:14:01 │ 25.0 │ 29.0 │ 20.0 │ 24.0 │ 40.0 │ 0 │
│ 5 │ 2020-01-06T08:02:01 │ 26.0 │ 30.0 │ 22.0 │ 25.0 │ 50.0 │ 0 │

df.ms = Dates.value.(df.timestamp)
Dates.millisecond is returning the millisecond part of datetime.
Note that Julia is using 0000-01-01T00:00:00 as the epoch rather than the standard Unix epoch. One way to get the Unix epoch would be Int.(Dates.datetime2unix.(Dates.DateTime.(df.timestamp)))

Use Dates.value.(df.timestamp). As you have a vector of DateTime values it will give you the number of milliseconds. If you had a Date object (date only, without time) you would get a number of days view Dates.value.

Related

What can R do about a messy data format?

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.
I will post the dataset example here just in case the question is deleted.
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
As you can see this is not the right way to post data. As a user wrote in a comment,
It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.
I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.
What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

Using data.table::fread:
x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'
fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))
# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3
The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.
The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.
Credits for reminding me of arguments comment.char and strip.white go to #nicola, and his comment.
dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)
But as you can see there are some issues with the result.
dat
X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA
To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.
So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.
dat <- dat[-c(1, ncol(dat))]
dat
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3
That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.
dat$Date <- as.Date(dat$Date)
And the result is satisfactory.
str(dat)
'data.frame': 9 obs. of 5 variables:
$ Date : Date, format: "2018-06-01" "2018-06-03" ...
$ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
$ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
$ Priority : int 0 0 0 1 2 0 0 0 0
$ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3
Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.
The whole process took only 3 lines of base R code.
Finally, the end result in dput format, like it should be in the first place.
dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685,
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.
I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.
I've since come across a couple more cases and added them to the test suite.
I've also included a table drawn using box-drawing characters. You don't come across this much these days, but for completeness' sake it's here.
x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"
x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Date | Emp1 | Case | Priority | PriorityCountinLast7days
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
2018-06-01 | A | A|1 | 0 | 0
2018-06-03 | A | A|2 | 0 | 1
2018-06-02 | B | B|2 | 0 | 2
2018-06-03 | B | B|3 | 0 | 3
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"
x3 <- "
Maths | English | Science | History | Class
0.1 | 0.2 | 0.3 | 0.2 | Y2
0.9 | 0.5 | 0.7 | 0.4 | Y1
0.2 | 0.4 | 0.6 | 0.2 | Y2
0.9 | 0.5 | 0.2 | 0.7 | Y1
"
x4 <- "
Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2016/2017 | TeamA | 1 | 1.49
5 | 2016/2017 | TeamB | 3 | 1.51
6 | 2016/2017 | TeamC | 2 | N/A
"
x5 <- "
A B C
┌───┬───┬───┐
A │ 5 │ 1 │ 3 │
├───┼───┼───┤
B │ 2 │ 5 │ 3 │
├───┼───┼───┤
C │ 3 │ 4 │ 4 │
└───┴───┴───┘
"
x6 <- "
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|10/04/2013 |WM.5597394 |PNEUMATIC |
|11/07/2013 |GB.D040790 |RING |
------------------------------------------------------------
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|08/06/2013 |WM.4M01004A05 |TOUCHEUR |
|08/06/2013 |WM.4M010108-1 |LEVER |
------------------------------------------------------------
"
My go at a function
f <- function(x=x6, header=TRUE, rem.dup.header=header,
na.strings=c("NA", "N/A"), stringsAsFactors=FALSE, ...) {
# read each row as a character string
x <- scan(text=x, what="character", sep="\n", quiet=TRUE)
# keep only lines containing alphanumerics
x <- x[grep("[[:alnum:]]", x)]
# remove vertical bars with trailing or leading space
x <- gsub("\\|? | \\|?", " ", x)
# remove vertical bars at beginning and end of string
x <- gsub("\\|?$|^\\|?", "", x)
# remove vertical box-drawing characters
x <- gsub("\U2502|\U2503|\U2505|\U2507|\U250A|\U250B", " ", x)
if (rem.dup.header) {
dup.header <- x == x[1]
dup.header[1] <- FALSE
x <- x[!dup.header]
}
# read the result as a table
read.table(text=paste(x, collapse="\n"), header=header,
na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)
}
lapply(c(x1, x2, x3, x4, x5, x6), f)
Output
[[1]]
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3
[[2]]
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3
[[3]]
Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1
[[4]]
Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2016/2017 TeamA 1 1.49
5 2016/2017 TeamB 3 1.51
6 2016/2017 TeamC 2 NA
[[5]]
A B C
A 5 1 3
B 2 5 3
C 3 4 4
[[6]]
date Material Description
1 10/04/2013 WM.5597394 PNEUMATIC
2 11/07/2013 GB.D040790 RING
3 08/06/2013 WM.4M01004A05 TOUCHEUR
4 08/06/2013 WM.4M010108-1 LEVER
x3 is from here (will have to look at the edit history).
x4 is from here
x6 is from here

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)
## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0"
# [3,] "2018-06-03" "A" "A2" "0" "1"
# [4,] "2018-06-03" "A" "A3" "0" "2"
# [5,] "2018-06-03" "A" "A4" "1" "1"
# [6,] "2018-06-03" "A" "A5" "2" "1"
# [7,] "2018-06-04" "A" "A6" "0" "3"
# [8,] "2018-06-01" "B" "B1" "0" "1"
# [9,] "2018-06-02" "B" "B2" "0" "2"
#[10,] "2018-06-03" "B" "B3" "0" "3"
## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3
## or maybe just use `type.convert` on some columns?
dat[] <- lapply(dat, type.convert)

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.
Copy the dataset into the Notepad file.
Replace all | characters with ,
Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).
But, if you mean use the R to fully understand it in one step, then I have no idea.

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:
dput(mtcars %>% head(10), file = 'reproducible.txt')
The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).
df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

Change values in a data set in Julia

I am converting a function in R to Julia, but I do not know how to convert the following R code:
x[x==0]=4
Basically, x contains rows of numbers, but whenever there is a 0, I need to change it to a 4. The data set x comes from a binomial distribution. Can someone help me define the above code in Julia?

Use the .== (broadcasted ==), ie:
Dot Syntax for Vectorizing Functions
With vector:
julia> x = round.(Int, rand(5)) # notice how round is also broadcasted here
5-element Array{Int64,1}:
0
0
1
0
1
julia> x .== 0
5-element BitArray{1}:
true
true
false
true
false
julia> x[x .== 0] = 4
4
julia> x
5-element Array{Int64,1}:
4
4
1
4
1
With matrix:
julia> y = round.(Int, rand(5, 5))
h5×5 Array{Int64,2}:
0 1 1 0 0
1 0 1 1 1
0 0 0 0 1
1 1 0 0 0
0 1 0 1 1
julia> y[y .== 0] = 4
4
julia> y
5×5 Array{Int64,2}:
4 1 1 4 4
1 4 1 1 1
4 4 4 4 1
1 1 4 4 4
4 1 4 1 1
With dataframe:
julia> using DataFrames
julia> df = DataFrame(x = round.(Int, rand(5)), y = round.(Int, rand(5)))
5×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 0 │ 0 │
│ 2 │ 0 │ 1 │
│ 3 │ 0 │ 0 │
│ 4 │ 0 │ 1 │
│ 5 │ 1 │ 0 │
julia> df[:x][df[:x] .== 0] = 4
4
julia> df
5×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 4 │ 0 │
│ 2 │ 4 │ 1 │
│ 3 │ 4 │ 0 │
│ 4 │ 4 │ 1 │
│ 5 │ 1 │ 0 │

The simplest solution is to use the replace! function:
replace!(x, 0=>4)
Use replace(x, 0=>4) (without the !) to do the same thing, but creating a copy of the vector.
Note that these functions only exist in version 0.7!

Two small issues two long for a comment are:
In Julia 0.7 you should write x[x .== 0] .= 4 (using a second dot in assignment also)
In general it is faster to use e.g. foreach or a loop than to allocate a vector with x .== 0, e.g.:
julia> using BenchmarkTools
julia> x = rand(1:4, 10^8);
julia> function f1(x)
x[x .== 4] .= 0
end
f1 (generic function with 1 method)
julia> function f2(x)
foreach(i -> x[i] == 0 && (x[i] = 4), eachindex(x))
end
f2 (generic function with 1 method)
julia> #benchmark f1($x)
BenchmarkTools.Trial:
memory estimate: 11.93 MiB
allocs estimate: 10
--------------
minimum time: 137.889 ms (0.00% GC)
median time: 142.335 ms (0.00% GC)
mean time: 143.145 ms (1.08% GC)
maximum time: 160.591 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
julia> #benchmark f2($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 86.904 ms (0.00% GC)
median time: 87.916 ms (0.00% GC)
mean time: 88.504 ms (0.00% GC)
maximum time: 91.289 ms (0.00% GC)
--------------
samples: 57
evals/sample: 1

Convert "2016-08-28T17:12:41.795Z" to "2016-08-28 17:12:41.795"

I have a data frame with date and time column
data published_at
0 0.0 2015-11-05T12:55:34.685Z
1 0.0 2015-11-05T12:55:44.695Z
2 0.0 2015-11-05T12:56:25.328Z
3 0.0 2015-11-05T12:56:35.333Z
4 0.0 2015-11-05T12:56:45.332Z
I wanted to convert into the following format
data published_at
0 0.0 2015-11-05 12:55:34.685
1 0.0 2015-11-05 12:55:44.695
2 0.0 2015-11-05 12:56:25.328
3 0.0 2015-11-05 12:56:35.333
4 0.0 2015-11-05 12:56:45.332

This will remove T and Z if your data frame is called df:
gsub("T"," ", df$published_at)
gsub("Z","", df$published_at)
Note that these are just strings and not dates/timestamps

How get a matrix from two columns of pair data in r? The diagonal is zero and keep lower-diagonal part?

I have a data frame looks like following;
ID_r ID_c SCORE
A1 A2 0.2
A1 A3 0.2
A1 A4 0.3
A1 A5 0.2
A1 A6 0.2
A2 A3 0.6
A2 A4 0.2
A2 A5 0.2
A2 A6 0.2
A3 A4 0.2
A3 A5 0.2
A3 A6 0.2
A4 A5 0.2
A4 A6 0.9
A5 A6 0.2
ID_r<-c('A1','A1','A1','A1','A1','A2','A2','A2','A2','A3','A3','A3','A4','A4','A5')
ID_c<-c('A2','A3','A4','A5','A6','A3','A4','A5','A6','A4','A5','A6','A5','A6','A6')
SCORE<-c(0.2,0.2,0.3,0.2,0.2,0.6,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.9,0.2)
And I want to use the two columns data to generate a matrix
like the following (only keep the lower diagonal part and diaginal is zero). I want to export this matrix to a csv to be use by other software.
A1 A2 A3 A4 A5 A6
A1 0.0 . . . . .
A2 0.2 0.0 . . . .
A3 0.2 0.6 0.0 . . .
A4 0.3 0.2 0.2 0.0 . .
A5 0.2 0.2 0.2 0.2 0.0 .
A6 0.2 0.2 0.2 0.9 0.2 0.0
Thanks in advance.

You can do something like this.
library(dplyr); library(tidyr)
df$ID_r <- as.character(df$ID_r)
df$ID_c <- as.character(df$ID_c)
ID <- unique(c(df$ID_r, df$ID_c))
diagDf <- data.frame(ID_r = ID, ID_c = ID, SCORE = "0.0")
newDf <- rbind(df, diagDf) %>% arrange(ID_r, ID_c)
resultDf <- spread(newDf, ID_r, SCORE, fill = ".")
names(resultDf)[1] <- ""
resultDf
A1 A2 A3 A4 A5 A6
1 A1 0.0 . . . . .
2 A2 0.2 0.0 . . . .
3 A3 0.2 0.6 0.0 . . .
4 A4 0.3 0.2 0.2 0.0 . .
5 A5 0.2 0.2 0.2 0.2 0.0 .
6 A6 0.2 0.2 0.2 0.9 0.2 0.0

A loop for subsetting with which in data frame for another dataframe?

I searched many posts like this Subsetting a dataframe in R by multiple conditions about selecting the "which" function to select values dataframe dataframe and can not find another solution . The problem is as follows:
I have the following data set with thousands of cases:
> head(Datos)
tipo estacion hora usos
1 hábil A.SANIN X4 11
2 hábil ALAMOS X4 4
3 hábil AMANECER X4 45
4 hábil AMERICAS X4 2
5 hábil ATANASIO X4 10
6 hábil BELALCAZAR X4 5
. . . .
. . . .
. . . .
The variable to subset of dataframe above is "usos" The variable "tipo" takes the values : "hábil", "Sábado" and "Festivo". The variable "estacion" has 60 levels and the variable "hora" has 22 values: x4, x5, x6, ... x23 . As I need to calculate the quartiles according to all combinations of "tipo" , "estacion" and "hora" i use the "aggregate" function and calculate the critical values so I get this:
> head(todo)
Group.1 Group.2 Group.3 y1 y2
1 hábil X4 A.SANIN 1.5 21.5
2 Sábado X4 A.SANIN 4.0 12.0
3 Festivo X4 A.SANIN 0.0 0.0
4 hábil X5 A.SANIN 66.0 130.0
5 Sábado X5 A.SANIN 40.0 96.0
6 Festivo X5 A.SANIN 7.5 43.5
. .
. .
. .
Each row is a different case and the values y1 and y2 are my critical values. Need, according to the values y1 and y2 of the dataframe "todo" that I choose to values less than y1 or greater y2 of the variable "usos" from dataframe"Datos". But in a cycle, there are 3480 combinations on dataframe "todo", this is, 3480 rows. And store it in another Matrix.
For example, for the first case is as follows:
print(which(subset(Datos$usos,Datos$tipo=="hábil"&Datos$hora=="X4"&Datos$estacion=="A.SANIN")<todo$y1[1] | subset(Datos$usos,Datos$tipo=="hábil"&Datos$hora=="X4"&Datos$estacion=="A.SANIN")>todo$y2[1]))
I need to do that for all rows of the dataframe "all" and apply it to "use" the dataframe "Data".
THANK YOU!

I had a little trouble understanding what you are saying, but I think this is what you want. First, merge todo with Datos:
# Rename the columns of todo to match Datos
names(todo)<-c('tipo','hora','estacion','y1','y2')
# Merge the two.
Datos.y.todo<-merge(Datos,todo)
Now, you can easily subset based on your criteria:
with(Datos.y.todo, Datos.y.todo[usos<y1 | usos>y2, ])
I believe the above answers your question. Let me illustrate with your data.
# Load in your data.
Datos<-read.table(textConnection('tipo estacion hora usos
1 hábil A.SANIN X4 11
2 hábil ALAMOS X4 4
3 hábil AMANECER X4 45
4 hábil AMERICAS X4 2
5 hábil ATANASIO X4 10
6 hábil BELALCAZAR X4 5'),header=TRUE)
todo<-read.table(textConnection('Group.1 Group.2 Group.3 y1 y2
1 hábil X4 A.SANIN 1.5 21.5
2 Sábado X4 A.SANIN 4.0 12.0
3 Festivo X4 A.SANIN 0.0 0.0
4 hábil X5 A.SANIN 66.0 130.0
5 Sábado X5 A.SANIN 40.0 96.0
6 Festivo X5 A.SANIN 7.5 43.5'),header=TRUE)
As you mentioned, Datos contains many of the same "cases" repeatedly, so let's add two rows to Datos to make the example clearer:
Datos<-rbind(Datos,data.frame(tipo=c('hábil','Sábado'),estacion='A.SANIN',hora='X4',usos=c('23','3')))
# tipo estacion hora usos
# 1 hábil A.SANIN X4 11 # This one neither below y1 or above y2
# 2 hábil ALAMOS X4 4
# 3 hábil AMANECER X4 45
# 4 hábil AMERICAS X4 2
# 5 hábil ATANASIO X4 10
# 6 hábil BELALCAZAR X4 5
# 7 hábil A.SANIN X4 23 # This one is above y2 (21.5), so we want to find it.
# 8 Sábado A.SANIN X4 3 # This one is below y1 (4.0), so we want to find it.
Now run the code I previously gave you:
# Rename the columns of todo to match Datos
names(todo)<-c('tipo','hora','estacion','y1','y2')
# Merge the two.
Datos.y.todo<-merge(Datos,todo)
# Notice how the y1 and y2 values are now repeated for easy comparison.
# tipo estacion hora usos y1 y2
# 1 hábil A.SANIN X4 11 1.5 21.5 # We don't want this row.
# 2 hábil A.SANIN X4 23 1.5 21.5 # We want this row.
# 3 Sábado A.SANIN X4 3 4.0 12.0 # We want this row.
And finally, you filter the rows you want:
with(Datos.y.todo, Datos.y.todo[usos<y1 | usos>y2, ])
# tipo estacion hora usos y1 y2
# 2 hábil A.SANIN X4 23 1.5 21.5
# 3 Sábado A.SANIN X4 3 4.0 12.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to convert DataFrame's DateTime element to Int64 milliseconds in Julia? - julia

df.ms = Dates.value.(df.timestamp) Dates.millisecond is returning the millisecond part of datetime. Note that Julia is using 0000-01-01T00:00:00 as the epoch rather than the standard Unix epoch. One way to get the Unix epoch would be Int.(Dates.datetime2unix.(Dates.DateTime.(df.timestamp)))

Use Dates.value.(df.timestamp). As you have a vector of DateTime values it will give you the number of milliseconds. If you had a Date object (date only, without time) you would get a number of days view Dates.value.

Related

What can R do about a messy data format?

Change values in a data set in Julia

Convert "2016-08-28T17:12:41.795Z" to "2016-08-28 17:12:41.795"

How get a matrix from two columns of pair data in r? The diagonal is zero and keep lower-diagonal part?

A loop for subsetting with which in data frame for another dataframe?

Categories

Resources