Change values in a data set in Julia

Change values in a data set in Julia - r

I am converting a function in R to Julia, but I do not know how to convert the following R code:
x[x==0]=4
Basically, x contains rows of numbers, but whenever there is a 0, I need to change it to a 4. The data set x comes from a binomial distribution. Can someone help me define the above code in Julia?

Use the .== (broadcasted ==), ie:
Dot Syntax for Vectorizing Functions
With vector:
julia> x = round.(Int, rand(5)) # notice how round is also broadcasted here
5-element Array{Int64,1}:
0
0
1
0
1
julia> x .== 0
5-element BitArray{1}:
true
true
false
true
false
julia> x[x .== 0] = 4
4
julia> x
5-element Array{Int64,1}:
4
4
1
4
1
With matrix:
julia> y = round.(Int, rand(5, 5))
h5×5 Array{Int64,2}:
0 1 1 0 0
1 0 1 1 1
0 0 0 0 1
1 1 0 0 0
0 1 0 1 1
julia> y[y .== 0] = 4
4
julia> y
5×5 Array{Int64,2}:
4 1 1 4 4
1 4 1 1 1
4 4 4 4 1
1 1 4 4 4
4 1 4 1 1
With dataframe:
julia> using DataFrames
julia> df = DataFrame(x = round.(Int, rand(5)), y = round.(Int, rand(5)))
5×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 0 │ 0 │
│ 2 │ 0 │ 1 │
│ 3 │ 0 │ 0 │
│ 4 │ 0 │ 1 │
│ 5 │ 1 │ 0 │
julia> df[:x][df[:x] .== 0] = 4
4
julia> df
5×2 DataFrames.DataFrame
│ Row │ x │ y │
├─────┼───┼───┤
│ 1 │ 4 │ 0 │
│ 2 │ 4 │ 1 │
│ 3 │ 4 │ 0 │
│ 4 │ 4 │ 1 │
│ 5 │ 1 │ 0 │

The simplest solution is to use the replace! function:
replace!(x, 0=>4)
Use replace(x, 0=>4) (without the !) to do the same thing, but creating a copy of the vector.
Note that these functions only exist in version 0.7!

Two small issues two long for a comment are:
In Julia 0.7 you should write x[x .== 0] .= 4 (using a second dot in assignment also)
In general it is faster to use e.g. foreach or a loop than to allocate a vector with x .== 0, e.g.:
julia> using BenchmarkTools
julia> x = rand(1:4, 10^8);
julia> function f1(x)
x[x .== 4] .= 0
end
f1 (generic function with 1 method)
julia> function f2(x)
foreach(i -> x[i] == 0 && (x[i] = 4), eachindex(x))
end
f2 (generic function with 1 method)
julia> #benchmark f1($x)
BenchmarkTools.Trial:
memory estimate: 11.93 MiB
allocs estimate: 10
--------------
minimum time: 137.889 ms (0.00% GC)
median time: 142.335 ms (0.00% GC)
mean time: 143.145 ms (1.08% GC)
maximum time: 160.591 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
julia> #benchmark f2($x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 86.904 ms (0.00% GC)
median time: 87.916 ms (0.00% GC)
mean time: 88.504 ms (0.00% GC)
maximum time: 91.289 ms (0.00% GC)
--------------
samples: 57
evals/sample: 1

Related

How to convert DataFrame's DateTime element to Int64 milliseconds in Julia?

using TimeSeries, DataFrames
s="DateTime,Open,High,Low,Close,Volume
2020/01/05 16:14:01,20,23,19,20,30
2020/01/05 16:14:11,23,27,19,22,20
2020/01/05 17:14:01,24,28,19,23,10
2020/01/05 18:14:01,25,29,20,24,40
2020/01/06 08:02:01,26,30,22,25,50"
ta=readtimearray(IOBuffer(s),format="yyyy/mm/dd HH:MM:SS")
df = DataFrame(ta)
df.ms = Dates.millisecond.(df.timestamp)
df
the output result is strange, every ms is just zero?
5×7 DataFrame
│ Row │ timestamp │ Open │ High │ Low │ Close │ Volume │ ms │
│ │ DateTime │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │
├─────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼───────┤
│ 1 │ 2020-01-05T16:14:01 │ 20.0 │ 23.0 │ 19.0 │ 20.0 │ 30.0 │ 0 │
│ 2 │ 2020-01-05T16:14:11 │ 23.0 │ 27.0 │ 19.0 │ 22.0 │ 20.0 │ 0 │
│ 3 │ 2020-01-05T17:14:01 │ 24.0 │ 28.0 │ 19.0 │ 23.0 │ 10.0 │ 0 │
│ 4 │ 2020-01-05T18:14:01 │ 25.0 │ 29.0 │ 20.0 │ 24.0 │ 40.0 │ 0 │
│ 5 │ 2020-01-06T08:02:01 │ 26.0 │ 30.0 │ 22.0 │ 25.0 │ 50.0 │ 0 │

df.ms = Dates.value.(df.timestamp)
Dates.millisecond is returning the millisecond part of datetime.
Note that Julia is using 0000-01-01T00:00:00 as the epoch rather than the standard Unix epoch. One way to get the Unix epoch would be Int.(Dates.datetime2unix.(Dates.DateTime.(df.timestamp)))

Use Dates.value.(df.timestamp). As you have a vector of DateTime values it will give you the number of milliseconds. If you had a Date object (date only, without time) you would get a number of days view Dates.value.

R Statistics Trying to Call Element from Vector Using Elements In Another Dataframe

I probably worded this question terribly and I know there is a really simple solution but I am horrible at everything and can't find it because I suck at trying to find the words for this stuff I am sorry in advance for what is crap code anyway here's what I am trying to do.
X1 <- c(0,0,1,1,0)
X2 <- c(1,0,0,1,0)
X3 <- c(0,1,1,1,1)
lookup <- data.frame (X1, X2, X3)
#This above here creates a data frame with 5 rows and 3 columns with binary answers.
Match <- 1:(15)
P1 <- rep(1:5, each=3)
X1 <- rep(1:3,length.out=15)
X1 <- paste("X", X1, sep="")
Data <- data.frame(Match, X1, P1)
#This above creates a dataframe where it shows every possible match up of row and column for a total of 15 rows (5 people with 3 items).
What I want to do is pull the element from the lookup table into a new column that shows the result of the match up of P1 and X1. Something like this:
Data$Result <- lookup[1,'X3']
The above works like I want it to but it only works for row 1 and X3 (question 3). But when I try to replace those things to change by row depending on what the column values are it's just a mess either returning null or not the result at all. Here's what I tried:
Data$Result <- lookup["P1","X1"] #this doesn't work
Data$Result <- lookup[Data$P1,Data$X1] #and this doesn't work
Data$Result <- lookup[P1,X1] #and this doesn't work
I'm sure there's a really easy answer and I'm just really stupid it would be super nice if someone could give me some help on this.

I think you are looking for a left join.
See suggestion below.
Also, a small word to your variabel naming. I would recommend using only small letters, no capitals, makes it easier for coding. (e.g., data$match)
X1 <- c(0,0,1,1,0)
X2 <- c(1,0,0,1,0)
X3 <- c(0,1,1,1,1)
P1 <- as.integer(1:5)
lookup <- data.frame (P1, X1, X2, X3) #I have added this column because I think this is how your lookup is corresponding to your data
require(dplyr)
require(tidyr)
lookup_long <- gather(lookup, 'X' , 'answer', X1:X3) #making data tidy (one observation/variable per row)
left_join(Data, lookup_long, by = 'P1')
Match X1 P1 X answer
1 1 0 P1 X1 0
2 1 0 P1 X2 1
3 1 0 P1 X3 0
4 2 0 P2 X1 0
5 2 0 P2 X2 0
6 2 0 P2 X3 1
7 3 1 P3 X1 1
8 3 1 P3 X2 0
9 3 1 P3 X3 1
10 4 1 P4 X1 1
11 4 1 P4 X2 1
12 4 1 P4 X3 1
13 5 0 P5 X1 0
14 5 0 P5 X2 0
15 5 0 P5 X3 1
16 6 0 P1 X1 0
17 6 0 P1 X2 1
18 6 0 P1 X3 0
19 7 0 P2 X1 0
20 7 0 P2 X2 0
21 7 0 P2 X3 1
22 8 1 P3 X1 1
23 8 1 P3 X2 0
24 8 1 P3 X3 1
25 9 1 P4 X1 1
26 9 1 P4 X2 1
27 9 1 P4 X3 1
28 10 0 P5 X1 0
29 10 0 P5 X2 0
30 10 0 P5 X3 1
31 11 0 P1 X1 0
32 11 0 P1 X2 1
33 11 0 P1 X3 0
34 12 0 P2 X1 0
35 12 0 P2 X2 0
36 12 0 P2 X3 1
37 13 1 P3 X1 1
38 13 1 P3 X2 0
39 13 1 P3 X3 1
40 14 1 P4 X1 1
41 14 1 P4 X2 1
42 14 1 P4 X3 1
43 15 0 P5 X1 0
44 15 0 P5 X2 0
45 15 0 P5 X3 1

Concatencation+collecting of ranges in Julia

According to documentation, [A; B; C; ...] calls vcat(). So, to concatenate and collect a comprehension of ranges this way
>>[1:4; 6:9; 20:23]
12-element Array{Int64,1}:
1
2
3
4
6
7
8
9
20
21
22
23
I tried to use vcat(), but it does not do collecting
vcat([i:i+3 for i in [1,6,20]])
3-element Array{UnitRange{Int64},1}:
1:4
6:9
20:23
Is there a simple way to collect all values from a comprehension of ranges?

Simply add ... to get
julia> vcat([i:i+3 for i in [1,6,20]]...)
12-element Array{Int64,1}:
1
2
3
4
6
7
8
9
20
21
22
23

Remove rows that match a value

I'm trying to filter out some data. Say the columns contain a numeric value that if equal to zero in all columns must go. I've though about performing multiple matches with which as so
match1 <- match(which(storm$FATALITIES==0), which(storm$INJURIES==0))
match2 <- match(which(storm$CROPDMG==0), which(storm$CROPDMGEXP==0))
match3 <- match(which(storm$PROPDMG==0), which(storm$PROPDMGEXP==0))
match4 <- match(match1, match2)
matchF <- match(match4, match3)
but it clearly doesn't work since its giving a position given the last vector...
the data looks something like this:
BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
1 4/18/1950 0:00:00 AL TORNADO 0 15 25.0 K 3
2 4/18/1950 0:00:00 AL TORNADO 0 0 0.0 K 0
3 2/20/1951 0:00:00 AL TORNADO 0 2 25.0 K 0
4 6/8/1951 0:00:00 AL TORNADO 0 2 0.0 K 0
5 11/15/1951 0:00:00 AL TORNADO 0 0 0.0 K 0
6 11/15/1951 0:00:00 AL TORNADO 1 6 2.5 K 0
7 11/16/1951 0:00:00 AL TORNADO 0 1 2.5 K 0
CROPDMGEXP LATITUDE LONGITUDE REFNUM
1 3040 8812 1
2 3042 8755 2
3 3340 8742 3
4 3458 8626 4
5 3412 8642 5
6 3450 8748 6
7 3405 8631 7
I'm interested in matching removing all entries that are 0 for INJURIES, FATALITIES, CROPDMG, PROPDMG (all of them simultaneously). I've already filtered out NA with complete.cases().
Thanks

Here are a couple ways. One interactive and very intuitive:
subset(storm, INJURIES != 0 |
FATALITIES != 0 |
CROPDMG != 0 |
PROPDMG != 0)
and one programmatic, hence more flexible/scalable:
fields <- c('INJURIES', 'FATALITIES', 'CROPDMG', 'PROPDMG')
keep <- rowSums(storm[fields] != 0) > 0
storm[keep, ]

Convert missing values (-9) to NAs in a Plink PED file when reading into R

I have two files: pedigree.ped and pedigree.map. These two file formats can be used by Plink.
In my case I want use them with R, and I think I must do a conversion to R format. For eg: missing values in Plink are different from missing values in R.
How I can convert these two files to use them in R? How I can change the missing values to NA?
Sample of my data:
ped file:
1 1 0 0 1.02 A A G G 0 0
1 2 0 0 0.51 T G C C A A
2 3 1 2 -9 0 0 A G T T
...
first column is id_family, second the id_individual, third and fourth are father and mother of the id_individual, fifth is the quantitative trait (-9 : is missing value), the remaining columns are genotypes (SNPs allele). The missing value for columns is 0 except for quantitative trait is -9.
map file:
1 rs1 0 100000
1 rs2 0 100100
1 rs3 0 100200
first column is the id chromosome (1-22, X, Y or 0 if unplaced), the second rs# or snp identifier, third the Genetic distance (morgans), and fourth is Base-pair position (bp units)

Assuming the data in the ped file is read into an R data frame -
> my.dataframe
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 1 0 0 1.02 A A G G 0 0
2 1 2 0 0 0.51 T G C C A A
3 2 3 1 2 -9.00 0 0 A G T T
Now check for invalid/missing values per column & assign NA. For ex, take the 5th column -
my.dataframe[my.dataframe[,5] == -9, 5] <- NA
> my.dataframe
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 1 1 0 0 1.02 A A G G 0 0
2 1 2 0 0 0.51 T G C C A A
3 2 3 1 2 NA 0 0 A G T T
Similarly assign NA to required entries.
Note: R functions treat NAs in a special way. Look into the respective function arguments. Some related keywords to watch for - na.rm, na.pass, na.fail, na.omit etc.

Define NA values when reading ped files into R, e.g.:
read.table(text = "
1 1 0 0 1.02 A A G G 0 0
1 2 0 0 0.51 T G C C A A
2 3 1 2 -9 0 0 A G T T",
na.strings = c("NA", "-9"), sep = "\t")
# result
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 1 1 0 0 1.02 A A G G 0 0
# 2 1 2 0 0 0.51 T G C C A A
# 3 2 3 1 2 NA 0 0 A G T T
Also, use --tab option when using plink, so the separator for columns is tab and space between genotypes is space.
--tab Delimit --recode and --recode12 with tabs

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Change values in a data set in Julia - r

The simplest solution is to use the replace! function: replace!(x, 0=>4) Use replace(x, 0=>4) (without the !) to do the same thing, but creating a copy of the vector. Note that these functions only exist in version 0.7!

Related

How to convert DataFrame's DateTime element to Int64 milliseconds in Julia?

R Statistics Trying to Call Element from Vector Using Elements In Another Dataframe

Concatencation+collecting of ranges in Julia

Remove rows that match a value

Convert missing values (-9) to NAs in a Plink PED file when reading into R

Categories

Resources