I'm new user of Julia.
I try to create one code for an start up.
begin
using Pkg
Pkg.add("CSV")
Pkg.add("DataFrames")
using Statistics
using DataFrames
using CSV
using Dates
v=[1:12]
resultat=CSV.File("ResultatPropre.csv";header=true; delim=";")
println(resultat)
i=1
while i <= 12
a=now() ##Dates.millisecond
dt = CSV.read(resultat[i,1]*"_"*resultat[i,2]*"_"*resultat[i,3]*"_"*resultat[i,4]*"_1.csv")
x=resultat[i,6]
moy="MOYENNE"
if x==moy
for c in eachcol(dt)
println(mean(dt[:c]))
end
else
for c in eachcol(dt)
println(median(dt[:c]))
end
end
v[i]=now()-a
close(dt)
i = i + 1
end
CSV.write("OUT1.csv", DataFrame(hcat(resultat,v)), writeheader=false)
close(resultat)
end
I don't know if this code is correct but i haven't error message.
The document OUT1.CSV is empty.
Why?
Sorry if you can't understand I'm not fluent in english.
Thank you.
As with your previous question this is hard to debug, as it relies on files that are stored on your local machine - it is easiest for others to help you if you can create minimal working examples that reproduce the error you're getting.
From what you posted, there is an obvious issue though with what you're trying to do:
resultat=CSV.File("ResultatPropre.csv";header=true; delim=";")
This will return a CSV.File object, and not a DataFrame as you might expect. Consider the following example:
julia> using DataFrames, CSV
julia> CSV.write("out.csv", DataFrame(rand(5, 2))) # example data
"out.csv"
julia> resultat = CSV.File("out.csv")
5-element CSV.File{false}:
CSV.Row{false}: (x1 = 0.8579220366916582, x2 = 0.6209363986752581)
CSV.Row{false}: (x1 = 0.25341118271903995, x2 = 0.13828085618933872)
CSV.Row{false}: (x1 = 0.67532944746357, x2 = 0.7830459406731047)
CSV.Row{false}: (x1 = 0.268297279369758, x2 = 0.9701649420771219)
CSV.Row{false}: (x1 = 0.8369770803698637, x2 = 0.77439272213442)
This is probably not what you expected, given that you hcat resultat and your v vector later on.
The line that will actually error in your code however is:
close(dt)
At this point, dt is a DataFrame, as you've created it by calling CSV.read on a csv file, and you are calling the close function on this DataFrame. However, a close method does not exist for DataFrames:
julia> close(DataFrame(rand(5,2)))
ERROR: MethodError: no method matching close(::DataFrame)
The result of CSV.read is a DataFrame that is stored in memory, and there is no "open" file handle anywhere that needs to be closed after performing operations on your DataFrame - CSV.read is different from calling open on some text file that you then iterate through.
Related
I'm running a large number of meta-analyses with metafor. To get an overview of the results, I wanted to put together vectors containing the main estimates (to combine them in a dataframe later on). Yet, for some of these calculations, I do not have enough primary studies yet, so R will not be able to create a model for this particular domain. Hence, I will get an error message when I try to create a vector at the end.
library(metafor)
r1<-c(NA,NA)
n1<-c(NA,NA)
data1<-data.frame(r1,n1)
escalc1<-escalc(measure="COR", ri=r1,ni=n1, data = data1, method=REML)
rma1<-rma(yi,vi, data=escalc1)
#note the program will not be able to calculate rma1, because k = 0.
r2<-c(.3,.2)
n2<-c(100,200)
data2<-data.frame(r2,n2)
escalc2<-escalc(measure="COR", ri=r2,ni=n2, data = data2, method=REML)
rma2<-rma(yi,vi, data=escalc2)
#it will create an object for rma2 though
estimates<-c(rma1$beta, rma2$beta)
#as rma2 exists but rma1 doesn't, R will no let me create a vector here
Is there a way to tell R to check if the object exists first and to put in NAs for all cases where no object has been created yet? Specifically, I want R to replace rma1$beta (which does not exist) with NA in the last line of code. Is that possible?
You can use tryCatch to tell R what to do as an alternative if an error occurs, e.g.,
library(metafor)
r1<-c(NA,NA)
n1<-c(NA,NA)
data1<-data.frame(r1,n1)
escalc1<-escalc(measure="COR", ri=r1,ni=n1, data = data1)
e1 <- tryCatch({
rma1<-rma(yi,vi, data=escalc1);
rma1$beta}, error = function(e) NA)
r2<-c(.3,.2)
n2<-c(100,200)
data2<-data.frame(r2,n2)
escalc2<-escalc(measure="COR", ri=r2,ni=n2, data = data2)
e2 <- tryCatch({
rma2<-rma(yi,vi, data=escalc2);
rma2$beta}, error = function(e) NA)
estimates<-c(e1, e2)
#[1] NA 0.2356358
I am trying to use the aggregate function to compute the mean of a variable by group
using Distributions, PooledArrays
N=Int64(2e9/8); K=100;
pool = [#sprintf "id%03d" k for k in 1:K]
pool1 = [#sprintf "id%010d" k for k in 1:(N/K)]
function randstrarray(pool, N)
PooledArray(PooledArrays.RefArray(rand(UInt8(1):UInt8(K), N)), pool)
end
using JuliaDB
DT = IndexedTable(Columns([1:N;]), Columns(
id1 = randstrarray(pool, N),
v3 = rand(round.(rand(Uniform(0,100),100),4), N) # numeric e.g. 23.5749
));
res = IndexedTables.aggregate(mean, DT, by=(:id1,), with=:v3)
How I get the error
MethodError: no method matching mean(::Float64, ::Float64)
Closest candidates are:
mean(!Matched::Union{Function, Type}, ::Any) at statistics.jl:19
mean(!Matched::AbstractArray{T,N} where N, ::Any) where T at statistics.jl:57
mean(::Any) at statistics.jl:34
in at base\<missing>
in #aggregate#144 at IndexedTables\src\query.jl:119
in aggregate_to at IndexedTables\src\query.jl:148
however
IndexedTables.aggregate(+ , DT, by=(:id1,), with=:v3)
works fine
Edit:
res = IndexedTables.aggregate_vec(mean, DT, by=(:id1,), with=:v3)
from help:
help?> IndexedTables.aggregate_vec
aggregate_vec(f::Function, x::IndexedTable)
Combine adjacent rows with equal indices using a function from vector to scalar, e.g. mean.
Old answer:
(I keep it because it was pleasant exercise (for me) how to create helper type and functions if something doesn't work like we want. Maybe it could help someone in future :)
I am not sure how do you like to aggregate mean. My idea is to calculate "center of gravity" for points with equivalent mass.
center of two points: G = (A+B)/2
adding (aggregating) third point C is (2G+C)/3 (2G because G's mass is A's mass +B's mass)
etc.
struct Atractor
center::Float64
mass::Int64
end
" two points create new atractor with double mass "
mediocre(a::Float64, b::Float64) = Atractor((a+b)/2, 2)
# pls forgive me function's name! :)
" aggregate new point to atractor "
function mediocre(a::Atractor, b::Float64)
mass = a.mass + 1
Atractor((a.center*a.mass+b)/mass, mass)
end
Test:
tst_array = rand(Float64, 100);
isapprox(mean(tst_array), reduce(mediocre, tst_array).center)
true # at least in my tests! :)
mean(tst_array) == reduce(mediocre, tst_array).center # sometimes true
For aggregate function we need a little more work:
import Base.convert
" we need method for convert Atractor to Float64 because aggregate
function wants to store result in Float64 "
convert(Float64, x::Atractor) = x.center
And now it (probably :P) works
res = IndexedTables.aggregate(mediocre, DT, by=(:id1,), with=:v3)
id1 │
────────┼────────
"id001" │ 45.9404
"id002" │ 47.0032
"id003" │ 46.0846
"id004" │ 47.2567
...
I hope you see that aggregating mean has impact to precision! (there is more sum and divide operations)
You need to tell it how to reduce two numbers to one. mean is for arrays. So just use an anonymous function:
res = IndexedTables.aggregate((x,y)->(x+y)/2, DT, by=(:id1,), with=:v3)
I'd really like to help you, but it took me 10 minutes to install all the packages and another few minutes to run the code and figuring out what it actually does (or doesn't). It would be great if you'd provide a "minimal working example", which focusses on the problem. In fact, the only requirement to reproduce your problem is seemingly IndexedTables and two random arrays.
(Sorry, this is not a complete answer, but too long to be a comment.)
Anyways, if you read the docstring of IndexedTables.aggregate, you see that it requires a function which takes two arguments and obviously returns a single value::
help?> IndexedTables.aggregate
aggregate(f::Function, arr::IndexedTable)
Combine adjacent rows with equal indices using the given 2-argument
reduction function, returning the result in a new array.
You see in the error message you posted, that there is
no method matching mean(::Float64, ::Float64)
Since I don't know what you expect to be calculated, I now assume that you want to calculate the mean value of the two numbers. In this case you can define another method for mean():
Base.mean(x, y) = (x+y) / 2
This will fulfil the aggregate function signature requirements. But I am not sure if this is what you want.
In R, I am using readHTMLTable to read in a tables from the web. The tables I want occur at indexes 16 & 17, [[16]] & [[17]].
Here is a small sample of the data for you to work with:
These are some of the urls that contain the HTML tables.
url1 = "http://www.basketball-reference.com/leagues/NBA_1980.html"
url2 = "http://www.basketball-reference.com/leagues/NBA_1981.html"
url3 = "http://www.basketball-reference.com/leagues/NBA_1982.html"
And here, I read in the tables to variables named x1, x2, and x3.
x1 = readHTMLTable(url1)
x2 = readHTMLTable(url2)
x3 = readHTMLTable(url3)
If you look at the summary of each of these summary(x1), summary(x2), summary(x3) and count down through the indexes, the tables I want are the ones named "team" and "opponent", which occur on line 16 and line 17.
I have been trying to write a loop that would cycle through these and name the "team" table from each to a variables named team.1980, team.1981, and team.1982, respectively. The "opponent" tables would follow the same trend, opp.1980, and so forth.
This is the code for the loop I have been trying:
for(i in 1:3) {
for (j in 1980:1982) {
nam1 = paste0("team.", j)
nam2 = paste0("opp.", j)
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
}
}
I think the theory behind this loop works, however the problem occurs with the two assign functions:
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
When I run the loop, I get the error message
Error in paste0("x.", i)[[16]] : subscript out of bounds
which is the same error I get if I just run:
paste0("x", 1)[[16]]
> paste0("x", 1)[[16]]
Error in paste0("x", 1)[[16]] : subscript out of bounds
So I am pretty sure this is where my problem is. Does anyone know how I could cycle through variables and pull out indexes from each?
Please keep in mind that I am rather new to R, so simplicity would be much appreciated! Thanks in advance!
The output from readHTMLTable() is a list and the elements can be referenced by name; index isn't necessary. (Though you can use it.)
Suppose x1, x2, and x3 are defined as in your post. Then you can just do this:
for (i in 1:3) {
year <- 1980 + i - 1
eval(parse(text=paste0("team.", year, " <- x", i, '[["team"]]')))
eval(parse(text=paste0("opp.", year, " <- x", i, '[["opponent"]]')))
}
This evaluates the parsed text that's constructed dynamically in the loop. It creates 6 data frames: team.1980 and opp.1980 for years 1980-1982.
Let's take a closer look at what it's doing...
First a string is constructed using paste0() to concatenate the values into a string with no separator. The first call to paste0() in the first iteration yields this string:
'team.1980 <- x1[["team"]]'
Calling parse() on this tells R to turn that string into an object called an expression. Expressions can be evaluated using eval(). So this string gets turned into an R statement and executed, thereby assigning team.1980.
This process continues for each of the 3 iterations.
This may not be the best approach, but it should work in your situation. I assume you have more than just these 6, otherwise you might as well just write them as individual assignments.
I was wondering if there were some pre-built function to perform such operation I'm doing by hand right now: creating a new vector from 2 original ones by taking, for example, one data from each iteratively:
x = 1:5
y = 10:14
output = c(1,10,2,11,3,12,4,13,5,14)
For now I've been using:
output = c(rbind(x,y))
but it seems a bit dodgy to me and it is case specific to this mixing. I can't do for example:
output = c(1,2,10,3,4,11,5,1,12,...
thanks
I have some data that is badly formatted. Specifically I have numeric columns that have some elements with spurious text in them (e.g. "8 meters" instead of "8"). I want to use readtable to read in the data, make the necessary fixes to the data and then convert the column to a Float64 so that it behaves correctly (comparison, etc).
There seems to have been a macro called #transform that would do the conversion but it has been deleted. How do I do this now?
My best solution at the moment is to clean up the data, write it out as a csv and then re-read it using readtable and specify eltypes. But that is horrible.
What else can I do?
There is no need to run things via a csv file. You can change or update the DataFrame directly.
using DataFrames
# Lets make up some data
df=DataFrame(A=rand(5),B=["8", "9 meters", "4.5", "3m", "12.0"])
# And then make a function to clean the data
function fixdata(arr)
result = DataArray(Float64, length(arr))
reg = r"[0-9]+\.*[0-9]*"
for i = 1:length(arr)
m = match(reg, arr[i])
if m == nothing
result[i] = NA
else
result[i] = float64(m.match)
end
end
result
end
# Then just apply the function to the column to clean the data
# and then replace the column with the cleaned data.
df[:B] = fixdata(df[:B])
lets say you had a dataframe = df and a column B that has strings to convert.
First this converts a string to a float and returns NA if a failure:
string_to_float(str) = try convert(Float64, str) catch return(NA) end
Then transform that column:
df[:B] = map(string -> string_to_float string, df[:B])
an alternative shorter version
df[:B] = map(string_to_float, df[:B])