Plotting multiple data files with Gadfly - julia

I'm trying to plot different data files using Gadfly. I tried using a DataFrame, but with no result.
I tried using the below code but it just plots the last data file. I don´t know how to plot all the data file in just one plot using Gadfly.
data1 = CSV.read("DOC.CSV")
data2 = CSV.read("DODC.CSV")
data3 = CSV.read("DOTC.CSV")
data4 = CSV.read("DTC.CSV")
data5 = CSV.read("DTDC.CSV")
data6 = CSV.read("DTTC.CSV")
data = [data1,data2,data3,data4,data5,data6]
ddframe = DataFrame()
for i in 1:6
ddframe[Symbol("Data"*string(i))]=DataFrame(x=data[i][!,1],y=data[i][!,2],label="Data "*string(i))
end
p = plot(ddframe,x="x",y="y",color="label",Geom.point,Geom.line,Guide.xlabel("Wavelength(nm)"),Guide.ylabel("Absorbance(UA)"),Guide.title("Absorbance Spectrum for
Cianine dyes"))

I assume your data is like this.
│ x │ y │
│───────┼───────┤
│ 1 │ 1 │
│ 2 │ 3 │
│ 3 │ 5 │
│ 4 │ 7 │
│ 5 │ 9 │
Therefore, the data is an array of dataframes.
More, I assume a dataframe can be represented as a line (wavelength vs absorbance).
According to your code, you may combine every dataframe into a plot, so what you need is the layer() function.
plot(
layer(data[1],
x=:x,
y=:y,
Geom.point,
Geom.line,
Theme(default_color="red")),
layer(data[2],
x=:x,
y=:y,
Geom.point,
Geom.line,
Theme(default_color="blue"))
)

Related

Julia: how to remove rows by condition with a search across the entire dataframe

Sometimes there is a need to delete rows by some condition in a large dataframe, where it is irrational to manually specify all columns by name. For example, such a situation may occur if the value "no data" is set as some specific numeric value, for example "-1". It would be useful to be able to immediately delete all the rows where this value occurs, similar to how dropmissing() removes them. I have not found a convenient way to replace an arbitrary value with "missing" or directly delete all lines where the specified value occurs.
Simple example:
df = DataFrame(A=[2, -1, 3, 3], B=[2, 5, 7, -1], C=3)
I want to delete all rows where -1 values occur, or at least extract all rows where this value does not occur. Something like this:
df_clear = df[df .>= 0, :]
or:
df_clear = df[findall(x -> x>0, df), :]
A similar syntax works with replacing values, but it doesn't seem to be applicable for deleting strings:
df .= ifelse.(df.< 0, 0, df)
What is the most elegant way to solve this problem? Is there a beautiful solution to perform such a check on a range of columns or on all columns except for the specified one?
You could do it e.g. like this (using the ifelse function you used; the approach assumes you do not have missing values in your data frame):
julia> dropmissing!(ifelse.(df .== -1, missing, df))
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3
A more direct approach is to use the subset function by keeping rows in which -1 is not present:
julia> subset(df, All() .=> ByRow(!=(-1)))
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3
finally you could do the following using the indexing syntax:
julia> df[all.(!=(-1), eachrow(df)), :]
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 2 2 3
2 │ 3 7 3

Adding a (fixed) new row to the top of each dataset in a list of N datasets using apply

I have N data sets which were loaded into RStudio and stored in the list object "datasets". The problem is what I want to be the top row in each of them or the headers for each of them, either way is in their third rows.
The initial version of this question I posted only had the paragraph below describing what each of the N datasets look like, but I realized that is not nearly simple enough, so now I am including a screenshot of what one of them looks like right below that paragraph.
Each dataset is 503 by 31 and that 3rd row is "Y", "X1", "X2", ..., "X30" in every dataset, the first row in each of them is a row of dummy variables, so all of them are either 1 or 0 depending on a condition. The 2nd row of each is black in the first spot, then '1', '2', '3', ..., '30'.
What I want to do from here is to add a new row, one equivalent to the 3rd row, to the top of each of these N dataframe elements within the list object datasets, or add proper headers to them instead which would be even better. Or, find a way to make delete or drop the 2nd row, then make the 1st and the new 2nd row switch places.
I just also took the liberty of adding in that new row in the source csv file-formatted dataset and screenshotting that to demonstrate what the dataframe in R for that dataset should look like after I am done applying whatever answer to this question works
Would it be possible for me to somehow combine an rbind() function with one of the apply functions to accomplish this task??
p.s. What is below the 3rd rows of each dataframe are just 500 rows of observations on each of the 31 variables.
I already tried to add the aforementioned row names to each dataframe using the following:
lapply(datasets, function(i){
colnames(i) <- c("Y", "X1","X2", "X3", "X4","X5", "X6", "X7","X8", "X9",
"X10","X11", "X12", "X13","X14", "X15", "X16","X17",
"X18", "X19","X20", "X21", "X22","X23", "X24", "X25",
"X26", "X27", "X28","X29", "X30") }
But this didn't actually result in any permanent changes in datasets at all to my surprise.
p.s. The 2nd thing I do in this script (after setting the WorkSpace) is load the following libraries with the following unorthodox method:
# load all necessary packages using only 1 command/line
library_list <- c(library(stats),library(plyr),library(dplyr),
library(tidyverse),library(tibble),library(readr),
library(leaps),library(lars),library(stringi),
library(purrr),library(parallel), library(vroom))
I just run rm(library_list immediately) afterwards and it's like I never did it weird. I do it that way because my hands are disabled, so the less thumb clicks to run each line individually, the better!
If I understand you correctly this should work:
library(janitor)
library(purrr)
library(dplyr)
# create a list
df1 <- read.table(header = FALSE,
text = '
1 0 1 1 0
1 2 3 4 5
X1 X2 X3 X4 X5
no no no no no')
df2 <- read.table(header = FALSE,
text = '
1 1 0 0 0
6 7 8 9 10
X1 X2 X3 X4 X5
no no no no no')
my_list <- list(df1, df2)
Base R
# create a custom function and then use it with lapply
my_renamer <- function(df, row=3){
names(df) <- df[row,]
df
}
lapply(my_list, function(x) my_renamer(x, 3))
OR with purrr and janitors row_to_names:
map(my_list, ~row_to_names(., remove_rows_above = FALSE,
remove_row = FALSE, 3))
OR with lapply and janitor:
lapply(my_list, function(x) row_to_names(x, remove_rows_above = FALSE, 3))
[[1]]
X1 X2 X3 X4 X5
1 1 0 1 1 0
2 1 2 3 4 5
3 X1 X2 X3 X4 X5
4 no no no no no
[[2]]
X1 X2 X3 X4 X5
1 1 1 0 0 0
2 6 7 8 9 10
3 X1 X2 X3 X4 X5
4 no no no no no

Can CatagoricalArrays be used with Julia Dataframes to convert multiple columns from string to categories?

I have a fairly large survey dataset (100+ columns and 5000 rows) that has a bunch of string variables.
I can use the following function to convert individual columns one by one,
function fix_df_column(df)
levels = ["x"]
Colname = categorical(df[!, :Colname]; levels, ordered = false)
df[!, :Colname] = Colname
#df
end
but I would like to be able to iterate across the whole dataframe and convert everything automatically.
The only code I can find relates to arrays https://dataframes.juliadata.org/stable/man/categorical/ and the only examples I can find are changes to single columns, not multiple.
Does anyone know a simpler way to achieve this?
Thanks
Yes, you can do it. Assuming that you want to convert all columns that contain strings (without missing) and you want automatic assignment of levels you can do:
transform!(df, names(df, AbstractString) .=> categorical, renamecols=false)
For example:
julia> df = DataFrame(x1=["a", "b"], x2=[1,2], x3=[missing, "x"], x4=["c", "d"])
2×4 DataFrame
Row │ x1 x2 x3 x4
│ String Int64 String? String
─────┼────────────────────────────────
1 │ a 1 missing c
2 │ b 2 x d
julia> transform!(df, names(df, AbstractString) .=> categorical, renamecols=false)
2×4 DataFrame
Row │ x1 x2 x3 x4
│ Cat… Int64 String? Cat…
─────┼────────────────────────────
1 │ a 1 missing c
2 │ b 2 x d
and you can see that only :x1 and :x4 are changed.

call variables by name and column number in a data.frame

I have a data frame with columns I want to reorder. However, in different iterations of my script, the total number of columns may change.
>Fruit
Vendor A B C D E ... Apples Oranges
Otto 4 5 2 5 2 ... 3 4
Fruit2<-Fruit[c(32,33,2:5)]
So instead of manually adapting the code (the columns 32 and 33 change) I'd like to do the following:
Fruit2<-Fruit[,c("Apples", "Oranges", 2:5)]
I tried a couple of syntaxes but could not get it to do what I want. I know, this is a simple syntax issue, but I could not find the solution yet.
The idea is to mix the variable name with the vector to reference the columns when writing a new data frame. I don't want to spell out the whole vector in variable names because in reality it's 30 variables.
I'm not sure how your data is stored in R, so this is what I used:
Fruit <- data.frame( "X1" = c("A",4),"X2" = c("B",5),"X3" = c("C",2),"X4"=
c("D",5),"X5"= c("E",2),"X6" = c("Apples",3),"X7"=
c("Oranges",4),row.names = c("Vendor","Otto"),stringsAsFactors = FALSE)
X1 X2 X3 X4 X5 X6 X7
Vendor A B C D E Apples Oranges
Otto 4 5 2 5 2 3 4
Then use:
indexes <- which(Fruit[1,]%in%c("Apples","Oranges"))
Fruit2<- Fruit[,c(indexes,2:5)]
Fruit[1,] references the Vendor row, and "%in%" returns a logical vector to the function "which". Then "which" returns indexes.
This gives:
> Fruit2
X6 X7 X2 X3 X4 X5
Vendor Apples Oranges B C D E
Otto 3 4 5 2 5 2
Make sure your data are not being stored as factors, otherwise this will not work. Or you could change the Vendor row to column names as per the comment above.
The answer is, as I found out, use the dplyr package.
It is very powerful.
The solution to the aforementioned problem would be:
Fruit2<-Fruit %>% select(Apples,Oranges,A:E)
This allows dynamic selection of columns and lists of columns even if the indexes of the columns change.

Dividing one column in my dataset by a fix value in R

Im pretty new to R and here's (maybe a simple) question:
I have big .dat datasets and I add together two of them to get the sum of the values. The datasets kinda look like this:
#stud1
AMR X1 X2 X3...
1 3 4 10
2 4 5 2
#stud2
AMR X1 X2 X3
1 6 4 6
2 1 2 1
So what I did is
> studAll <- stud1 + stud2
and the result was:
# studAll:
AMR X1 X2 X3
2 9 8 16
4 5 7 3
MY PROBLEM NOW IS:
The AMR column is not meant to change, so my idea was to divide this column through the value "2" to get to the former values. Or is there another solution easier than my idea?
If I understand your question correctly you want to make a new dataframe which adds all the columns except AMR?
You could do it the long way:
studAll$X1 <- Stud1$X1 + Stud2$X1
repeat for each X...
Or this would work if the AMR column is preserved accross all
#set up
stud1 =data.frame(c(1, 2), c(3,4),c(4,5),c(10,2))
stud2 <- stud1
cols <- (c("AMR", "X1", "X2", "X3"))
colnames(stud1) <- cols
colnames(stud2) <- cols
#add them
studAll = stud1 + stud2
#replace the AMR column into studAll from stud1
#this assumes the AMR column is the same in all studs'
studAll$X1 <- stud1$X1
You could also select all columns other than AMR and add them
See for example here http://www.r-tutor.com/r-introduction/data-frame

Resources