How to Write Matrix data to Excel Julia? - julia

I have variables that stored in julia notebook called
sheet1 = [1 2 3; 4 5 6] and sheet2 = [1 1 1; 2 2 2]
how do i write this variable into excel workbook with a different worksheet?
and how if i want in a same worksheet?
xlxs.addsheet! didnt work , tq

If you prefer a one liner this can be done as:
XLSX.writetable("file.xlsx",
sheet1=(collect(eachcol(sheet1)), 'a':'c'),
sheet2=(collect(eachcol(sheet2)), string.("col", 1:3)) )

Modifying an example from the documentation,
using XLSX
filename = "myfile.xlsx"
# Some example data to try writing to .xlsx
columns = Vector()
push!(columns, [1, 2, 3])
push!(columns, ["a", "b", "c"])
labels = [ "column_1", "column_2"]
XLSX.openxlsx(filename, mode="w") do xf
sheet = xf[1]
# Write our data to sheet 1
XLSX.writetable!(sheet, columns, labels, anchor_cell=XLSX.CellRef("A1"))
# Write the same data, but to a different place in the sheet
XLSX.writetable!(sheet, columns, labels, anchor_cell=XLSX.CellRef("D1"))
# Add a new sheet, which we will then access with xf[2]
XLSX.addsheet!(xf)
# Write the same data, but to sheet 2, in yet another position
XLSX.writetable!(xf[2], columns, labels, anchor_cell=XLSX.CellRef("B2"))
end
You were on the right track with XLSX.addsheet!, but have to then write to that second sheet, the reference to which is stored in the second position of xf in this example. This example also shows how to write to different positions within the same sheet, using the anchor_cell keyword option.

Related

Beginner using pipes

I am a beginner and I'm trying to find the most efficient way to change the name of the first column for many CSV files that I will be creating. Once I have created the CSV files, I am loading them into R as follows:
data <- read.csv('filename.csv')
I have used the names() function to do the name change of a single file:
names(data)[1] <- 'Y'
However, I would like to find the most efficient way of combining/piping this name change to read.csv so the same name change is applied to every file when they are opened. I tried to write a 'simple' function to do this:
addName <- function(data) {
names(data)[1] <- 'Y'
data
}
However, I do not yet fully understand the syntax for writing a function and I can't get this to work.
Note
If you were expecting your original addName function to "mutate" an existing object like so
x <- data.frame(Column_1 = c(1, 2, 3), Column_2 = c("a", "b", "c"))
# Try (unsuccessfully) to change title of "Column_1" to "Y" in x.
addName(x)
# Print x.
x
please be aware that R passes by value rather than by reference, so x itself would remain unchanged:
Column_1 Column_2
1 1 a
2 2 b
3 3 c
Any "mutation" would be achieved by overwriting x with the return value of the function
x <- addName(x)
# Print x.
x
in which case x itself would obviously be changed:
Y Column_2
1 1 a
2 2 b
3 3 c
Answer
Anyway, here's a solution that compactly incorporates pipes (%>% from the magrittr package) and a custom function. Please note that without the linebreaks and comments, which I have added for clarity, this could be condensed to only a few lines of code.
# The dplyr package helps with easy renaming, and it includes the magrittr pipe.
library(dplyr)
# ...
filenames <- c("filename1.csv", "filename2.csv", "filename3.csv")
# A function to take a CSV filename and give back a renamed dataset taken from that file.
addName <- function(filename) {
return(# Read in the named file as a data.frame.
read.csv(file = filename) %>%
# Take the resulting data.frame, and rename its first column as "Y";
# quotes are optional, unless the name contains spaces: "My Column"
# or `My Column` are needed then.
dplyr::rename(Y = 1))
}
# Get a list of all the renamed datasets, as taken by addName() from each of the filenames.
all_files <- sapply(filenames, FUN = addName,
# Keep the list structure, in which each element is a
# data.frame.
simplify = FALSE,
# Name each list element by its filename, to help keep track.
USE.NAMES = TRUE)
In fact, you could easily rename any columns you desire, all in one fell swoop:
dplyr::rename(Y = 1, 'X' = 2, "Z" = 3, "Column 4" = 4, `Column 5` = 5)
This will read a vector of filenames, change the name of the first column of each one to "Y" and store all of the files in a list.
filenames <- c("filename1.csv","filename2.csv")
addName <- function(filename) {
data <- read.csv(filename)
names(data)[1] <- 'Y'
data
}
files <- list()
for (i in 1:length(filenames)) {
files[[i]] <- addName(filenames[i])
}

reading in csv file with character type variable but expand scientific notation readr

I have a csv file with a variable (id). In excel when I check the format of the cells, some cells are type general and some are scientific:
# id
# 1 ge189839898 #general format cell in excel
# 2 we7267178 #general format cell in excel
# 3 2.8E+12 #scientific format cell in excel
When I read the file into R using read_csv, it thinks that the column is character (which it is and what I want) but it means 2.8E+12 is also a character.
options(digits = 22, scipen = 9999)
library(tidyverse)
dfcsv <- read_csv("file.csv")
#where dfcsv looks like:
dfcsv <- data.frame(id = c("ge189839898",
"we7267178",
"2.8E+12"))
dfcsv
# id
# 1 ge189839898
# 2 we7267178
# 3 2.8E+12
Is there a way to automatically read in the csv so that variables with mixed types are correctly identified so it would be return a character variable but scientific notation is expanded:
# id
# 1 ge189839898
# 2 we7267178
# 3 2800000000000
I don't think guess_max is what I am after here. I would also prefer not to use grep/sprintf type solutions (if possible) as I think that is trying to fix a problem I shouldn't have? I found these problematic ids by chance so I would like an automated way of doing this at the reading in stage.
The cleanest solution is probably to go in to the csv file and make a conversion there but I want to do it through R.
Thanks
id <- c("ge189839898", "we7267178", "2.8E+12")
func <- function(x) {
poss_num <- suppressWarnings(as.numeric(x))
isna <- is.na(poss_num)
x[!isna] <- format(poss_num[!isna], scientific = FALSE)
x
}
func(id)
# [1] "ge189839898" "we7267178" "2800000000000"

Manually assigning column names using read_excel in R

First off, I'm new to R, so this question won't be using if statements or anything specific. I loaded an xls document into R, and now have to open a specific sheet and call specific columns from one particular sheet. Let's call the sheet "fruit" and the columns: "apples", "oranges", "bananas". I also have to skip the first two rows, so I might have done that wrong as well. Please let me know if I did.
I'm trying this :
fruit_types <-read_excel('fruittypes.xls', sheet = "Fruit",
col_names = "apples","oranges","bananas" ,skip = 2)
It just isn't working. And no, I am not allowed to just select columns, I have to call them by their name.
To reproduce the sample example for your understanding have created a raw excel sheet
code
setwd ("folderpath")
library(readxl)
data <- read_excel("data.xlsx", sheet = "Fruits", skip = 2)
data <- subset (data , select = c("Apple","Banana","Oranges"))
print (data)
# A tibble: 3 x 3
Apple Banana Oranges
<dbl> <dbl> <dbl>
1 1 4 7
2 2 5 8
3 3 6 9
You could try using "range" instead. E.g., using the raw excel sheet created by Arun kumar mahesh the code would be as follows:
setwd ("folderpath")
library(readxl)
data <- read_excel("data.xlsx", sheet = "Fruits", range: "A3:E6")

How to select a particular dataframe from a list of dataframes in Python equivalent to R?

I have a list of dataframes in R, with which I'm trying to select a particular dataframe as follows:
x = listOfdf$df1$df2$df3
Now, trying hard to find an equivalent way to do so in Python. Like, the syntax on how a particular DataFrame be selected from a list of DataFrames in Pandas Python.
Found a solution to select a particular dataframe/dataframe_column from a list of dataframes.
In R : x = listOfdf$df1$df2$df3
In Python : x = listOfdf['df1']['df2']['df3']
Thank you :)
I see you've already answered your own question, and that's cool. However, as jezrael hints in his comment, you should really consider using a dictionary. That might sound a bit scary coming from R (been there myself, now I prefer Python in most ways), but It will be worth your effort.
First of all, a dictionary is a way of mapping a value or variable to a key (like a name). You use curly brackets { } to build the dictionary, and use square brackets [ ] to index it.
Let's say that you have two dataframes like this:
np.random.seed(123)
# Reproducible input - Dataframe 1
rows = 10
df_1 = pd.DataFrame(np.random.randint(90,110,size=(rows, 2)), columns=list('AB'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_1['dates'] = datelist
df_1 = df_1.set_index(['dates'])
df_1.index = pd.to_datetime(df_1.index)
##%%
# Reproducible input - Dataframe 2
rows = 10
df_2 = pd.DataFrame(np.random.randint(10,20,size=(rows, 2)), columns=list('CD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_2['dates'] = datelist
df_2 = df_2.set_index(['dates'])
df_2.index = pd.to_datetime(df_2.index)
With a limited number of dataframes you can easily organize them in a dictionary this way:
myFrames = {'df_1': df_1,
'df_2': df_2}
Now you have a reference to your dataframes, as well as your own defined names or keys. You'll find a more elaborate explanation here.
Here's how you use it:
print(myFrames['df_1'])
You can also use that reference to make changes to one of your dataframes, and add that to your dictionary:
df_3 = myFrames['df_1']
df_3 = df_3*10
myFrames.update({'df_3': df_3})
print(myFrames)
Now lets say that you have a whole bunch of dataframes that you'd like to organize the same way. You can make a list of the names of all available dataframes like described below. However, you should be aware that using eval() for many reasons often is not recommended.
Anyway, here we go: First you get a list of strings of all dataframe names like this:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
It's more than likely that you won't be interested in ALL of them if you've got a lot going on at the same time. So lets say that the names of all your dataframes of particluar interest start with 'df_'. You can isolate them like this:
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
Now you can use that list in combination with eval() to make a dictionary:
myFrames2 = {}
for dfName in dfNames:
myFrames2[dfName] = eval(dfName)
Now you can loop through that dictionary and do something with each of them.
You could, as an example, take the last column of each dataframe, multiply by 10, and make a new dataframe with those values:
j = 1
for key in myFrames.keys():
# Build new column names for your brand new df
colName = []
colName.append('column_' + str(j))
if j == 1:
# First, make a new df by referencing the dictionary
df_new = myFrames2[key]
# Subset the last column and make sure it doesn't
# turn into a pandas series instead of a dataframe in the process
df_new = df_new.iloc[:,-1].to_frame()
# Set new column names
df_new.columns = colName[:]
else:
# df_new already exists, so you can add
# new columns and names for the rest of the columns
df_new[colName] = myFrames2[key].iloc[:,-1].to_frame()
j = j + 1
print(df_new)
Hope you'll find this useful!
And by the way... For your next question, please provide some reproducible code as well as a few words about what solutions you have tried yourself. You can read more about how to ask an excellent question here.
And here is the whole thing for an easy copy&paste:
#%%
# Imports
import pandas as pd
import numpy as np
np.random.seed(123)
# Reproducible input - Dataframe 1
rows = 10
df_1 = pd.DataFrame(np.random.randint(90,110,size=(rows, 2)), columns=list('AB'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_1['dates'] = datelist
df_1 = df_1.set_index(['dates'])
df_1.index = pd.to_datetime(df_1.index)
##%%
# Reproducible input - Dataframe 2
rows = 10
df_2 = pd.DataFrame(np.random.randint(10,20,size=(rows, 2)), columns=list('CD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_2['dates'] = datelist
df_2 = df_2.set_index(['dates'])
df_2.index = pd.to_datetime(df_2.index)
print(df_1)
print(df_2)
##%%
# If you dont have that many dataframes, you can organize them in a dictionary like this:
myFrames = {'df_1': df_1,
'df_2': df_2}
# Now you can reference df_1 in that collecton by using:
print(myFrames['df_1'])
# You can also use that reference to make changes to one of your dataframes,
# and add that to your dictionary
df_3 = myFrames['df_1']
df_3 = df_3*10
myFrames.update({'df_3': df_3})
# And now you have a happy little family of dataframes:
print(myFrames)
##%%
# Now lets say that you have whole bunch of dataframes that you'd like to organize the same way.
# You can make a list of the names of all available dataframes like this:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
##%%
# It's likely that you won't be interested in all of them if you've got a lot going on.
# Lets say that all your dataframes of interest start with 'df_'
# You get them like this:
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
##%%
# Now you can use that list in combination with eval() to make a dictionary:
myFrames2 = {}
for dfName in dfNames:
myFrames2[dfName] = eval(dfName)
##%%
# And now you can reference each dataframe by name in that new dictionary:
myFrames2['df_1']
##%%
#Loop through that dictionary and do something with each of them.
j = 1
for key in myFrames.keys():
# Build new column names for your brand new df
colName = []
colName.append('column_' + str(j))
if j == 1:
# First, make a new df by referencing the dictionary
df_new = myFrames2[key]
# Subset the last column and make sure it doesn't
# turn into a pandas series instead for a dataframe in the process
df_new = df_new.iloc[:,-1].to_frame()
# Set new column names
df_new.columns = colName[:]
else:
# df_new already exists, so you can add
# new columns and names for the rest of the columns
df_new[colName] = myFrames2[key].iloc[:,-1].to_frame()
j = j + 1
print(df_new)

How to import large dataset in r splitting and filtering by 3 different criteria when found

I'm dealing with a couple of txt files with climatological data with 3 parameters that differentiate each chunk of data (Parameter measured, station of measurement, and year), each file has more than a million lines, In the past I mannualy selected each parameter one a time, for a station and year and read it into r using read.fwd; but with this size files that is absurd and inefficient. Is there any way to automate this process, taking into account that the file has a "FF" as indicator every time a new parameter for a station and a given year starts and knowing that i want to generate separate files or datasets that have to be named according to the station, year and parameter to be able to use it thereafter?
File to read Format
Circled in red is the FF, I guess intended to mark the start of a new set of records.
Circled in Black is the name of the parameter measured (there are in total 8 different parameter classes)
Circled in blue is the year of meassurement.
Circled in green is the number or identifier of the station of measurement.
In the past, i read just what i need it with read.fwf, given the fixed with in the data, but that separation is not applied in the head of each table.
PRUEBA3 <- read.fwf("SanIgnacio_Pmax24h.txt", header = FALSE, widths = c(5,4,4,6,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,10,2),skip=1)
Thanks, and every help will be appreciated
You will need to make a function that loops through the txt files. (The output that you linked to was produced by a database; I assume you don't have access to it).
Here is how the function could look like using the fast fread from data.table and a foreach loop (you can make the loop parallel by registering a parallel backend and change %do% into %dopar%):
library(data.table)
library(foreach)
myfiles = dir(pattern = ".txt$")
res = foreach(i = 1:myfiles) %dopar% {
x = fread(myfiles[i], na.strings = c("", " "))
# get row indices for start and end dates
# the "V" variables are column indices, I assume these don't change per file
start.dia = x[, grep("DIA", V2)] + 2
end.dia = x[, grep("MEDIA", V2)] - 2
# get name of station
estacion.detect = x[, grep("ESTACION", V9)]
estacion.name = x[estacion.detect, V10]
mydf = x[start.dia : end.dia, estacion := estacion.name]
# remove empty rows and columns
junkcol = which(colSums(is.na(mydf)) == nrow(mydf))
junkrow = which(rowSums(is.na(mydf)) == ncol(mydf))
if (length(junkcol) > 0) {
mydf = mydf[, !junkcol, with = F]
}
if (length(junkrow) > 0) {
mydf = mydf[!junkrow, ]
}
# further data cleaning
}
# bind all files
all = rbindlist(res)

Resources