Importing google sheets file that contains multiple headers (Using R) - r

I am currently trying to import a google sheets cross tabular file into R that contains 3 headers that I would like to combine (first row= Year / Second Row = Quarter / Third row = Week).
Most packages in R allow you to select only 1 header and allow you to 'skip' rows until you find the observations however I can't seem to find one that allows you to select multiple headers at once.
Can anyone help?

you might read the first 3 rows first (n_max = 3), and create tidy headers from that, then read the rest (with skip = 2), then add your headers to the data. Something like:
headers= read_csv (file, n_max = 3)
names (headers) = paste0 (headers[1,], headers[2,], headers[3,])
data = read_csv (file, skip = 2)
names (data) = names (headers)
(the code probably needs some debugging, though.)

Related

R - Export large dataframe into CSV

Beginner here: I have a list (see screenshot) called Coins_list from which I want to export the second dataframe stored in it called data into a csv. When I use the code
write.csv(Coins_list$data, file = "Coins_list_full_data.csv")
I get a huge CSV with a bunch of numbers from the column named price which apparently containts more dataframes, if I read the output correctly or at least display the data in the price column? How can I export this dataframe into CSV correctly? See screenshot for more details.
EDIT: I was able to get the first four rows into CSV by using df2 <- Coins_list$data write.csv(df2[1:4,], file="BTC_row.csv"), however it now looks like R puts the price of all four rows within a list c( ) and repeats it in each row? Any idea how to change that?
(I would post this as a comment but I have too few reputation)
Hey, you could try for starters to flatten the json file by going further than response list$content but looking at what's into the content with another $.
Else you could try getting data$price and see what pops up from there.
something like this:
names = list(data$symbol)
df = data.frame(price = NA, symbol = NA)
for (i in length(data)) {
x = data.frame(price = data$price[i], symbol = names[i])
df = inner_join(df, data)
}
to get a dataframe with price and symbol. I don't know how the data is nested so I'm just guessing.
It would be helpful to know from where you got the data for reproducibility.

Importing data in R from Excel with information cotained in header

As title says, I am trying to import data from Excel to R, where part of the information is contained in the header.
I a very simplified way, the Excel I have looks like this:
GROUP;1234
MONTH;"Jan"
PERSON;SEX;AGE;INCOME
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
Total;;;147000
After reading in to R, it should be a "clean" dataset that looks like this.
GROUP;MONTH;PERSON;SEX;AGE;INCOME
1234;Jan;John;m;26;20000
1234;Jan;Michael;m;24;40000
1234;Jan;Phillip;m;25;15000
1234;Jan;Laura;f;27;72000
I have several files that look like this. The number of persons however varies in each file. The last line contains a summary that should be skipped. There might be empty lines between the list and summary line.
Any help is higly apreciated.Thank you very much.
Excel files can be read using readxl::read_excel()
One of the parameters is skip, using which you can skip certain number of rows defined by you.
For your data, you need to skip the first two lines that contain GROUP and MONTH.
You will get the data in following format.
PERSON;SEX;AGE;INCOME;
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
After this, you can manually add the columns GROUP and MONTH
Thank you very much for your help. The hint from #Aurèle brought the missing puzzle piece. The solution I have now come up with is as follows:
group <- read_excel("TEST1.xlsx", col_names = c("C1","GROUP") ,n_max = 1)
group <- group[,2]
month <- read_excel("TEST1.xlsx", col_names = c("C1","MONTH") ,skip = 1, n_max = 1)
month <- month[,2]
data <- read_excel("TEST1.xlsx", col_names = c("NAME","SEX","AGE","INCOME") , skip = 4)
data <- data[data$AGE != NA,]
data <- cbind(data,group,month)
data

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Reading a folder of csv files with specific name endings followed by filtering with dplyr and combining output into single dataframe

I have a folder containing 13000 csv files.
Problem 1: I need to read all the files ending with -postings.csv. All the -postings.csv files have same number of variables and same format.
So far I have the following
name_post = list.files(pattern="-postings.csv")
for (i in 1:length(name_post)) assign(name_post[i], read.csv(name_post[i], header=TRUE))
This creates around 600 dataframes.
Problem 2: I need to filter the 600 dataframe output trough the following rules
1) column_name1 != "" (remove all empty rows)
2) column_name2 ==124 (only keep rows with values equal to 124)
So far I have only done this on a single file, but need a way to get this done on all 600 dataframes. (I use filter which is part of the dplyr package. I am open for other solutions)
filter(random_name-postings.csv,column_name1 != "",column_name2==124)
Problem 3: I need to combine the filtering output from problem 2 into a single dataframe.
I have not done this since I have issues solving problem 2.
Any help is much appreciated :)
Rather than working with the data frames as 600 separate variables, which isn't a good idea, you can combine them into one data frame as soon as you read them in. map_df from the purrr package is a good way to do this.
library(purrr)
name_post = list.files(pattern="-postings.csv")
combined = map_df(name_post, read.csv, header = TRUE)
After that, you can perform your filtering on the combined dataset.
library(dplyr)
combined_filtered = combined %>%
filter(column_name1 != "", column_name2 == 124)
Note that if you want to know which file each row originally came from, you could turn name_post into a named vector and use .id = "filename", which would add a filename column to your output.
names(name_post) = name_post
combined = map_df(name_post, read.csv, header = TRUE, .id = "filename")

How to read csv files in matlab as you would in R?

I have a data set that is saved as a .csv file that looks like the following:
Name,Age,Password
John,9,\i1iiu1h8
Kelly,20,\771jk8
Bob,33,\kljhjj
In R I could open this file by the following:
X = read.csv("file.csv",header=TRUE)
Is there a default command in Matlab that reads .csv files with both numeric and string variables? csvread seems to only like numeric variables.
One step further, in R I could use the attach function to create variables with associated with teh columns and columns headers of the data set, i.e.,
attach(X)
Is there something similar in Matlab?
Although this question is close to being an exact duplicate, the solution suggested in the link provided by #NathanG (ie, using xlsread) is only one possible way to solve your problem. The author in the link also suggests using textscan, but doesn't provide any information about how to do it, so I thought I'd add an example here:
%# First we need to get the header-line
fid1 = fopen('file.csv', 'r');
Header = fgetl(fid1);
fclose(fid1);
%# Convert Header to cell array
Header = regexp(Header, '([^,]*)', 'tokens');
Header = cat(2, Header{:});
%# Read in the data
fid1 = fopen('file.csv', 'r');
D = textscan(fid1, '%s%d%s', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);
Header should now be a row vector of cells, where each cell stores a header. D is a row vector of cells, where each cell stores a column of data.
There is no way I'm aware of to "attach" D to Header. If you wanted, you could put them both in the same structure though, ie:
S.Header = Header;
S.Data = D;
Matlab's new table class makes this easy:
X = readtable('file.csv');
By default this will parse the headers, and use them as column names (also called variable names):
>> x
x =
Name Age Password
_______ ___ ___________
'John' 9 '\i1iiu1h8'
'Kelly' 20 '\771jk8'
'Bob' 33 '\kljhjj'
You can select a column using its name etc.:
>> x.Name
ans =
'John'
'Kelly'
'Bob'
Available since Matlab 2013b.
See www.mathworks.com/help/matlab/ref/readtable.html
I liked this approach, supported by Matlab 2012.
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of cource you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));

Resources