I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:
There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:
,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊
Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging.
I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):
Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin
Is there a way of doing this in R? How will I go about it?
I've outlined a plan in a comment above. This is untested code that makes it more concrete. I'll keep testing till it works
inps <- readLines( ~/Documents/R_code/Tientsin unformatted.txt")
inp2 <- inps[ -(1:2) ]
# 'identify groupings with cumsum(grepl(",,,", inp2) as the second arg to split'
inp.df <- lapply( split(inp2, cumsum(grepl(",,,", inp2) , read.csv)
library(data.table) # only needed if you use rbindlist not needed for do.call(rbind , ..
# make a list of one-line dataframes as below
# finally run rbindlist or do.call(rbind, ...)
in.dt <- do.call( rbind, (inp.df)) # rbind checks for ordering of columns
This is the step that makes a one line dataframe from a set of text lines:
txt <- 'Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,'
temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA))
in1 <- setNames( data.frame(as.list(temp$V2)), temp$V1)
in1
#-------------------------
Title of Newspaper 1 (Language) (Ideology) (Owner) (Editior) (Publication Frequency) (Circulation)
1 遠東報(Yuan Dong Bao) 漢文 東支鉄道機関紙 (総経理)史秉臣 張福臣 日刊 1,000
(Others)
1 1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す
So it looks like the column names of the individually constructed items would need to further processing to make them capable of being successfully "bindable" by plyr::rbindlist or data.table::rbindlist
I want to edit an existing excel file using R. For example, ExcelFile_1 has the data, and I need to place the data from ExcelFile_1 into another file called ExcelFile_2. This is based on the column and row names.
ExcelFile_1:
Store Shipped Qty
1111 100
2222 200
ExcelFile_2:
Store Shipped Qty
1111
2222
If I m working with a data frame, I generally do
ExcelFile_2$Shipped Qty <-
ExcelFile_1$Shipped Qty[match(ExcelFile_1$Store #, ExcelFile_2$Store #)
The above line works for my data frame, but I donot know how to place this formula while writing into a worksheet using XLConnect package. All I see is the below mentioned options.
writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)
I do not want to edit as a data frame and save the data frame as another "worksheet" in an existing/new Excel File, as I want to preserve the ExcelFile_2 formats.
For example: I want to change the value of ExcelFile_2 cell "B2" using the values from another sheet.
Could anyone please help me with the above problem?
Assuming your files are stored in your home directory and named one.xlsx and two.xlsx, you can do the following:
library(XLConnect)
# Load content of the first sheet of one.xlsx
df1 <- readWorksheetFromFile("~/one.xlsx", 1)
# Do what you like to df1 ...
# Write df1 to the first sheet of two.xlsx
wb2 <- loadWorkbook("~/two.xlsx")
writeWorksheet(wb2, df1, sheet = 1)
saveWorkbook(wb2)
If needed, you can also use startRow and startCol in both readWorksheetFromFile() and writeWorksheet() to specify exact rows and columns and header to specify if you want to read/write the headers.
I am new to R and I have run into a problem. I have a folder with 50 csv files, each representing a city. I want to import the each csv files into R studio as independent data frames to eventually plot all 50 cities in one time series plot.
There are four things I want to do to each csv file, but in the end, have it automated that these four actions are done to each of the 50 csv files.
Skip the first 25 row of the csv file
Combine the Date and Time column for each csv file
Remove the rows where the values in the cells in column 3 is empty
Change the name of column 3 from "ug/m3" to "CO"
After skipping, the first row will be the header
I used the code below on one csv file to see if it would work on one csv.Everything work except for city[,3][!(is.na(city[,3]))].
city1 <- read.csv("path",
skip = 25)
city1$rtime <- strptime(paste(city1$Date, city1$Time), "%m/%d/%Y %H:%M")
colnames(city1)[3] <- "CO"
city[,3][!(is.na(city[,3]))] ## side note: help with this would be appreciated, I was if something goes before the comma especially.
I am not sure how to combine everything in an efficient manner in a function.
I would appreciate suggestions on an efficient manner to perform the 4 actions ( in a function statement maybe) to each csv file while importing them to R.
Use this function for each csv you want to read
read_combine <- function(yourfile){
file <- read.csv(yourfile,skip=25)
file$rtime <- strptime(paste(file $Date, file $Time), "%m/%d/%Y %H:%M")
colnames(file)[3] <- "CO"
file$CO[!is.na(file$CO)]
}
yourfile must be "path"
so I'm back with an even more adventurous approach for my thousands of .CSV files manipulation with R. I can import, merge every ten files, rename columb headers, save new .CSV etc, but the result is still too cumbersome to manipulate analytically.
What I need is; every 10 files put into a matrix OR merged into a single file (see below for file example). The columns are Frequency, Channel A (and later Channel B). Simply F, A, and B, and the F values are the same for every file (hence I was thinking matrix). In the end I'll end up with headers
| *F* | *A1* | *B1* | *A2* | *B2* | *A3* | *B3* |
etc... to 10.
Inside of the matrix/bind_col loop, is it possible before wrie.csv to do some math functions on the values A1-10? A few new columns for Average and Mean for each Frequency. I need others too but I'll sort that myself.
+------------+-------------+
| Frequency | Channel A |
| (MHz) | (dBV) |
+------------+-------------+
0.00000000,-27.85117000
0.00007629,-28.93283000
0.00015259,-32.89576000
0.00022888,-43.54568000
---
Continued...
---
19.99977312,-60.59710000
19.99984941,-48.58142000
19.99992571,-43.29094000
Thanks for you time, I know I've spent too much debugging and now I'm looking for a more elegant method.
PS: How's my formatting? Table and .CSV style blunder!
Tough to answer without a better example of what each file looks like, and what you want your output to be, as well as some example code.
Are the files small enough that you can load all 1000 at once?
Something like the following is where I would start if it were me.
library(data.table)
filenames <- list.files(pattern = ".csv$")
list_data <- vector(mode = "list", length = length(filenames))
i <- 1
for (file in filenames){
list_data[[i]] <- fread(file)
i <- i + 1
}
dat <- rbindlist(list_data, use.names = TRUE, fill = TRUE)
After that, you can use all of the useful data.table features.
dat[, .(meanA = mean(A), stdevA = sd(A)), by = Frequency]
I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.
What I want to do:
I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:
Import the data into R from Excel
Add a variable to each file containing the filename
Merge all of the data (add observations/rows - do not need to match on variables)
I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.
What the Data Look Like:
Row 1 Title
Row 2 StudentID Var1 Var2 Var3 Var4 Var5
Row 3 11234 1 9/8/2011 343 159-167 32
Row 4 11235 2 9/16/2011 112 152-160 12
Row 5 11236 1 9/8/2011 325 164-171 44
Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.
What I have so far:
At first I simply tried to import data from excel. Using the XLSX package, this works nicely:
dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1,
sheetName=NULL, startRow=2,
endRow=NULL, as.data.frame=TRUE,
header=TRUE)
Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.
setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));
I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.
When I do this I get the following error:
Error in data.frame(res, ...) :
arguments imply differing number of rows: 616, 1, 5
I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.
What next?
If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!
I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine
You need to pass a single FILENAMEVAR to read.xlsx2.
lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))