I have been given an oddly structured dataset that I need to prepare for visualisation in GIS. The data is from historic newspapers from different locations in China, published between 1921 and 1937. The excel table is structured as follows:
There is a sheet for each location, 2. each sheet has a column for every year and the variables for each newspaper is organised in blocks of 7 rows and separated by a blank row. Here's a sample from one of the sheets:
,1921年,1922年,1923年
,,,
Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,
,,,
Title of Newspaper 2,ノーウォスチ・ジーズニ(Nōuosuchi Jīzuni),ノォウォスチ・ジーズニ,ノウウスチジーズニ
(Language),露文,露文,露文
(Ideology),政治、社会、文学、商工新聞,社会民主,社会民主
(Owner),タワリシエスウオ・ペチャヤチャ,ぺチヤチ合名会社,ぺチヤチ合名会社
(Editior),イ・エフ・ブロクミユレル,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー,イ・エフ・ブロクミユレル(本名クリオリン)、記者(社員)チエルニヤエフスキー
(Publication Frequency),日刊,日刊,日刊
(Circulation),"約3,000","約3,000","3,000"
(Others),1909年創刊、猶太人会より補助を受く、「エス・エル」党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓せるものにして記事多く比較的正確なり、日本お対露干渉排斥排日記事を掲載す,1909年創刊、エス・エル党の過激派に接近せる主張をなす、哈爾賓諸新聞中最も紙面整頓し記事多く比較的正確にして最も有力なるものなり、日本お対露干渉排斥及一般排日記事を掲載し「チタ」政府を擁護す,1909年創刊、過激派に益々接近し長春会議以後は「ダリタ」通信と相待って過激派系の両翼たりし感あり、紙面整頓し記事比較的正確且つ金力に於て猶太人系の後援を有し最も有力なる新聞たり、一般排日記事を掲載し支那側に媚を呈す、「チタ」政権の擁護をなし当地に於ける機関紙たりと自任す
,,,
Title of Newspaper 3,北満洲(Kita Manshū),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun),哈爾賓日々新聞(Harbin Nichi-Nichi Shimbun)
(Language),邦文,邦文,邦文
(Ideology),,,
(Owner),合資組織,社長 児玉右二,株式組織 (社長)児玉右二
(Editior),木下猛、遠藤規矩郎,編集長代理 阿武信一,(副社長)磯部検三 (主筆)阿武信一
(Publication Frequency),日刊,日刊,日刊
(Circulation),"1,000内外",,
(Others),大正3年7月創刊,大正11年1月創刊の予定、西比利亜新聞、北満洲及哈爾賓新聞の合同せるものなり,大正11年1月創刊
Yes, it's also in numerous non-latin languages, which makes it a little bit more challenging.
I want to create a new matrix for every year, then rotate the table to turn the 7 rows for each newspaper into columns so that I end up with each row corresponding to one newspaper. Finally, I need to generate a new column that gives me the location of the newspaper. I would also like to add a unique identifier for each newspaper and add another column that states the year, just in case I decide to merge the entire dataset into a single matrix. I did the transformation manually in Excel but the entire dataset contains data several from thousand newspapers, so I need to automate the process. Here is what I want to achieve (sans unique identifier and year column):
Title of Newspaper,Language,Ideology,Owner,Editor,Publication Frequency,Circulation,Others,Location
直隷公報(Zhi Li Gong Bao),漢文,直隷省公署の公布機関,直隷省,,日刊,2500,光緒22年創刊、官報の改称,Tientsin
大公報(Da Gong Bao),漢文,稍親日的,合資組織,樊敏鋆,日刊,,光緒28年創刊、倪嗣仲の機関にて現に王祝山其の全権を握り居れり、9年夏該派の没落と共に打撃を受け少しく幹部を変更して再発行せり、但し資金は依然王より供給し居れり,Tientsin
天津日々新聞(Tianjin Ri Ri Xin Wen),漢文,日支親善,方若,郭心培,日刊,2000,光緒27年創刊、親日主義を以て一貫す、國聞報の後身なり民国9年安直戦争中直隷派の圧迫を受けたるも遂に屈せさりし,Tientsin
時聞報(Shi Wen Bao),漢文,中立,李大義,王石甫,,1000,光緒30年創刊、紙面相当価値あり,Tientsin
Is there a way of doing this in R? How will I go about it?
I've outlined a plan in a comment above. This is untested code that makes it more concrete. I'll keep testing till it works
inps <- readLines( ~/Documents/R_code/Tientsin unformatted.txt")
inp2 <- inps[ -(1:2) ]
# 'identify groupings with cumsum(grepl(",,,", inp2) as the second arg to split'
inp.df <- lapply( split(inp2, cumsum(grepl(",,,", inp2) , read.csv)
library(data.table) # only needed if you use rbindlist not needed for do.call(rbind , ..
# make a list of one-line dataframes as below
# finally run rbindlist or do.call(rbind, ...)
in.dt <- do.call( rbind, (inp.df)) # rbind checks for ordering of columns
This is the step that makes a one line dataframe from a set of text lines:
txt <- 'Title of Newspaper 1,遠東報(Yuan Dong Bao),,
(Language),漢文,,
(Ideology),東支鉄道機関紙,,
(Owner),(総経理)史秉臣,,
(Editior),張福臣,,
(Publication Frequency),日刊,,
(Circulation),"1,000",,
(Others),1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す,,'
temp=read.table(text=txt, sep="," , colClasses=c(rep("character", 2), NA, NA))
in1 <- setNames( data.frame(as.list(temp$V2)), temp$V1)
in1
#-------------------------
Title of Newspaper 1 (Language) (Ideology) (Owner) (Editior) (Publication Frequency) (Circulation)
1 遠東報(Yuan Dong Bao) 漢文 東支鉄道機関紙 (総経理)史秉臣 張福臣 日刊 1,000
(Others)
1 1908年創刊、東支鉄道に支那か勢力を扶殖せし以来著しく排日記事を記載す
So it looks like the column names of the individually constructed items would need to further processing to make them capable of being successfully "bindable" by plyr::rbindlist or data.table::rbindlist
I have searched through Stackoverflow and the web and found some similar solutions to what I would think would be a very simple problem but nothing that addresses this. However, maybe I am just not thinking about it in correct "R" terms so here goes... Please help.
I have a few Odd CSV files which I have to process everyday.
Here is a mock up of the data as it comes in:
This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,
NOTES on the raw files:
they are all standard csvs
The number of columns may vary from file to file or day to day but the headers should always start with the same initial column name (in this example, "Header1").
Each file will have at least 2-10 lines which are worthless and I don't need.
The actual headers will appear within the first 10 rows
All of the data after the first header row is part of Group1 and I want to add a new column "Group" with that as the data
Eventually (5000 to 100,000 rows later), another set of the same header row will appear. All of the data after this second header row is part of Group2 and I want to alter the data in the new Group column to match (i.e. - change to putting "Group2" in that column).
In the end I would like to end up with this (given the initial data above):
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,NEWFIELD
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,Group1
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group1
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,Group1
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,Group1
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,Group2
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,Group2
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,Group2
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,Group2
I have tried to treat the data as a connection stream with a series of if/else statements to perform the identification of the headers, groups, adding the new columns, etc. but I am having issues putting it back into a form I can use with proper headers.
Group <- "Start"
processFile = function(datafilepath) {
con = file(datafilepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
print("EOF")
break
}
if (grepl("Header1", line) & Group == "Start") {
colnames(result) <- data.frame(paste(line,",","Group"))
print("Initial Headers found, Switching to Group1")
Group <- "Group1"
} else if (grepl("Systems.Name", line) & Group == "Group1") {
print("Switching to Group2")
Group <- "Group2"
} else if (Group == "Start") {print("At Start")}
if (Group != "Start") {
indresult <- (paste(line,",", Group))
result <- rbind(result, indresult)
}
}
return(result)
close(con)
}
This code fails to load the headers correctly and I am not finding a method for loading the headers directly and then the data after that. I am fairly certain the column additions should work if the other can be done but I can't get to the point of verifying the resulting data will be seen as a complete dataframe until I can get past this.
Main Questions: Is this the correct method to go about this and, if so, how do I get the data into a data frame to be able to be able to use it?
Thanks,
Solution I am using currently:
The earlier solution with fread was the closest but I had a hard time wrapping my brain around it and the := assignment operator isn't recognized on my setup.
Thus, here is what I eventually used:
#This line removes all rows before the appears of "Header1"
Data <- fread(paste(Folder, File, sep = ""), skip="Header1")
Group= "Group1"
#Add additional column to data frame to be filled in below
Data$Group= ""
#Loop through each row and add Group - I had tried using simply "Data" instead of 1:nrow(Data) but in that case R only took the initial column of Data and not each row itself.
for (dataline in 1:nrow(Data)) {
if (Data[dataline,]$"Header1" == "Header1" & Group == "Group1") {
#Reached second row of Headers indicating Group change
Group <- "Group2"
next
}
#Assign Group
Data[dataline,]$Group <- Group
}
#Remove Duplicate Header rows
Data <- Data[!(Data$Header == "Header1"),]
It is slow (takes about 4-5 minutes to run through on 50,000 rows) but it at least is automatic and gets what I need. If there is a way of speeding it up, please feel free to add. Thanks!
Something like this:
x = 'This is worthless and I want to get rid of it,,,,,,,,
This is worthless and I want to get rid of it,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
This line may or may not be here,,,,,,,,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345604,10.21.1151.12.0,Daisy,Petal,Stem,Data,Data,Data,
20345627,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20345600,10.21.1151.12.0,Samson,Petal,Stem,Data,Data,Data,
20345623,10.21.1151.12.0,Cloud,Petal,Stem,Data,Data,Data,
Header1,Header2,Header3,Header4,Header5,Header6,Header12,Header13,
20345704,10.21.1151.12.0,Simmons,Petal,Stem,Data,Data,Data,
20345677,10.21.1151.12.0,Butle,Petal,Stem,Data,Data,Data,
20347600,10.21.1151.12.0,Rose,Petal,Stem,Data,Data,Data,
20745623,10.21.1151.12.0,Unicorn,Petal,Stem,Data,Data,Data,'
require(data.table)
require(zoo) # for na.locf
o = fread(x, skip = 5,sep= ',')
# count how many headers
nh = nrow(o[grepl('Header1', V1) & grepl('Header2', V2)])
# add header id
o[grepl('Header1', V1) & grepl('Header2', V2), group := 1:nh]
# fill down header
o[, group := na.locf(group, na.rm = FALSE)]
# remove rows containing 'Header*'
o = o[!grepl('Header1', V1) & !grepl('Header2', V2) ]
o
V1 V2 V3 V4 V5 V6 V7 V8 V9 group
1: 20345604 10.21.1151.12.0 Daisy Petal Stem Data Data Data NA 1
2: 20345627 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 1
3: 20345600 10.21.1151.12.0 Samson Petal Stem Data Data Data NA 1
4: 20345623 10.21.1151.12.0 Cloud Petal Stem Data Data Data NA 1
5: 20345704 10.21.1151.12.0 Simmons Petal Stem Data Data Data NA 2
6: 20345677 10.21.1151.12.0 Butle Petal Stem Data Data Data NA 2
7: 20347600 10.21.1151.12.0 Rose Petal Stem Data Data Data NA 2
8: 20745623 10.21.1151.12.0 Unicorn Petal Stem Data Data Data NA 2
x should be the path to your csv file.
Also, check out data.table::fread for more arguments that might be useful here.
You could further use setnames() to change the column names and perhaps change data types from character to numeric in case the original dataset has it.
I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!