Is there any method to use extract_text in R vertically?

Is there any method to use extract_text in R vertically? - r

I'm trying to extract table text from pdf file which is wrote in Korean.
I used library which named tabulizer to extract text.
So my code is
library(pdftools)
library(tidytext)
library(dplyr)
library(janeaustenr)
library(rJava)
library(tabulizer)
library(tidyverse)
setwd("C:/Users/user/Desktop/Test") #This is my directory which contain pdf files.
files <- list.files(pattern = "pdf$")
f2 <- files[6]
e <- extract_text(f2,pages = 25,encoding = 'UTF-8')
But the problem is the table which in the pdf file, the pattern is not desirable for me.
I want extract data vertically, however, the extract_text function makes strings horizontally
Here is the outcome which occurred with extract_text:
나. 집합투자기구에 부과되는 보수 및 비용 \r\n구분 \r\n지급비율(연간, %) \r\n지급시기 \r\nC(수수료미징\r\n구-오프라인) \r\nW(수수료미징\r\n구-오프라인-\r\n랩) \r\ne(수수료미징구\r\n-온라인) \r\nI(수수료미징구-\r\n오프라인-기관) \r\nC-P(수수료미\r\n징구-오프라인-\r\n개인연금) \r\nC-P2(수수료미\r\n징구-오프라인-\r\n퇴직연금) \r\n집합투자업자 보수 0.46 0.46 0.46 0.46 0.46 0.46 \r\n매 3개월 \r\n판매회사 보수 1.00 0.00 0.98 0.03 0.95 0.85 \r\n수탁회사 보수 0.025 0.025 0.025 0.025 0.025 0.025 \r\n일반사무관리회사 보\r\n수 \r\n0.005 0.005 0.005 0.005 0.005 0.005 \r\n총 보수 1.49 0.49 1.47 0.52 1.44 1.34 - \r\n기타비용 0.002 0.002 0.002 0.002 0.002 0.002 사유 발생 시 \r\n총 보수․비용 1.492 0.492 1.472 0.522 1.442 1.342 - \r\n(동종유형 총 보수) 1.59 - 1.22 - - - - \r\n총 보수․비용 \r\n(피투자 집합투자기구 보수 포함) 1.493 0.493 1.473 0.523 1.443 1.343 - \r\n증권거래비용 0.107 0.108 0.105 0.108 0.106 0.104 사유 발생 시 \r\n구분 지급비율(연간, %) 지급시기 \r\n"
And more specifically I attached capture image.
Again, what I want to extract is something vertical (red circle)
But extract_text organize it horizontally (blue circle)
Also if you know how can organize the text like {cat(e, sep="\n")} please leave a comment, because using the cat function, I cannot contain result in variables, it automatically gives me output and I have no choice to contain these values.. but i want things clearly organized and 'Anytime i want to need some information I go to the container variables, then get the info..' << that's what I need

Related

Creating an igraph from weighted correlation matrix csv

First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!

This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.

Unable to tweak the findAssocs() in tm package in R

I was trying to find associations between top 10 frequent words with the rest of the frequent words int the input text.
When I look at the individual output of findAssocs():
findAssocs(dtm, "good", corlimit=0.4)
It gives the output clearly by printing the word 'good' with which associations have been sought.
$good
better got hook next content fit person
0.44 0.44 0.44 0.44 0.43 0.43 0.43
But when I try to automate this process for a character vector having top 10 words:
t10 <- c("busi", "entertain", "topic", "interact", "track", "content", "paper", "media", "game", "good")
the output is a list of correlations for each of those elements BUT WITHOUT THE WORD WITH WHICH THE ASSOCIATIONS HAVE BEEN SOUGHT. The sample output is as below (plz notice that the word at t10[i] is not printed, unlike the above individual output where 'good' was clearly printed):
for(i in 1:10) {
t10_words[i] <- as.list(findAssocs(dtm, t10[i], corlimit=0.4))
}
> t10_words
[[1]]
littl descript disrupt enter model
0.50 0.48 0.48 0.48 0.48
[[2]]
immers anyth effect full holodeck iot problem say startrek such suspect wow
0.68 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48
[[3]]
area captur give overal like alon avid begin
0.51 0.47 0.47 0.47 0.44 0.43 0.43 0.43
circuit cloud collaboration communic communiti concis confus defin
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
discord doesnt drop enablesupport esport event everi everyon
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
How do I print the output along with the actual association word?
Can somebody please help me with this??
Thanks.

After running your for loop, add the following piece of code:
names(t10_words) <- t10
This will name the lists with the words specified in t10.

interpolate data series with R

I am having trouble interpolating the values of two data series. I have a reference time in first column. The second column is time linked for values of P130. I want to interpolate new values of P130 (third column) according to reference time.
The reference time and timeP130 have the first and last value the same and they are all in variable steps, so there is no pattern.
Reference_time timeP130 P130 results
0.0001 0.0001 0.2194 0.2194
0.000694 0.003 0.25 0.22552
0.00138889 0.0035 0.26 0.23164
0.00208333 0.006 0.24 0.23776
0.00277778 0.009 0.245 0.24388
0.003 0.009 0.255 0.25
0.00416667 0.0125 0.27 ETC
0.00486111 0.015 0.21
0.00555556 0.018 0.20
0.00625 0.0208 0.2194
0.00694444 0.021 0.2194
0.00763889 0.0211 0.2194
0.00833333 0.0215 0.2194
0.00902778 0.022 0.2195
0.00972222 0.0327 0.2591
0.0104167 0.0433 0.3664
0.0111111 0.0839 0.4068
0.0118056 2.5 0.4087
0.0125 0.27
0.0141944
0.0158889
0.0165833
0.0182778
2.5 0.4087

Save unequal output to a csv or txt file

I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)

You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()

You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")

How to efficiently grow large data in R

The product of one simulation is a large data.frame, with fixed columns and rows. I ran several hundreds of simulations, with each result stored in a separate RData file (for efficient reading).
Now I want to gather all those files together and create statistics for each field of this data.frame into the "cells" structure which is basically a list of vectors with . This is how I do it:
#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation
cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
cells[[i]]<-as.list(rep(NA,rowscount))
for(j in 1:rows)
{
cells[[i]][[j]]<-rep(NA,simcount)
}
}
names(cells)<-colnames
addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
simdata<readRDS(simfilenames[[simnr]])
for(i in 1:colscount)
{
for(j in 1:rowscount)
{
if (!is.na(simdata[j,i]))
{
cells[[i]][[j]][simnr]<-simdata[j,i]
}
}
}
}
library(plyr)
a_ply(1:simcount,1,addcells)
The problem is, that this the
> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
user system elapsed
0.088 0.004 0.093
While
? system.time(addcells(1))
user system elapsed
147.328 0.296 147.644
I would expect both commands to have comparable execution times (or at least the latter be max 10 x slower). I guess I am doing something very inefficient there, but what? The whole cells data structure is rather big, it takes around 1GB of memory.
I need to transpose data in this way, because later I do many descriptive statistics on the results (like computing means, sd, quantiles, and maybe histograms), so it is important, that the data for each cell is stored as a (single-dimensional) vector.
Here is profiling output:
> summaryRprof('/tmp/temp/rprof.out')
$by.self
self.time self.pct total.time total.pct
"[.data.frame" 71.98 47.20 129.52 84.93
"names" 11.98 7.86 11.98 7.86
"length" 10.84 7.11 10.84 7.11
"addcells" 10.66 6.99 151.52 99.36
".subset" 10.62 6.96 10.62 6.96
"[" 9.68 6.35 139.20 91.28
"match" 6.06 3.97 11.36 7.45
"sys.call" 4.68 3.07 4.68 3.07
"%in%" 4.50 2.95 15.86 10.40
"all" 4.28 2.81 4.28 2.81
"==" 2.34 1.53 2.34 1.53
".subset2" 1.28 0.84 1.28 0.84
"is.na" 1.06 0.70 1.06 0.70
"nargs" 0.62 0.41 0.62 0.41
"gc" 0.54 0.35 0.54 0.35
"!" 0.42 0.28 0.42 0.28
"dim" 0.34 0.22 0.34 0.22
".Call" 0.12 0.08 0.12 0.08
"readRDS" 0.10 0.07 0.12 0.08
"cat" 0.10 0.07 0.10 0.07
"readLines" 0.04 0.03 0.04 0.03
"strsplit" 0.04 0.03 0.04 0.03
"addParaBreaks" 0.02 0.01 0.04 0.03
It looks that indexing the list structure takes a lot of time. But I can't make it array, because not all cells are numeric, and R doesn't easily support hash map...

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is there any method to use extract_text in R vertically? - r

Related

Creating an igraph from weighted correlation matrix csv

Unable to tweak the findAssocs() in tm package in R

interpolate data series with R

Save unequal output to a csv or txt file

How to efficiently grow large data in R

Categories

Resources