How do you silently save an inspect object in R's tm package? - r

When I save the inspect() object in R's tm package it prints to screen. It does save the data that I want in the data.frame, but I have thousands of documents to analyze and the printing to screen is eating up my memory.
library(tm)
data("crude")
matrix <- TermDocumentMatrix(corpus,control=list(removePunctuation = TRUE,
stopwords=TRUE))
out= data.frame(inspect(matrix))
I have tried every trick that I can think of. capture.output() changes the object (not the desired effect), as does sink(). dev.off() does not work. invisible() does nothing. suppressWarnings(), suppressMessages(), and try() unsurprisingly do nothing. There are no silent or quiet options in the inspect command.
The closest that I can get is
out= capture.output(inspect(matrix))
out= data.frame(out)
which notably does not give the same data.frame, but pretty easily could be if I need to go down this route. Any other (less hacky) suggestions would be helpful. Thanks.
Windows 7
64- bit R-3.0.1
tm package is the most recent version (0.5-9.1).

Assign inside the capture then:
capture.output(out <- data.frame(inspect(matrix))) -> .null # discarding this
But really, inspect is for visual inspection, so maybe try
as.data.frame(as.matrix(matrix))
instead (btw matrix is a very unfortunate name for a variable, as that's a base function).

Using this input (varible name changed from you question as using a variable named "matrix" can be confusing:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude,control=list(removePunctuation = TRUE,
stopwords=TRUE))
Then this will avoid printing to screen
m <- as.matrix(tdm)
and then I would personally do something like
require(data.table)
data.table(m, keep.rownames=TRUE)
# rn 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
# 1: 100000 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
# 2: 108 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 3: 111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 4: 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# 5: 12217 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
# ---
# 996: yesterday 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 0 0
# 997: yesterdays 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# 998: york 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
# 999: zero 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
# 1000: zone 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0

Related

Why does my import urlopen from urllib.request not go through?

I have the below code that I eventually want to webscrape and analyse.
My code has been running for almost an hour and it doesn't seem to pull through from this site.
import bs4 as bs
from urllib.request import urlopen as ureq
my_url2 = 'https://www.dreamteamfc.com/g/#tournament/stats-centre-stats'
ureq(my_url2)
The data you're looking for are loaded from other URL via Ajax (so BeautifulSoup doesn't see it). Also, use requests module for fetching the pages/Json data - it handles compression, redirects etc. automatically.
To load the data, use following example:
import json
import requests
url = "https://nuk-data.s3.eu-west-1.amazonaws.com/json/players_tournament.json"
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
for player in data:
print(
"{:<15} {:<15} {}".format(
player["first_name"], player["last_name"], player["cost"]
)
)
Prints:
Cristiano Ronaldo 7000000
Goran Pandev 1000000
David Marshall 2000000
Jesús Navas 3000000
Kasper Schmeichel 3000000
Sergio Ramos 5000000
Raúl Albiol 2000000
Giorgio Chiellini 3500000
...and so on.
EDIT: To load the data into a dataframe, you can use .json_normalize
import json
import requests
import pandas as pd
url = "https://nuk-data.s3.eu-west-1.amazonaws.com/json/players_tournament.json"
data = requests.get(url).json()
df = pd.json_normalize(data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
id first_name last_name squad_id cost status positions locked injury_type injury_duration suspension_length cname stats.round_rank stats.season_rank stats.games_played stats.total_points stats.avg_points stats.high_score stats.low_score stats.last_3_avg stats.last_5_avg stats.selections stats.owned_by stats.MIN stats.SMR stats.SMB stats.GS stats.ASS stats.YC stats.RC stats.PM stats.PS stats.CS stats.GC stats.star_man_awards stats.7_plus_ratings stats.goals stats.assists stats.cards stats.clean_sheets tournament_stats.star_man_awards tournament_stats.7_plus_ratings tournament_stats.goals tournament_stats.assists tournament_stats.cards tournament_stats.clean_sheets
0 14937 Cristiano Ronaldo 359 7000000 playing [4] 0 None None None None 0 0 9 0 0 0 0 0 0 22760 41.3 764 0 0 15 0 1 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0
1 15061 Goran Pandev 504 1000000 playing [4] 0 None None None None 0 0 0 0 0 0 0 0 0 50 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 15144 David Marshall 115 2000000 playing [1] 0 None None None None 0 0 0 0 0 0 0 0 0 166 0.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 17740 Jesús Navas 118 3000000 playing [3] 0 None None None None 0 0 0 0 0 0 0 0 0 154 0.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 17745 Kasper Schmeichel 369 3000000 playing [1] 0 None None None None 0 0 9 0 0 0 0 0 0 3261 5.9 810 0 0 0 0 1 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0
5 17861 Sergio Ramos 118 5000000 playing [2] 0 None None None None 0 0 9 0 0 0 0 0 0 14647 26.6 712 0 0 1 0 1 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0
...and so on.
And saves data.csv (screenshot from LibreOffice):

Separate a string of characters space-separated of dataframe in different columns

I am pretty new at using R and I have some data that I need to tidy a bit before I can use it. Basically I have a dataframe with a bunch of rows and columns and in every cell of this dataframe I have a string of 20 numbers of 1 and zeroes ("0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0").
Now I am trying to separate every number of a field having each number in a new column (1 field would be 20 columns). After that I would like to convert these newly separated strings into numbers. I will show a small sample of the data. Here I would need the numbers separated in 40 columns and 3 rows:
df<-data.frame(
"V1" = c("0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 ","0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ","1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 "),
"V2" = c("0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ","0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 ","0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 "))
As you can see a good way to separate each number of a string would be treating the space as a delimiter, but I am not having any luck with that. I tried my luck with df<-lapply(strsplit(df, " "), as.numeric) but the dataframe can't be treated with this function. I tried then df<-lapply(strsplit(as.character(df), " "), as.numeric)
That way it separates correctly but making the full dataframe as a character messes up the data.
I suppose that it's easier than I think but I still lack skill in this code.
Easier option is read.table (no packages used)
read.table(text = as.character(df$V1), header = FALSE)
For multiple columns, use lapply
lapply(df, function(x) read.table(text = as.character(x), header = FALSE))
You can use cSplit from splitstackshape to convert multiple columns into separate columns.
splitstackshape::cSplit(df, names(df), " ")
# V1_01 V1_02 V1_03 V1_04 V1_05 V1_06 V1_07 V1_08 V1_09 V1_10 V1_11
#1: 0 0 0 0 0 0 0 0 0 0 0
#2: 0 0 0 1 0 0 0 0 0 0 0
#3: 1 0 0 0 0 0 0 0 0 0 0
# V1_12 V1_13 V1_14 V1_15 V1_16 V1_17 V1_18 V1_19 V1_20 V2_01 V2_02
#1: 0 0 0 1 0 0 0 0 0 0 0
#2: 0 0 0 0 0 0 0 0 0 0 0
#3: 0 0 0 0 0 0 0 0 0 0 0
# V2_03 V2_04 V2_05 V2_06 V2_07 V2_08 V2_09 V2_10 V2_11 V2_12 V2_13
#1: 0 0 0 0 1 0 0 0 0 0 0
#2: 0 0 0 0 0 0 0 0 0 0 0
#3: 0 0 0 0 0 0 0 1 0 0 0
# V2_14 V2_15 V2_16 V2_17 V2_18 V2_19 V2_20
#1: 0 0 0 0 0 0 0
#2: 0 0 0 0 0 1 0
#3: 0 0 0 0 0 0 0
Note that I have used names(df) here since you want to convert all the columns into separate columns. If you have additional columns and want to separate only few of them, you can also do
splitstackshape::cSplit(df, c("V1", "V2"), " ")
I found both answers equally good but the use of cSplit made the posterior process better I think. What I finally did to obtain the result:
df<-cSplit(df, names(df), " ")
df<-lapply(df,as.numeric)
df<-as.data.frame(df)
I suppose that this can be done with less lines of code but this way is more understandable for me. Thank you very much for your answers!

Can't name my columns as dates

I have a range of dates where some products were bought. I create a sort of a pivot table relating the products and the dates, but there are dates where nothing was sold. I can find the missing dates and even add them to the main data frame, the problem is that instead of keeping the date format, they adopt the integer format (with the integer being the distance to origin) and I can't order them. The code I'm using is this:
upper.bound<- paste("01", month[1], 2013, sep="-")
lower.bound <- paste("30", month[4], 2013, sep="-")
dates <- seq(as.Date(upper.bound, "%d-%m-%Y"), as.Date(lower.bound, "%d-%m-%Y"), "days")
diff <- setdiff(dates, as.Date(colnames(export_f_ub), "%Y-%m-%d"))
len <- dim(as.matrix(diff))[1]*11
aux <- data.frame()
aux <- seq(0,0,length.out=len)
dim(aux) <- c(11, dim(as.matrix(diff))[1])
col_dates <- as.Date(diff, origin="1970-01-01")
colnames(aux)<- c(col_dates)
This was a tryout to set the matrix to zeros and then bind it to the main one. But this doesn't work, as in the result I get the column names as numeric. Here's a screenshot of the console:
Console log
I've never seen someone try to assign a Date vector as column names of a matrix. Dimension names must always be character strings, so in general this is not something you should be doing.
That being said, in terms of effect, the intuitive expectation would be that the column-name-assignment machinery in R would at some point coerce the Date vector to character along the lines of as.character(), and thus you'd get the text representation of the dates, rather than a stringification of their underlying double values.
Calling `colnames<-`() on a matrix eventually calls `dimnames<-`() which drops into the C code by running .Primitive("dimnames<-"). I haven't really looked into the C implementation, but we can guess that at some point it pulls out the double values underlying the Date vector, coerces them to character, and that's why you end up with numbers as your column names.
The correct approach here is to call as.character() yourself when assigning the names:
col_dates <- as.Date(c('2013-06-03','2013-06-04','2013-06-05','2013-06-06','2013-06-08','2013-06-22','2013-07-07','2013-07-08','2013-07-11','2013-07-13','2013-07-23','2013-07-25','2013-07-26','2013-08-27','2013-09-03','2013-09-04','2013-09-05','2013-09-06','2013-09-07','2013-09-09','2013-09-10','2013-09-11','2013-09-13','2013-09-14','2013-09-15','2013-09-16','2013-09-18','2013-09-20','2013-09-21','2013-09-22','2013-09-24','2013-09-30'));
aux <- matrix(0,11L,length(col_dates));
colnames(aux) <- as.character(col_dates);
aux;
## 2013-06-03 2013-06-04 2013-06-05 2013-06-06 2013-06-08 2013-06-22 2013-07-07 2013-07-08 2013-07-11 2013-07-13 2013-07-23 2013-07-25 2013-07-26 2013-08-27 2013-09-03 2013-09-04 2013-09-05 2013-09-06 2013-09-07 2013-09-09 2013-09-10 2013-09-11 2013-09-13 2013-09-14 2013-09-15 2013-09-16 2013-09-18 2013-09-20 2013-09-21 2013-09-22 2013-09-24 2013-09-30
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [11,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
But you should be aware that the column names do not retain the Date class or internal representation (referring to the double values); they are pure character strings. If you want to recreate the Date vector from the column names, you'll have to run them through as.Date().
And by the way, dim(as.matrix(x))[1] for a vector x is an unnecessarily roundabout way of getting length(x).

Generating confusion matrix with HResults HTK Tool for handwriting recognition ICFHR’s example

I am studying how HTK Tools works with handwriting recognition. Following the ICFHR–2010 TUTORIAL I run examples for "Spanish-Numbers" corpus and received the resulting HMMs (files stored in folder hmm and listed in HMMsList), and res32.mlf with results of recognition received with HVite. Also I have master label file SamplesRef.mlf.
And now I want to see recognition results statistics, i.e. studying HResults tool.
When I run HResults as
HResults -I SamplesRef.mlf HMMsList res32.mlf
I see
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:21:11 2015
Ref : SamplesRef.mlf
Rec : res32.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=77.78, Acc=77.78 [H=7, D=0, S=2, I=0, N=9]
===================================================================
But if I add option -p in order to have confusion matrix I see the following error message:
~/icfhr$ HResults -p -I SamplesRef.mlf HMMsList res32.mlf
ERROR [+3331] Index: Label millones not in list[0 of 19]
FATAL ERROR - Terminating program HResults
I understand that message means that there is no HMM with name "millones" and I found that in my res32.mlf samples looks like:
"’*’/210341.rec"
mil
seiscientos
cincuenta
y
siete
millones
.
If I change res32.mlf with text editor to res33.mlf with content like:
"’*’/210341.rec"
m
i
l
s
e
i
s
c
i
... and so on.
And use samples.mlf (instead of SamplesRef.mlf) which inside looks like:
"*/210341.lab"
m
i
l
#
q
u
i
n
i
e
n
t
o
s
#
c
... and so on.
I have the desired result:
~/icfhr$ HResults -p -I samples.mlf HMMsList res33.mlf
====================== HTK Results Analysis =======================
Date: Tue Mar 31 15:35:42 2015
Ref : samples.mlf
Rec : res33.mlf
------------------------ Overall Results --------------------------
SENT: %Correct=0.00 [H=0, S=2, N=2]
WORD: %Corr=79.63, Acc=77.78 [H=43, D=5, S=6, I=1, N=54]
------------------------ Confusion Matrix -------------------------
a c d e i l m n o s t u v y Del [ %c / %e]
# 0 0 0 0 0 1 1 0 0 0 0 0 0 0 5 [ 0.0/3.7]
a 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
e 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
m 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
n 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 [85.7/1.9]
o 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
q 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 [ 0.0/1.9]
s 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0
t 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0
u 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 [50.0/1.9]
v 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 [50.0/1.9]
Ins 0 0 0 0 0 0 0 0 0 1 0 0 0 0
===================================================================
So, the main question is:
What is the simplest way (without text editor) to make mlf-files adapted for making confusion matrix?
(I suppose I miss some option of some HTK tool… but which tool and which option?)
Any useful ideas would be highly appreciated.
In order to use the -p option, you need to provide the labels list of the classes not your HMMs, (i.e. if you're trying to recognize the words Yes, No, Never) then your "HMMsList" file should be written as:
Yes
No
Never
Regardless of the HMMs that actually constitutes the words.
Your "HMMsList" file should be "LabelsList"

incorrect number of dimensions even if the size of the array is the same

I have been working on this program for many days, and decide to rewrite it today....
But this problem keeps bothering me.
I thought the csm[1,] and Prank[1,] has the same dimension.
Who can help me with this problem?
Prank<-read.csv("result.csv")
nrP<-nrow(Prank)
ncP<-ncol(Prank)
csm<-matrix(0,nrP*3,ncP)
ccsm<-matrix(0,nrP*3,ncP)
nrC<-nrow(csm)
ncC<-ncol(csm)
nrP
[1] 30
ncP
[1] 144
nrC
[1] 90
ncC
[1] 144
Prank[1,]
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32
1 4 2 3 1 4 2 3 1 4 2 3 1 3 1 4 2 4 2 3 1 4 1 3 2 4 1 3 2 4 2 3 1
P33 P34 P35 P36 P37 P38 P39 P40 P41 P42 P43 P44 P45 P46 P47 P48 P49 P50 P51 P52 P53 P54 P55 P56 P57 P58 P59 P60 P61
1 4 1 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P62 P63 P64 P65 P66 P67 P68 P69 P70 P71 P72 P73 P74 P75 P76 P77 P78 P79 P80 P81 P82 P83 P84 P85 P86 P87 P88 P89 P90
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P91 P92 P93 P94 P95 P96 P97 P98 P99 P100 P101 P102 P103 P104 P105 P106 P107 P108 P109 P110 P111 P112 P113 P114 P115
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P116 P117 P118 P119 P120 P121 P122 P123 P124 P125 P126 P127 P128 P129 P130 P131 P132 P133 P134 P135 P136 P137 P138
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
P139 P140 P141 P142 P143 P144
1 0 0 0 0 0 0
csm[1,]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[59] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[117] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
csm[1,]<-Prank[1,]
csm[1,]
Error in csm[1, ] : incorrect number of dimensions
The problem is that Prank[1, ] is a data.frame (i.e. a list) so when you try to assign it to the first row of csm, it has the unexpected side-effect of converting csm to a list. At that point, doing csm[1, ] does not make any sense (a list has a single dimension) hence the error.
A solution is to unlist Prank[1, ] before assigning:
csm[1,] <- unlist(Prank[1,])
read.csv() returns a data.frame, and unless all of the columns of Prank are numeric, the assignment
csm[1,]<-Prank[1,]
will cause csm to be coerced to a list because Prank[1,] is not a numeric vector. You will want to make sure that Prank[1,] is a numeric vector (i.e. is.numeric(Prank[1,])).
Revised suggestion: take a look at data.frame (head(Prank)) and it may be obvious that one or more columns are not numeric. To inspect the classes of each field in prank, you can use
lapply(Prank,class)
or
sapply(Prank,class)
If all the fields in Prank are integer or numeric, you can coerce them all to numeric via
Prank[] <- lapply(Prank,as.numeric)
If not all the fields are numeric, you will want to coerce the problem fields to numeric, or
or remove the offending fields from Prank (e. g. Prank$ProblemField <- NULL) before the assignment.

Resources