I want to do some text analysis of instagram photo descriptions which often contain a mix of text, hashtags, #username, urls etc. and are also written in different languages. My method of extraction of the text seems to have stored the text in UTF-8. My first task is to convert all the strings to latin1 before identifying languages and then translating. I have a feeling its because the italian text contains an emoji code ðŸ˜, which the converter cant handle? Any help or directions to good tutorials appreciated!
t.de <- "mein bozen. schönheit liegt im auge des betrachters. worauf
richtest DU tagtäglich deinen blick? auf die vielen kleinen wunder des
lebens\\, auf das lächeln eines gesprächspartners\\n#lavitaebella
#perspektivenwechsel #entscheidedich #blickwinkel
#schönheitliegtimaugedesbetrachters"
t.it <- "ðŸ˜\u008dðŸ˜\u008dCi si può innamorare di una ricamatice???
ðŸ˜\u008dðŸ˜\u008d Personalizzazioni di tutti i tipi!! #ricamo #ricamato
#personalizzato #elefante #palloncini #tenero #baby #nascita #amoredimamma
#cuccioloso"
iconv(t.de, "UTF-8","latin1") #this works fine
iconv(t.it, "UTF-8","latin1") #this doesnt work
rawde=stri_encode(t.de, from="UTF-8", to="latin1", to_raw=TRUE)
stri_encode(rawde, "latin1") #using the 'stringi' approach also works for de but not for it
Related
how are you, I'm making a GUI in scilab based on a tutorial of Openeering people, in the GUI that I'm making I need to plot the response of a system on the right side of the figure window. It prints an initial graph of the system and I've a button to print the new graph with the parameters set in some text boxes that the GUI has, so:
the code where I initially store the data written in the text boxes is
//Lista ordenada de valores por defecto
values1=[0.00074,0.3,8.7,0.48,3.9,0.0035];
//Posicionamiento
l1=30;l2=100;l3=110;
for k=1:size(labels1,2)
uicontrol("parent",sistem_graf,"style","text","string",labels1(k),"position",[l1,guih1-
k*20+guih1o,l2,20],"horizontalalignment","left","fontsize",14,"background",[1 1 1]);
guientry(k)=uicontrol("parent",sistem_graf,"style","edit","string",string(values1(k)),
"position",
[l3,guih1k*20+guih1o,180,20],"horizontalalignment","left","fontsize",14,"background",
[.9 .9 .9],"tag",labels1(k));
end
guientry(k) is the control that names the text boxes.
The button is generated with the following code
//Adicionando un botón
huibutton=uicontrol(sistem_graf,"style","pushbutton",...
"position",[110 100 100 20],"String","Graficar",...
"BackgroundColor",[.9 .9 .9], "fontsize",14,...
"Callback","Calcula_Sistema");
and the function that is the callback of the button which is called "Calcula_Sistema" is
function Calcula_Sistema()
//Lee los parámetros del sistema
parametros=[];
la=findobj("tag,""La [H]"); parametros.la=la;
ra=findobj("tag","Ra [Ohm]"); parametros.ra=ra;
in=findobj("tag","In [A]"); parametros.in=in;
par=findobj("tag","Par [N*m]"); parametros.par=par;
ke=findobj("tag","Ke [V/krpm]"); parametros.ke=ke;
j=findobj("tag","j [N/m^2]"); parametros.j=j;
//Lee los tiempos del sistema
/* Tsim=[];
Tini=findobj("tag","Tinicio [s]"); Tsim.Tini=evstr(Tini);
Tfin=findobj("tag","Tfin [s]"); Tsim.Tini=evstr(Tfin);
Tpaso=findobj("tag", "Tpaso [s]"); Tsim.Tpaso=evstr(Tpaso);
*/
Sis_Motor(parametros.la,parametros.ra,parametros.in,
parametros.par,parametros.ke,parametros.j);
endfunction
when I press the button to generate the new graph the it gives me the error
at line 8 of function Sis_Motor ( F:\Users\valery\Documents\MEGAsync\UCV\Postgrado en Controles Industriales\Trabajo de Grado\Proyecto\Cálculos\Aplicación Scilab\Ventana.sce line 96 )
at line 18 of function Calcula_Sistema ( F:\Users\valery\Documents\MEGAsync\UCV\Postgrado en Controles Industriales\Trabajo de Grado\Proyecto\Cálculos\Aplicación Scilab\Ventana.sce line 142 )
syslin: Input arguments #2 y #3 incompatibles: Are expected of the same size.
The line 8 of the Sis_Motor function is the line 8 of the following code which is the Sis_Motor code
function [Wn,Zita,ftr,fta]=Sis_Motor(in,par,la,ra,ke,j)
kt=par/in;
n=kt/(j*la);
b=j/10;
d=[((b*ra+ke*kt)/(j*la)) ((b*la+j*ra)/(j*la)) 1];
dpoly=poly(d,'s','c');
t=[0:0.001:0.2];
fta=syslin('c',n,dpoly);
ftr=fta/(1+fta);
[Wn,Zita]=damp(ftr);
graf=csim('step',t,ftr);
delete(gca());
plot2d(t,graf);
legend('Respuesta al escalón');
//Línea vertical.
set(gca(),"auto_clear","off");
graf_eje=gca();
graf_eje.axes_bounds=[1/3,0,2/3,1];
endfunction
I also tried changing the second line of Sis_Motor where is the following code
n=kt/(j*la);
for the line
n=[kt/(j*la) 0 0];
but that didn't worked and the same errors keeps appearing.
I suppose that the error comes from the processing of the inputs of the text boxes but I don't know how to solve it
Can anyone help me with that?
Update 1:
findobj wasn't find the tags exactly as #Stephane Mottelet said, now is solved.
findobj yields a handle to the uicontrol. To recover the numeric value of edit boxes you have to write (here e.g. for ra)
ra=findobj("tag","Ra [Ohm]"); parametros.ra=evstr(ra.string);
If it still fails maybe the object is not found and findobj yields an empty matrix. Just insert disp(ra) after the findobj call to be sure that the tag is (or is not) found.
Im trying to create a field in the team entity to be able to select its zone/territory.
Looking in the DDBB i see that the field comp_secterr uses the entry type with id 53 (Entry type Territory) with this SQL
SELECT * FROM Custom_Edits WHERE ColP_ColName LIKE ('comp_secterr');
but when creaing a new field, in the select, there is not option to choose the Entry Type 53
<select class="EDIT" size="1" name="entrytype" id="entrytype" onchange="document.EntryForm.hiddenmode.value=1;document.EntryForm.submit();"><option value="15">Texto con búsqueda de casilla de verificación</option><option value="25">Producto</option><option value="27">Selección inteligente</option><option value="28">Selección múltiple</option><option value="42">Fecha sólo</option><option value="51">Divisa</option><option value="56">Selección de búsqueda avanzada</option><option value="57">Minutos</option><option value="59">Símbolos de las monedas</option><option value="63">Selección de grupo de usuarios</option><option value="10" selected="">Texto</option><option value="44">Procedimiento almacenado</option><option value="45">Casilla</option><option value="50">Número de teléfono</option><option value="11">Texto MultiLínea</option><option value="12">Dirección de correo electrónico</option><option value="13">URL WWW</option><option value="21">Selección</option><option value="22">Selección de usuario</option><option value="23">Selección de Equipo</option><option value="31">Número entero</option><option value="32">Numérico</option><option value="41">Fecha y hora</option></select>
Entry Type 53 is reserved for Territory fields, but is not accessible via the front-end. To use it, you need to update the Custom_Edits table directly using a query similar to this:
UPDATE custom_Edits SET colp_entrytype = 53 WHERE ...
Make sure you do a full metadata refresh after as well.
While it is not absolutely necessary, it is also good for consistency to name Territory fields _secterr (E.g. comp_secterr)
Six Ticks Support
I have got a Qt application which receive strings in JSON objects from the Disqus API:
{ "title": "Swiftkey pr\u0102\u0160dit votre choix d\u2019emoji gr\u0102\u02d8ce au clavier Swiftmoji" }
(there's more but I only write what matters here)
Then I put the title string in a QString:
// Assuming that "reply" is the QNetworkReply * containing the Disqus API response.
QByteArray disqusReply = reply->readAll();
// disqusReply == "{ \"title\": \"Swiftkey pr\u0102\u0160dit votre choix d\u2019emoji\ gr\u0102\u02d8ce au clavier Swiftmoji" }"
QJsonDocument doc = QJsonDocument::fromJson(disqusReply);
QJsonObject obj = doc.object();
QString title = obj["title"].toString();
Later I write it in a QML Text. It should display "Swiftkey prédit votre choix d'emoji grâce au clavier Swiftmoji" but it displays "Swiftkey prĂŠdit votre choix d'emoji grâce au clavier Swiftmoji" instead.
As you can see there are some encoding issues, with two successive Unicode characters which should be interpreted as one character only ("ĂŠ" instead of "é" and "â" instead of "à" here). How can I do (with Qt or QML) to display the right characters? Which encoding conversions I have to perform (with Qt or QML) in order to solve those encoding issues?
Additional informations: the bug occurs on Windows 10 64-bit.
EDIT : you can find the bug here: https://disqus.com/api/3.0/threads/list.json?since=2016-05-18T14%3A08%3A27%2B00%3A00&forum=frandroid&api_key=7o0xSBOEzN2AG6yxcJgeJbeEbACBfGhgnoIRHu7umbifKAvXQpisYKT3KSXF9nPN
Well, I think the problem in double encoding or something else from server side, not from client's. Because your code must be pr\u00e9dit instead of pr\u0102\u0160dit. If you'll use clean_title instead of title of your json-answer you'll get the right string cause it's encode is correct.
UPD:
As I said in comment - there are two replies with same news. Id 4836688567 has wrong encode string. And 4836587900 correct. There are many double variants of news with different encoding.
First:
{
"feed":"https://frandroid.disqus.com/httpwwwfrandroidcomandroidapplications358721_swiftkey_predire_choix_demoji_grace_clavier_swiftmoji/latest.rss",
"identifiers":[],
"dislikes":0,
"likes":0,
"message":"",
"id":"4836688567",
"createdAt":"2016-05-18T09:08:43",
"category":"448171",
"author":"3938134",
"userScore":0,
"isSpam":false,
"signedLink":"http://disq.us/?url=http%3A%2F%2Fwww.frandroid.com%2Fandroid%2Fapplications%2F358721_swiftkey-predire-choix-demoji-grace-clavier-swiftmoji&key=nqCbe6jgfwM-skLyqTf3lg",
"isDeleted":false,
"raw_message":"",
"isClosed":false,
"link":"http://www.frandroid.com/android/applications/358721_swiftkey-predire-choix-demoji-grace-clavier-swiftmoji",
"slug":"httpwwwfrandroidcomandroidapplications358721_swiftkey_predire_choix_demoji_grace_clavier_swiftmoji",
"forum":"frandroid",
"clean_title":"Swiftkey pr\u0102\u0160dit votre choix d\u2019emoji gr\u0102\u02d8ce au clavier Swiftmoji",
"posts":0,
"userSubscription":false,
"title":"Swiftkey pr\u0102\u0160dit votre choix d\u2019emoji gr\u0102\u02d8ce au clavier Swiftmoji",
"highlightedPost":null
}
Second:
{
"feed":"https://frandroid.disqus.com/swiftkey_predit_votre_choix_d8217emoji_grace_au_clavier_swiftmoji/latest.rss",
"identifiers":["358721 http://www.frandroid.com/?p=358721"],
"dislikes":0,
"likes":1,
"message":"",
"id":"4836587900",
"createdAt":"2016-05-18T08:16:30",
"category":"448171",
"author":"3938134",
"userScore":0,
"isSpam":false,
"signedLink":"http://disq.us/?url=http%3A%2F%2Fwww.frandroid.com%2Fandroid%2Fapplications%2Fgoogle-apps%2F358721_swiftkey-predire-choix-demoji-grace-clavier-swiftmoji&key=UU8IrLN_UDXEggF6wHjAYg",
"isDeleted":false,
"raw_message":"",
"isClosed":false,
"link":"http://www.frandroid.com/android/applications/google-apps/358721_swiftkey-predire-choix-demoji-grace-clavier-swiftmoji",
"slug":"swiftkey_predit_votre_choix_d8217emoji_grace_au_clavier_swiftmoji",
"forum":"frandroid",
"clean_title":"Swiftkey pr\u00e9dit votre choix d\u2019emoji gr\u00e2ce au clavier Swiftmoji",
"posts":13,
"userSubscription":false,
"title":"Swiftkey pr\u00e9dit votre choix d’emoji gr\u00e2ce au clavier Swiftmoji",
"highlightedPost":null
}
As you see, difference in url from where news comes. But why some of them encoded wrong - this is a question.
UPD 2:
Or maybe it's RSS bug. Let's take prédit word. In second variant RSS returns xml with content which has been already encoded to é and feed link looks like normal. In first variant RSS returns ĂŠ and its feed link looks like anormal - feel like it takes whole url and do another encode.
Have you tried
QByteArray disqusReply
= QString::fromUtf8(reply->readAll().data()).toLocal8Bit();
? You might prefer delaying this conversion until actually writing the string into the QML text.
QJsonDocument requires UTF-8 encoded strings. Is your document encoded in UTF-8 when you load it?
I have the following somewhere in a page:
<asp:Localize ID="locChangePasswordPrompt" runat="server"
Text="Change Your Password" meta:resourcekey="locChangePasswordPrompt" />
I am localizing using sql server database and I have stored the FRENCH corresponding values in the db.
Do if u do a query:
SELECT TOP 1000 [ResourceType]
,[CultureCode]
,[ResourceKey]
,[ResourceValue]
,[Preserve]
FROM [CLeX].[dbo].[StringResource]
where resourcekey like 'locChangePasswordPrompt%'
You get the values:
ResourceType CultureCode ResourceKey ResourceValue Preserve
common/UserPreferences.aspx en locChangePasswordPrompt 1
common/UserPreferences.aspx en locChangePasswordPrompt.Text Change Your Passwordss 1
common/UserPreferences.aspx en-US locChangePasswordPrompt 1
common/UserPreferences.aspx en-US locChangePasswordPrompt.Text Change Your Passwordss 1
common/UserPreferences.aspx fr locChangePasswordPrompt 1
common/UserPreferences.aspx fr locChangePasswordPrompt.Text Changez votre mot de passe 1
However, I am still not able to get the FRENCH values at all. In fact, not even the English values are being pulled from the DB. Localize simply pulls the text from its text attribute.
What could possibly be the reason?
Did you configure your DB Resource Provider?
I am trying to extract text from part of a website. The div node which contains the text also contains several children each with their own text or other content. However, I only want the text from the top node not from its children!
This is how the relevant page section looks like:
<div class="body-text">
<div id="other" class="other"></div>
<div id="other2" class="other2"></div>
<div id="other3" class="other3">
<span>irrelevant text</span>
</div>
<h2>heading2</h2>
-Text which I want to get. There are also text parts which are linked.
</div>
This is my code which gets me the "messy" text. I tried /text() but this will truncate my text whenever part of it is linked. So I cannot use it. I also tried something with /div/node()[not(self::div)] but have not managed to get it to work. Could anybody help?
webpage = getURL(url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE, encoding='UTF-8')
body <- xpathSApply(pagetree, "//div[#class='body-text']", xmlValue)
1) Posted example
Try searching for nodes of text() or a/text() within the body-text division, removing any trivial nodes that only contain white space:
## input
Text <- '<div class="body-text">
<div id="other" class="other"></div>
<div id="other2" class="other2"></div>
<div id="other3" class="other3">
<span>irrelevant text</span>
</div>
<h2>heading2</h2>
-Text which I want to get. There are also text parts which are linked.
</div>'
library(XML)
pagetree <- htmlTreeParse(Text, asText = TRUE, useInternalNodes = TRUE)
## process it - xpth is the Xpath expression and xpathSApply() runs it
trim <- function(x) gsub("^\\s+|\\s+$", "", x) # trim whitespace from start & end
xpth <- "( //div[#class='body-text']/text() |
//div[#class='body-text']/a/text() ) [ normalize-space() != '' ]"
txt <- trim(xpathSApply(pagetree, xpth, xmlValue))
The result is the following:
> txt
[1] "-Text which I want to get. There are also text parts which are linked."
2) Example provided by poster in comments. Using this as Text
Text <- '<div class="body-text"> text starts here
<a class="footnote" href="link"> text continues here <sup>1</sup> </a>
and continues here</div>'
and repeating the above code we get:
> txt
[1] "text starts here" "text continues here" "and continues here"
EDIT: Have modified the above based on comments by poster. Main change was the xpath expression, xpth and final point which illustrates the same code with the example provided by the poster in the comments.
EDIT: Have moved the filtering out of whitespace-only nodes from R to Xpath. This lengthens the Xpath expression a bit but eliminates the R Filter() step. Also simplified and reduced the presentation slightly.
There are a few possible solutions to this problem, but, first, it's necessary to clarify which nodes you want to select. You say:
I only want the text from the top node not from its children!
But this is not true! All of the element nodes found in the article text (e.g. a, em,, etc) are themselves children of the body-text div. What you really want to do is select all of the text found within a certain section of the div. Conveniently, your source document (linked in the comments above) contains comment nodes that mark the start and end of the article. They look like this:
<!-- inizio TESTO -->article text<!-- fine TESTO -->
In fact, you really only need the start marker, because there is no additional content after it.
Selecting text after the start marker
The following expression selects the desired nodes:
//div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()
Testing on the following stripped-down document:
<div class="body-text">
<div class="fb-like-button" id="fb-like-head"></div>
<h2><!-- inizio OCCHIELLO -->IRAN<!-- fine OCCHIELLO --></h2>
<h1><!-- title -->"A Isfahan colpito sito nucleare"<br/>Londra annuncia azioni dure<!-- fine TITOLO --></h1>
<h3><!-- summary -->Secondo il<em>Times</em>, fonti di intelligence...<br/><strong><br/></strong><!-- fine SOMMARIO --></h3>
<div class="sidebar">Sidebar text...</div>
<!-- inizio TESTO --><strong>TEHERAN</strong> - L'esplosione avvenuta
lunedì scorso in Iran a Isfahan <sup>1</sup> avrebbe colpito un
sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico <em>The Times</em>, secondo le
quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan.
Sale, intanto, la tensione con la Gran Bretagna: dopo <a href="http://www.repubblica.it" class="footnote">l'assalto all'
ambasciata britannica <sup>2</sup></a> ieri...<!-- fine TESTO -->
</div>
Returns the following text nodes:
[#text: TEHERAN]
[#text: - L'esplosione avvenuta
]
[#text: lunedì scorso in Iran a Isfahan ]
[#text: 1]
[#text: avrebbe colpito un
sito nucleare. Lo hanno riferito fonti dell'intelligence israeliana al quotidiano britannico ]
[#text: The Times]
[#text: , secondo le
quali alcune immagini satellitari "mostrano chiaramente colonne di fumo e la distruzione" di una struttura nucleare di Isfahan.
Sale, intanto, la tensione con la Gran Bretagna: dopo ]
[#text: l'assalto all'
ambasciata britannica ]
[#text: 2]
[#text: ieri...]
[#text:
]
This is a node-set, which you can iterate, etc. I do not know R, so I cannot provide those details.
Selecting text between the start and end markers
If there could be content after the end marker that should be excluded -- there isn't in the provided example -- then use the following expression:
//div[#class='body-text']//text()[preceding::comment()[.=' inizio TESTO '] and
following::comment()[.=' fine TESTO ']]
Selecting text between the start and end markers (Kayessian Formula)
Note that the previous expression can be expressed more directly as the intersection of two node-sets: 1) all text nodes after the start marker and; 2) all text node before the end marker. There is a general formula for performing intersection in XPath 1.0:
$set1[count(.|$set2)=count($set2)]
The general idea here, in English, is that if you add an element from $set1 into $set2 and the size of $set2 does not change, then that node must have already been in $set2. The set of all nodes from $set1 for which this is the case is the intersection of $set1 and $set2.
In your specific case:
$set1 = //div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()
$set2 = //div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text()
Putting it all together:
//div[#class='body-text']/comment()[.=' inizio TESTO ']/following::text()[
count(.|//div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text())
=
count(//div[#class='body-text']/comment()[.=' fine TESTO ']/preceding::text())]