Issue with encoding of cyrlic character strings - r

I have some cyrlic strings in my dataframe that I can't manage to read acuratelly.
This is how the dataframe looks after I load the csv:
unique(transactions$orders)
[1] "ÌÈÏÑ-ÏÏ30Å" "ÈÍÒ-ÏÏ30Å" "ÊÈÁÑ-ÏÏ30Å" "ÊÈÁÑ-ÏÏ50Å" "ÌÈÏÑ-ÏÏ50Å" "ÊÈÁÑ-ÏÏ53Å" "ÈÍÒ-ÏÏ53Å"
[8] "ÌÈÏÑ-ÏÏ53Å" "ÈÍÒ-ÏÏ30" "ÊÈÁÑ-ÏÏ30" "ÌÈÏÑ-ÏÏ30" "ÌÈÏÑ-ÏÏ50" "ÈÍÒ-ÏÏ10" "ÊÈÁÑ-ÏÏ50"
[15] "ÈÍÒ-ÏÏ40" "ÊÈÁÑ-ÏÏ53" "ÈÍÒ-ÏÏ53" "ÌÈÏÑ-ÏÏ53" "ÊÈÁÑ-ÏÏ10" "ÈÍÒ-ÏÏ30Ï" "ÊÈÁÑ-ÏÏ50Ï"
[22] "ÌÈÏÑ-ÏÏ30Ï" "ÊÈÁÑ-ÏÏ30Ï" "ÌÈÏÑ-ÏÏ50Ï" "ÈÍÒ-ÏÏ50" "ÌÈÏÑ-ÏÏ10" "ÊÈÁÑ-ÏÏ53Ï" "ÈÍÒ-ÏÏ53Ï"
Any ideas how I can fix this?

Related

List.files based on numbers

I am trying to create a list of files on which I want to run a function. I created a pattern which matches 35 files which I want to use.
mypattern <- paste0("NBS_NLoans_since2009_", seq(1, 35),".xls")
[1] "NBS_NLoans_since2009_1.xls" "NBS_NLoans_since2009_2.xls" "NBS_NLoans_since2009_3.xls" "NBS_NLoans_since2009_4.xls"
[5] "NBS_NLoans_since2009_5.xls" "NBS_NLoans_since2009_6.xls" "NBS_NLoans_since2009_7.xls" "NBS_NLoans_since2009_8.xls"
[9] "NBS_NLoans_since2009_9.xls" "NBS_NLoans_since2009_10.xls" "NBS_NLoans_since2009_11.xls" "NBS_NLoans_since2009_12.xls"
[13] "NBS_NLoans_since2009_13.xls" "NBS_NLoans_since2009_14.xls" "NBS_NLoans_since2009_15.xls" "NBS_NLoans_since2009_16.xls"
[17] "NBS_NLoans_since2009_17.xls" "NBS_NLoans_since2009_18.xls" "NBS_NLoans_since2009_19.xls" "NBS_NLoans_since2009_20.xls"
[21] "NBS_NLoans_since2009_21.xls" "NBS_NLoans_since2009_22.xls" "NBS_NLoans_since2009_23.xls" "NBS_NLoans_since2009_24.xls"
[25] "NBS_NLoans_since2009_25.xls" "NBS_NLoans_since2009_26.xls" "NBS_NLoans_since2009_27.xls" "NBS_NLoans_since2009_28.xls"
[29] "NBS_NLoans_since2009_29.xls" "NBS_NLoans_since2009_30.xls" "NBS_NLoans_since2009_31.xls" "NBS_NLoans_since2009_32.xls"
[33] "NBS_NLoans_since2009_33.xls" "NBS_NLoans_since2009_34.xls" "NBS_NLoans_since2009_35.xls"
Then I used the pattern to get those files from my directory. I got only one file. I have tried different patterns but either I got one file or more than 35 files. Thanks for any suggestion.
list.files(pattern = mypattern)
[1] "NBS_NLoans_since2009_1.xls"

Substitution of strings results in incorrect names

I,d like to change several strings in vector. In my case, I have in all.images object:
# Original character's list
all.images <-c("S2B2A_20171003_124_IndianaIIPR00911120170922_BOA_10.tif",
"S2B2A_20181028_124_IndianaIIPR0065820181024_BOA_10.tif",
"S2B2A_20170715_124_SantaMariaCalcasPR0033420170731_BOA_10.tif",
"S2B2A_20180928_124_NSraAparecidaBortolettoPR0042720180912_BOA_10.tif",
"S2A2A_20170610_124_LagoaAmarelaPR0022020170619_BOA_10.tif",
"S2A2A_20160705_124_AguaSumidaPR001320160629_BOA_10.tif",
"S2A2A_20181023_124_SaoPedroGabrielGarciaPR001720181031_BOA_10.tif",
"S2B2A_20180908_124_NSraAparecidaBortolettoPR001920180911_BOA_10.tif",
"S2A2A_20180824_124_NSraAparecidaBortolettoPR0043320180911_BOA_10.tif",
"S2A2A_20170720_124_VoAnaPR001520170802_BOA_10.tif",
"S2B2A_20180322_124_SaoMateusPR0021920180314_BOA_10.tif",
"S2A2A_20181212_124_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180413_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif",
"S2B2A_20170913_124_PerdizesPR0034920170905_BOA_10.tif",
"S2A2A_20170610_124_TresMeninasPR001820170601_BOA_10.tif",
"S2B2A_20180428_081_SantaFeSebastiaoFogacaPR0021020180501_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0022320180427_BOA_10.tif",
"S2A2A_20170809_124_VoAnaPR001620170803_BOA_10.tif",
"S2B2A_20180819_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20181214_081_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180423_081_SantaFeSebastiaoFogacaPR0033920180427_BOA_10.tif",
"S2A2A_20180814_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20170715_124_VoAnaPR0015A20170803_BOA_10.tif",
"S2A2A_20160615_124_AguaSumidaPR0011220160627_BOA_10.tif",
"S2A2A_20170720_124_SantaMariaCalcasPR0022820170726_BOA_10.tif",
"S2A2A_20180913_124_SantaMariaCalcasPR001620180829_BOA_10.tif",
"S2B2A_20170804_124_NSraAparecidaBortolettoPR0035720170811_BOA_10.tif",
"S2A2A_20170809_124_SantaFeBaracatPR001920170801_BOA_10.tif",
"S2B2A_20180322_124_NSradeFatimaGlebaAPR001320180403_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif")
#
My idea is 1) remove S2B2A_ and _BOA_10.tif; 2) After S2B2A_ convert the 8 values into dates (e.g. 2017-09-05); 3) After the dates take the next three
values to the end (eg. 124 or 081); and 4) Separate the characters based in capital letters and dates (eg. AguaSumidaPR0011220160627 to AguaSumida-PR00112-2016-06-27).
But when I try to do:
sub("^\\w+_(\\d+)_(\\d+)_([A-Za-z]+)([A-Z]{2}\\d{3})(\\d)(\\d{4})(\\d{2})(\\d+)_.*",
"\\3_\\4_\\5_\\6-\\7-\\8_\\1_\\2", all.images)
[1] "IndianaII_PR009_1_1120-17-0922_20171003_124"
[2] "IndianaII_PR006_5_8201-81-024_20181028_124"
...
[28] "SantaFeBaracat_PR001_9_2017-08-01_20170809_124"
[29] "NSradeFatimaGlebaA_PR001_3_2018-04-03_20180322_124"
[30] "SantaFeSebastiaoFogaca_PR002_1_9201-80-427_20180508_081"
I have incorrected dates (eg. in [30] 9201-80-427_20180508_081) and my desirable output needs to be:
[1] "IndianaII_PR009111_2017-09-22_2017-10-03_124"
[2] "IndianaII_PR00658_2018-10-24_2018-10-28_124"
...
[28] "SantaFeBaracat_PR0019_2017-08-01_2017-08-09_124"
[29] "NSradeFatimaGlebaA_PR0013_2018-04-03_2018-03-22_124"
[30] "SantaFeSebastiaoFogaca_PR00219_2018-04-27_2018-05-08_081"
Please any help with it?
I think this handles those exceptions in the comments on your answer using look ahead:
sub("^\\w+_(\\d{4})(\\d{2})(\\d{2})_(\\d+)_([A-Za-z]+)([A-Z]{2}\\w+)(?=\\d{8})+(\\d{4})(\\d{2})(\\d+)_.*",
"\\5_\\6_\\7-\\8-\\9_\\1-\\2-\\3_\\4", all.images, perl = TRUE)

How to turn rvest output into table

Brand new to R, so I'll try my best to explain this.
I've been playing with data scraping using the "rvest" package. In this example, I'm scraping US state populations from a table on Wikipedia. The code I used is:
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
forecasthtml = html_nodes(statepop, "td")
forecasttext = html_text(forecasthtml)
forecasttext
The resulting output was as follows:
[2] "7000100000000000000♠1"
[3] " California"
[4] "39,250,017"
[5] "37,254,503"
[6] "7001530000000000000♠53"
[7] "738,581"
[8] "702,905"
[9] "12.15%"
[10] "7000200000000000000♠2"
[11] "7000200000000000000♠2"
[12] " Texas"
[13] "27,862,596"
[14] "25,146,105"
[15] "7001360000000000000♠36"
[16] "763,031"
[17] "698,487"
[18] "8.62%"
How can I turn these strings of text into a table that is set up similar to the way it is presented on the original Wikipedia page (with columns, rows, etc)?
Try using rvest's html_table function.
Note there are five tables on the page thus you will need to specify which table you would like to parse.
library(rvest)
statepop = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population")
#find all of the tables on the page
tables<-html_nodes(statepop, "table")
#convert the first table into a dataframe
table1<-html_table(tables[1])

Case insensitive sort of vector of string in R

I have the following vector:
mylist <- c("MBT.LN.ID", "ISA51VG.LN.ID", "R848.LN.ID", "sHz.LN.ID", "FK565.LN.ID",
"bCD.LN.ID", "MALP2s.LN.ID", "ADX.LN.ID", "AddaVax.LN.ID", "FCA.LN.ID",
"Pam3CSK4.LN.ID", "D35.LN.ID", "ALM.LN.ID", "K3.LN.ID", "K3SPG.LN.ID",
"MPLA.LN.ID", "DMXAA.LN.ID", "cGAMP.LN.ID", "Poly_IC.LN.ID",
"cdiGMP.LN.ID")
I'd like to sort them alphabetically in case-insensitive manner.
The expected output is this:
[1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID" "cGAMP.LN.ID"
[7] "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID" "ISA51VG.LN.ID" "K3.LN.ID"
[13] "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID" "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID"
[19] "R848.LN.ID" "sHz.LN.ID"
I tried this but failed (Using R.3.2.0 alpha):
> sort(mylist)
[1] "ADX.LN.ID" "ALM.LN.ID" "AddaVax.LN.ID" "D35.LN.ID"
[5] "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID" "ISA51VG.LN.ID"
[9] "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID"
[13] "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID"
[17] "bCD.LN.ID" "cGAMP.LN.ID" "cdiGMP.LN.ID" "sHz.LN.ID"
Try
mylist[order(tolower(mylist))]
As noted by #Pascal, this is documented in help(Comparison) and sort is local specific. One Option is switching your local (for example Sys.setlocale("LC_TIME", "us")), but that could be inconvenient. Another option could be using gtools::mixedsort which could be also useful because you string also contains numbers.
library(gtools)
mixedsort(mylist)
# [1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID" "cGAMP.LN.ID" "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID"
# [11] "ISA51VG.LN.ID" "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID" "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID" "sHz.LN.ID"
> library(searchable)
> sort(ignore.case(mylist))
[1] "AddaVax.LN.ID" "ADX.LN.ID" "ALM.LN.ID" "bCD.LN.ID" "cdiGMP.LN.ID"
[6] "cGAMP.LN.ID" "D35.LN.ID" "DMXAA.LN.ID" "FCA.LN.ID" "FK565.LN.ID"
[11] "ISA51VG.LN.ID" "K3.LN.ID" "K3SPG.LN.ID" "MALP2s.LN.ID" "MBT.LN.ID"
[16] "MPLA.LN.ID" "Pam3CSK4.LN.ID" "Poly_IC.LN.ID" "R848.LN.ID" "sHz.LN.ID"

Compatibility issue mac/PC on date format

Everything works at work on my PC but at home with my mac I meet a problem
I wrote my data on Excel,
it formats date dd/jj/yy even if I write dd/jj/yyyy but it keeps in memory the way I wrote it (dd/jj/yyyy)
I save the file as a CSV and read it in a data.frame
here is the problem :
data$ddn
[1] 29/11/58 25/07/64 25/09/67 03/01/82 15/05/58 29/07/78 22/03/69 23/01/60 15/12/60 16/06/64
[11] 10/12/60 23/08/78 13/04/67 29/11/59 25/09/56 10/10/87 22/06/60 21/06/76 01/11/63 08/07/69
[21] 22/05/52 06/05/69 04/03/64 08/04/75 09/03/54 22/04/69 29/04/71 18/03/79 14/06/71 03/06/60
71 Levels: 01/11/63 01/12/40 02/07/48 03/01/82 03/05/68 03/06/60 04/03/64 05/01/62 ... 31/07/70
> class(data$ddn)
[1] "factor"
data$ddn<-as.Date(data$ddn,format="%d/%m/%Y") (this syntax works perfectly on my PC)
data$ddn
[1] "0058-11-29" "0064-07-25" "0067-09-25" "0082-01-03" "0058-05-15" "0078-07-29" "0069-03-22"
[8] "0060-01-23" "0060-12-15" "0064-06-16" "0060-12-10" "0078-08-23" "0067-04-13" "0059-11-29"
[15] "0056-09-25" "0087-10-10" "0060-06-22" "0076-06-21" "0063-11-01" "0069-07-08" "0052-05-22"
[22] "0069-05-06" "0064-03-04" "0075-04-08" "0054-03-09" "0069-04-22" "0071-04-29" "0079-03-18"
[29] "0071-06-14" "0060-06-03"
data$ddn<-as.Date(data$ddn,format="%d/%m/%y")
data$ddn
[1] "2058-11-29" "2064-07-25" "2067-09-25" "1982-01-03" "2058-05-15" "1978-07-29" "1969-03-22"
[8] "2060-01-23" "2060-12-15" "2064-06-16" "2060-12-10" "1978-08-23" "2067-04-13" "2059-11-29"
[15] "2056-09-25" "1987-10-10" "2060-06-22" "1976-06-21" "2063-11-01" "1969-07-08" "2052-05-22"
[22] "1969-05-06" "2064-03-04" "1975-04-08" "2054-03-09" "1969-04-22" "1971-04-29" "1979-03-18"
[29] "1971-06-14" "2060-06-03"
R choose to put 19 or 20 before the date and I do not know why
And if I modify the original data (cell format : text or standard instead of date) it 29/11/58 becomes 20056 (again, I am perplexed).
I thought it was a EXCEL problem but the CSV which works with R on my PC doesn't work on mac
How to correct this R compatibility problem between PC and mac?
Thanks.

Resources