How to transliterate Polish alphabet with US-ASCII? - transliteration

Is there a more or less standard way to transliterate Polish alphabet with the original ASCII (US-ASCII) characters?
This question can be broken in two related and more precise questions:
How to transliterate 32 letters of Polish alphabet with only 26 letters of basic Latin alphabet maximizing understanding by a Polish reader?
Is there a reversible way to transliterate any Polish text with US-ASCII characters?
I can see that most Polish websites just remove the diacritics in their URLs. For example:
Świętosław Milczący → Swietoslaw Milczacy
Dzierżykraj Łaźniński → Dzierzykraj Lazninski
Józef Soćko → Jozef Socko
This is hardly reversible, but is it the most readable transliteration for Polish readers?
In some other cases, more complicated ad hoc transliteration might be used, like Wałęsa → Wawensa. Are there any standard rules for doing this latter kind of transformations?
P.S. Just to clarify, I'm interested in transliteration rules (like ł → w, ę → en), not the implementation. Something like this table.

You could encode presense of diacritics as some kind of ternary number, and store them near the plain ASCII transliteration to make it reversible.
URLs often contain some additional IDs, even this one: 48686148/how-to-transliterate-polish-alphabet-with-us-ascii
Here is example implementation:
trans_table = {
'A': ('A', 0), 'a': ('a', 0),
'Ą': ('A', 1), 'ą': ('a', 1),
'B': ('B', 0), 'b': ('b', 0),
'C': ('C', 0), 'c': ('c', 0),
'Ć': ('C', 1), 'ć': ('c', 1),
'D': ('D', 0), 'd': ('d', 0),
'E': ('E', 0), 'e': ('e', 0),
'Ę': ('E', 1), 'ę': ('e', 1),
'F': ('F', 0), 'f': ('f', 0),
'G': ('G', 0), 'g': ('g', 0),
'H': ('H', 0), 'h': ('h', 0),
'I': ('I', 0), 'i': ('i', 0),
'J': ('J', 0), 'j': ('j', 0),
'K': ('K', 0), 'k': ('k', 0),
'L': ('L', 0), 'l': ('l', 0),
'Ł': ('L', 1), 'ł': ('l', 1),
'M': ('M', 0), 'm': ('m', 0),
'N': ('N', 0), 'n': ('n', 0),
'Ń': ('N', 1), 'ń': ('n', 1),
'O': ('O', 0), 'o': ('o', 0),
'Ó': ('O', 1), 'ó': ('o', 1),
'P': ('P', 0), 'p': ('p', 0),
'R': ('R', 0), 'r': ('r', 0),
'S': ('S', 0), 's': ('s', 0),
'Ś': ('S', 1), 'ś': ('s', 1),
'T': ('T', 0), 't': ('t', 0),
'U': ('U', 0), 'u': ('u', 0),
'W': ('W', 0), 'w': ('w', 0),
'Y': ('Y', 0), 'y': ('y', 0),
'Z': ('Z', 0), 'z': ('z', 0),
'Ź': ('Z', 1), 'ź': ('z', 1),
'Ż': ('Z', 2), 'ż': ('z', 2),
}
def pol2ascii(text):
plain = []
diacritics = []
for c in text:
ascii_char, diacritic = trans_table.get(c, (c, 0))
plain.append(ascii_char)
diacritics.append(str(diacritic))
return ''.join(plain) + '_' + hex(int('1' + ''.join(reversed(diacritics)), 3))[2:]
reverse_trans_table = {
k: v for v, k in trans_table.items()
}
def ascii2pol(text):
plain, diacritics = text.rsplit('_', 1)
diacritics = int(diacritics, base=16)
res = []
for c in plain:
diacritic = diacritics % 3
diacritics = diacritics // 3
pol_char = reverse_trans_table.get((c, diacritic), c)
res.append(pol_char)
return ''.join(res)
TESTS = '''
Świętosław Milczący
Dzierżykraj Łaźniński
Józef Soćko
'''
for l in TESTS.strip().splitlines():
plain = pol2ascii(l)
original = ascii2pol(plain)
print(original, plain)
assert original == l

Ad. 1. The Polish alphabet consists only of two groups of letters: the Latin letters and the Latin letters with diacritics. Therefore the only used way to transliterate the Polish letters is to remove diacritic for the last group, for example:
ą --> a
ć --> c
ż --> z
ź --> z
...
This way is the most readable transliteration.
Ad. 2. Definitely no.

Related

How can I condense a long list of items into categories for a repeated logit regression?

I'm using a program called Apollo to make an ordered logit model. In this model, you have to specify a list of variables like this:
apollo_beta = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum3 = 0)
I want to do two things:
Firstly, I want to be able to specify these beforehand:
specification1 = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to call it:
apollo_beta = specification1
Secondly, I want to be able to make categories:
var1 <- c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0)
var2 <- c(
b_var2_dum1 = 0,
b_var2_dum2 = 0)
var3 <- c(
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to use those in the specification:
specification1 = c(
var1,
var2,
var3)
And then:
apollo_beta = specification1
I know you might not have the best knowledge of the very niche programme Apollo. I am not quite sure if this is even possible, but since it would save me days (maybe weeks) of work, can anyone give me a hint on what I might be doing wrong? I worry I have a list within a list.
Since I have to make 60 specifications of the same model with different variations of 6 variables, it would be a lot of code and lot of work if I can't shorten it like this.
Any tips would be greatly appreciated.
Data:
df <- data.frame(
var1_dum1 = c(0, 1, 0),
var1_dum2 = c(1, 0, 0),
var1_dum3 = c(0, 0, 1),
var2_dum1 = c(0, 1, 0),
var2_dum2 = c(1, 0, 0),
var3_dum1 = c(1, 1, 0),
var3_dum2 = c(1, 0, 0),
var3_dum3 = c(0, 1, 0),
var3_dum4 = c(0, 0, 1),
)
So there is a dataset with these variables. In apollo you specify "database = df" first, so it already refers to the variables.
In the list of apollo_beta, it doesn't refer to the variables directly, so technically you can call it what you want. I just want to call it the same as the variables as I will refer to them later.
My question is simple. Can I condense the long list to simply say "specification1". It's just a question of the r language. Whether the items of the list will function the same way as how it was originally written in code.
In other words, would calling apollo_beta in the above three examples lead to the same result? If not, how do I change the code so that it does lead to the same?

R Function for Identifying NA Values Incorrectly Entered as Zeroes

I have a data set with a number of columns like this:
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
group_id N N_surv_1 N_surv_2 N_surv_3
1: 1 4588 0 3703 NA
2: 1 4589 0 0 3054
3: 1 0 4264 0 3159
4: 1 4590 4266 3710 0
5: 1 4588 4264 3715 0
The number of rows per group varies and each row represents a measurement for an entity specified by group_id for a particular point in time. I believe the data was incorrectly entered such that in some cases an NA value indicates a missing value, but in other cases a 0 was entered to indicate an NA value. There are legitimate zero values in the dataset, but I can identify the erroneous ones by looking for differences in column values above a particular threshold. For example
1
3
5
0
3
Might be a legit zero but
50
46
50
0
47
probably wouldn't be.
I think the best solution then would be to look for a string of zeroes followed or proceeded by a large jump and relabel the zeroes as NA. How could I do something like this in R?
dcarlson's advice is spot on. You'll need to think harder on your definition of true zeros.
Library(data.table)
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
#Difference approach
pop[c(diff(N),NA)>100,N:=NA,by=group_id]
#This won't handle two zeros in a row that should both be NA.
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
This will -use a rolling mean with na.rm=TRUE and specify the cut-off value (here 100.)
pop[frollmean(N,3,fill = NA,na.rm=TRUE,align = "left")-N>100,N:=NA]
pop[frollmean(N,3,fill = NA,na.rm=TRUE,align = "right")-N>100,N:=NA]
Need to use the right and left rolling mean to get 'em all.
#But, this uses the column names an excessive number of times (6 times for one operation.) You're likely to generate a typo messing up your data table if you do that.
#Let's start again.
pop <- data.table(group_id = c(1, 1, 1, 1, 1),
N = c(4588, 4589, 0, 4590, 4588),
N_surv_1 = c(0, 0, 4264, 4266, 4264),
N_surv_2 = c(3703, 0, 0, 3710, 3715),
N_surv_3 = c(NA, 3054, 3159, 0, 0) )
RollReplace <- function(dt,colName,maxDiffAllowed){
dt[frollmean(get(colName),3,fill = NA,na.rm=TRUE,align = "left")-get(colName)>maxDiffAllowed,
eval(colName):=NA]
dt[frollmean(get(colName),3,fill = NA,na.rm=TRUE,align = "right")-get(colName)>maxDiffAllowed,
eval(colName):=NA]
}
RollReplace(pop,colName='N',100)
RollReplace(pop,colName='N_surv_1',100)
Still, you want to be careful.

R Manipulating List of Lists With Conditions / Joining Data

I have the following data showing 5 possible kids to invite to a party and what neighborhoods they live in.
I have a list of solutions as well (binary indicators of whether the kid is invited or not; e.g., the first solution invites Kelly, Gina, and Patty.
data <- data.frame(c("Kelly", "Andrew", "Josh", "Gina", "Patty"), c(1, 1, 0, 1, 0), c(0, 1, 1, 1, 0))
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1), c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
I'm looking for a way to now filter the solutions in the following ways:
a) Only keep solutions where there are at least 3 kids from both neighborhood A and neighborhood B (one kid can count as one for both if they're part of both)
b) Only keep solutions that have at least 3 kids selected (i.e., sum >= 3)
I think I need to somehow join data to the solutions in solutions, but I'm a bit lost on how to manipulate everything since the solutions are stuck in lists. Basically looking for a way to add entries to every solution in the list indicating a) how many kids the solution has, b) how many kids from neighborhood A, and c) how many kids from neighborhood B. From there I'd have to somehow filter the lists to only keep the solutions that satisfy >= 3?
Thank you in advance!
I wrote a little function to check each solution and return TRUE or FALSE based on your requirements. Passing your solutions to this using sapply() will give you a logical vector, with which you can subset solutions to retain only those that met the requirements.
check_solution <- function(solution, data) {
data <- data[as.logical(solution),]
sum(data[["Neighborhood A"]]) >= 3 && sum(data[["Neighborhood B"]]) >= 3
}
### No need for function to test whether `sum(solution) >= 3`, since
### this will *always* be true if either neighborhood sums is >= 3.
tests <- sapply(solutions, check_solution, data = data)
# FALSE FALSE FALSE FALSE FALSE
solutions[tests]
# list()
### none of the `solutions` provided actually meet criteria
Edit: OP asked in the comments how to test against all neighborhoods in the data, and return TRUE if a specified number of neighborhoods have enough kids. Below is a solution using dplyr.
library(dplyr)
data <- data.frame(
c("Kelly", "Andrew", "Josh", "Gina", "Patty"),
c(1, 1, 0, 1, 0),
c(0, 1, 1, 1, 0),
c(1, 1, 1, 0, 1),
c(0, 1, 1, 1, 1)
)
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B", "Neighborhood C",
"Neighborhood D")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1),
c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
check_solution <- function(solution,
data,
min_kids = 3,
min_neighborhoods = NULL) {
neighborhood_tests <- data %>%
filter(as.logical(solution)) %>%
summarize(across(starts_with("Neighborhood"), ~ sum(.x) >= min_kids)) %>%
as.logical()
# require all neighborhoods by default
if (is.null(min_neighborhoods)) min_neighborhoods <- length(neighborhood_tests)
sum(neighborhood_tests) >= min_neighborhoods
}
tests1 <- sapply(solutions, check_solution, data = data)
solutions[tests1]
# list()
tests2 <- sapply(
solutions,
check_solution,
data = data,
min_kids = 2,
min_neighborhoods = 3
)
solutions[tests2]
# [[1]]
# [1] 1 0 0 1 1
#
# [[2]]
# [1] 0 1 0 1 1

gt table how to add a border to column by spanner column label?

I wanted to add a border to the left of a group of column with the same column label spanner and i dont know how to do it !
I try this :
%>%tab_style(
style = list(
cell_borders(
sides = "left",
color = "black",
weight = px(3)
)
),
locations = cells_column_spanners(everything()))
but it only add border on the column spanner label part and not the entire column.
have you any idea how to do it ?
I have the result on the top and i want the result of the bottom :
Thanks for your help !
data for example (the gt format gives a lot of lines and i cannot put them here):
x<-structure(list(A = c("1", "2", "3"),
`ONE||N` = c(0, 0, 0), `ONE||%` = c(0, 0, 0), `TWO||N` = c(0,
0, 0), `TWO||%` = c(0, 0, 0), `THREE||N` = c(0, 0, 0), `THREE||%` = c(0,
0, 0), `THREE||Δ` = c(0, 0, 0), `FOUR||N` = c(0, 0, 0),
`FOUR||%` = c(0, 0, 0), `TOTAL||%` = c(0, 0, 0)))
I did'nt try to do it by spanner label ! So i do it with the column label. If someone have a response if with can add style to a group of column end the (entire column not only the top !) by their spanner label , feel free to share !:)

Remove part of string from column names [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
That's a data:
structure(list(Fasta.headers = c("Person01050.1", "Person01080.1",
"Person01090.1", "Person01100.4", "Person01140.1", "Person01220.1"),
ToRemove.Gr_1 = c(0, 1107200, 17096000, 0, 0, 0), ToRemove.Gr_10 = c(0,
37259000, 1104800000, 783870, 0, 1308600), ToRemove.Gr_11 = c(1835800,
53909000, 623960000, 0, 0, 0), ToRemove.Gr_12 = c(0, 19117000,
808600000, 0, 0, 719400), ToRemove.Gr_13 = c(2544200, 2461400,
418770000, 0, 0, 0), ToRemove.Gr_14 = c(5120400, 1373700,
117330000, 0, 0, 0), ToRemove.Gr_15 = c(6623500, 0, 73336000,
0, 0, 0), ToRemove.Gr_16 = c(0, 0, 31761000, 0, 0, 0), ToRemove.Gr_17 = c(13475000,
0, 29387000, 0, 0, 0), ToRemove.Gr_18 = c(7883300, 0, 27476000,
0, 0, 0), ToRemove.Gr_19 = c(82339000, 3254700, 50825000,
0, 0, 0), ToRemove.Gr_2 = c(1584100, 84847000, 5219500000,
6860700, 0, 8337700), ToRemove.Gr_20 = c(205860000, 0, 67685000,
0, 0, 0), ToRemove.Gr_21 = c(867120000, 1984400, 2.26e+08,
0, 0, 10502000)), .Names = c("Fasta.headers", "ToRemove.Gr_1",
"ToRemove.Gr_10", "ToRemove.Gr_11", "ToRemove.Gr_12", "ToRemove.Gr_13",
"ToRemove.Gr_14", "ToRemove.Gr_15", "ToRemove.Gr_16", "ToRemove.Gr_17",
"ToRemove.Gr_18", "ToRemove.Gr_19", "ToRemove.Gr_2", "ToRemove.Gr_20",
"ToRemove.Gr_21"), row.names = c(NA, 6L), class = "data.frame")
As already column names suggests part "ToRemove" should be removed from the name and only Gr_* should stay behind.
I would appreciate two solutions for that problem. First based on a assigned string it should delete part of column name or based on specific character like . for example. It should remove whole part before or after a dot.
We can use sub
names(df1)[-1] <- sub(".*\\.", "", names(df1)[-1])
If we need the . as well, replace with .
names(df1)[-1] <- sub(".*\\.", ".", names(df1)[-1])
To match the pattern exactly, we can also match zero or more characters that are not a do t([^.]*) from the start (^) of the string followed by a dot (\\. - escape the dot as it is a metacharacter implying any character) and replace it with blank ("")
sub("^[^.]*\\.", "", names(df1)[-1])
#[1] "Gr_1" "Gr_10" "Gr_11" "Gr_12" "Gr_13" "Gr_14" "Gr_15" "Gr_16"
#[9] "Gr_17" "Gr_18" "Gr_19" "Gr_2" "Gr_20" "Gr_21"
As it is already mentioned above 'ToRemove',
sub("ToRemove.", "", names(df1)[-1], fixed = TRUE)
Also, if we need to remove all characters including .
sub("\\..*", "", names(df1)[-1])

Resources