#1 Combining categories of a categorical variable

#1 Combining categories of a categorical variable - r

I would like to combine some Brazilian political party names from a categorical variable (partido_pref) that was wrongly coded.
The categories that I would like to combine are "PC do B" and "PCdoB", and "PT do B" and "PTdoB". The parties with and without space are the same parties.
I would rather do it in Stata but I can also work on R.
Below you will find the list of political parties.
. tab partido_pref
partido_pref | Freq. Percent Cum.
---------------+-----------------------------------
DEM | 2,267 2.14 2.14
NA | 34,848 32.84 34.98
Não disponível | 2 0.00 34.98
Outra situação | 19 0.02 35.00
PAN | 6 0.01 35.00
PC do B | 260 0.25 35.25
PCB | 2 0.00 35.25
PCdoB | 7 0.01 35.26
PCO | 1 0.00 35.26
PDT | 3,933 3.71 38.97
PFL | 6,811 6.42 45.39
PHS | 194 0.18 45.57
PL | 2,525 2.38 47.95
PMDB | 14,833 13.98 61.93
PMN | 410 0.39 62.31
PP | 5,467 5.15 67.47
PPB | 1,661 1.57 69.03
PPL | 10 0.01 69.04
PPS | 2,493 2.35 71.39
PR | 1,861 1.75 73.14
PRB | 298 0.28 73.43
PRN | 9 0.01 73.43
PRONA | 26 0.02 73.46
PRP | 273 0.26 73.72
PRTB | 121 0.11 73.83
PSB | 2,905 2.74 76.57
PSC | 480 0.45 77.02
PSD | 816 0.77 77.79
PSDB | 11,316 10.66 88.45
PSDC | 121 0.11 88.57
PSL | 273 0.26 88.83
PSOL | 4 0.00 88.83
PST | 48 0.05 88.87
PSTU | 1 0.00 88.88
PT | 5,258 4.96 93.83
PT do B | 139 0.13 93.96
PTB | 5,383 5.07 99.03
PTC | 140 0.13 99.17
PTdoB | 10 0.01 99.18
PTN | 108 0.10 99.28
PV | 702 0.66 99.94
Recusa | 2 0.00 99.94
Sem partido | 62 0.06 100.00
---------------+-----------------------------------
Total | 106,105 100.00
Thank you in advance!

One option is fct_collapse from forcats
library(forcats)
fct_collapse(df1$partido_pref, pc = c( "PC do B", "PCdoB"),
pt = c( "PT do B", "PTdoB"))

If your problem is just getting rid of whitespace:
replace partido_pref = subinstr(partido_pref, " ", "")
See help string_functions for more options.
R is more flexible, but Stata can handle that level of simple text management.

Related

summarytools::freq gives unintended results when variables are factors without NA

Since my data has summarized counts, I am using the freq function from summmarytools with weights.
With weights, the freq function works fine for summarizing a column when
The column is numeric or integer
The columns is a factor with NA or NaN values
But when
The column is a factor without NA or NaN values then the summary take a level away from the column and displays it in NA!!
I came across this issue in a live case and have reproduced a sample.
library(data.table)
library(summarytools)
dt <- data.table(A= as.integer( c(5,3,4,5,6,1,2,NA,3,NaN)),
B= c(5,3,4,5,6,1,2,NA,3,NaN),
C=as.factor( c(5,3,4,5,6,1,2,NA,3,NA)),
D=as.factor( c(5,3,4,5,6,1,2,NaN,3,NaN)),
E=as.factor( c(5,3,4,5,6,1,2,5,3,3)),
Frequency=c(10,20,30,40,5,60,7,80,99,10)
)
str(dt)
frequency being an integer or numeric does not matter
if we have a factor without Nan or NA values it makes a difference
writeLines("\n\n\n Without weights: No errors")
#summarytools::freq(dt[,1:5]) #Commented to minimize clutter
writeLines("\n\n\n With weights, Column E shows incorrect values but not C and D")
summarytools::freq(dt[,1:5],weights=dt$Frequency)
Without weights: No errors
With weights, Column E shows incorrect values but not C and D
1 NaN value(s) converted to NA
0 NaN value(s) converted to NA
Weighted Frequencies
dt$A
Weights: weights
Freq % Valid % Valid Cum. % Total % Total Cum.
1 60.00 22.14 22.14 16.62 16.62
2 7.00 2.58 24.72 1.94 18.56
3 119.00 43.91 68.63 32.96 51.52
4 30.00 11.07 79.70 8.31 59.83
5 50.00 18.45 98.15 13.85 73.68
6 5.00 1.85 100.00 1.39 75.07
<NA> 90.00 24.93 100.00
Total 361.00 100.00 100.00 100.00 100.00
dt$B
Type: Numeric
Freq % Valid % Valid Cum. % Total % Total Cum.
1 60.00 22.14 22.14 16.62 16.62
2 7.00 2.58 24.72 1.94 18.56
3 119.00 43.91 68.63 32.96 51.52
4 30.00 11.07 79.70 8.31 59.83
5 50.00 18.45 98.15 13.85 73.68
6 5.00 1.85 100.00 1.39 75.07
<NA> 90.00 24.93 100.00
Total 361.00 100.00 100.00 100.00 100.00
dt$C
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
1 60.00 22.14 22.14 16.62 16.62
2 7.00 2.58 24.72 1.94 18.56
3 119.00 43.91 68.63 32.96 51.52
4 30.00 11.07 79.70 8.31 59.83
5 50.00 18.45 98.15 13.85 73.68
6 5.00 1.85 100.00 1.39 75.07
<NA> 90.00 24.93 100.00
Total 361.00 100.00 100.00 100.00 100.00
dt$D
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
1 60.00 22.14 22.14 16.62 16.62
2 7.00 2.58 24.72 1.94 18.56
3 119.00 43.91 68.63 32.96 51.52
4 30.00 11.07 79.70 8.31 59.83
5 50.00 18.45 98.15 13.85 73.68
6 5.00 1.85 100.00 1.39 75.07
<NA> 90.00 24.93 100.00
Total 361.00 100.00 100.00 100.00 100.00
dt$E
Type: Factor
Freq % Valid % Valid Cum. % Total % Total Cum.
1 60.00 16.85 16.85 16.62 16.62
2 7.00 1.97 18.82 1.94 18.56
3 129.00 36.24 55.06 35.73 54.29
4 30.00 8.43 63.48 8.31 62.60
5 130.00 36.52 100.00 36.01 98.61
<NA> 5.00 1.39 100.00
Total 361.00 100.00 100.00 100.00 100.00

A fix was issued for this. You can install the latest version from GitHub with:
devtools::install_github("dcomtois/summarytools")
or, to get the latest development version:
devtools::install_github("dcomtois/summarytools", ref = "dev-current)

Convert a string to a number using jq

After parsing some json I have numbers like the following
1 BTC 1.1 -1.27 4.5 12483.315628 209496088918
2 XRP -1.14 20.92 153.78 3.0061025564 116453842357
3 ETH -1.08 13.41 40.64 847.89295234 82049924696.0
4 BCH 0.51 -9.21 -5.22 2025.07027989 34210094446.0
5 ADA 1.12 14.9 205.14 0.9950722725 25799309000.0
6 XEM -0.02 20.4 100.84 1.4710629893 13239566903.0
and I would like to convert them to numbers before printing them because they're not that readable this way.
In the last but one column I'd like to truncate precision to 3 digits after the dot and in the last column I'd like to divide by 1M. In the json these fields are strings. I'm forced to use jq.
The context is this:
curl -s "https://api.coinmarketcap.com/v1/ticker/?convert=EUR&limit=20" | jq -r '.[] | [.rank, .symbol, .percent_change_1h, .percent_change_24h, .percent_change_7d, .price_eur, .market_cap_eur] | #tsv'
I added a issue on the project github, so check there for updates also.

jq + awk solution (assuming Linux environment):
curl -s "https://api.coinmarketcap.com/v1/ticker/?convert=EUR&limit=20" \
| jq -r '.[] | [.rank, .symbol, .percent_change_1h, .percent_change_24h,
.percent_change_7d, .price_eur, .market_cap_eur] | #tsv' \
| awk '{ $6=sprintf("%.3f",$6); $7=$7/(1*10^6) }1' | column -t
The output:
1 BTC 1.62 0.76 5.63 12618.627 211768
2 XRP -4.04 13.64 142.7 2.885 111773
3 ETH -1.12 11.55 40.07 845.206 81790.2
4 BCH 1.01 -7.0 -4.13 2043.695 34524.9
5 ADA -2.79 10.38 194.99 0.966 25046.2
6 XEM -1.64 17.26 96.98 1.445 13006.7
7 XLM -3.54 -9.96 267.75 0.660 11804
8 TRX -0.48 152.86 454.55 0.169 11132.6
9 LTC -1.64 -3.04 -2.97 197.775 10801.2
10 MIOTA 5.79 4.76 19.54 3.481 9675.58
11 DASH 0.73 8.52 13.65 1045.762 8155.11
12 NEO 0.13 9.17 69.28 87.454 5684.52
13 EOS -0.25 27.87 22.32 9.652 5628.04
14 XMR 0.32 0.28 6.36 333.080 5183.6
15 BTG 1.12 2.98 -2.53 231.167 3870.93
16 QTUM 0.28 3.58 11.99 50.186 3702.88
17 XRB 4.29 15.4 160.91 25.188 3356.32
18 ETC -3.23 10.74 24.11 30.394 3005.25
19 ICX -3.59 9.69 37.95 6.352 2398.27
20 LSK 5.42 9.61 -1.45 18.794 2192.41

Apply a function for each pair of vector elements in R

I have a data frame of elements imported from mongo database, sample:
# | Mcg | Gvh | Alm | Mit | Vac | Nuc
:----: | :----: | :----: | :----: | :----: | :----: | :----:
1 | 0.62 | 0.42 | 0.47 | 0.16 | 0.52 | 0.22
2 | 0.60 | 0.40 | 0.52 | 0.46 | 0.53 | 0.22
3 | 0.47 | 0.32 | 0.64 | 0.30 | 0.63 | 0.22
4 | 0.46 | 0.39 | 0.54 | 0.22 | 0.48 | 0.25
5 | 0.50 | 0.46 | 0.52 | 0.16 | 0.52 | 0.26
data = data.frame(
Mcg = c(0.62, 0.60, 0.47, 0.46, 0.50),
Gvh = c(0.42, 0.40, 0.32, 0.39, 0.46),
Alm = c(0.47, 0.52, 0.64, 0.54, 0.52),
Mit = c(0.16, 0.46, 0.30, 0.22, 0.16),
Vac = c(0.52, 0.53, 0.63, 0.48, 0.52),
Nuc = c(0.22, 0.22, 0.22, 0.25, 0.26),
row.names = c(1:5)
)
I need to compute number of shared nearest neighbors for each pair of examples based on distance matrix and if the value is above given threshold save those examples indexes in a list named clusters.
# | 1 | 2 | 3 | 4 | 5
:----: | :----: | :----: | :----: | :----: | :----:
1 | 0.00 | 0.31 | 0.31 | 0.19 | 0.14
2 | 0.31 | 0.00 | 0.27 | 0.28 | 0.32
3 | 0.31 | 0.27 | 0.00 | 0.21 | 0.26
4 | 0.19 | 0.28 | 0.21 | 0.00 | 0.11
5 | 0.14 | 0.32 | 0.26 | 0.11 | 0.00
dists = as.matrix(
dist(data, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
)
I already have a working code with 'for' loops:
# get indexes of k nearest neighbors
Neighbors <- function(id, k) {
as.integer( names( sort( dists[id, -id] )[1:k] ) )
}
threshold = 5
k = 10
clusters = vector("list", (k - threshold))
for (y in 1:(nrow(data) - 1)) {
neighbors.e1 = Neighbors(y, k)
for (z in (y+1):(nrow(data))) {
neighbors.e2 = Neighbors(z, k)
snn = sum(table(c(neighbors.e1, neighbors.e2)) == 2)
if (snn > threshold) {
cluster.id = snn - threshold
if (!(y %in% clusters[[cluster.id]])) {
clusters[[cluster.id]] = c(clusters[[cluster.id]], y)
}
if (!(z %in% clusters[[cluster.id]])) {
clusters[[cluster.id]] = c(clusters[[cluster.id]], z)
}
}
}
}
This solution is really slow. Therefore I want to optimize it by using apply function instead of loops.
Here is a 'clusters' list for a complete dataset (1484 rows)
> clusters
[[1]]
[1] 1 3 6 7 19 2 23 8 16 17 18 20 4 5 14 21 24 15 9 10 12 22 11 13
[[2]]
[1] 1 17 24 2 3 7 8 16 18 20 6 4 10 12 13 22 5 19 21 23 9 15 11 14
[[3]]
[1] 4 11 7 16 19 8 18 10 12 13 14 22 20 17 24
[[4]]
[1] 5 21 7 8 18 20
[[5]]
NULL

How to read .txt tables into R

I have a .txt file which I think was output from STATA, but I'm not sure. It is a list of tables formatted like this:
Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00
It continues for about 200 questions and their respective survey answers. Does anyone know how I can quickly read each survey question into separate data frames in R?

No need for scan():
txt <- " Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00"
library(purrr)
You can just as easily read from a file vs the above text vector. The main goal here is to remove cruft from the data and get it into a form we can work with, so we get rid of the dashed lines and the Total line, and convert spaces to commas. This makes a big assumption about your data format so it needs to be consistent.
readLines(textConnection(txt)) %>%
discard(~grepl("(----|Total)", .)) %>%
gsub("[[:space:]]*\\|[[:space:]]*", ",", .) %>%
gsub("[[:space:]][[:space:]]+", ",", .) %>%
gsub("^,", "", .) -> lines
There's a blank line between the tables. This is another assumption the code makes. We find that blank line and extract the lines that are between blanks (including the start and end of the text). Then we read that into a data frame with read.csv.
starts <- c(1, which(lines=="")+1)
ends <- c(which(lines=="")-1, length(lines))
map2(starts, ends, function(start, end) {
read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
})
This results in a list of data frames:
## [[1]]
## Q1 Freq. Percent Cum.
## 1 answer 35 21.08 21.08
## 2 text 4 2.41 23.49
## 3 words 35 21.08 44.50
## 4 something 38 22.89 67.47
## 5 blah 54 32.53 100.00
##
## [[2]]
## Q2 Freq. Percent Cum.
## 1 foo 1 0.60 0.60
## 2 blahblah 11 6.63 7.23
## 3 etc 26 15.66 22.89
## 4 more text 82 49.40 72.29
## 5 answer 7 4.22 76.51
## 6 survey response 39 23.49 100.00
##
## [[3]]
## Q3 Freq. Percent Cum.
## 1 option 7 4.22 4.22
## 2 text 24 14.46 18.67
## 3 blahb 25 15.06 33.73
## 4 more text 82 49.40 83.13
## 5 etc 28 16.87 100.00
But, I think this might be more useful as one big data frame:
map2_df(starts, ends, function(start, end) {
df <- read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
colnames(df) %>%
tolower() %>%
gsub("\\.", "", .) -> cols
question <- cols[1]
cols[1] <- "text"
setNames(df, cols) %>%
mutate(question=question) %>%
mutate(n=1:nrow(.)) %>%
select(question, n, text, freq, percent, cum) %>%
mutate(percent=percent/100, cum=cum/100)
})
## question n text freq percent cum
## 1 q1 1 answer 35 0.2108 0.2108
## 2 q1 2 text 4 0.0241 0.2349
## 3 q1 3 words 35 0.2108 0.4450
## 4 q1 4 something 38 0.2289 0.6747
## 5 q1 5 blah 54 0.3253 1.0000
## 6 q2 1 foo 1 0.0060 0.0060
## 7 q2 2 blahblah 11 0.0663 0.0723
## 8 q2 3 etc 26 0.1566 0.2289
## 9 q2 4 more text 82 0.4940 0.7229
## 10 q2 5 answer 7 0.0422 0.7651
## 11 q2 6 survey response 39 0.2349 1.0000
## 12 q3 1 option 7 0.0422 0.0422
## 13 q3 2 text 24 0.1446 0.1867
## 14 q3 3 blahb 25 0.1506 0.3373
## 15 q3 4 more text 82 0.4940 0.8313
## 16 q3 5 etc 28 0.1687 1.0000

renaming a column in a dataframe using a value from the global environment in R

I have a data frame (summary_transposed_no_time) and I want to rename one of the columns to a name I have stored as a value.
summary_transposed_no_time looks like this:
| A | B | C | D
------ | ------ | ------ | ------ | ------
area_1 | 0.870 | 0.435 | 0.968 | 0.679
area_2 | 0.456 | 0.259 | 0.906 | 0.467
area_3 | 0.298 | 0.256 | 0.457 | 0.768
area_4 | 0.994 | 0.987 | 0.365 | 0.765
My value is called test and it is set to "B" so I have tried using the following code with no luck:
summary_transposed_no_time <- names(summary_transposed_no_time)[c(test_col)]<-c("test")
Desire output
| A | test | C | D
------ | ------ | ------ | ------ | ------
area_1 | 0.870 | 0.435 | 0.968 | 0.679
area_2 | 0.456 | 0.259 | 0.906 | 0.467
area_3 | 0.298 | 0.256 | 0.457 | 0.768
area_4 | 0.994 | 0.987 | 0.365 | 0.765

I think you need (I have replaced summary_transposed_no_time with x)
names(x)[match(test_col, names(x))] <- "test"
x <- trees[1:5, ]
# Girth Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
#3 8.8 63 10.2
#4 10.5 72 16.4
#5 10.7 81 18.8
names(x)[match("Girth", names(x))] <- "test"
# test Height Volume
#1 8.3 70 10.3
#2 8.6 65 10.3
#3 8.8 63 10.2
#4 10.5 72 16.4
#5 10.7 81 18.8

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

#1 Combining categories of a categorical variable - r

One option is fct_collapse from forcats library(forcats) fct_collapse(df1$partido_pref, pc = c( "PC do B", "PCdoB"), pt = c( "PT do B", "PTdoB"))

If your problem is just getting rid of whitespace: replace partido_pref = subinstr(partido_pref, " ", "") See help string_functions for more options. R is more flexible, but Stata can handle that level of simple text management.

Related

summarytools::freq gives unintended results when variables are factors without NA

Convert a string to a number using jq

Apply a function for each pair of vector elements in R

How to read .txt tables into R

renaming a column in a dataframe using a value from the global environment in R

Categories

Resources