R markdown export diacritics - r

I am trying to export a summary of categorical variables through the R markdown. The output of the summary is in the Czech language with diacritics, but the R isn't able to encode it.
Example
summary(data$Et6_d1q_ii)
3 a více hodin 1 - 2 hodiny méně než 1 hodinu žádný NA's
113 240 196 111 6932
Is there any way how to set the endocing globally so the output is readable? I wasn't able to find it anywhere.
Thank you!

Related

Reading values from a data frame in R

I have been trying to find a way to combine the lines into one:
R: Code
Datensatz_LR[29]
Datensatz_LR[48,]
Datensatz_LR[63,]
Datensatz_LR[100,]
Output:
> Datensatz_LR[29,]
Word
29 Dog
> Datensatz_LR[48,]
Word
48 cat
> Datensatz_LR[63,]
Word
63 land
> Datensatz_LR[100,]
Word
100 shoe
Datensatz is the name of the file that I have imported into R. I can work with it just fine, but I want to have less code and thought that maybe I could write something like this
pseudocode:
Datensatz_LR[29,46,63,100]
desired output of the pseudocode:
Datensatz_LR[29,48,63,100]
Word
29 Dog
48 cat
63 land
100 shoe
Any help would be greatly appreciated.
Would there be a way to make a more elegant version of this ?
I apologise for the simple question. I am still relatively new to R.

Creating LDA model using gensim from bag-of-words vectors

I want to create a topic model from data provided by Jstor (e.g. https://www.jstor.org/dfr/about/sample-datasets). However, because of copyright, they do not allow full text access. Instead, I can request a list of unigrams followed by their frequencies in the document (supplied in plain .txt). e.g:
his 295
old 181
he 165
age 152
p 110
from 79
life 74
de 71
petrarch 58
book 51
courtier 47
This should be easy to convert to a bag-of-words vector. However, I have only found examples of Gensim LDA models being built from fulltext. Would it be possible to pass it these vectors instead?
Yes, you only need to convert (word, frequency) to (word_number, frequency), and pass a list of tuples to corpus of any gensim model. To convert a word to a number, you can first count how many words are in the whole corpus, suppose we have V words, then each word can be represented as an integer between 1 to V.

Single line user input in R

I want to input numeric values from user in R. These numeric values will be in one line. readline() does read the values but then returns them as character making me unable to do statistical operations on those values whereas scan() doesn't take multiple numeric values in one line in R. Please help.
Sample Input
630 135 146 233 144 498 729 120 511 670
Can you suggest me a way using which I can prompt user to input these values and store them in numeric so that I can perform basic statistic operation on these values.
as.numeric(unlist(strsplit(readline()," ")))

R code chunk printing extra line in Markdown

I'm creating a data analysis report using Markdown, knitr.
When I run a code chunk containing a table,
addmargins(table(x$gender, exclude=NULL))
This is what I get:
##
## Female Male <NA> Sum
## 49 53 0 102
This is what I want:
## Female Male <NA> Sum
## 49 53 0 102
Markdown naturally outputs a lot of white space, and I'm trying to provide as condensed an output as possible since these reports need to be printed. These extra lines add up to be a lot of extra pages.
As far as I've seen, this seems to happen only with tables, and not with other code. It seems that table() is causing the problem by inserting the extra line above the table. Any way to disable this quirk?
I believe table() is printing a blank line for your dimension names. If you specify dnn=NULL, it should go away.
addmargins(table(x$gender, exclude=NULL, dnn=NULL))

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources