convert a docx text into dataset or matrix using R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
I have a 150pages microsoft word document with info on books listed. I include an example right here:
So this document contains several pieces of information per each book, there are hundreds of different books listed in this docx file and I want to extract and convert the information in a classic dataset with each column being "Title (en)", "Title (de)"..."Abstract" and so on. In this way I would organize all the info in a dataset which has 1 line per book and each cloumn with attributes like English name, abstract and so on. The point is that not all the books have the same piece of information so sometimes the Abstract (for example) is missing so I would need that cell empty.
How can I do that in R? I am not new in R but I never worked with texts so I am not sure what would be the best approach here.
Thanks

Related

Why are these states not displaying the data when the data exists with the usmap package? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm trying to plot election results on the US map with the usmap package but even though the dataset is complete, I get plot that shows missing values for some states. The states are greyed out and I'm not sure why this is happening..
plot_usmap(data=data_total,values='percent_biden')+
scale_fill_continuous(low='red',high='blue',name='Percent for Biden')+
theme(legend.position='right')+
ggtitle(paste("Total Popular Vote of Final Results"))
You are incorrectly assuming that usmap will infer any format for state names. For instance, both of these produce a working map,
usmap::plot_usmap(data=data.frame(state=c("alabama","new york"),s=c(5,15)), values="s")
usmap::plot_usmap(data=data.frame(state=c("AL","NY"),s=c(5,15)), values="s")
whereas inferring from your pic of data, you are trying
usmap::plot_usmap(data=data.frame(state=c("alabama","new-york"),s=c(5,15)), values="s")
# ^ dash, not space
So I believe you need to clean up your data and fix your state names.

Find duplicate registers in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have an excel file with a list of emails and channels that collected it. How can I know how many emails per channel are duplicated using R and automate it (every time I import a different file just have to run it and get the results ) ?
Thank you!!
Assuming the "df" dataframe has the relevant variables under the names "channel" and "email", then:
To get the number of unique channel-email pairs:
dim(unique(df[c("channel", "email")]))[1]
To get the sum of all channel-email observations:
sum(table(df$channel, df$email))
To get the number of duplicates, simply subtract the former from the later:
sum(table(df$channel, df$email)) - dim(unique(df[c("channel", "email")]))[1]

Reading all observations from a csv file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i have imported this file into R only problem is there is 380 observations and it only reads first 100 observations. How can i get the rest of it, here it is
BPL16_17 <- read.csv("BPL16:17.csv")
BPL16_17
Thanks
Personally I always recommend using readr::read_csv over read.csv.
While I am unsure why read.csv is limited to 100 columns (This has not been true for many years now, my mistake) read_csv is not and handles data_frames much better especially dates, times and doesn't include factors by default.
https://github.com/tidyverse/readr
Also a great resource is this chapter from the R for data science book which is available online always for free.
http://r4ds.had.co.nz/data-import.html

Find strings that start and end with certain characters [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm working on a text-mining project with data from twitter. In my data frame, many words are converted into Unicode characters, e.g.
<U+0E2B><U+0E25><U+0E07><U+0E1E>
I want to collect every converted words like above and put them into 1 large string so I can deal with them separately.
Is there any way I can find all the strings that start with <U+ and end with > using R?
Your request is a bit imprecise, so I'm taking the liberty to make a few assumptions on how you want the output.
text <- "Words <Q+0E2B><U+0E2B2>, 1 < 2, <p>
<U+0E2B><U+0E25><U+0E07><U+0E1E> </p> some more words"
regmatches(text, gregexpr("<U\\+[0-9A-Z]{4}>", text))
# "<U+0E2B>" "<U+0E25>" "<U+0E07>" "<U+0E1E>"

pdf to txt tvs or vcf in R, ubuntu [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have the next link
[1] https://drive.google.com/open?id=0ByCmoyvCype7ODBMQjFTSlNtTzQ
This is a pdf file. The author of a paper gave the list of mutation in this format.
I need to annotate the mutation of this file.
I need a txt or TVS or VCF file to be reading by annovar.
Can you help me to convert this using R or other software in ubuntu?
In principle this is a job for tabulizer but I couldn't get it to work in this instance; I suspect the single table over so many pages confused it.
You can read it in to R as text with the pdftools package easily enough
library(pdftools)
txt <- pdf_text("selection.pdf")
Now txt is an R list, with each element of the list a character string for a single page in the original document. You might be able to do something fancy with regular expressions to convert this to more meaningful data.
However, it makes more sense to ask the original author for their data in an appropriate format. Publishing a 561 page PDF of tabular data is just nuts.

Resources