Stemming a text column in a dataframe with R - r
I have a dataframe with this structure :
#Load lexicon
Lexicon_DF <- read.csv("LexiconFrancais.csv",header=F, sep=";")
The structure of the "LexiconFrancais.csv" is like this :
French Translation (Google Translate);Positive;Negative
un dos;0;0
abaque;0;0
abandonner;0;1
abandonné;0;1
abandon;0;1
se calmer;0;0
réduction;0;0
abba;1;0
abbé;0;0
abréger;0;0
abréviation;0;0
> Lexicon_DF
V1 V2 V3
1 French Translation (Google Translate) Positive Negative
2 un dos 0 0
3 abaque 0 0
4 abandonner 0 1
5 abandonné 0 1
6 abandon 0 1
7 se calmer 0 0
8 réduction 0 0
9 abba 1 0
10 abbé 0 0
11 abréger 0 0
12 abréviation 0 0
I try to stemm the first column of the dataframe, for this I did :
Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
But after this command I find only the first column in the Lexicon_DF dataframe, the two other column disappear.
> Lexicon_DF <- SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
> Lexicon_DF
[1] "French Translation (Google Translate)" "un dos" "abaqu"
[4] "abandon" "abandon" "abandon"
[7] "se calm" "réduct" "abba"
[10] "abbé" "abreg" "abrévi"
How can I do the stemming wtihout missing the two other columns?
thank you
You are trying to replace the whole content of Lexicon_DF with the o/p of wordStem-
Try this :
Lexicon_DF$V1 <-SnowballC::wordStem(Lexicon_DF[[1]], language = 'fr')
Related
how to read a specific .Matrix file in R
I have a .Matrix file, I have been told it is similar to .csv file, and I take a look by web browser, it looks like this: %TransMat_H0004.E1.L1.S1.B1.T1 CLUSTER,,3,3,2,2,1,1,3,1,1,1,1,3,2,3,1,2,2,1,1,3,3,1,2,1,3,1,1,2,1,3,3,2,3,3,1,1,1,1,1,3,3,1,2,3,2,1,1,1,1,2,1,2,2,3,1,3,2,2,2,1,3,3,2,3,3,1,2,3,3,2,2,2,3,2,2,2,1,1,2,1,1,2,1,1,1,1,2,3,1,3,2,3,3,3,3,2,1,1,3,3,3,1,1,1,2,1,3,1,2,1,1,1,1,1,1,1,3,1,3,2,3,1,1,3,2,2,3,3,1,3,1,1,2,1,2,2,1,1,3,3,1,2,1,2,2,2,2,2,1,3,1,2,3,2,2,2,2,3,2,1,1,2,3,3,2,1,3,1,1,1,1,3,3,3,1,3,3,1,2,2,3,2,3,2,2,3,1,2,2,1,3,1,2,2,3,1,2,3,2,3,3,1,3,2,3,1,1,2,3,1,1,3,2,1,2,1,1,3,1,1,3,1,1,2,1,2,2,2,3,1,3,3,3,1,3,1,1,3,2,3,1,3,2,1,3,1,1,1,2,3,3,3,1,3,3,3,1,1,2,2,3,2,3,3,3,1,3,3,1,1,2,3,2,1,1,3,1,1,1,1,1,3,3,2,2,1,1,1,1,1,3,1,1,2,3,3,1,1,3,2,2,1,1,2,1,1,3,2,1,2,1,2,3,2,1,1,3,2,1,3,2,1,2,2,1,3,3,1,3,3,2,3,2,3,1,3,3,3,3,2,1,3,2,3,3,3,2,1,2,1,2,3,1,1,3,3,3,3,3,2,3,3,1,3,1,1,2,3,3,3,3,3,3,2,2,2,3,1,2,3,3,3,3,2,1,2,2,3,2,3,2,3,2,3,3,2,1,2,3,3,2,1,2,3,3,3,1,3,2,3,3,1,2,2,3,1,1,2,2,3,2,1,1,2,2,1,3,1,2,3,1,3,1,1,2,3,3,1,2,3,2,2,1,1,2,3,2,2,2,1,2,1,2,2,3,2,1,2,1,3,1,2,3,1,2,3,1,2,1,1,2,1,3,3,3,1,3,3,2,2,2,1,2,3,1,3,1,2,1,3,1,2,2,1,2,3,1,1,3,3,2,2,3,1,1,2,1,1,1,2,1,2,3,3,2,2,1,2,3,2,3,1,2,2,2,1,3,3,3,3,3,3,2,3,2,1,2,1,3,3,1,3,3,1,3,2,3,3,1,2,3,3,3,3,3,1,2,1,2,1,1,1,1,2,2,3,1,1,2,3,2,3,2,2,3,3,1,2,1,3,2,3,2,2,3,2,3,1,1,1,3,1,2,3,1,3,2,3,2,2,1,2,3,1,3,2,1,2,3,1,3,1,2,2,1,3,3,2,1,3,3,1,2,3,1,2,1,1,3,1,3,2,3,3,3,3,2,2,1,1,3,3,2,1,3,1,1,3,3,3,1,3,3,1,1,3,3,3,1,1,3,3,2,1,3,2,3,1,3,2,2,2,2,2,3,3,1,2,2,3,2,3,3,1,3,1,3,3,1,3,2,1,2,3,1,3,1,3,2,2,1,1,1,1,3,2,3,3,2,2,3,2,3,1,3,2,1,2,3,1,2,2,1,1,1,3,3,2,3,3,3,3,2,3,1,1,3,3,3,1,1,3,2,1,2,3,2,3,1,3,3,2,1,1,1,1,3,3,2,3,1,2,1,3,3,3,2,2,2,2,3,3,1,1,2,3,2,2,3,3,2,2,3,3,3,2,2,1,2,2,3,3,3,3,1,2,2,3,2,2,2,2,3,2,2,2,1,1,2,2,2,1,2,3,2,2,3,3,2,1,3,1,2,2,1,3,2,3,1,1,3,1,2,2,2,3,3,1,3,3,1,2,1,2,3,1,3,2,3,1,1,3,3,3,1,2,3,3,3,1,3,3,1,3,2,2,2,3,2,1,2,3,3,2,1,2,1,2,1,1,3,3,1,1,3,2,1,3,2,1,3,3,3,2,2,2,1,3,2,3,2,3,1,2,3,1,3,3,1,1,3,2,1,2,3,2,1,1,2,3,1,3,2,1,2,2,3,2,2,1,3,2,1,1,3,3,2,1,3,1,2,2,1,2,2,3,2,2,2,3,1,1,3,3,3,3,1,2,2,3,3,3,2,1,3,2,1,2,3,3,1,3,2,1,2,1,1,2,2,3,2,2,3,1,2,3,2,3,1,2,3,3,2,3,3,1,1,2,1,1,1,3,1,3,1,3,3,2,3,1,2,2,1,2,3,3,2,3,2,3,2,1,1,3,2,3,2,3,1,1,3,1,3,2,1,3,2,2,2,3,1,1,2,3,1,1,1,2,3,3,3,1,2,3,3,3,3,2,3,1,3,1,3,2,3,2,3,3,1,1,2,3,1,1,3,3,2,3,3,1,2,3,1,2,3,3,2,3,3,2,1,2,3,3,2,3,1,2,2,3,1,2,1,3,2,3,1,2,2,3,3,2,2,3,1,3,3,3,3,2,3,2,2,1,3,1,2,1,1,1,3,2,3,1,1,1,1,3,3,2,3,1,1,2,1,3,1,2,3,3,2,2,1,1,3,2,2,3,1,2,3,3,3,2,1,2,2,3,1,3,3,2,1,2,2,3,3,2,2,3,2,1,1,3,1,3,3,1,3,2,3,3,3,1,1,1,3,1,2,2,3,2,3,2,3,1,1,2,1,2,1,3,3,1,3,3,2,2,1,3,1,2,2,3,2,2,2,3,3,2,1,1,1,1,3,1,1,2,1,2,2,3,3,2,3,3,3,2,1,1,3,2,2,2,3,1,3,3,3,2,2,3,1,3,3,3,1,3,3,3,2,3,1,2,1,1,3,1,2,3,2,1,3,3,2,1,3,2,3,2,3,1,2,2,3,3,2,3,3,3,1,2,3,3,3,3,3,1,1,2,3,1,2,1,1,1,1,2,1,1,2,3,1,3,3,2,2,3,2,2,1,3,2,2,3,1,1,1,1,1,3,1,3,1,1,3,2,2,3,3,3,1,2,2,3,3,2,3,2,3,3,2,1,2,3,3,1,3,1,2,1,1,2,2,2,2,2,2,1,3,1,3,2,3,2,2,2,2,2,3,2,2,1,3,1,1,1,2,1,2,1,2,1,3,1,3,3,1,3,1,3,3,1,3,2,3,3,3,3,1,3,3,2,3,2,3,3,3,1,1,2,2,3,3,3,2,2,3,3,1,3,1,2,1,2,2,1,1,3,3,1,1,3,1,1,1,2,2,3,2,2,2,3,3,1,2,1,2,2,2,3,2,2,1,2,1,1,1,3,3,3,2,1,3,3,3,2,2,3,1,2,1,3,1,3,3,1,3,2,3,2,2,1,1,1,3,3,2,3,1,3,2,2,2,2,2,3,1,3,2,3,1,3,1,3,1,2,3,2,2,3,3,3,3,3,1,1,2,3,3,2,3,1,3,3,1,3,3,2,2,1,3,3,3,3,2,1,3,2,2,2,3,3,1,1,3,3,3,1,3,1,1,2,3,1,3,3,3,2,1,3,1,2,1,3,2,2,3,1,3,1,2,3,3,3,2,2,3,1,2,1,1,1,2,3,1,2,3,2,3,3,2,1,1,2,3,3,1,2,3,1,1,1,3,1,2,3,1,2,3,2,2,3,2,3,2,3,1,2,3,3,1,3,3,2,2,1,1,2,3,2,2,3,3,2,1,1,1,3,3,3,2,2,1,3,2,2,1,3,2,3,3,1,1,3,2,3,3,2,3,1,3,3,1,3,3,2,3,3,2,3,1,3,3,3,3,3,1,1,3,2,2,3,3,3,3,1,1,1,1,3,2,3,3,1,3,2,2,1,1,1,1,3,2,2,3,2,2,3,3,2,3,1,1,1,3,3,3,3,2,3,1,3,3,1,1,3,3,1,3,3,3,1,3,2,1,1,3,3,2,3,3,3,2,2,1,3,3,3,1,2,2,2,2,1,2,2,1,2,3,2,1,2,2,3,3,3,3,3,2,2,3,2,2,3,2,1,3,1,1,2,2,3,1,2,3,2,1,3,1,1,2,1,2,2,3,1,2,2,3,3,1,3,2,1,3,3,2,1,3,3,3,1,3,2,3,3,2,3,2,2,3,2,1,3,3,3,3,2,1,3,3,3,1,3,3,1,3,1,3,3,3 tSNE-1,,8.13846968090103,12.8635212043927,10.3864480425066,7.17083119797853,-72.7452686458686,-49.7960088439495,45.63460621346,-50.3693843293848,-53.2415432674881,-54.6891175204711,-46.4635164735514,4.49644447816871,3.98243750756555,-9.99729157677144,-98.1041739031645,14.4129117311442,21.8090838800674,-46.5547640077783,-65.8379505581324,39.8907136841164,45.2453417297103,-43.4054353275594,5.58370171555427,-82.6419520577671,42.7647608862027,-91.125151907502,-37.9838559192307,62.9924569510685,-69.108888726706,62.7774653919852,60.3873481045592,62.825 I tried to read it by read.csv: test=read.csv('TransMat_H0004.E1.L1.S1.B1.T1.Matrix',sep='' ) str(test) 'data.frame': 33141 obs. of 1 variable: $ X.TransMat_H0004.E1.L1.S1.B1.T1: Factor w/ 33141 levels "A1BG,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"| truncated,..: 13453 31099 31100 1 2 3 4 5 6 7 ... how should I read it in a right format, say, first character of 'sequence'(list?I guess?) as rowname. Thanks in advance! sorry, I cannot provide the data link because it is unpublished; but I can tell you what the data look like: %TransMat_H0004.E1.L1.S1.B1.T1 cluster,1,2,3,2,3…. tsne- 1,-41,-80….. tsne- 2,-41,-80….. tsne- 3,-41,-80….. (and the rest are all started with gene name and number, such as) genea, 0,2,1,0… …. genez,0,2,1,0 my desired output is to remove the first 4 factors(cluster, tsne-1, tsne-2,tsne-3), and extract the gene transcripts matrix,such as: V1 V2 V3 V4 V5 1 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0
I figure this out by this: read.csv("E2.Matrix", skip=1) since the first row is annotation according to the bioinfor technician who arranged the .Matrix file Thanks! # Stephan
Using DocumentTermMatrix on a Vector of First and Last Names
I have a column in my data frame (df) as follows: > people = df$people > people[1:3] [1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" [2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" [3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code: > people_list = strsplit(people, ", ") > corp = Corpus(VectorSource(people_list)) > dtm = DocumentTermMatrix(corp, people_dict) where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows: > people_dict[1:3] [[1]] [1] "Christian Slater" [[2]] [1] "Tara Reid" [[3]] [1] "Stephen Dorff" However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito". > inspect(actors_dtm[1:5,1:10]) <<DocumentTermMatrix (documents: 5, terms: 10)>> Non-/sparse entries: 0/50 Sparsity : 100% Maximal term length: 9 Weighting : term frequency (tf) Terms Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!
The default tokenizer splits text into individual words. You need to provide a custom function commasplit_tokenizer <- function(x) unlist(strsplit(as.character(x), ", ")) Note that you do not separate the actors before creating the corpus. people <- character(3) people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman") The control options didn't work with just Coprus, I used VCorpus corp = VCorpus(VectorSource(people)) dtm = DocumentTermMatrix(corp, control = list(tokenize = commasplit_tokenizer, dictionary = people_dict, tolower = FALSE)) All of the options are passed within control, including: tokenize - function dictionary tolower = FALSE Results: as.matrix(dtm) Terms Docs Nia LOng Stephen Dorff Uma Thurman 1 0 1 0 2 0 0 0 3 0 0 1 I hope this helps
Converting a Term Document Matrix to a Term Document Matrix supported by tm library
I have a csv file, where I have all of my documents stemmed in a Term Document Matrix form and a categorical variable as a sentiment. I'd like to use tm's capabilities (terms frequencies etc.). Is there a way to do so, given the data I started with? # given: dtm = read.csv(file_path, na.strings="") dtm$rating = as.factor(dtm$rating) str(dtm) # 'data.frame': 2000 obs. of 2002 variables: # $ ID : int 1 2 3 4 5 6 7 8 9 10 ... # $ abl : int 0 0 0 0 0 0 0 0 0 0 ... # ... head(dtm) #ID abl absolut absorb accept #1 1 0 0 0 #2 2 0 0 1 # I'd like to achieve... tdm <- TermDocumentMatrix(dtm, control = list(removePunctuation = TRUE, stopwords = TRUE))
Can you use as.TermDocumentMatrix(df, weighting = weightTf) (in the R package tm) to do what you seek?
Conditional input using read.table or readLines
I'm struggling with using readLines() and read.Table() to get a well formatted data frame in R. I want to read files like this which are Hockey stats. I'd like to get a nicely formatted data frame, however, specifying the concrete amount of lines to read is difficult because in other files like this the number of players is different. Also, non-players, signed as #.AC, #.HC and so on, should not be read in. I tried something like this LINES <- 19 stats <- read.table(file=Datei, skip=11, header=FALSE, stringsAsFactors=FALSE, encoding="UTF-8", nrows=LINES) but as mentioned above, the value for LINES is different each time. I also tried readLines as in this post, but had no luck with it. Is there a way to integrate a condition in read.table, like (pseudo code) if (first character == "AC") { break read.table } Sorry if this looks strange, I don't have that much experience in scripting or coding. Any help is appreciated, thanks a lot! Greetz!
Your data show a couple of difficulties which should be handled in a sequence, which means you should not try to read the entire file with one command: Read plain lines and find start and stop row Depending on the specification of the files you read in my suggestion is to first find the the first row you actually want to read in by any indicator. So this can be a lone number which is always the same or as in my example two lines after the line "TEAM STATS". Finding the last line is then simple again by just looking for the first line containing only whitespaces after the start line: lines <- readLines( Datei ) start <- which(lines == "TEAM STATS") + 2 end <- start + min( grep( "^\\s+$", lines[ start:length(lines) ] ) ) -2 lines <- lines[start:end] Read the data to data.frame In your case you meet a couple of complications: Your header line starts with an # which is on default recognized as a comment character, ignoring the line. But even if you switch this behavior off (comment.char = "") it's not a valid column name. If we tell read.table to split the columns along whitespaces you end up with one more column in the data, than in the header since the Player column contains white spaces in the cells. So the best is at the moment to just ignore the header line and let read.table do this with it's default behavior (comment.char = "#"). Also we let the PLAYER column be split into two and will fix this later. You won't be able to use the first column as row.names since they are not unique. The rows have unequal length, since the POS column is not filled everywhere. : tab <- read.table( text = lines[ start:end ], fill = TRUE, stringsAsFactors=FALSE ) # fix the PLAYER column tab$V2 <- paste( tab$V2, tab$V3 ) tab <- tab[-3] Fix the header Just split the start line at multiple whitespaces and reset the first entry (#) by a valid column name: colns <- strsplit( lines[start], "\\s+" )[[1]] colns[1] <- "code" colnames(tab) <- colns Fix cases were "POS" was empty This is done by finding the rows which last cell contains NAs and shift them by one cell to the right: colsToFix <- which( is.na(tab[, "SHO%"]) ) tab[ colsToFix, 4:ncol(tab) ] <- tab[ colsToFix, 3:(ncol(tab)-1) ] tab[ colsToFix, 3 ] <- NA > str(tab) 'data.frame': 25 obs. of 20 variables: $ code : chr "93" "91" "61" "88" ... $ PLAYER: chr "Eichelkraut, Flori" "Müller, Lars" "Alt, Sebastian" "Gross, Arthur" ... $ POS : chr "F" "F" "D" "F" ... $ GP : chr "8" "6" "7" "8" ... $ G : int 10 1 4 3 4 2 0 2 1 0 ... $ A : int 5 11 5 5 3 4 6 3 3 4 ... $ PTS : int 15 12 9 8 7 6 6 5 4 4 ... $ PIM : int 12 10 12 6 2 36 37 29 6 0 ... $ PPG : int 3 0 1 1 1 1 0 0 1 0 ... $ PPA : int 1 5 2 2 1 2 4 2 1 1 ... $ SHG : int 0 1 0 1 1 0 0 0 0 0 ... $ SHA : int 0 0 1 0 1 0 0 1 0 0 ... $ GWG : int 2 0 1 0 0 0 0 0 0 0 ... $ FG : int 1 0 1 1 1 0 0 0 0 0 ... $ OTG : int 0 0 0 0 0 0 0 0 0 0 ... $ UAG : int 1 0 1 0 0 0 0 0 0 0 ... $ ENG : int 0 0 0 0 0 0 0 0 0 0 ... $ SHOG : int 0 0 0 0 0 0 0 0 0 0 ... $ SHOA : num 0 0 0 0 0 0 0 0 0 0 ... $ SHO% : num 0 0 0 0 0 0 0 0 0 0 ...
using graph.adjacency() in R
I have a sample code in R as follows: library(igraph) rm(list=ls()) dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file m=as.matrix(dat) net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE) where I used csv file as input which contain following data: 23732 23778 23824 23871 58009 58098 58256 23732 0 8 0 1 0 10 0 23778 8 0 1 15 0 1 0 23824 0 1 0 0 0 0 0 23871 1 15 0 0 1 5 0 58009 0 0 0 1 0 7 0 58098 10 1 0 5 7 0 1 58256 0 0 0 0 0 1 0 After this I used following command to check weight values: E(net)$weight Expected output is somewhat like this: > E(net)$weight [1] 8 1 10 1 15 1 1 5 7 1 But I'm getting weird values (and every time different): > E(net)$weight [1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313 [13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316 I'm unable to find where and what I am doing wrong? Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.?? Thanks, Nitin
Just a small working example below, much clearer than CSV input. library('igraph'); adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10); g1<-graph.adjacency(adjm1); plot(g1) P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')). Related threads Creating co-occurrence matrix Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug. After you do, m <- as.matrix(dat) set its mode to numeric by: mode(m) <- "numeric" And then do: net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE) > E(net)$weight [1] 8 1 10 1 15 1 1 5 7 1