I'm trying to cluster items (find similar items) based on their attributes. I initially had a CSV of the format:
Item | Attribute1 | Attribute2.....about 200 attributes
Since its a mixed format set of attributes (INT, String...), I decided to concatenate the attributes and now I have:
Item | ConcatenatedAttributes.
My clustering code is:
uniqueItem <- unique(as.character(data$ConcatenatedAttributes))
distanceMatrix <- stringdistmatrix(uniqueItem ,uniqueItem ,method = "jw")
rownames(distanceMatrix ) <- uniqueItem
hc <- hclust(as.dist(distanceMatrix ))
dfClust <- data.frame(uniqueItem , cutree(hc, k=200))
Now, I want to be able to see which Items have been clustered together based on their similarities of the ConcatenatedAttributes field. How can I do that?
So, something like:
ClusterNumber | Item |
You want to group_by your data frame.
One obvious way is to use a for loop. Most R fans will suggest to learn dplyr.
But IMHO, you idea of concatenating everything into one unmanageable field and then abusing a string distance is just horrible.
Related
My df has a comments column and I need to search for multiple names in the comments using key words(he comment has a lot of irrelevant information and not necessarily full name provided) - able o accomplish this with nested ifelse but there is a limit of 50 for nesting and my list has grown to more than 200 names so the code looks very tedious and I don't want to be editing the code each time (instead i want to upload an excel with list of names and key search terms)
I am currently using this statement - which should give clear understanding of what the relevant columns contain
comdata$name <- ifelse(grepl('jen',comdata$comments),'Jennifer A',
ifelse(grepl('rick',final_DM$comments) | grepl('richard',final_DM$comments) ,'richard',
ifelse(grepl('summ',comdata$comments),'Summer','Others'))))
Is it possible to do this with a loop or some other way if I create a list of the names and the possible 'key' search terms?
basically i need the correct syntax to write below code - which just gives other for most of the rows in comdata$name:
comdata$name< - ifelse(comdata$comments %like% name_list$Key.1, name_list$FullName, 'Other')
Create a key/val dataset and use regex_left_join
keyval <- data.frame(comments = c("jen", "rick"),
name = c("Jennifer A", "richard"))
library(fuzzyjoin)
regex_left_join(comdata, keyval, by = "comments")
I am trying to trace back a pedigree and I have a package to do it for specific individuals but instead, I need to use a list of 2000 animals. What I need is all the ancestors of each individual 5 generations back .
Here it is an example:
library(ggenealogy)
data(sbGeneal)
getAncestors("5601T", sbGeneal, 5)
I need to use a list of individuals instead of writing one by one the name of the animals.
Would it be possible?
Have you tried something like this?
library(ggenealogy)
data(sbGeneal)
lst <- sapply(sbGeneal[,1], function(x) getAncestors(x, sbGeneal, 5))
It gets all results done and store them to a list lst. This is just a rough idea. You may need to adjust the code.
To retrieve those values:
lst$`5601T`
lst$Adams
would be the same as
getAncestors("5601T", sbGeneal, 5)
getAncestors("Adam", sbGeneal, 5)
I have a data frame that includes 43 different countries.
To summarize my data frame, row names like that: (AUS1, AUS2, AUS3, ... BRA1, BRA2, ... GER1, GER2...GER56) and there is a variable like Country which includes country codes.
I need to find their export values. I can find separately but, it is taking so much time because I have 14 different years. Thus, I want to use for loop. However, I can not find any way to use for loop for the below process.
This is my code to find export for single country.
##AUT
AUT <- filter(wiot, wiot$Country == "AUT")
exportAUT <- sum(AUT$TOT) - sum(select(AUT, starts_with("AUT")))
##BEL
BEL <- filter(wiot, wiot$Country == "BEL")
exportBEL <- sum(BEL$TOT) - sum(select(BEL, starts_with("BEL")))
Trying to create individually named objects for this set of results is the path to madness in R. Instead create a list with a more generic name and then put results in the "leaves" (individual element) inside the list:
export <- list()
for (i in wiot$Country) {
export[i] <- sum(wiot[i]$TOT) - sum(select(wiot, starts_with(i)))
#or maybe: export[i] <- sum(wiot[i]$TOT) - sum(wiot[ grepl(i,names(wiot)) ] )
}
This is a guess, since I'm not able to figure out how the rows and columns are referenced in your data.frame object. It would be much easier to debug this if you provided a less ambiguous description of the data object named wiot. Use either the output of str(wiot) or show output of dput(head(wiot))
Consider base R's by to build a named list of export calculations:
export_list <- by(wiot, wiot$country, function(sub)
sum(sub$TOT) - sum(select(sub, starts_with(sub$country[1])))
)
export_list$AUT
export_list$BEL
export_list$GER
...
I am using the function text stat_keyness that look at the most frequently appearing words for a specific group of documents in comparison with all the other groups of documents (so basically you input the target group of documents and the output is a dataset containing the words ordered from the most important to the less important and some other columns with some statistics.
I have a a character vector with all the name of the documents groups I want to apply Keynes analysis to:
interests_list <- c(unique(data$interest))
(it looks like : chr "0" , "340" , "456" etc.. basically each number corresponds to a group of documents)
I can easily apply stat_keyness to a single group of document as follows
keyness <- dfm(dfmat_data, groups = "group_interest")
#Calculate keyness and determine audience as target group, compare frequencies of words
between target and reference documents.
result_keyness <- textstat_keyness(keyness, target = "17627")
the problem is that I don't want to run stat_keyness for each group individually as I have around 100 groups.
I was thinking to use a for loop, but I am not sure how to create a list of all the dataframes generated by text stat_keyness
I wrote this so far, but I don't know how to store all the results I would obtain
for(i in interest_list) {textstat_keyness(keyness, target = i )
}
otherwise, I tried with apply but it doesn't work
keylist <- lapply(keyness, textstat_keyness(keyness, target = interest_list ))
any idea how I can do to obtain my list of data frame in any efficient way?
thank you very much,
Carlo
Alternative to the for loop provided by JaiPizGon, is a solution with lapply.
keylist <- lapply(interest_list, function(i) textstat_keyness(keyness, target = i))
Note that lapply is essentially a for loop, which always return a list.
The notation used by JaiPizGon is also correct, only you should be careful in growing objects in R - see chapter 2 in "The R Inferno".
So if you are more comfortable using a for loop I suggest specifying the size of the list prior to assignment, i.e.:
keylist <- vector("list", length(interest_list))
for(i in seq_along(interest_list)) {
keylist[[i]] <- textstat_keyness(keyness, target = interest_list[i])
}
Have you tried initializing a list and assigning the result of textstat_keyness function?
Code:
keylist <- list()
for (i in 1:length(interest_list)) {
keylist[[i]] <- textstat_keyness(keyness, target = interest_list[i])
}
I have to create a table where I analyse 9 variables in a bigger data set. For each variable, I have to state how it is scaled, what the measure of central tendency is, and what the dispersion measure is.
As, depending on how the variable is scaled, I have different measures, I would like to specify that inside the corresponding cell of the table I'm writing. Example:
"Median: (median(GB$government,na.rm=T)"
or
"Median:" (median(GB$government, na.rm=T)
This doesn't work, RStudio warns me because of an unexpected symbol. The code I have is this (it includes specify_decimal because I have to include two decimals of each value - that function works flawlessly so don't mind it :)
MZT <- c("Median:" specify_decimal(median(GB$government,na.rm=T),2),
specify_decimal(Modus(GB$local),2),specify_decimal(Modus(GB$gender),2),
specify_decimal(mean(GB$height,na.rm=T),2),
specify_decimal(mean(GB$weight,na.rm=T),2),specify_decimal(mean(GB$age,na.rm=T),2),
specify_decimal(mean(GB$education,na.rm=T),2),
specify_decimal(median(GB$income,na.rm=T),2),
specify_decimal(median(GB$father_educ,na.rm=T),2))
/ edit: I now understand how kable works :D
One way to make custom tables in R is to use the knitr::kable() function, along with R Markdown. Here is a trivial example that prints a table comparing sample and theoretical values for an exponential distribution where lambda = 0.2.
library(knitr)
Statistic <- c("Mean","Std. Deviation")
Sample <- c(round(5.220134,2),round(5.4018713,2))
Theoretical <- c(5.0,5.0)
theTable <- data.frame(Statistic,Sample,Theoretical)
rownames(theTable) <- NULL
kable(theTable)
...and the text based output:
> kable(theTable)
|Statistic | Sample| Theoretical|
|:--------------|------:|-----------:|
|Mean | 5.22| 5|
|Std. Deviation | 5.40| 5|
>
When run in R Markdown, the output looks like this:
Explanation
I used the following technique to create the table.
Data Frame is used as the container to hold the data
Each column in the table is a column of data in the data frame
The first column stores the names of the rows
The second thru n-th columns store different values related to each row