How to abbreviate long names in a dataframe for R? - r

I'm working with a dataframe that has really long names that is more than 25 characters. I'm trying to make a bar graph (with plotly) with all of these organizations name, but the names get cut off because they're super long. I've already tried to the margins like the following:
plot_ly(x = number, y = org_name, type = 'bar') %>%
layout(margin = list(l = 150))
It works but the bar graph doesn't look nice so the alternative I'm trying to do is abbreviate any organization's name that are longer than 25 characters. However, I'm having a hard time doing so. One way I tried to abbreviate it is to create a new column called abbrv, use substring to get the first 25 characters of the organization name and then do "...", and then put it in the column. While for the organization's name that isn't greater than 25, I would just put an NA in the abbrv column like the following:
for(i in dataframe.name$org_name){
if(nchar(i) > 25){
dataframe.name$abbrv <- paste0(substring(i, 0, 25), "...")
}
else{
dataframe.name$abbrv <- "NA"
}
The only thing with this way is now that I have the abbrv column (if it works), how will I make sure that plotly displays the abbrv column if the organization name is greater than 25 characters and if it doesn't then it displays the normal organization name.
Anyways, I talked enough about that, but that was one approach I tried to do, but it doesn't quite work since the abbrv column puts "NA" for ALL of the rows in the column, no matter how long the organization's names are. Another approach I was trying to do is use the replace function such as:
for(i in dataframe.name$org_name){
if(nchar(i) > 25){
dataframe.name[i].replace(
to_replace=i,
value= abbreviate(i)
)
}
But I get errors for that one as well. At this point, I'm not even sure what to do and how to abbreviate the long names in my dataframe? I'm really lost and confused on what to do and how to exactly abbreviate the long names. If anyone can help me out, that'll be great! Thanks.
*******Edit*******
So now I'm using this code:
for(i in 1:nrow(dfname)){
if(nchar(dfname$orgname[i]) > 25){
dfname$abbrv.column <- substring(dfname$orgname[i], 0, 25)
}
else{
dfname$abbrv.column <- dfname$orgname
}
}
This isn't quite working though because all of the entries are the same organization name

dataframe.name$abbr is a vector of all abbreviations in the dataframe, not just a single name.
It is the reason all entries in dataframe.name$abbr are being set to NA; the last name is in the dataframe is 25 characters or less, so all entries in dataframe.name$abbr are assigned NA.
#brettljausn has a decent suggestion: just do away with the NAs completely and only truncate where the character count exceeds 25.
Something like this should work a treat:
dataframe.name$abbrv <- substring( dataframe.name$org_name, 0, 25 )
I would try to use abbreviate first though:
dataframe.name$abbrv <- abbreviate( dataframe.name$org_name )

Base R abbreviate. Limit to 8 characters including the "."
> abbreviate(names(iris), minlength = 8)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"Spl.Lngt" "Spl.Wdth" "Ptl.Lngt" "Ptl.Wdth" "Species"

Related

How to concatenate NOT as character in R?

I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.

min() does not work as expected

I am trying to get the minimum of a a column.
The data has been split into groups using the "abbr" factor. My objective is to return the data in column 2 corresponding to the minimum in column number passed in the argument. If it helps , this is a part of the coursera R programming introductory course.
The minimum is supposed to be somewhere around 8, it shows 10.
Please help me here.
here's the link to the csv file on which i used read.csv
https://drive.google.com/file/d/0Bxkj3-FNtxqrLW14MFZCeEl6UGc/view?usp=sharing
best <- function(abbr, outvar){
## outcome is a dataframe consisting of a column labelled "State" (one of many)
## outvar is the desired column number
statecol <- split(outcome, outcome$State) ##state is a factor which will be inputted as abbr
dislist <- statecol[[abbr]][,2][statecol[[abbr]][, outvar] ==
min(statecol[[abbr]][, outvar])] ##continuation of prev line
dislist
}
In my opinion you are messing up with NA, make sure to specify na as not available and na.rm=TRUE in min..
filedata<-read.table(file.choose(),quote='"',sep=",",dec=".",header=TRUE,stringsAsFactors=FALSE, na.strings="Not Available")
f<-function(df,abbr,outVar,na.rm=TRUE){
outlist<-split(df,df["State"])
tempCol<-outlist[[abbr]][outVar]
outlist[[abbr]][,2][which(tempCol==min(tempCol,na.rm=na.rm))]
}
f(filedata,"AK",44)

Dynamically call dataframe column & conditional replacement in R

First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})

Project Euler #22, off by 158,055

I'm currently working through Project Euler problem 22 which has the following challenge:
Using names.txt (right click and 'Save Link/Target As...'), a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49714.
What is the total of all the name scores in the file?
The file can be downloaded using the above link. I've written the below code to solve the problem:
rm(list=ls())
library(splitstackshape)
#read in data from http://projecteuler.net/problem=22
names=sort(t(read.table("names.txt",sep=",")))
#letters to numbers conversion vectors
from=LETTERS[seq(1,26)]
to=as.character(seq(1,26))
#function to replace all letters with corresponding numbers
gsub2 = function(pattern, replacement, x, ...){
for(i in 1:length(pattern))
x = gsub(pattern[i],paste(replacement[i]," ",sep=""), x, ...)
x
}
#create df, run function, create row number var for later calculation
df=data.frame(names=names)
df$name.num = gsub2(from,to,df$names)
df$rownum=seq(1,nrow(df))
#split letter values, add across rows, multiply by row number to get name score and sum
df=concat.split(df,"name.num"," ")
df$name.sum=rowSums(df[,4:15],na.rm=TRUE)
df$name.score=df$name.sum*df$rownum
print(sum(df$name.score,na.rm=TRUE))
My result appears to be off 158,055 (I get 871040227 where it should be 871198282). I've spot checked parts of it, and it appears that the list of names is sorted correctly, and that the name scores are compiling correctly (for instance, I also get COLIN=49174). I've also read other threads troubleshooting this problem on SO, but they're mostly in Python and the problems seem to be different than mine. My suspicion is that either the names.txt file is somehow not being read in right or that perhaps the method I'm using (concat.split from the splitstackshape package) to split the df$name.num is incorrect, though it seems to be working correctly.
Any ideas?
Also, any suggestions on how to improve/simplify my code are more than welcome!
I used to have fun doing the Euler problems in R. Here's my solution to 22.
namesscore<-function(name) {
score<-0;
for(s in 1:nchar(name)) {
score<-score + which(substr(name,s,s)==LETTERS[1:26])
}
score
}
names<-scan("prob022.txt", "character", sep=",", quote="\"", na.strings="")
name.pos <- rank(names)
name.val <- sapply(names,namesscore)
sum(name.pos*name.val)
# [1] 871198282
There is a name "NA" in the list which may cause you problems.
As pointed out by #MrFlick, there's a 'NA' in the names list, so you need to treat it.
x = sort(scan('http://projecteuler.net/project/names.txt', what = '', sep =',', na.strings = ""))
s = sapply(x, function(w){
match(w, x) * sum(match(strsplit(w, '')[[1]], LETTERS))
})
print(sum(s))
# 871198282

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Resources