I'm currently working through Project Euler problem 22 which has the following challenge:
Using names.txt (right click and 'Save Link/Target As...'), a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49714.
What is the total of all the name scores in the file?
The file can be downloaded using the above link. I've written the below code to solve the problem:
rm(list=ls())
library(splitstackshape)
#read in data from http://projecteuler.net/problem=22
names=sort(t(read.table("names.txt",sep=",")))
#letters to numbers conversion vectors
from=LETTERS[seq(1,26)]
to=as.character(seq(1,26))
#function to replace all letters with corresponding numbers
gsub2 = function(pattern, replacement, x, ...){
for(i in 1:length(pattern))
x = gsub(pattern[i],paste(replacement[i]," ",sep=""), x, ...)
x
}
#create df, run function, create row number var for later calculation
df=data.frame(names=names)
df$name.num = gsub2(from,to,df$names)
df$rownum=seq(1,nrow(df))
#split letter values, add across rows, multiply by row number to get name score and sum
df=concat.split(df,"name.num"," ")
df$name.sum=rowSums(df[,4:15],na.rm=TRUE)
df$name.score=df$name.sum*df$rownum
print(sum(df$name.score,na.rm=TRUE))
My result appears to be off 158,055 (I get 871040227 where it should be 871198282). I've spot checked parts of it, and it appears that the list of names is sorted correctly, and that the name scores are compiling correctly (for instance, I also get COLIN=49174). I've also read other threads troubleshooting this problem on SO, but they're mostly in Python and the problems seem to be different than mine. My suspicion is that either the names.txt file is somehow not being read in right or that perhaps the method I'm using (concat.split from the splitstackshape package) to split the df$name.num is incorrect, though it seems to be working correctly.
Any ideas?
Also, any suggestions on how to improve/simplify my code are more than welcome!
I used to have fun doing the Euler problems in R. Here's my solution to 22.
namesscore<-function(name) {
score<-0;
for(s in 1:nchar(name)) {
score<-score + which(substr(name,s,s)==LETTERS[1:26])
}
score
}
names<-scan("prob022.txt", "character", sep=",", quote="\"", na.strings="")
name.pos <- rank(names)
name.val <- sapply(names,namesscore)
sum(name.pos*name.val)
# [1] 871198282
There is a name "NA" in the list which may cause you problems.
As pointed out by #MrFlick, there's a 'NA' in the names list, so you need to treat it.
x = sort(scan('http://projecteuler.net/project/names.txt', what = '', sep =',', na.strings = ""))
s = sapply(x, function(w){
match(w, x) * sum(match(strsplit(w, '')[[1]], LETTERS))
})
print(sum(s))
# 871198282
Related
I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.
I was on that post read.csv and skip last column in R but did not find my answer, and try to check directly in Answer ... but that's not the right way (thanks mjuarez for taking the time to get me back on track.
The original question was:
I have read several other posts about how to import csv files with
read.csv but skipping specific columns. However, all the examples I
have found had very few columns, and so it was easy to do something
like:
columnHeaders <- c("column1", "column2", "column_to_skip")
columnClasses <- c("numeric", "numeric", "NULL")
data <- read.csv(fileCSV, header = FALSE, sep = ",", col.names =
columnHeaders, colClasses = columnClasses)
All answer were good, but does not work for what I entended to do. So I asked my self and other:
And in one function, does data <- read_csv(fileCSV)[,(ncol(data)-1)]
could work?
I've tried in one line of R to get on data, all 5 of first 6 columns, so not the last one. To do so, I would like to use "-" in the number of column, do you think it's possible? How can I do that?
Thanks!
In base r it has to be 2 steps operation. Example:
> data <- read.csv("test12.csv")
> data
# 3 columns are returned
a b c
1 1/02/2015 1 3
2 2/03/2015 2 4
# last column is excluded
> data[,-ncol(data)]
a b
1 1/02/2015 1
2 2/03/2015 2
one cannot write data <- read.csv("test12.csv")[,-ncol(data)] in base r.
But if you know max number of columns in your csv (say 3 in my case) then one can write:
df <- read.csv("test12.csv")[,-3]
df
a b
1 1/02/2015 1
2 2/03/2015 2
The right hand side of an assignment is processed first so this line from the question:
data <- read.csv(fileCSV)[,(ncol(data)-1)]
is trying to use data before it is defined. Also note what the above is saying is to take only the 2nd last field. To get all but the last field:
data <- read.csv(fileCSV)
data <- data[-ncol(data)]
If you know the name of the last field, say it is lastField, then this works and unlike the code above does not read the whole file and then remove the last field but rather only reads in fields other than the last. Also it is only one line of code.
read.csv(fileCSV, colClasses = c(lastField = "NULL"))
If you don't know the name of the last field but you do know how many fields there are, say n, then either of these would work:
read.csv(fileCSV)[-n]
read.csv(fileCSV, colClasses = replace(rep(NA, n), n, "NULL"))
Another way to do it without first reading in the last field is to first read in the header and first line to calculate the number of fields (assuming that all records have the same number) and then re-read the file using that.
n <- ncol(read.csv(fileCSV, nrows = 1))
making use of one of the prior two statements involving n.
It's not possible in one line as the data variable is not yet initialized when you call it. So the command ncol(data) will trigger an error.
You would need to use two lines of code to first load your data into the data variable and then remove the last column by either using data[,-ncol(data)] or data[,1:(ncol(data)-1)].
Not a single function, but at least a single line, using dplyr (disclaimer: I never use dplyr or magrittr, so a more optimized solution must exist using these libraries)
library(dplyr)
dat = read.table(fileCSV) %>% select(., which(names(.) != names(.)[ncol(.)]))
I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.
I'm writing a function for a data set called opps on part number sales data, and I'm trying to break the data down into smaller data sets that are specific to the part numbers. I am trying to name the data sets as the argument "modNum". Here is what I have so far-
# modNum (Modified Product Number) takes a product number that looks
# like "950-0004-00" and makes it "opQty950.0004.00"
productNumber <- function(prodNum,modNum){
path <- "C:/Users/Data/"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
}
productNumber(280-0213-00,opQty280.0213.00)
#Error: object 'opQty910.0002.01' not found
The line I believe I'm having problems with is
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
and it's because in order for the code to work, there have to be parenthesis around prodNum, but when i put the parenthesis in the code,
prodNum is no longer seen as the argument to be filled in. When i put the parenthesis inside the argument like this,-
productNumber(280-0213-00,"opQty280.0213.00")
I still have a problem.
How can I get around this?
I have tried rewriting the oppsQty$Part.Number variable to be numeric (shown below) so that I can eliminate the parenthesis all together, but I still have errors...
productNumber <- function(prodNum,nameNum){
path <- "C:/Users/Data"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
#ifelse(oppsQty$Part.Number=="Discount",
# oppsQty$Part.Number=="000000000",
# oppsQty$Part.Number)
oppsQty$Part <- paste(substr(oppsQty$Part.Number,1,3),
substr(oppsQty$Part.Number,5,8),
substr(oppsQty$Part.Number,10,11),sep = "")
oppsQty$Part <- as.numeric(oppsQty$Part)
oppsQty$Part[is.na(oppsQty$Part)] <- 0
nameNum <- oppsQty[oppsQty$Part==prodNum,]
}
> productNumber(401110201,opQty401.1102.01)
Warning message:
In productNumber(401110201, opQty401.1102.01) : NAs introduced by coercion
Help is much appreciated!
Thank you!
At the moment you are passing prodNum as a numeric value, thus
280-0213-00 is evaluated as 67 (280-213-0= 67)
You should pass (and consider) prodNum as a character string (as this is what you intend)
ie. "280-0213-00"
I have a function that computes some things and then assigns that to a matrix. This matrix receives its name from a paste statement (based on some other current values). I then want to assign the dimnames to the matrix, but don't know how to make the pasted name be understood.
Here is what is going on:
function <- someComputations(labs) {
### bunch of computations, leading to X, Y, and Z:
matName <- paste("rhoMat_", X, sep = "") # this yields rhoMat_15 if X equals 15
assign(matName, Y %*% Z)
assign(dimnames(matName), labs) # labs is a list of row labels and column labels
return(matName)
}
This works well, including the first assign statement, and then it breaks down.
I have tried all kinds of approaches, such as eval(parse(text = matNum)), as.name(matNum), substitute(matNum), but to no avail.
Since I don't know the actual name of the matrix (because matNum is not given), I can't hardcode the name into the function--so I am stuck with its character name matName. How can I make R understand I want to set the dimnames of the matrix rhoMat_15, rather than of matName?
Thanks, Peter
dimnames(get(matName)) <- labs