How to load CSV as factor in R

How to load CSV as factor in R - r

I have a file called metadata.csv that I want to load into R and convert to a factor.
I begin with:
metadata <- read.csv(file="metadata.csv", header=T, stringsAsFactors=T)
And this loads the CSV just fine. I've printed out metadata here:
> metadata
Filename Genre Date Gender
1 Austen_Emma.txt Social Early Female
2 Bronte_Eyre.txt Social Middle Female
3 Dickens_Expectations.txt Social Late Male
4 Eliot_Mill.txt Social Late Female
5 Lewis_Monk.txt Gothic Early Male
6 Radcliffe_Italian.txt Gothic Early Female
7 Shelley_Frankenstein.txt Gothic Middle Female
8 Stoker_Dracula.txt Gothic Late Male
9 Thackeray_Vanity.txt Social Middle Male
10 Trollope_Vicar.txt Social Middle Male
Now I want to convert it to a factor:
as.factor(metadata)
This gives me the following error:
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

metadata is a dataframe which is a special type of list made up of vectors of equal length. You can only use as.factor() on vectors. Therefore you must class as.factor() on each vector in the dataframe. This can be done using the lapply function:
metadata <- data.frame(lapply(metadata, factor))
This will convert each column to a factor (check this by class(metadata[, 1])). The overall structure of metadata will still be a dataframe.

read.csv puts data into a data.frame
You cannot convert a data.frame into a factor. That's very basic R stuff.
It's like you're trying to change of a bunch of .doc files into PDFs by converting your computer into a PDF. It just doesn't make sense.
The error is asking "Have you called sort on a list?" Yes, you have. as.factor calls sort, and your data.frame is a list.

Related

Criteria for deciding which character columns should be converted to factors

I have been working through the book "Analyzing Baseball Data with R" by Marchi and Albert and am wondering about an issue which they don't address.
Many of the datasets I need to import are fairly large (though not really "Big" in the sense of "Big Data"). For example, the Retrosheet Game Logs have 1 csv file per year dating back to 1871 where each file has a row for each game played that year, and 161 columns. When I read it into a dataframe using read.csv() using the default setting on stringsAsFactors fully 75 of the 161 columns become factors. Some of these columns conceptually are factors (such as one containing "D" or "N" for day or night games) but others are probably better left as strings (many of the columns contain names of starting pitchers, closers, etc.) I know how to convert columns from factors to strings or vice versa, but I don't want to have to scan through 161 columns, making an explicit decision for 75 of them.
The reason I think it important is that I've noticed that conceptually small dataframes obtained by subsetting these game logs are surprisingly large given the need to retain the full factor information. For example, given the dataframe GL2016 obtained from downloading, unzipping and the reading in the file, object.size(GL2016) is about 2.8 MB, and when I use:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
to extract the home day games played by the Cleveland Indians in 2016, I get a df with 26 rows. 26/2428 (where 2428 is the number of rows in the whole dataframe) is slightly more than 1%, but object.size(df) is around 1.3 MB, which is far more than 1% of the size of GL2016.
I came up with an ad-hoc solution. I first defined a function:
big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k}
And then used mutate_if from dplyr like thus:
GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016
30 is the number of teams in the MLB and I somewhat arbitrarily decided that any factor with more than 30 levels should probably be treated as a string.
After this code has been run, the number of factor variables has been reduced from 75 to 12. It works in the sense that even though now GL2016 is around 3.2 MB (slightly larger than before), if I now subset the dataframe to pull out the Cleveland day games, the resulting dataframe is just 0.1 MB.
Questions:
1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large data set?
2) I am aware of the cost in terms of memory footprint of converting all character data to factors, but am I incurring any hidden costs (say in processing time) when I convert most of these factors back into strings?

Essentially, I think what you need to do is:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
df <- droplevels(df)
droplevelsfunction will remove all the unused factor levels, and thus reduce the size of df immensely.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

R Refer to (part of) data frame using string in R

I have a large data set in which I have to search for specific codes depending on what i want. For example, chemotherapy is coded by ~40 codes, that can appear in any of 40 columns called (diag1, diag2, etc).
I am in the process of writing a function that produces plots depending on what I want to show. I thought it would be good to specify what I want to plot in a input data frame. Thus, for example, in case I only want to plot chemotherapy events for patients, I would have a data frame like this:
Dataframe name: Style
Name SearchIn codes PlotAs PlotColour
Chemo data[substr(names(data),1,4)=="diag"] 1,2,3,4,5,6 | red
I already have a function that searches for codes in specific parts of the data frame and flags the events of interest. What i cannot do, and need your help with, is referring to a data frame (Style$SearchIn[1]) using codes in a data frame as above.
> Style$SearchIn[1]
[1] data[substr(names(data),1,4)=="diag"]
Levels: data[substr(names(data),1,4)=="diag"]
I thought perhaps get() would work, but I cant get it to work:
> get(Style$SearchIn[1])
Error in get(vars$SearchIn[1]) : invalid first argument
enter code here
or
> get(as.character(Style$SearchIn[1]))
Error in get(as.character(Style$SearchIn[1])) :
object 'data[substr(names(data),1,5)=="TDIAG"]' not found
Obviously, running data[substr(names(data),1,5)=="TDIAG"] works.
Example:
library(survival)
ex <- data.frame(SearchIn="lung[substr(names(lung),1,2) == 'ph']")
lung[substr(names(lung),1,2) == 'ph'] #works
get(ex$SearchIn[1]) # does not work

It is not a good idea to store R code in strings and then try to eval them when needed; there are nearly always better solutions for dynamic logic, such as lambdas.
I would recommend using a list to store the plot specification, rather than a data.frame. This would allow you to include a function as one of the list's components which could take the input data and return a subset of it for plotting.
For example:
library(survival);
plotFromSpec <- function(data,spec) {
filteredData <- spec$filter(data);
## ... draw a plot from filteredData and other stuff in spec ...
};
spec <- list(
Name='Chemo',
filter=function(data) data[,substr(names(data),1,2)=='ph'],
Codes=c(1,2,3,4,5,6),
PlotAs='|',
PlotColour='red'
);
plotFromSpec(lung,spec);
If you want to store multiple specifications, you could create a list of lists.

Have you tried using quote()
I'm not entirely sure what you want but maybe you could store the things you're trying to get() like
quote(data[substr(names(data),1,4)=="diag"])
and then use eval()
eval(quote(data[substr(names(data),1,4)=="diag"]), list(data=data))
For example,
dat <- data.frame("diag1"=1:10, "diag2"=1:10, "other"=1:10)
Style <- list(SearchIn=c(quote(data[substr(names(data),1,4)=="diag"]), quote("Other stuff")))
> head(eval(Style$SearchIn[[1]], list(data=dat)))
diag1 diag2
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

R convert names to numbers

I have a data frame with donations and names of donors.
**donation** **Donor**
25.00 Steve Smith
20.00 Jack Johnson
50.00 Mary Jackson
... ...
I'm trying to do some clustering using the pvclust package. Unfortunately the package doesn't seem to take non-numerical data.
> rs1.pv1 <- parPvclust(cl, rs1, nboot=10)
Error in cor(x, method = "pearson", use = use.cor) : 'x' must be numeric
I have two questions.
1) Is there another package or method that would do this better?
2) Is there a way to "normalize" the donor names list? Ie get a list of unique donor names, assign each an id number and then insert the id number into the data frame in place of the character name.

For number 2:
#If donor is a factor then
as.numeric(donor)
#will transform your factor to numeric.
#If it isn't, tranform it to a factor and the to numeric
as.numeric(as.factor(donor))
However, I'm not sure that transforming the donor list to a numeric and then using cor makes sense at all.
HTH

How about rs1 <- transform(rs1, Donor=as.numeric(factor(Donor))) ? (Warning: I haven't thought about what you're doing enough to know whether that makes sense -- so I'm only answering question #2, not question #1). Typically Donor would already be a factor (this is what e.g. read.table or read.csv would do by default), so the factor() part would be redundant.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.

words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.

Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex