Reading items from a dataset - r

I download a file which in every column contains an item or empty cells in csv format. When i write the code:
groceries_data = groceries_data <- read.transactions("groceries.csv")
Surprisingly I see the result :
summary(groceries_data)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
8146 columns (items) and a density of 0.0004401248
but when i write the code
groceries_data = read.transactions("groceries.csv",sep=",")
Then the result is:
summary(groceries_data)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
which is the right result from the book but logically it should work with the first command and not by the second. What is going wrong here?

That function isn't intended to work with CSV by default. See help(read.transactions) - for the sep argument it states:
a character string specifying how fields are separated in the data file. The default ("") splits at whitespaces.
So unless you tell it to split on comma, it is splitting on every white space. If you've got spaces in many product names, then every word of every product name will become a column.
By specifying the sep argument as a comma, it's importing the CSV file correctly, as you wanted.

Related

A cell in a CSV is (wrongly) read as a character vector of length 2 in R

I have a data frame like this I read in from a .csv (or .xlsx, I've tried both), and one of the variables in the data frame is a vector of dates.
Generate the data with this
Name <- rep("Date", 15)
num <- seq(1:15)
Name <- paste(Name, num, sep = "_")
data1 <- data.frame(
Name,
Due.Date = seq(as.Date("2020/09/24", origin = "1900-01-01"),
as.Date("2020/10/08", origin = "1900-01-01"), "days")
)
When I reference one of the cells specifically, like this: str(project_dates$Due.Date[241]) it reads the date as normal.
However, the exact position of the important dates varies from project to project, so I wrote a command that identifies where the important dates are in the sheet, like this: str(project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"])
This code worked on a few projects, but on the current project it now returns a character vector of length 2. One of the values is the date, and the other value is NA. And to make matters worse, the location of the date and the NA is not fixed across dates--the date is the first value in some cells and the second in others (otherwise I would just reference, e.g., the first item in the vector).
What is going on here, but more importantly, how do I fix this?!
Clarification on the second command:
When I was originally reading from an Excel file, the command was project_dates[str_detect(project_dates$Name, "Date_17"), "Due.Date"]$Due.Date because it was returning a 1x1 tibble, and I needed the value in the tibble.
When I switched to reading in data as a csv, I had to remove the $Due.Date because the command was now reading the value as an atomic vector, so the $ operator was no longer valid.
Help me, Oh Blessed 1's (with) Knowledge! You're my only hope!
Edited to include an image of the data like the one that generates the error
I feel sheepish.
I was able to remove the NAs with
data1<- data1[!is.na(data1$Due.Date), ].
I assumed that command would listwise delete the rows with any missing values, so if the cell contained the 2-length vector, then I would lose the whole row of data. Instead, it removed the NA from the cell, leaving only the date.
Thank you to everyone who commented and offered help!

I want to read a csv file allowing duplicate of row names [duplicate]

I am trying to read a csv file with repeated row names but could not. The error message I am getting is Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed.
The code I am using is:
S1N657 <- read.csv("S1N657.csv",header=T,fill=T,col.names=c("dam","anim","temp"))
An example of my data is given below:
did <- c("1N657","1N657","1N657","1N657","1N657","1N657","1N657","1N657","1N657","1N657")
aid <- c(101,102,103,104,105,106,107,108,109,110)
temp <- c(36,38,37,39,35,37,36,34,39,38)
data <- cbind(did,aid,temp)
Any help will be appreciated.
the function is seeing duplicate row names, so you need to deal with that. Probably the easiest way is with row.names=NULL, which will force row numbering--in other words, it treats your first column as the first dimension and not as the row numbers, and so adds row numbers (consecutive integers starting with "1".
read.csv("S1N657.csv", header=T,fill=T, col.names=c("dam","anim","temp"), row.names=NULL)
try this:
S1N657 <- read.csv("S1N657.csv",header=T,fill=T,col.names=c("dam","anim","temp"),
row.names = NULL)[,-1]
Guessing your csv file was one converted from xlsx.Add a comma to the end of the first row ,remove the last row ,done
An issue I had recently was that the number of columns in the header row did not match the number of columns I had in the data itself. For example, my data was tab-delimited and all of the data rows had a trailing tab character. The header row (which I had manually added) did not.
I wanted the rows to be auto-numbered, but instead it was looking at my first row as the row name. From the docs (emphasis added by me):
row.names a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names. Otherwise if row.names is missing, the rows are numbered.
Using row.names = NULL forces row numbering. Missing or NULL row.names generate row names that are considered to be ‘automatic’ (and not preserved by as.matrix).
Adding an extra tab character to the header row made the header row have the same number of columns as the data rows, thus solving the problem.
In short, check your column names. If your first row is the names of columns, you may be missing one or more names.
Example:
"a","b","c"
a,b,c,d
a,b,c,d
The example above will cause a row.name error because each row has 4 values, but only 3 columns are named.
This happened to me when I was building a csv from an online resources.
I was getting the same "duplicate 'row.names' are not allowed" error for a small CSV. The problem was that somewhere outside of the 14x14 chart area I wanted there was a random cell with a space/other data.
Discovered the answer when I ran it "row.names = NULL" and there were multiple rows of blank data below my table (and therefore multiple duplicate row names all "blank").
Solution was to delete all rows/columns outside the table area, and it worked!
in my case the problem came from the excel file. Although it seemed perfectly organized, it did not worked and I had always the message: Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed.
I tried to copy-paste my excel matrix to a new empty excel sheet and I retried to read it: it worked ! No error message anymore !

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:
The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

How to ignore comma in text para when saving in .csv format?

I am trying to extract data from NCBI using different functions in rentrez package. However, I have an issue because the function extract_from_esummary() in rentrez results in matrix, where text of a column is splitted into adjacent columns when saved in .csv file ( as shown in Image) because of "," is recognized as a delimiter.
library (rentrez)
PM.ID <- c("25979833", "25667274","23792568","22435913")
p.data <- entrez_summary(db = "pubmed", id = PM.ID )
pubrecord.table <- extract_from_esummary(esummaries = p.data ,
elements = c("uid","title","fulljournalname",
"pubtype"))
From the image example above, In Column PMID: 25979833, the journal name split to extend into the next column. European journal of cancer (Oxfordin columns 1 and then England : 1990) in next column. When I did a dput(pubrecord.table), I understood that the split is because the words are separated by comma ",". How can I make R understand thatEuropean journal of cancer (Oxford, England : 1990) belongs to the same column ? Similar issue with the Title and Pubtype fields.... where the long text has a comma in between and R breaks it by csv format. How can I clean the data to so that data is in appropriate column ?
I thought this looked like a bug in extract_from_esummary. I searched package's issues on Github for "comma" and got this, which says:
This is not really a problem with rentrez, just a property of NCBI records and R objects.
In this case, the pubtype field is variably-sized.
When you try and write the matrix it represents the vectors like you'd type them in (c(..., ...)) which adds a comma which breaks the csv format.
In this case, you can collapse the vectors and unlist each matrix row to allow them to be written out
The issue page has code examples as well.

Importing csv file into R - numeric values read as characters

I am aware that there are similar questions on this site, however, none of them seem to answer my question sufficiently.
This is what I have done so far:
I have a csv file which I open in excel. I manipulate the columns algebraically to obtain a new column "A". I import the file into R using read.csv() and the entries in column A are stored as factors - I want them to be stored as numeric. I find this question on the topic:
Imported a csv-dataset to R but the values becomes factors
Following the advice, I include stringsAsFactors = FALSE as an argument in read.csv(), however, as Hong Ooi suggested in the page linked above, this doesn't cause the entries in column A to be stored as numeric values.
A possible solution is to use the advice given in the following page:
How to convert a factor to an integer\numeric without a loss of information?
however, I would like a cleaner solution i.e. a way to import the file so that the entries of column entries are stored as numeric values.
Cheers for any help!
Whatever algebra you are doing in Excel to create the new column could probably be done more effectively in R.
Please try the following: Read the raw file (before any excel manipulation) into R using read.csv(... stringsAsFactors=FALSE). [If that does not work, please take a look at ?read.table (which read.csv wraps), however there may be some other underlying issue].
For example:
delim = "," # or is it "\t" ?
dec = "." # or is it "," ?
myDataFrame <- read.csv("path/to/file.csv", header=TRUE, sep=delim, dec=dec, stringsAsFactors=FALSE)
Then, let's say your numeric columns is column 4
myDataFrame[, 4] <- as.numeric(myDataFrame[, 4]) # you can also refer to the column by "itsName"
Lastly, if you need any help with accomplishing in R the same tasks that you've done in Excel, there are plenty of folks here who would be happy to help you out
In read.table (and its relatives) it is the na.strings argument which specifies which strings are to be interpreted as missing values NA. The default value is na.strings = "NA"
If missing values in an otherwise numeric variable column are coded as something else than "NA", e.g. "." or "N/A", these rows will be interpreted as character, and then the whole column is converted to character.
Thus, if your missing values are some else than "NA", you need to specify them in na.strings.
If you're dealing with large datasets (i.e. datasets with a high number of columns), the solution noted above can be manually cumbersome, and requires you to know which columns are numeric a priori.
Try this instead.
char_data <- read.csv(input_filename, stringsAsFactors = F)
num_data <- data.frame(data.matrix(char_data))
numeric_columns <- sapply(num_data,function(x){mean(as.numeric(is.na(x)))<0.5})
final_data <- data.frame(num_data[,numeric_columns], char_data[,!numeric_columns])
The code does the following:
Imports your data as character columns.
Creates an instance of your data as numeric columns.
Identifies which columns from your data are numeric (assuming columns with less than 50% NAs upon converting your data to numeric are indeed numeric).
Merging the numeric and character columns into a final dataset.
This essentially automates the import of your .csv file by preserving the data types of the original columns (as character and numeric).
Including this in the read.csv command worked for me: strip.white = TRUE
(I found this solution here.)
version for data.table based on code from dmanuge :
convNumValues<-function(ds){
ds<-data.table(ds)
dsnum<-data.table(data.matrix(ds))
num_cols <- sapply(dsnum,function(x){mean(as.numeric(is.na(x)))<0.5})
nds <- data.table( dsnum[, .SD, .SDcols=attributes(num_cols)$names[which(num_cols)]]
,ds[, .SD, .SDcols=attributes(num_cols)$names[which(!num_cols)]] )
return(nds)
}
I had a similar problem. Based on Joshua's premise that excel was the problem I looked at it and found that the numbers were formatted with commas between every third digit. Reformatting without commas fixed the problem.
So, I had the similar situation here in my data file when I readin as a csv. All the numeric value were turned into char. But in my file there was a value with a word "Filtered" instead of NA. I converted "Filtered" to NA in vim editor of linux terminal with a command <%s/Filtered/NA/g> and saved this file and later used it and read it in R, all the values were num type and not char type any more.
Looks like character value "Filtered" was inducing all values to be char format.
Charu
Hello #Shawn Hemelstrand here are the steps in detail below:
example matrix file.csv having 'Filtered' word in it
I opened the file.csv in linux command terminal
vi file.csv
then press "Esc shift:"
and type the following command at the bottom
"%s/Filtered/NA/g"
press enter
then press "Esc shift:"
write "wq" at the bottom (this save the file and quit vim editor)
then in R script I read the file
data<- read.csv("file.csv", sep = ',', header = TRUE)
str(data)
All columns were num type which were earlier char type.
In case you need more help, it would be easier to share your txt or csv file.

Resources