Read list of names from CSV into R - r

I have a text file of names, separated by commas, and I want to read this into whatever in R (data frame or vector are fine). I try read.csv and it just reads them all in as headers for separate columns, but 0 rows of data. I try header=FALSE and it reads them in as separate columns. I could work with this, but what I really want is one column that just has a bunch of rows, one for each name. For example, when I try to print this data frame, it prints all the column headers, which are useless, and then doesn't print the values. It seems like it should be easily usable, but it appears to me one column of names would be easier to work with.

Since the OP asked me to, I'll post the comment above as an answer.
It's very simple, and it comes from some practice in reading in sequences of data, numeric or character, using scan.
dat <- scan(file = your_filename, what = 'character', sep = ',')

You can use read.csv are read string as header, but then just extract names (using names) and put this into a data.frame:
data.frame(x = names(read.csv("FILE")))
For example:
write.table("qwerty,asdfg,zxcvb,poiuy,lkjhg,mnbvc",
"FILE", col.names = FALSE, row.names = FALSE, quote = FALSE)
data.frame(x = names(read.csv("FILE")))
x
1 qwerty
2 asdfg
3 zxcvb
4 poiuy
5 lkjhg
6 mnbvc

Something like that?
Make some test data:
# test data
list_of_names <- c("qwerty","asdfg","zxcvb","poiuy","lkjhg","mnbvc" )
list_of_names <- paste(list_of_names, collapse = ",")
list_of_names
# write to temp file
tf <- tempfile()
writeLines(list_of_names, tf)
You need this part:
# read from file
line_read <- readLines(tf)
line_read
list_of_names_new <- unlist(strsplit(line_read, ","))

Related

How do I extract specific rows from a CSV and format the data in R?

I have a CSV file that contains thousands of lines like this:
1001;basket/files/legobrick.mp3
4096;basket/files/sunshade.avi
2038;data/lists/blockbuster.ogg
2038;data/random/noidea.dat
I want to write this to a new CSV file but include only rows which contain '.mp3' or '.avi'. The output file should be just one column and look like this:
"basket/files/legobrick.mp3#1001",
"basket/files/sunshade.avi#4096",
So the first column should be suffixed to the second column and separated by a hash symbol and each line should be quoted and separated by a comma as shown above.
The source CSV file does not contain a header with column names. It's just data.
Can someone tell me how to code this in R?
Edit (following marked answer): This question is not a duplicate because it involves filtering rows and the output code format is completely different requiring different processing methods. The marked answer is also completely different which really backs up my assertion that this is not a duplicate.
You can do it in the following way :
#Read the file with ; as separator
df <- read.csv2(text = text, header = FALSE, stringsAsFactors = FALSE)
#Filter the rows which end with "avi" or "mp3"
inds <- grepl("avi$|mp3$", df$V2)
#Create a new dataframe by pasting those rows with a separator
df1 <- data.frame(new_col = paste(df$V2[inds], df$V1[inds], sep = "#"))
df1
# new_col
#1 basket/files/legobrick.mp3#1001
#2 basket/files/sunshade.avi#4096
#Write the csv
write.csv(df1, "/path/of/file.csv", row.names = FALSE)
Or if you want it as a text file you can do
write.table(df1, "path/test.txt", row.names = FALSE, col.names = FALSE, eol = ",\n")
data
text = "1001;basket/files/legobrick.mp3
4096;basket/files/sunshade.avi
2038;data/lists/blockbuster.ogg
2038;data/random/noidea.dat"
See whether the below code helps
library(tidyverse)
df %>%
filter(grepl("\\.mp3|\\.avi", file_path)) %>%
mutate(file_path = paste(file_path, ID, sep="#")) %>%
pull(file_path) %>% dput
A data.table answer:
dt <- fread("file.csv")
fwrite(dt[V2 %like% "mp3$|avi$", .(paste0(V2, "#", V1))], "output.csv", col.names = FALSE)

Data frame writes to a single column in write.csv/write.table R

I'm trying to write a csv file (I would use write.xlsx, but for some reason that was giving me Java memory errors.... no matter, I'll do this instead), but if I used the following data frame:
id <- c(1,2,3,4,5)
email <- c('jim#chase.com','steve#aol.com','stacy#gmail.com/','chris#yahoo.com','emilio#verizon.net/')
sample <- data.frame(id,email)
write.table(sample, 'Me\\Raw List.csv',
row.names = TRUE, col.names = TRUE, append = FALSE)
I get the data all in a single-column CSV, along with a row identifier like this:
id "email"
1 1 "jim#chase.com"
2 2 "steve#aol.com"
3 3 "stacy#gmail.com/"
4 4 "chris#yahoo.com"
5 5 "emilio#verizon.net/"
My question is two-part: 1) How do I separate this data into columns; and 2) How do I remove the row identifiers so that I can just use my id?
write.table()'s standard separator is " " (check the docs).
Use write.csv() or write.csv2() along with the parameter row.names=False instead.
write.csv(sample,file = "my_dir/my_file.csv", row.names=F)
row.names=F makes that a unique row identifier (basically an id like you have it) will not be written.
You may use write.table() as well but you'll have to pass additional parameters:
write.table(sample, file = "test.csv", row.names=F, sep=",", dec=".")

Issues importing csv data into R where the data contains additional commas

I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source

Get column types of excel sheet automatically

I have an excel file with several sheets, each one with several columns, so I would like to not to specify the type of column separately, but automatedly. I want to read them as stringsAsFactors= FALSE would do, because it interprets the type of column, correctly. In my current method, a column width "0.492 ± 0.6" is interpreted as number, returning NA, "because" the stringsAsFactors option is not available in read_excel. So here, I write a workaround, that works more or less well, but that I cannot use in real life, because I am not allowed to create a new file. Note: I need other columns as numbers or integers, also others that have only text as characters, as stringsAsFactors does in my read.csv example.
library(readxl)
file= "myfile.xlsx"
firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0)
#firstread has the problem of the a column with "0.492 ± 0.6",
#being interpreted as number (returns NA)
colna<-colnames(firstread)
# read every column as character
colnumt<-ncol(firstread)
textcol<-rep("text", colnumt)
secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE,
col_types = textcol, na = "", skip = 0)
# another column, with the number 0.532, is now 0.5319999999999999
# and several other similar cases.
# read again with stringsAsFactors
# critical step, in real life, I "cannot" write a csv file.
write.csv(secondreadchar, "allcharac.txt", row.names = FALSE)
stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE)
colnames(stringsasfactor)<-colna
# column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well
Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {
#Get the number of column in current sheet
col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))
# Get the dataframe with columns as text
df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))
# Convert to data.frame
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Get numeric fields by trying to convert them into
# numeric values. If it returns NA then not a numeric field.
# Otherwise numeric.
cond <- apply(df, 2, function(x) {
x <- x[!is.na(x)]
all(suppressWarnings(!is.na(as.numeric(x))))
})
numeric_cols <- names(df)[cond]
df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
# Return df in desired format
df
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
The process goes as follows:
First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.
Edit:
As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:
library(readxl)
# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")
dfs <- lapply(all_sheets, function(sheetname) {
suppressWarnings(read_excel(path = "myfile.xlsx",
sheet = sheetname,
col_types = 'guess',
guess_max = Inf))
})
# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets
I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.
I hope this helps.

read.table returns extra rows

I am working with textfiles of many, long rows with varying number of elements. Each element in the rows are separated by \t and of course the rows are terminated by \n. I'm using read.table to read the textfiles. An example samplefile is this: https://www.dropbox.com/s/6utslbnwerwhi58/samplefile.txt
The sample file has 60 rows.
Code to read the file:
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE);
dim(sampleData);
The dim returns 70 rows when in fact it should be 60. When I try nrows=60 like
sampleData <- read.table("samplefile.txt", as.is=TRUE, fill = TRUE, nrows = 60);
dim(sampleData);
it does work, however, I don't know if doing so will delete some of the information. My suspicion is that the last portions of some of the rows are added to new rows. I don't know why that would be the case, however, as I have fill = TRUE;
I have also tried
na.strings = "NA", fill=TRUE, strip.white=TRUE, blank.lines.skip =
TRUE, stringsAsFactors=FALSE, quote = "", comment.char = ""
but to no avail.
Does anyone have any idea what might be going on?
In the absence of a reproducible example, try something like this:
# Make some fake data
R <- c("1 2 3 4","2 3 4","4 5 6 7 8")
writeLines(R, "samplefile.txt")
# read line by line
r <- readLines("samplefile.txt")
# split by sep
sp <- strsplit(r, " ")
# Make each into a list of dataframes (for rbind.fill)
sp <- lapply(sp, function(x)as.data.frame(t(x)))
# now bind
library(plyr)
rbind.fill(sp)
If this is similar to your actual problem, anyway.

Resources