How to import CSV data in R separated by tab/space?

How to import CSV data in R separated by tab/space? - r

The data in have is as below in a .csv file.
id.airwaybill_number.order_number.org_pincode.product_type.inscan_date.pickup_date.actual_weight.original_act_weight.chargeable_weight.collectable_value.declared_value.code.name.active.center_shortcode.center_shortcode.if.sc.center_shortcode...NULL csc.center_shortcode sc.center_shortcode..rts_status.reverse_pickup.ref_airwaybill_number.dest_pincode.pincode.item_description.length.breadth.height.volumetric_weight.city_name.city_name.state_shortcode.state_shortcode.zone_shortcode.zone_shortcode
"61773384 147200492 SLP759809537 110008 ppd 2016-03-02 04:38:56 2016-03-01 0.25 0.25 0.5 0 424 92006 JASPER INFOTECH PRIVATE LIMITED activ 0 NULL 37.5 DLT MPS MPS 0 0 NULL 403516 403516 Vimarsh Rechargeable Tube With Charger Emergency Light 10 10 10 0.2 DELHI MAPUSA DL GA NCR WS"
When I import it into R using -
y <- read.csv("x.csv", sep = "\t")
y <- read.table("x.csv", sep = "\t")
All the data comes into one cell. This is sample of very big data and I want to import the data column wise and not in a single cell.
Please help.

Your file is a little odd, in that it seems to have a mix of delimiters (some \t, some _, and some ,), and as #Sun Bee mentions in the comments, your header doesn't seem to match up with your data. For those reasons, it might be worth working on the file "from scratch" rather than relying on something like read.table or fread.
First, read in the file as text:
con <- file( "x.csv" )
input <- readLines( con )
close( con )
Then perform a few tasks on it. First, split the text in each line by any of \t, ,, and _.
data <- sapply( input, strsplit, "\t|,|_" )
If you take a look at the lengths of each element, you'll see that the first (the header) is an odd one out, meaning the values won't line up with the header names.
sapply( data, length )
My suggestion here is to remove that first row, and go without a header for now.
data <- data[ -1 ]
Then bind the list together rowwise to make a matrix* (which you can convert to a data.frame if you prefer). I'm removing the row names here because I assume you don't need them.
data <- do.call( rbind, data )
row.names(data) <- NULL
What results from the above is something that I'd say represents your data well, albeit without columns names. You can take the first line of your file and work with it to extract proper column names if you wish, but I'm not seeing exactly how they should go, so I won't attempt it here.
NOTE if you want the rbind function not to convert the columns to factor class (which it will by default), you can specify options( stringsAsFactors = FALSE ) beforehand.

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct

I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

Convert a dataframe to a character array?

I am trying to convert a dataframe to a character array in R.
THIS WORKS BUT THE TEXT FILE ONLY CONTAINS LIKE 83 RECORDS
data <- readLines("https://www.r-bloggers.com/wp-content/uploads/2016/01/vent.txt")
df <- data.frame(data)
textdata <- df[df$data, ]
THIS DOES NOT WORK..MAYBE BECAUSE IT HAS 3k RECORDS?
trump_posts <- read.csv(file="C:\\Users\\TAFer\\Documents\\R\\TrumpFBStatus1.csv",
sep = ",", stringsAsFactors = TRUE)
trump_text <- trump_posts[trump_posts$Facebook.Status, ]
All I know is I have a dataframe called trump posts. The frame has a single column called Facebook.Status. I just wanted to turn it into a character array so I can run an analysis on it.
Any help would be very much appreciated.
Thanks

If Facebook.Status is a character vector you can directly perform your analysis on it.
Or you can try:
trump_text <- as.character(trump_posts$Facebook.Status)

I think you are somehow confusing data.frame syntax with data.table syntax. For DF, you'd reference vector as df$col. However, for DT it is somewhat similar to what you wrote dt[,col] or dt[,dt$col]. Also, if you want a character vector right away, set stringsAsFactors = F in your read.csv. Otherwise you'll need extra conversion, for example, dt[,as.character(col)] or as.character(df$col).
And on a side note, size of vector is almost never an issue, unless you hit the limits of your hardware.

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:

The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.

I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

R: changing column names for improved documentation

I have two csv files. One containing measurements at several points and one containing the description of the single points. It has about a 100 different points and 10000's of measurements but for simplification let's assume there are only two points and measurements.
data.csv:
point1,point2,date
25,80,11.06.2013
26,70,10.06.2013
description.csv:
point,name,description
point1,tempA,Temperature in room A
point2,humidA,Humidity in room A
Now I read both of the csv's into dataframes. Then I change the column names in the dataframe to make it more readable.
options(stringsAsFactors=F)
DataSource <- read.csv("data.csv")
DataDescription <- read.csv("description.csv")
for (name.source in names(DataSource))
{
count = 1
for (name.target in DataDescription$point)
{
if (name.source == name.target)
{
names(DataSource)[names(DataSource)==name.source] <- DataDescription[count,'name']
}
count = count + 1
}
}
So, my questions now are: Is there a way to do this without the loops? And would you change the names for readability as I did or not? If not, why?

The trick with replacements is sometimes to match the indexing on both sides of hte assignment:
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
DataDescription$name[match(DataDescription$point, names(DataSource))]
#> DataSource
tempA humidA date
1 25 80 11.06.2013
2 26 70 10.06.2013
Earlier effort :
names(DataSource)[match(DataDescription$point, names(DataSource))] <-
gsub(" ", "_", DataDescription$description)[
match(DataDescription$point, names(DataSource))]
#> DataSource
Temperature_in_room_A Humidity_in_room_A date
1 25 80 11.06.2013
2 26 70 10.06.2013
Notice that I did not put non-syntactic names on that dataframe. To do so would have been a disservice. Anando Mahto's comment is well considered. I would not want to do this unless it were are the very end of data-processing or a side excursion on the way to a plotting effort. In that case I might not substitute the underscores. In the case where you wanted plotting lables there might be a further need for insertion of "\n" to fold the text within space constraints.

ok, I ordered the columns in the first one and the rows in the second one to work around the problem with the same order of the points. Now the description only need to have the same points as the data source. Here is my final code:
# set options to get strings right
options(stringsAsFactors=F)
# read in original data
DataOriginal <- read.csv("data.csv", sep = ";")
DataDescriptionOriginal <- read.csv("description.csv", sep = ";")
# sort the data
DataOrdered <- DataOriginal[,order(names(DataOriginal))]
DataDescriptionOrdered <- DataDescriptionOriginal[order(DataDescriptionOriginal$points),]
# copy data into final dataframe and replace names
Data <- DataOrdered
names(Data)[match(DataDescriptionOrdered$points, names(Data))] <- gsub(" ", "_", DataDescriptionOrdered$description)[match(DataDescriptionOrdered$points, names(Data))]
Thx a lot to everyone contributing to find a good solution for me!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to import CSV data in R separated by tab/space? - r

Related

read a csv file with quotation marks and regex R

Convert a dataframe to a character array?

Combining CSV files and splitting the column into 2 columns using R

How to convert rows

R: changing column names for improved documentation

Categories

Resources