R: Read in .csv file and convert into multiple column data frame - r

I am new to R and currently having a plenty of trouble just reading in .csv file and converting it into data.frame with 7 columns. Here is what I am doing:
gene_symbols_table <- as.data.frame(read.csv(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE, sep=","))
After that I am getting a data.frame with dim = 46761 x 1, but I need it to be 46761 x 7. I tried the following stackoverflow threads:
How can you read a CSV file in R with different number of columns
read.delim() - errors "more columns than column names" and "header and ''col.names" are of different lengths"
Split a column of a data frame to multiple columns
But somehow nothing is working in my case.
Here is how the table looks:
> head(gene_symbols_table, 3)
input.reason.matches.organism.name.primaryIdentifier.symbol.briefDescription.c
lass.secondaryIdentifier
1 WBGene00008675 MATCH 1 Caenorhabditis elegans
WBGene00008675 irld-26 Gene F11A5.7
2 WBGene00008676 MATCH 1 Caenorhabditis elegans
WBGene00008676 oac-15 Gene F11A5.8
3 WBGene00008677 MATCH 1 Caenorhabditis elegans
WBGene00008677 Gene F11A5.9
The .csv file in Excel looks like this:
input | reason | matches | organism.name | primaryIdentifier | symbol |
briefDescription
WBGene00008675 | MATCH | 1 | Caenorhabditis elegans WBGene00008675 | irld-26 | ...
...
The following code:
gene_symbols_table <- read.table(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=FALSE, sep=",",
col.names = paste0("V",seq_len(7)), fill = TRUE)
Seems to be working, however when I look into dim I can see right away that it is wrong: 20124 x 7. Then:
V1
1input;reason;matches;organism.name;primaryIdentifier;symbol;briefDescription;class;secondaryIdentifier
2 WBGene00008675;MATCH;1;Caenorhabditis
elegans;WBGene00008675;irld-26;;Gene;F11A5.7
3 WBGene00008676;MATCH;1;Caenorhabditis
elegans;WBGene00008676;oac-15;;Gene;F11A5.8
V2 V3 V4 V5
1
2
3
1
So, it is wrong
Other attempts at read.table are giving me the error specified in the second stackoverflow thread.
I have also tried splitting the data.frame with one column into 7, but so far no success.

The sep seems to be space or semi-colon, and not comma from what the table looks like. So either try specifying that, or you could try fread from the data.table package, which automatically detects the separator.
gene_symbols_table <- as.data.frame(fread(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE))

Related

Read comma separated csv file with fields containing commas using fread in r

I have a csv file separated by comma. However, there are fields containing commas like company names "Apple, Inc" and the fields will be separated into two columns, which leads to the following error using fread.
"Stopped early on line 5. Expected 26 fields but found 27."
Any suggestions on how to appropriately load this file?
Example rows are as follows. It seems that there are some fields with comma without quotes. But they have whitespace following the comma inside the field.
100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
1) Using the test file created in the Note at the end and assuming that the file has no semicolons (use some other character if it does) read in the lines, replace the first and last comma with semicolon and then read it as a semicolon separated file.
L <- readLines("firms.csv")
read.table(text = sub(",(.*),", ";\\1;", L), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
2) Another approach is to use gsub to replace every comma followed by space with semicolon followed by space and then use chartr to replace every comma with semicolon and every semicolon with comma and then read it in
as a semicolon separated file.
L <- readLines("firms.csv")
read.table(text = chartr(",;", ";,", gsub(", ", "; ", L)), sep = ";")
## V1 V2 V3
## 1 100 Microsoft azure.com
## 2 300 IBM ibm.com
## 3 500 Google google.com
## 4 100 Amazon, Inc amazon.com
## 5 400 SAP, Inc sap.com
3) Another possibility if there are not too many such rows is to locate them and then put quotes around the offending fields in a text editor. Then it can be read in normally.
which(count.fields("firms.csv", sep = ",") != 3)
## [1] 4
Note
Lines <- '100,Microsoft,azure.com
300,IBM,ibm.com
500,Google,google.com
100,Amazon, Inc,amazon.com
400,"SAP, Inc",sap.com
'
cat(Lines, file = "firms.csv")
Works fine for me. Can you provide a reproducible example?
library(data.table)
# Create example and write out
df_out <- data.frame("X" = c("A", "B", "C"),
"Y"= c("a,A", "b,B", "C"))
write.csv(df_out, file = "df.csv", row.names = F)
# Read in CSV with fread
df_in <- fread("./df.csv")
df_in
X Y
1: A a,A
2: B b,B
3: C C

Removing a tweet/row if it contains any non-english word

I want to remove the whole tweet or a row from a data-frame if it contains any non-english word.
My data-frame looks like
text
1 | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!
JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>
2 | #natefrancis00 #SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F
<U+0086> if only Alan had a Twitter hahaha
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president
The expected dataframe should be like this:
text
3 | #pchirsch23 #The_0nceler #livetennis Whoa whoa let’s not take this too
far now
4 | #pchirsch23 #The_0nceler #livetennis Well Pat that’s just not true
5 | One word #Shame on you! #Ji allowing looters to become president.
You want to preserve the alpha-numeric characters along with some of punctuation's like #, ! etc.
If your column contains mainly of <unicode>, then this should do:
For data frame df with text column, using grep:
new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]
To put it back to your original column text. you can just work with indices that you need and put other to NA:
idx <- grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA
For cleaning the tweet, you can use gsub function. refer this post cleaning tweet in R

How to remove multiple commas but keep one in between two values in a csv file?

I have a csv file with millions of records like below
1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,
I have to remove the extra commas between two values and keep only one. The output for the sample input should look like
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
How can I achieve this using shell since it automates for the other files too.
I need to load this data in to a database. Can we do it using R?
sed method:
sed -e "s/,\+/,/g" -e "s/,$//" input_file > output_file
Turns multiple commas to single comma and also remove last comma on line.
Edited to address modified question.
R solution.
The original solution provided was just processing text. Assuming that your rows are in a structure, you can handle multiple rows with:
# Create Data
Row1 = "1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,"
Row2 = "2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,"
Rows = c(Row1, Row2)
CleanedRows = gsub(",+", ",", Rows) # Compress multiple commas
CleanedRows = sub(",\\s*$", "", CleanedRows) # Remove final comma if any
[1] "1,a,4,456,3455" "2,b,5,567,4566"
But if you are trying to read this from a csv and compress the rows,
## Create sample data
Data =read.csv(text="1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,
2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,",
header=FALSE)
You code would probably say
Data = read.csv("YourFile.csv", header=FALSE)
Data = Data[which(!is.na(Data[1,]))]
Data
V1 V8 V18 V27 V38
1 1 a 4 456 3455
2 2 b 5 567 4566
Note: This assumes that the non-blank fields are in the same place in every row.
Use tr -s:
echo 'a,,,,,,,,b,,,,,,,,,,c' | tr -s ','
Output:
a,b,c
If the input line has trailing commas, tr -s ',' would squeeze those trailing commas into one comma, but to be rid that one requires adding a little sed code: tr -s ',' | sed 's/,$//'.
Speed. Tests on a 10,000,000 line test file consisting of the first line in the OP example, repeated.
3 seconds. tr -s ',' (but leaves trailing comma)
9 seconds. tr -s ',' | sed 's/,$//
30 seconds. sed -e "s/,\+/,/g" -e "s/,$//" (Jean-François Fabre's answer.)
If you have a file that's really a CSV file, it might have quoting of commas in a few different ways, which can make regex-based CSV parsing unhappy.
I generally use and recommend csvkit which has a nice set of CSV parsing utilities for the shell. Docs at http://csvkit.readthedocs.io/en/latest/
Your exact issue is answered in csvkit with this set of commands. First, csvstat shows what the file looks like:
$ csvstat -H --max tmp.csv | grep -v None
1. column1: 2
11. column11: c
27. column27: 6
42. column42: 567
63. column63: 4656
Then, now that you know that all of the data is in those columns, you can run this:
$ csvcut -c 1,11,27,42,63 tmp.csv
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
to get your desired answer.
Can we do it using R?
Provided your input is as shown, i.e., you want to skip the same columns in all rows, you can analyze the first line and then define column classes in read.table:
text <- "1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,"
tmp <- read.table(text = text, nrows = 1, sep = ",")
colClasses <- sapply(tmp, class)
colClasses[is.na(unlist(tmp))] <- "NULL"
Here I assume there are no actual NA values in the first line. If there could be, you'd need to adjust it slightly.
read.table(text = text, sep = ",", colClasses = colClasses)
# V1 V11 V27 V42 V63
#1 1 a 4 456 3455
#2 1 b 5 467 3445
#3 2 c 6 567 4656
Obviously, you'd specify a file instead of text.
This solution is fairly efficient for smallish to moderately sized data. For large data, substitute the second read.table with fread from package data.table (but that applies regardless of the skipping columns problem).

How to find and replace double quotes in R data frame

I have a data frame that looks like this (sorry, I can't replicate the actual data frame with code as the double quotes don't show up. Vx are variables):
V1, V2, V3, V4
home, 15, "grand", terminal,
"give", 32, "cuz", good,
"miles", 5, "before", ten,
yes, 45, "sorry," fine
Question: how I might be able to fix the double quote issue for my entire data frame that I've imported using the read.csv function, where all the double quotes are removed?
What I'm looking for is the excel or word equivalent of FIND + REPLACE: Find the double quote, and replace with nothing.
Notes:
1) I've confirmed it's a data frame by running is.data.frame() function
2) The actual data frame has hundreds of columns, so going through each one and declaring the type of column it is isn't feasible
3) I tried using the following, and it didn't work: as.data.frame(sapply(my_data, function(x) gsub("\"", "", x)))
4) I confirmed that this isn't a simple print issue by testing using sql on the the data frame. It won't find columns in double quotes unless I use LIKE instead of =
Thanks in advance!
7/7/15 EDIT 01: as requested from #alexforrence, here is the d(put) output for a couple of columns:
billing_first_name billing_last_name billing_company
3 NA
4 Peldi Guilizzoni NA
5 NA
6 "James Andrew" Angus NA
7 NA
8 Nova Spivack NA
Here is a solution using dplyr and stringr. Note that purely numerical columns will be character columns afterwards. It's not clear to me from your description whether there are purely numerical columns. If there are then you'd probably want to treat them separately, or alternatively convert back into numbers afterwards.
require(dplyr)
require(stringr)
df <- data.frame(V1=c("home", "\"give\"", "\"miles\"", "yes"),
V2=c(15, 32, 5, 45),
V3=c("\"grand\"", "\"cuz\"", "\"before\"", "\"sorry\""),
V4=c("terminal", "good", "ten", "fine"))
df
## V1 V2 V3 V4
## 1 home 15 "grand" terminal
## 2 "give" 32 "cuz" good
## 3 "miles" 5 "before" ten
## 4 yes 45 "sorry" fine
df %>% mutate_each(funs(str_replace_all(., "\"", "")))
## V1 V2 V3 V4
## 1 home 15 grand terminal
## 2 give 32 cuz good
## 3 miles 5 before ten
## 4 yes 45 sorry fine
You can identify the double quotes using nchar().
a <- ""
nchar(a)==0
[1] TRUE
In addition to the above I ran into a very strange problem. Using the tips I wrote this very short program:
setClass("char.with.deleted.quotes")
setAs("character", "char.with.deleted.quotes",
function(from) as.character(gsub('„',"xxx", as.character(from), fixed = TRUE)))
TMP = read.csv2("./test.csv", header=TRUE, sep=";", dec=",",
colClasses = c("character","char.with.deleted.quotes"))
temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
print(temp)
with the Output:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
which reads the dummy csv:
Number;Name
X-23;This is some „Test
K-33.01;And another „Test
My goal is to get rid of this double quote before the word Test. However this so far does not work. And this is because of this double quote.
If instead I choose to replace a different part of the character it does work with either read.csv2 and the above definition of a class or directly with gsub saving it into the temp variable.
Now what is really strange is the following. After running the program I copied the two lines "temp <- gsub" and "print(temp)" manually into the command line:
> source('test.R')
[1] "This is some „Test" "And another „Test"
[1] "This is some „Test" "And another „Test"
[1] " "
Number Name
1 X-23 This is some „Test
2 K-33.01 And another „Test
>
> temp <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(temp)
[1] "This is some xxxTest" "And another xxxTest"
This for whatever reason works and it does also work if I modify the data frame directly:
> TMP$Name <- gsub('„', "xxx", TMP$Name, fixed=TRUE)
> print(TMP)
Number Name
1 X-23 This is some xxxTest
2 K-33.01 And another xxxTest
But if I repeat this command in the program and run it again, it does not work. And I really have no idea why.

How to read and print first head of a file in R?

I want to print a head of a file in R. I know how to use read.table and other input methods supported by R. I just want to know R alternatives to unix command cat or head that reads in a file and print some of them.
Thank you,
SangChul
read.table() takes an nrows argument for just this purpose:
read.table(header=TRUE, text="
a b
1 2
3 4
", nrows=1)
# a b
# 1 1 2
If you are instead reading in (possibly less structured) files with readLines(), you can use its n argument instead:
readLines(textConnection("a b
1 2 3 4 some other things
last"), n=1)
# [1] "a b"

Resources