colsplit in r: separate one column into two - r

I have a csv file ("sumCounts") loaded in to r which contains a column called "transcript". An example of a row in this column is show below:
TR43890|c0_g1_i1
I want to split this column into two columns called "transcript" and "isoform" along the pipe "|" character.
sumCounts <- colsplit(transcript, "|", c("transcript", "isoform"))
I keep getting the following error: Error in str_split_fixed(string, pattern, n = length(names)) : object 'transcript' not found

Your question doesn't contain quite enough information to know whether this will work, but I'm assuming your data is read in to a data object named sumCounts, with a column named transcript that you want separated into two. If that's the case then Hadley Wickham's tidyr package will do what you want:
install.packages("tidyr")
require(tidyr)
#sumCounts <- read.csv("sumCounts.csv")
## Toy example:
sumCounts <- data.frame(
"transcript"=c(
"TR43890|c0_g1_i1",
"TR43890|c0_g1_i1",
"TR43890|c0_g1_i1"
)
)
## Note that the sep= argument requires a regular expression, for which
## the pipe argument is a special character and must be escaped:
separate(sumCounts, transcript, c("transcript", "isoform"), sep="\\|")

Related

Can I convert nested lists to a data frame without altering column names?

I'm investigating the abililty for R to read files with unfriendly column names. One file type I'm looking at is JSON, which can be read with jsonlite or rjson. The jsonlite package works wonderfully and is not involved in my question.
To start with, here are two pieces of data that contain unfriendly column names. This is a small example of my actual data, which also contains numeric columns.
library(rjson)
library(dplyr)
csv_file <- 'English,中文 (Simplified),中文 (Traditional),فارسى,Русский язык,Tiếng Việt
Alice,蔼,佩佩,آیسا,Аделаида,Anh Đào
Amy,安,姗姗,افسانه,Анна,Anh Thư
Barbara,芳,安娜,اندیشه,Валентина,Bấc
Carol,淑芬,寶珠,نادره,Вера,Bạch Tuyểt
Elizabeth,菊,小蘭,بهشته,Зинаида,Bảo Châu
Jaclyn,兰,小鳳,تبسم,Изабэлла,Chín
Jane,丽丽,惠娟,توسکا,Капитолина,Cúc
Judy,梅,淑儀,روژین,Лариса,Dậu
Katie,敏,燕玲,زرّین تاج,Маргарита,Diễm'
json_file <- '[{"English":"Alice","中文 (Simplified)":"蔼","中文 (Traditional)":"佩佩","فارسى":"آیسا","Русский язык":"Аделаида","Tiếng Việt":"Anh Đào"},
{"English":"Amy","中文 (Simplified)":"安","中文 (Traditional)":"姗姗","فارسى":"افسانه","Русский язык":"Анна","Tiếng Việt":"Anh Thư"},
{"English":"Barbara","中文 (Simplified)":"芳","中文 (Traditional)":"安娜","فارسى":"اندیشه","Русский язык":"Валентина","Tiếng Việt":"Bấc"},
{"English":"Carol","中文 (Simplified)":"淑芬","中文 (Traditional)":"寶珠","فارسى":"نادره","Русский язык":"Вера","Tiếng Việt":"Bạch Tuyểt"},
{"English":"Elizabeth","中文 (Simplified)":"菊","中文 (Traditional)":"小蘭","فارسى":"بهشته","Русский язык":"Зинаида","Tiếng Việt":"Bảo Châu"},
{"English":"Jaclyn","中文 (Simplified)":"兰","中文 (Traditional)":"小鳳","فارسى":"تبسم","Русский язык":"Изабэлла","Tiếng Việt":"Chín"},
{"English":"Jane","中文 (Simplified)":"丽丽","中文 (Traditional)":"惠娟","فارسى":"توسکا","Русский язык":"Капитолина","Tiếng Việt":"Cúc"},
{"English":"Judy","中文 (Simplified)":"梅","中文 (Traditional)":"淑儀","فارسى":"روژین","Русский язык":"Лариса","Tiếng Việt":"Dậu"},
{"English":"Katie","中文 (Simplified)":"敏","中文 (Traditional)":"燕玲","فارسى":"زرّین تاج","Русский язык":"Маргарита","Tiếng Việt":"Diễm"}]'
I can create an unaltered data frame from the CSV data easily (but notice the check.names = FALSE), and this is the standard that I compare with other files to see if they are read in correctly. I additionally read the JSON data using rjson::fromJSON().
# make data frame from csv data
df_csv = read.csv(text = csv_file, check.names = FALSE)
# read json data using rjson
# giving a list of lists
raw_json <- rjson::fromJSON(json_file)
The rjson package returns a nested list of lists, which means that making it a data frame with something simple like as.data.frame() won't work. I thought of using rbind() inside of do.call() to create a data frame, which I'll then compare to the CSV data using dplyr::setdiff(). In this case I can use check.names = FALSE inside the data.frame() function.
# make data frame from json data using data.frame and rbind
df_json1 <- data.frame(do.call(rbind, raw_json), check.names = FALSE)
# compare using setdiff
setdiff(df_csv, df_json1)
The setdiff() comparison gives an error because all the columns in the the JSON-sourced data frame are lists.
Error in `setdiff()`:
! `x` and `y` are not compatible.
✖ Incompatible types for column `English`: character vs list.
✖ Incompatible types for column `中文 (Simplified)`: character vs list.
✖ Incompatible types for column `中文 (Traditional)`: character vs list.
✖ Incompatible types for column `فارسى`: character vs list.
✖ Incompatible types for column `Русский язык`: character vs list.
✖ Incompatible types for column `Tiếng Việt`: character vs list.
After some reading and discussing, I discovered that there's a version of rbind() for data frames, rbind.data.frame(). So, I tried using that inside of my do.call(), obviating the need for the surrounding data.frame().
# make data frame from json using rbind.data.frame
# gives a data frame whose columns are vectors
df_json2 <- do.call(rbind.data.frame, raw_json)
#compare using setdiff
setdiff(df_csv, df_json2)
This causes an error because the names were changed to be R friendly. I no longer have the ability to use check.names = FALSE. (Do I? I looked and can't find a way.)
Error in `setdiff()`:
! `x` and `y` are not compatible.
✖ Cols in `y` but not `x`: `中文..Simplified.`, `中文..Traditional.`,
`Русский.язык`, `Tiếng.Việt`.
✖ Cols in `x` but not `y`: `中文 (Simplified)`, `中文 (Traditional)`, `Русский
язык`, `Tiếng Việt`.
Here is my dilemma. I'd like to create a data frame identical to the CSV-generated one, but can't seem to get it using these approaches. How can I take the output of rjson::fromJSON and arrive at the CSV file?
NB: I realize the column name changes seem to only be an issue of spaces and parentheses. Things are cleaner because I've pasted all the Unicode here in the post. When I read data files from external sources, much stranger things happen. For example, the Tiếng Việt column gets changed to Tiê.ng.Viê.t, I suppose because of the multiple diacritics. Hence, my question of comparing the files without having to guess all the things that can go wrong and preprocessing them out.

Importing CSV file by read.csv but the function recognize wrong number of columns

I tried to import the CSV file from here: https://covid19.who.int/WHO-COVID-19-global-table-data.csv using read.csv function:
WHO_data <- read.csv("https://covid19.who.int/WHO-COVID-19-global-table-data.csv")
But the WHO_data I got has 12 columns and recognizes the first column as a row name.
I tried another method by getting a tibble instead of dataframe:
library(readr)
WHO_data <- read_csv("https://covid19.who.int/WHO-COVID-19-global-table-data.csv")
It then gives the error below:
Warning: 1 parsing failure.
row col expected actual file
1 -- 12 columns 13 columns 'https://covid19.who.int/WHO-COVID-19-global-table-data.csv'
Can anyone help me explain why this happens and how to fix this?
The file seem to be improperly formatted. There is an extra comma on the end of the second line. You can read the raw line data, remove the comma, then pass to read.csv. For example
file <- "https://covid19.who.int/WHO-COVID-19-global-table-data.csv"
rows <- readLines(file)
rows[2] <- gsub(",$", "", rows[2])
WHO_data <- read.csv(text=rows)
Here is another solution based on the data.table package. If you want to return a data.frame (as opposed to data.table), you can additionally specify the argument data.table=FALSE to the fread function:
library(data.table)
file <- "https://covid19.who.int/WHO-COVID-19-global-table-data.csv"
WHO_data <- fread(file, select=1:12, fill=TRUE)

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

Vectorise an imported variable in R

I have imported a CSV file to R but now I would like to extract a variable into a vector and analyse it separately. Could you please tell me how I could do that?
I know that the summary() function gives a rough idea but I would like to learn more.
I apologise if this is a trivial question but I have watched a number of tutorial videos and have not seen that anywhere.
Read data into data frame using read.csv. Get names of data frame. They should be the names of the CSV columns unless you've done something wrong. Use dollar-notation to get vectors by name. Try reading some tutorials instead of watching videos, then you can try stuff out.
d = read.csv("foo.csv")
names(d)
v = d$whatever # for example
hist(v) # for example
This is totally trivial stuff.
I assume you have use the read.csv() or the read.table() function to import your data in R. (You can have help directly in R with ? e.g. ?read.csv
So normally, you have a data.frame. And if you check the documentation the data.frame is described as a "[...]tightly coupled collections of variables which share many of the properties of matrices and of lists[...]"
So basically you can already handle your data as vector.
A quick research on SO gave back this two posts among others:
Converting a dataframe to a vector (by rows) and
Extract Column from data.frame as a Vector
And I am sure they are more relevant ones. Try some good tutorials on R (videos are not so formative in this case).
There is a ton of good ones on the Internet, e.g:
* http://www.introductoryr.co.uk/R_Resources_for_Beginners.html (which lists some)
or
* http://tryr.codeschool.com/
Anyways, one way to deal with your csv would be:
#import the data to R as a data.frame
mydata = read.csv(file="SomeFile.csv", header = TRUE, sep = ",",
quote = "\"",dec = ".", fill = TRUE, comment.char = "")
#extract a column to a vector
firstColumn = mydata$col1 # extract the column named "col1" of mydata to a vector
#This previous line is equivalent to:
firstColumn = mydata[,"col1"]
#extract a row to a vector
firstline = mydata[1,] #extract the first row of mydata to a vector
Edit: In some cases[1], you might need to coerce the data in a vector by applying functions such as as.numeric or as.character:
firstline=as.numeric(mydata[1,])#extract the first row of mydata to a vector
#Note: the entire row *has to be* numeric or compatible with that class
[1] e.g. it happened to me when I wanted to extract a row of a data.frame inside a nested function

R: Import CSV with column names that contain spaces

CSV file looks like this (modified for brevity). Several columns have spaces in their title and R can't seem to distinguish them.
Alias;Type;SerialNo;DateTime;Main status; [...]
E1;E-70;781733;01/04/2010 11:28;8; [...]
Here is the code I am trying to execute:
s_data <- read.csv2( file=f_name )
attach(s_data)
s_df = data.frame(
scada_id=ID,
plant=PlantNo,
date=DateTime,
main_code=Main status,
seco_code=Additional Status,
main_text=MainStatustext,
seco_test=AddStatustext,
duration=Duration)
detach(s_data)
I have also tried substituting
main_code=Main\ status
and
main_code="Main status"
Unless you specify check.names=FALSE, R will convert column names that are not valid variable names (e.g. contain spaces or special characters or start with numbers) into valid variable names, e.g. by replacing spaces with dots. Try names(s_data). If you do use check.names=TRUE, then use single back-quotes (`) to surround the names.
I would also recommend using rename from the reshape package (or, these days, dplyr::rename).
s_data <- read.csv2( file=f_name )
library(reshape)
s_df <- rename(s_data,ID="scada_id",
PlantNo="plant",DateTime="date",Main.status="main_code",
Additional.status="seco_code",MainStatustext="main_text",
AddStatustext="seco_test",Duration="duration")
For what it's worth, the tidyverse tools (i.e. readr::read_csv) have the opposite default; they don't transform the column names to make them legal R symbols unless you explicitly request it.
s_data <- read.csv( file=f_name , check.names=FALSE)
I believe spaces get replaced by dots "." when importing CSV files. So you'd write e.g. Main.status. You can check by entering names(s_data) to see what the names are.

Resources