issue with reading and writing a csv file in R language

issue with reading and writing a csv file in R language - r

I have a table in csv format, the data is the following:
1 3 1 2
1415_at 1 8.512147859 8.196725061 8.174426394 8.62388149
1411_at 2 9.119200527 9.190318548 9.149239039 9.211401637
1412_at 3 10.03383593 9.575728316 10.06998673 9.735217522
1413_at 4 5.925999419 5.692092375 5.689299161 7.807354922
When I read it with:
m <- read.csv("table.csv")
and print the values of m, I notice that they change to:
X X.1 X1 X3 X1.1 X4
1 1415_at 1 8.512148 8.196725 8.174426 8.623881
I made some manipulation to keep only those columns that are labelled 1 or 2, so I do that with:
smallerdat <- m[ grep("^X$|^X.1$|^X1$|^X2$|1\\.|2\\." , names(m) ) ]
write.csv(smallerdat,"table2.csv")
it writes me the file with those annoying headers and that first column added, which I do not need it:
X X.1 X1 X1.1 X2
1 1415_at 1 8.512148 8.174426 8.623881
so when I open that data in Excel the headers are still X, X.1 and son on. What I need is that the headers remain the same as:
1 1 2
1415_at 1 8.196725061 8.174426394 8.62388149
any help?
Please notice also that first column that is added automatically, I do not need it, so how I can get rid that of that column?

There are two issues here.
For reading your CSV file, use:
m <- read.csv("table.csv", check.names = FALSE)
Notice that by doing this, though, you can't use the column names as easily. You have to quote them with backticks instead, and will most likely still run into problems because of duplicated column names:
m$1
# Error: unexpected numeric constant in "mydf$1"
mydf$`1`
# [1] 8.512148 9.119201 10.033836 5.925999
For writing your "m" object to a CSV file, use:
write.csv(m, "table2.csv", row.names = FALSE)
After reading your file in using the method in step 1, you can subset as follows. If you wanted the first column and any columns named "3" or "4", you can use:
m[names(m) %in% c("", "3", "4")]
# 3 4
# 1 1415_at 1 8.196725 8.623881
# 2 1411_at 2 9.190319 9.211402
# 3 1412_at 3 9.575728 9.735218
# 4 1413_at 4 5.692092 7.807355
Update: Fixing the names before using write.csv
If you don't want to start from step 1 for whatever reason, you can still fix your problem. While you've succeeded in taking a subset with your grep statement, that doesn't change the column names (not sure why you would expect that it should). You have to do this by using gsub or one of the other regex solutions.
Here are the names of the columns with the way you have read in your CSV:
names(m)
# [1] "X" "X.1" "X1" "X3" "X1.1" "X2"
You want to:
Remove all "X"s
Remove all ".some-number"
So, here's a workaround:
# Change the names in your original dataset
names(m) <- gsub("^X|\\.[0-9]$", "", names(m))
# Create a temporary object to match desired names
getme <- names(m) %in% c("", "1", "2")
# Subset your data
smallerdat <- m[getme]
# Reassign names to your subset
names(smallerdat) <- names(m)[getme]

I am not sure I understand what you are attempting to do, but here is some code that reads a csv file with missing headers for the first two columns, selects only columns with a header of 1 or 2 and then writes that new data file retaining the column names of 1 or 2.
# first read in only the headers and deal with the missing
# headers for columns 1 and 2
b <- readLines('c:/users/Mark W Miller/simple R programs/missing_headers.csv',
n = 1)
b <- unlist(strsplit(b, ","))
b[1] <- 'name1'
b[2] <- 'name2'
b <- gsub(" ","", b, fixed=TRUE)
b
# read in the rest of the data file
my.data <- (
read.table(file = "c:/users/mark w miller/simple R programs/missing_headers.csv",
na.string=NA, header = F, skip=1, sep=','))
colnames(my.data) <- b
# select the columns with names of 1 or 2
my.data <- my.data[names(my.data) %in% c("1", "2")]
# retain the original column names of 1 or 2
names(my.data) <- floor(as.numeric(names(my.data)))
# write the new data file with original column names
write.csv(
my.data, "c:/users/mark w miller/simple R programs/missing_headers_out.csv",
row.names=FALSE, quote=FALSE)
Here is the input data file. Note the commas with missing names for columns 1 and 2:
, , 1, 3, 1, 2
1415_at, 1, 8.512147859, 8.196725061, 8.174426394, 8.62388149
1411_at, 2, 9.119200527, 9.190318548, 9.149239039, 9.211401637
1412_at, 3, 10.03383593, 9.575728316, 10.06998673, 9.735217522
1413_at, 4, 5.925999419, 5.692092375, 5.689299161, 7.807354922
Here is the output data file:
1,1,2
8.512147859,8.174426394,8.62388149
9.119200527,9.149239039,9.211401637
10.03383593,10.06998673,9.735217522
5.925999419,5.689299161,7.807354922

Related

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.

Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"

If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"

You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

extracting variable from file names in R

I have files that contain multiple rows, I want to add two new rows that I create by extracting varibles from the filename and multipling them by current rows.
For example I have a bunch of file that are named something like this
file1[1000,1001].txt
file1[2000,1001].txt
between the [] there are always 2 numbers spearated by a comma
the file itself has multiple columns, for example column1 & column2
I want for each file to extract the 2 values in the name of the file and then use them as variables to make 2 new columns that used the variable to modify the values.
for example
file1[1000,2000]
the file contains two columns
column1 column2
1 2
2 4
I want at the end to add the first file name value to column 1 to create column3 and add the second file name value to column 2 to create column 4, ending up with something like this
column1 column2 column3 column4
1 2 1001 2002
2 4 1002 2004
thanks for the help. I am almost there just a few more issues
original files has 2 columns "X_Parameter" "Y_Parameter", the file name is "test(64084,4224).txt
your code works great at extracting the two values V1 "64084" and V2 "4224" from the file name. I then add these values to the original data set. this yields 4 columns. "X_Parameter" "Y_Parameter" "V1" "V2".
setwd("~/Desktop/txt/")
txt_names = list.files(pattern = ".txt")
for (i in 1:length(txt_names)){assign(txt_names[i], read.delim(txt_names[i]))
DS1 <- read.delim(file = txt_names[i], header = TRUE, stringsAsFactors = TRUE)
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
DS2<-as.data.frame(do.call("rbind", (str_split(step2, ","))))
DS1$V1<-DS2$V1
DS1$V2<-DS2$V2
My issue arises when tying to sum "X_Parameter" and "V1" to make "absoluteX" and sum "Y_Parameter"with "V2" to make "absoluteY" for each row.
below are the two ways I have tried with the errors
DS1$absoluteX<-DS1$X_Parameter+DS1$V1
error
In Ops.factor(DS1$X_Parameter, DS1$V1) : ‘+’ not meaningful for factors
other try was
DS1$absoluteX<-rowSums(DS1[,c(“X_Parameter”,”V1”)])
error
Error in rowSums(DS1[, c("X_Parameter", "V1")]) : 'x' must be numeric
I have tried using
as.numeric(DS1$V1)
that causes all values to become 1
Any thoughts?Thanks

You can extract the numbers from a vector of file names as follows (not sure it is the shortest possible code, but it seems to work)
fnams<-c("file1[1000,2000].txt","file1[1500,2500].txt")
opsqbr<-regexpr("\\[",fnams)
comm<-regexpr(",",fnams)
clsqbr<-regexpr("\\]",fnams)
reslt<-data.frame(col1=as.numeric(substring(fnams,opsqbr+1,comm-1)),
col2=as.numeric(substring(fnams,comm+1,clsqbr-1)))
reslt
Which yields
col1 col2
1 1000 2000
2 1500 2500
Once you have this data frame,it is easy to sequentially read the files and do the addition

## set path to wherever your files are
setwd("path")
## make a vector with names of your files
txt_names <- list.files(pattern = ".txt") # use this to make a complete list of names
## read your files in
for (i in 1:length(txt_names)) assign(txt_names[i], read.csv(txt_names[i], sep = "whatever your separator is"))
## for now I'm making a dummy vector and data frame
txt_names <- c("[1000,2000]")
ds1 <- data.frame(column1 = c(1,2), column2 = c(2,4))
## grab the text you require from the file names
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
## step2 should look like this
> step2
[1] "1000,1001"
## split each string and convert to data frame with two columns
ds2 <- as.data.frame(do.call("rbind", (str_split(step2, ","))))
## cbind with the file
df <- cbind(ds1, ds2)
## coerce factor columns to numeric
df$V1 <- as.numeric(as.character(df$V1))
df$V2 <- as.numeric(as.character(df$V2))
## perform the operation to change the columns
df$V1 <- df$column1 + df$V1
df$V2 <- df$column2 + df$V2
NOw you have a data.frame with two columns , each containing the file name parts you need. Just rep them times length of each of your data.frames and cbind.

Use a vector/index as a row name in a dataframe using rbind

I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!

Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11

I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.

If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz

You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz

I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

R: losing column names when adding rows to an empty data frame

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.
example:
a<-data.frame(one = numeric(0), two = numeric(0))
a
#[1] one two
#<0 rows> (or 0-length row.names)
names(a)
#[1] "one" "two"
a<-rbind(a, c(5,6))
a
# X5 X6
#1 5 6
names(a)
#[1] "X5" "X6"
As you can see, the column names one and two were replaced by X5 and X6.
Could somebody please tell me why this happens and is there a right way to do this without losing column names?
A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.
Thanks
Context:
I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.

The rbind help pages specifies that :
For ‘cbind’ (‘rbind’), vectors of zero
length (including ‘NULL’) are ignored
unless the result would have zero rows
(columns), for S compatibility.
(Zero-extent matrices do not occur in
S3 and are not ignored in R.)
So, in fact, a is ignored in your rbind instruction. Not totally ignored, it seems, because as it is a data frame the rbind function is called as rbind.data.frame :
rbind.data.frame(c(5,6))
# X5 X6
#1 5 6
Maybe one way to insert the row could be :
a[nrow(a)+1,] <- c(5,6)
a
# one two
#1 5 6
But there may be a better way to do it depending on your code.

was almost surrendering to this issue.
1) create data frame with stringsAsFactor set to FALSE or you run straight into the next issue
2) don't use rbind - no idea why on earth it is messing up the column names. simply do it this way:
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df <- data.frame(a = character(0), b=character(0), c=numeric(0))
df[nrow(df)+1,] <- c("d","gsgsgd",4)
#Warnmeldungen:
#1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
# invalid factor level, NAs generated
#2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
# invalid factor level, NAs generated
df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)
df[nrow(df)+1,] <- c("d","gsgsgd",4)
df
# a b c
#1 d gsgsgd 4

Workaround would be:
a <- rbind(a, data.frame(one = 5, two = 6))
?rbind states that merging objects demands matching names:
It then takes the classes of the
columns from the first data frame, and
matches columns by name (rather than
by position)

FWIW, an alternative design might have your functions building vectors for the two columns, instead of rbinding to a data frame:
ones <- c()
twos <- c()
Modify the vectors in your functions:
ones <- append(ones, 5)
twos <- append(twos, 6)
Repeat as needed, then create your data.frame in one go:
a <- data.frame(one=ones, two=twos)

One way to make this work generically and with the least amount of re-typing the column names is the following. This method doesn't require hacking the NA or 0.
rs <- data.frame(i=numeric(), square=numeric(), cube=numeric())
for (i in 1:4) {
calc <- c(i, i^2, i^3)
# append calc to rs
names(calc) <- names(rs)
rs <- rbind(rs, as.list(calc))
}
rs will have the correct names
> rs
i square cube
1 1 1 1
2 2 4 8
3 3 9 27
4 4 16 64
>
Another way to do this more cleanly is to use data.table:
> df <- data.frame(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are messed up
> X1 X2
> 1 1 2
> df <- data.table(a=numeric(0), b=numeric(0))
> rbind(df, list(1,2)) # column names are preserved
a b
1: 1 2
Notice that a data.table is also a data.frame.
> class(df)
"data.table" "data.frame"

You can do this:
give one row to the initial data frame
df=data.frame(matrix(nrow=1,ncol=length(newrow))
add your new row and take out the NAS
newdf=na.omit(rbind(newrow,df))
but watch out that your newrow does not have NAs or it will be erased too.
Cheers
Agus

I use the following solution to add a row to an empty data frame:
d_dataset <-
data.frame(
variable = character(),
before = numeric(),
after = numeric(),
stringsAsFactors = FALSE)
d_dataset <-
rbind(
d_dataset,
data.frame(
variable = "test",
before = 9,
after = 12,
stringsAsFactors = FALSE))
print(d_dataset)
variable before after
1 test 9 12
HTH.
Kind regards
Georg

Researching this venerable R annoyance brought me to this page. I wanted to add a bit more explanation to Georg's excellent answer (https://stackoverflow.com/a/41609844/2757825), which not only solves the problem raised by the OP (losing field names) but also prevents the unwanted conversion of all fields to factors. For me, those two problems go together. I wanted a solution in base R that doesn't involve writing extra code but preserves the two distinct operations: define the data frame, append the row(s)--which is what Georg's answer provides.
The first two examples below illustrate the problems and the third and fourth show Georg's solution.
Example 1: Append the new row as vector with rbind
Result: loses column names AND coverts all variables to factors
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
c("Bob", 250)
)
my.df
X.Bob. X.250.
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ X.Bob.: Factor w/ 1 level "Bob": 1
$ X.250.: Factor w/ 1 level "250": 1
Example 2: Append the new row as a data frame inside rbind
Result: keeps column names but still converts character variables to factors.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : Factor w/ 1 level "Bob": 1
$ score: num 250
Example 3: Append the new row inside rbind as a data frame, with stringsAsFactors=FALSE
Result: problem solved.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(name="Bob", score=250, stringsAsFactors=FALSE)
)
my.df
name score
1 Bob 250
str(my.df)
'data.frame': 1 obs. of 2 variables:
$ name : chr "Bob"
$ score: num 250
Example 4: Like example 3, but adding multiple rows at once.
my.df <- data.frame(
table = character(0),
score = numeric(0),
stringsAsFactors=FALSE
)
my.df <- rbind(
my.df,
data.frame(
name=c("Bob", "Carol", "Ted"),
score=c(250, 124, 95),
stringsAsFactors=FALSE)
)
str(my.df)
'data.frame': 3 obs. of 2 variables:
$ name : chr "Bob" "Carol" "Ted"
$ score: num 250 124 95
my.df
name score
1 Bob 250
2 Carol 124
3 Ted 95

Instead of constructing the data.frame with numeric(0) I use as.numeric(0).
a<-data.frame(one=as.numeric(0), two=as.numeric(0))
This creates an extra initial row
a
# one two
#1 0 0
Bind the additional rows
a<-rbind(a,c(5,6))
a
# one two
#1 0 0
#2 5 6
Then use negative indexing to remove the first (bogus) row
a<-a[-1,]
a
# one two
#2 5 6
Note: it messes up the index (far left). I haven't figured out how to prevent that (anyone else?), but most of the time it probably doesn't matter.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

issue with reading and writing a csv file in R language - r

Related

What's the best way to add a specific string to all column names in a dataframe in R?

extracting variable from file names in R

Use a vector/index as a row name in a dataframe using rbind

how can i read a csv file containing some additional text data

R: losing column names when adding rows to an empty data frame

Categories

Resources