I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz
You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz
I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.
Related
I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)
How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20
I have files that contain multiple rows, I want to add two new rows that I create by extracting varibles from the filename and multipling them by current rows.
For example I have a bunch of file that are named something like this
file1[1000,1001].txt
file1[2000,1001].txt
between the [] there are always 2 numbers spearated by a comma
the file itself has multiple columns, for example column1 & column2
I want for each file to extract the 2 values in the name of the file and then use them as variables to make 2 new columns that used the variable to modify the values.
for example
file1[1000,2000]
the file contains two columns
column1 column2
1 2
2 4
I want at the end to add the first file name value to column 1 to create column3 and add the second file name value to column 2 to create column 4, ending up with something like this
column1 column2 column3 column4
1 2 1001 2002
2 4 1002 2004
thanks for the help. I am almost there just a few more issues
original files has 2 columns "X_Parameter" "Y_Parameter", the file name is "test(64084,4224).txt
your code works great at extracting the two values V1 "64084" and V2 "4224" from the file name. I then add these values to the original data set. this yields 4 columns. "X_Parameter" "Y_Parameter" "V1" "V2".
setwd("~/Desktop/txt/")
txt_names = list.files(pattern = ".txt")
for (i in 1:length(txt_names)){assign(txt_names[i], read.delim(txt_names[i]))
DS1 <- read.delim(file = txt_names[i], header = TRUE, stringsAsFactors = TRUE)
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
DS2<-as.data.frame(do.call("rbind", (str_split(step2, ","))))
DS1$V1<-DS2$V1
DS1$V2<-DS2$V2
My issue arises when tying to sum "X_Parameter" and "V1" to make "absoluteX" and sum "Y_Parameter"with "V2" to make "absoluteY" for each row.
below are the two ways I have tried with the errors
DS1$absoluteX<-DS1$X_Parameter+DS1$V1
error
In Ops.factor(DS1$X_Parameter, DS1$V1) : ‘+’ not meaningful for factors
other try was
DS1$absoluteX<-rowSums(DS1[,c(“X_Parameter”,”V1”)])
error
Error in rowSums(DS1[, c("X_Parameter", "V1")]) : 'x' must be numeric
I have tried using
as.numeric(DS1$V1)
that causes all values to become 1
Any thoughts?Thanks
You can extract the numbers from a vector of file names as follows (not sure it is the shortest possible code, but it seems to work)
fnams<-c("file1[1000,2000].txt","file1[1500,2500].txt")
opsqbr<-regexpr("\\[",fnams)
comm<-regexpr(",",fnams)
clsqbr<-regexpr("\\]",fnams)
reslt<-data.frame(col1=as.numeric(substring(fnams,opsqbr+1,comm-1)),
col2=as.numeric(substring(fnams,comm+1,clsqbr-1)))
reslt
Which yields
col1 col2
1 1000 2000
2 1500 2500
Once you have this data frame,it is easy to sequentially read the files and do the addition
## set path to wherever your files are
setwd("path")
## make a vector with names of your files
txt_names <- list.files(pattern = ".txt") # use this to make a complete list of names
## read your files in
for (i in 1:length(txt_names)) assign(txt_names[i], read.csv(txt_names[i], sep = "whatever your separator is"))
## for now I'm making a dummy vector and data frame
txt_names <- c("[1000,2000]")
ds1 <- data.frame(column1 = c(1,2), column2 = c(2,4))
## grab the text you require from the file names
require(stringr)
remove_text <- str_extract(txt_names, pattern = "\\[[0-9,0-9]+\\]")
step1 <- gsub("(\\[)", "", remove_text)
step2 <- gsub("(\\])", "", step1)
## step2 should look like this
> step2
[1] "1000,1001"
## split each string and convert to data frame with two columns
ds2 <- as.data.frame(do.call("rbind", (str_split(step2, ","))))
## cbind with the file
df <- cbind(ds1, ds2)
## coerce factor columns to numeric
df$V1 <- as.numeric(as.character(df$V1))
df$V2 <- as.numeric(as.character(df$V2))
## perform the operation to change the columns
df$V1 <- df$column1 + df$V1
df$V2 <- df$column2 + df$V2
NOw you have a data.frame with two columns , each containing the file name parts you need. Just rep them times length of each of your data.frames and cbind.
I have a large number of CSV files that look like this:
var val1 val2
a 2 1
b 2 2
c 3 3
d 9 2
e 1 1
I would like to:
Read them in
Take the top 3 from each CSV
Make a list of the variable names only (3 x number of files)
Keep only the unique names on the list
I think I have managed to get to point 3 by doing this:
csvList <- list.files(path = "mypath", pattern = "*.csv", full.names = T)
bla <- lapply(lapply(csvList, read.csv), function(x) x[order(x$val1, decreasing=T)[1:3], ])
lapply(bla,"[", , 1, drop=FALSE)
Now, I have a list of the top 3 variables in each CSV. However, I don't know how to convert this list to a string and keep only the unique values.
Any help is welcome.
Thank you!
The issue is in extracting the first columns of bla with drop=FALSE. This preserves the results as a list of columns (where each row has a name) instead of coercing it to its lowest dimension, which is a vector. Use drop=TRUE instead and then unlist followed by unique as #Frank suggests:
unique(unlist(lapply(bla,"[", , 1, drop=TRUE)))
As you know, drop=TRUE is the default, so you don't even have to include it.
Update to new requirements in comments.
To keep the first two columns var and var1 and remove duplicates in var (keep only the unique vars), do the following:
## unlist each column in turn and form a data frame
res <- data.frame(lapply(c(1,2), function(x) unlist(lapply(bla,"[", , x))))
colnames(res) <- c("var","var1") ## restore the two column names
## remove duplicates
res <- res[!duplicated(res[,1]),]
Note that this will only keep the first row for each unique var. This is the definition of removing duplicates here.
Hope this helps.
I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")
I have a variable in my R data frame that has a 18 chars. When I use write.csv(out2, file="ddd.csv", row.names=FALSE ). I get this specific variable's values in a scientific format. I try to export it as txt and it maintained the same exact structure as I wanted but I need it as a csv format. What can i do in order to maintain the exact format of my R data frame when I export it to csv?
Thank you,
Ron
R will write that column as a number if it thinks that it is a number, rather than a categorical variable. Compare, for example, the columns in this dataset
n <- 5
ids <- replicate(
n,
paste0(
sample(0:9, 18, replace = TRUE),
collapse = ""
)
)
out2 <- data.frame(
CategoricalId = factor(ids),
NumericId = as.numeric(ids)
)
out2
## CategoricalId NumericId
## 1 097572748411056439 9.757275e+16
## 2 455782786931417422 4.557828e+17
## 3 046986020739330140 4.698602e+16
## 4 384292451744509872 3.842925e+17
## 5 787170367185951077 7.871704e+17
The Excel number formatting dialog options, with output: