How can I read a double-semicolon-separated .txt in r? - r

I have this problem but in r:
How can I read a double-semicolon-separated .csv with quoted values using pandas?
The solution there is to drop the additional columns generated. I'd like to know if there's a way to read the file separated by ;; without generating those addiotional columns.
Thanks!

Read it in normally using read.csv2 (or whichever variant you prefer, including read.table, read.delim, readr::read_csv2, data.table::fread, etc), and then remove the even-numbered columns.
dat <- read.csv2(text = "a;;b;;c;;d\n1;;2;;3;;4")
dat
# a X b X.1 c X.2 d
# 1 1 NA 2 NA 3 NA 4
dat[,-seq(2, ncol(dat), by = 2)]
# a b c d
# 1 1 2 3 4

It is usually recommended to properly clean your data before attempting to parse it, instead of cleaning it WHILE parsing, or worse, AFTER. Either use Notepad++ to Replace all ;; occurences or R itself, but do not delete the original files (also a rule of thumb - never delete sources of data).
my.text <- readLines('d:/tmp/readdelim-r.csv')
cleaned <- gsub(';;', ';', my.text)
writeLines(cleaned, 'd:/tmp/cleaned.csv')
my.cleaned <- read.delim('d:/tmp/cleaned.csv', header=FALSE, sep=';')

Related

Skipping rows gets rid off necessary colnames?

I've a data frame with some metadata in the first 3 rows, that I need to skip. But doing so, also affects the colnames of the values cols.
What can I do, to avoid opening every CSV on excel and deleting these rows manually?
This is how the CSV looks when opened in Excel:
In R, I'm using this command to open it:
android_per <- fread("...\\Todas las adquisiciones de dispositivos de VersiĆ³n de Android PE.csv",
skip = 3)
And it looks like this:
UPDATE 1:
Similar logic to #G5W, but I think there needs to be a step of squashing the header that is in 2 rows back to one. E.g.:
txt <- "Some, utter, rubbish,,
Even more rubbish,,,,
,,Col_3,Col_4,Col_5
Col_1,Col_2,,,
1,2,3,4,5
6,7,8,9,0"
## below line writes a file - uncomment if you're happy to do so
##cat(txt, file="testfile.csv", "\n")
header <- apply(read.csv("testfile.csv", nrows=2, skip=2, header=FALSE),
2, paste, collapse="")
read.csv("testfile.csv", skip=4, col.names=header, header=FALSE)
Output:
# Col_1 Col_2 Col_3 Col_4 Col_5
#1 1 2 3 4 5
#2 6 7 8 9 0
Here is one way to do it. Read the file simply as lines of text. Eliminate the lines that you don't want, then read the remaining good part into a data.frame.
Sample csv file (I saved it as "Temp/Temp.csv")
Col_1,Col_2,Col_3,Col_4,Col_5
Some utter rubbish,,,,
Presumably documentation,,,,
1,2,3,4,5
6,7,8,9,0
Code
CSV_Lines = readLines("temp/Temp.csv")
CSV_Lines = CSV_Lines[-(2:3)]
DF = read.csv(text=CSV_Lines)
Col_1 Col_2 Col_3 Col_4 Col_5
1 1 2 3 4 5
2 6 7 8 9 0
It skipped the unwanted lines and got the column names.
If you use skip = 3, you definitely lose the column names without an option to get it back using R. An ugly hack could be to use skip = 2 which will make sure that all other columns except the first 2 are correct.
df <- read.table('csv_name.csv', skip = 2, header = TRUE)
The headers of the first 2 columns are in the first row so you can do
names(df)[1:2] <- df[1, 1:2]
Probably, you need to shift all the rows 1 step up to get dataframe as intended.
In case you put Header as false then you can use below code:
df<-fread("~/Book1.csv", header = F, skip = 2)
shift_up <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
df[1,1]<-df[2,1]
df[1,2]<-df[2,2]
df<-df[-2,]
names(df)<-as.character(df[1,])
df<-df[-1,]

Problems with column names in R

I am new to R. I am trying to use the "write.csv" command to write a csv file in R. Unfortunately, when I do this, the resulting data frame produces colnames with a prefix X in it eventhough the file already has a column name.
It produces, X_name1 ,X_name2
Please kindly tell me your suggestions
I have added an example code similar to my data.
a<- c("1","2")
b <- c("3","4")
df <- rbind(a,b)
df <- as.data.frame(df)
names(df) <- c("11_a","12_b")
write.csv(df,"mydf.csv")
a <- read.csv("mydf.csv")
a
#Result
X X11_a X12_b
1 a 1 2
2 b 3 4
All I need is to have only "11_a" and "12_b" as column names. But it incudes prefix X also.
Use check.names=FALSE when reading your data back in - names starting with numbers are not generally acceptable in R:
read.csv(text="11_a,12_b
a,1,2
b,3,4", check.names=FALSE)
# 11_a 12_b
#a 1 2
#b 3 4
read.csv(text="11_a,12_b
a,1,2
b,3,4", check.names=TRUE)
# X11_a X12_b
#a 1 2
#b 3 4
All you have to do is add header=TRUE to your code when you read in the .csv file. It would look like:
a <- read.csv("mydf.csv", header=TRUE)

Using R to list and mark multiple csv files with characters from the title of those files, and put those in a dataframe

I have a large number of files that are all numbered and labeled from a CTD cast. These files all contain 3 columns, for bottle number fired, Depth, and Conductivity, and 3 rows, one for each water bottle fired.
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547
These files are named after the cast number as such "OS1505xxx.csv", where the xxx is the cast number. I would like to take the data from multiple casts, label the data with the cast number(which I presume would go in another column for each bottle sample), and then merge that data together in one dataframe.
1,68.93,0.2123,001
2,14.28,0.3139,001
3,8.683,0.3547,001
1,109.5,0.2062,002
2,27.98,0.4842,002
3,5.277,0.3705,002
One other thing, some files only have 1 or 2 bottles fired, While others also have 4 bottles fired. I tried finding files with only 3 rows and making a list of the filenames repeated three times, and then mergeing that with the binded csv files that had three rows into a dataframe but I am very new to R and couldn't figure it out. Any help is appreciated.
This gets all of them into one data frame in order (001-100), and from there you can export it however you want.
df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c("V1", "V2", "V3", "file")
for(i in 1:100) {
file_name <- paste("OS1505",as.name(sprintf("%03d", i)),".csv",sep="")
if(file.exists(file_name)) {
print("match found")
df_tmp <- read.csv(file_name, header = FALSE, sep = ",",fill = TRUE)
df_tmp$file <- sprintf("%03d", i)
df <- rbind(df, df_tmp)
}
}
Try this:
files <- list.files(pattern="OS1505")
lst <- lapply(files, read.csv)
ids <- substr(files, 7,9)
for(i in 1:length(lst)) lst[[i]][,4] <- ids[i]
do.call(rbind, lst)
# X V1 V2 V3
#1 1 1 68.930 001
#2 2 2 14.280 001
#3 3 3 8.683 001
#4 1 1 109.500 002
#5 2 2 27.980 002
#6 3 3 5.277 002
We start by first creating two dummy files to try and save them as csv files to test. I named them in a way to match your files. (i.e. "OS1505001.csv"):
file1 <- read.table(text="
1,68.93,0.2123
2,14.28,0.3139
3,8.683,0.3547", sep=',')
file2 <- read.table(text="
1,109.5,0.2062
2,27.98,0.4842
3,5.277,0.3705", sep=',')
write.csv(file1, "OS1505001.csv")
write.csv(file2, "OS1505002.csv")
Going through the code, files checks the directory for any files that have OS1505 in them. There are two files that match that description "OS1505001.csv" "OS1505002.csv". We bring those two files into R with read.csv. It is wrapped in lapply so that the process can happen to all of the files in the files vector at once and saved in a list called lst. Now ids is a way to grab the id numbers from the file names. In a for loop we assign each id to the 4th column of the data frames. Lastly, do.call brings it all together with the rbind function.

Looping a series of variables in R while doing a cbind()

I have already read a dataset in R though read.csv and after doing some calculations created the following 10 variables of similar names, PI_1,PI_2,.....,PI_10
Now I combine the newly formed variables with my existing dataset (TempData).
x<-cbind(TempData,PI_1,PI_2,PI_3,PI_4,PI_5,PI_6,PI_7,PI_8,PI_9,PI_10)
Is there any smarter way of doing this (maybe by a loop). Any help is greatly appreciated
Assuming that the files are in the working directory and all of them starts with PI_ followed by some digits \\d+, we can use list.files with pattern argument in case there are other files also in the directory. To check the working directory, use getwd()
files <- list.files(pattern='^PI_\\d+')
This will give the file names. Now, we can use lapply and read those files in the list using read.table. Once, we are done with that part, use do.call(cbind to bind all the dataset columns together.
res <- do.call(cbind,lapply(files, function(x)
read.table(x, header=TRUE)))
Update
I guess you need to create 10 variables based on some PI. In the code that was provided in the comments, PI seems to be an object with some value inside it. Here, I am creating PI as the value as it is not clear. I created a dummy dataset.
TempData[paste0('PI_', 1:10)] <- Map(function(x,y) c('', 'PI')[(x==y)+1],
1:10, list(TempData$Concept))
head(TempData,3)
# Concept Val PI_1 PI_2 PI_3 PI_4 PI_5 PI_6 PI_7 PI_8 PI_9 PI_10
#1 10 -0.4304691 PI
#2 10 -0.2572694 PI
#3 3 -1.7631631 PI
You could use write.table to save the results
data
set.seed(42)
dat <- data.frame(Concept=sample(1:10,50, replace=TRUE), Val=rnorm(50))
write.csv(dat, 'Concept.csv', row.names=FALSE,quote=FALSE)
TempData <- read.csv('Concept.csv')
str(TempData)
#'data.frame': 50 obs. of 2 variables:
# $ Concept: int 10 10 3 9 7 6 8 2 7 8 ...
# $ Val : num -0.43 -0.257 -1.763 0.46 -0.64 ...

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Resources