R - extract data which changes position from file to file (txt) - r

I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.

If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.

Related

R: Why am I getting an extra column titled "X.1" in my dataframe after reading my .txt file?

I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.

write_csv - Exporting trailing spaces (no elimination)

I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

data frame accessing specific rows and col from csv file in R programming

I have csv file contains iphone device roadmap like version number, name of model, release of model , price etc. I have done following:
I have imported data set in Rstudio in variable name iphonedetail by following command. iphonedetail <-read.csv("iphodedata.csv")
Than i hv changed the attribute "name of model" to character by using following: iphonedetail$nameofmodel <- as.character(iphonedetail$nameofmodel)
Now i need to access 1st 5 name of model and store them in vector .
I tried this to achieve : iphonesubset <- data.frame(iphonedetail$nameofmodel)
Then on console i typed iphonesubset, but gave 0 col and row.
Could someone help in above 2 steps correct or not ? and also suggest how to fix 3rd step?
if you want to extract the first five (non unique):
iphonedf1to5 <- df[1:5,]
That means that you get the first 5 rows and all columns. Then if you want to get the unique first five elements it should be like:
iphonedf1to5 <- unique(df[1:5,])
Edit:
df means your data frame of the read csv, iphonedetail in your case.

Exporting data frame columns into separate txt files

I chunked several novels into a data frame called documents. I want to export each chunk as a separate .txt file.
The data frame that consists of two columns. The first column has the file names for each chunk, and the second column has the actual text that would go into the file.
documents[1,1]
[1] "Beloved.txt_1"
documents[1,2]
[1] "124 was spiteful full of a baby's venom the women......"
class(documents)
[1] "data.frame"
I'm trying to write a for loop that would take each row, make the second column into a .txt file, and make the first column the name of the file. And then to iterate for each row. I've been working with something like this:
for (i in 1:ncol(documents)) {
write(tagged_text, paste("data/taggedCorpus/",
documents[i], ".txt", sep=""))
I've also been reading that maybe the cat function would work well here?
I'm not positive this will work for you (a little more of an example of your input and desired output would help), but one issue you've got is that your for loop is by column rather than by row. If you want to do this once for every row, then it needs to be for (i in 1:nrow(documents) rather than ncol.
Assuming that "documents" is the name of your data.frame and that the column containing the text you want to save is called "tagged_text" and the column with the file name is called "file", try this:
for (i in 1:nrow(documents)) {
write(documents$tagged_text[i], paste0("data/taggedCorpus/",
documents$file[i], ".txt"))
}
Note that you don't need to specify the path every time if you already set it before you start the loop.

How to create a table in R from a csv file?

I have a csv file and am unsure how to get R to interpret it as a table because all the title info is in one cell and all the data relating to the titles is in a separate cell. So all the info I need is in 2 cells but it actually needs to be split up.
The cell A3 has a value called 'Team' , this corresponds to the part in the cell A4 that says 'Visitor'. Then each part after than corresponds to the bit below it. ..sorry I don't know how to describe it, but ultimately it would look like this …
Looks like the field separator in your data is a ;
read.csv has a parameter sep to change the field separator and another parameter header to tell it there is an initial line containing the column names. Use read.csv like this:
data = read.csv(file="/mydir/myfile.csv", sep=";", header=T)
To test you can print out the first 5 lines of the data table with:
head(data,5)

Resources