On a UNIX system, I can easily do a global change in a file. Eg, let's say I have a year value of /2021 and it will be unique in the file and associated with the date, so I can do a global change and insert a comma after the /2021. This then lets me read the file into R using comma delimiters. Is there any way I can read a string eg
7/06/2021 23:45 and change that to 7/06/2021, 23:45 in R running on Windows?
Thanks.
The data is as follows with previous columns removed and linefeeds inserted to show the data as a list.
ReadingDate Units Read.Type
08/06/2021 0:00 0 Actual
07/06/2021 23:45 0 Actual
07/06/2021 23:30 0 Actual
07/06/2021 23:15 0 Actual
07/06/2021 23:00 0 Actual
07/06/2021 22:45 0 Actual
ReadingDate is the date and time, so there are three columns. I would like four with time separated from date via a comma.
If your input would always be a date and time component, separated by a single space, then just use sub here in fixed mode:
date <- "7/06/2021 23:45"
output <- sub(" ", ", ", date, fixed=TRUE)
output
[1] "7/06/2021, 23:45"
To apply the above logic to a data frame column, use:
df$ReadingDate <- sub(" ", ", ", df$ReadingDate, fixed=TRUE)
Related
I'm trying to replace some of the columns in my data frame with extracted strings from each column name. This is my current data frame:
Date Time Temp ActivityLevelActivity ExplainActivityvalues4 AppetiteLevelAppetite
10/22/21 10:26 76 4 Activity was low 8
10/23/21 8:42 79 3 Activity was low again 7
I would like to replace the "ActivityLevelActivity" and "AppetiteLevelAppetite" column names with just "Activity" and "Appetite". I would like to change the "ExplainActivityvalues4" to "Activity_Comments".
I have tried:
gsub("Level", "[^L]+", names(df))
gsub("Explain", "(?<=\\n)[[:alpha:]]+(?<=\\v)", names(df))
I used "Level" and "Explain" as the patterns because the word "Level" is included in every column name where I would just like to take the first word. "Explain" is included for every column name where I would like to take the middle word and add "_Comments".
Essentially, I would like the new data frame to look like this:
Date Time Temp Activity Activity_Comments Appetite
10/22/21 10:26 76 4 Activity was low 8
10/23/21 8:42 79 3 Activity was low again 7
EDIT:
To explain further, here are all of my column names:
names(df) <- c(“Date”, “Time”, “Temp”, “ActivityLevelActivity”, “ExplainActivityvalues4”, “AppetiteLevelAppetite”, “ExplainAppetitevalues4”, “ComfortLevelComfort”, “ExplainComfortvalues4”, “DemeanorLevelDemeanor”, “ExplainDemeanorvalues4”, CooperationLevelCooperation”, “ExplainCooperationvalues4”, “HygieneLevelHygiene”, “ExplainHygienevalues4”, “MobilityLevelMobility”, “ExplainMobilityvalues4”)
Since you only have three columns and there's not really much of a shared pattern here, it would just be easier to more directly use rename.
# library(dplyr)
dd %>%
rename(
Activity = ActivityLevelActivity,
Appetite = AppetiteLevelAppetite,
Activity_Comments = ExplainActivityvalues4)
I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.
I have a very large DAT file (16 GB). It contains some information of let's say, 1000 customers. This data is sorted like below that the first column is representing the customer IDs:
9909814 246766 0 31/07/2012 7:00 0.03 0 0 0 0
8211675 262537 0 8/04/2013 3:00 0.52 0 0 0 0
However, the data of customers are not stored in an organized way. So, I want to extract the data of each customer and store it in a separate file. (I have a file that contains the customer IDs. )
For just one customer, I wrote the following code that can search through the file and extract data. However, my problem is to how to do this for all the customers when I'm reading this big file into R.
con<-file('D:/CD_INTERVAL_READING.DAT')
open(con)
n=20
nk=100000
B=9909814 #customer ID for customer no.1
customer1 <- read.table(con, sep=",", nrow=1)
for (i in 1:n) {
conn <- read.table(con,sep=",",skip=(i-1)*nk, nrow=nk)
## extracts just those rows that belong to a specific customer ID
temp1 <-conn[conn$V1==B,]
customer1 <-rbind(customer1,temp1)
}
customer1 <- customer1 [-1,]
library(xlsx)
write.xlsx(customer1, "D:/customer1.xlsx")
The optimal solution would probably be to import the data into a proper database but if you really want to split the file into multiple files based on the first token then you can use awk with this one-liner.
awk '/^/ {ofn=$1 ".txt"} ofn {print > ofn}' filetosplit.txt
It works by
/^/ matching start of line
{ofn=$1 ".txt"} sets the ofn variable to the first word (split by white space) with .txt appended.
Print each line to the file set by ofn.
It takes me just under two minutes on my laptop to split a 1 GB file with the same format as you listed above into multiple text files. I have no idea how well that scales or if it's fast enough for you. If you want an R solution you can always wrap it into a system() call ;o)
Addendum:
Oh ... I'm guessing you are on windows based on the path you mentioned. Then you may need to install Cygwin to get awk.
I have fixed width data files (.dbf) that don't have line separators. Here is what two lines of that datafile looks like:
20141101 77h 3.210 0 3 20141102 76h 3.090 0 3
The widths of one line is c(8,4,7,41) for date (8), some time measure (4), the data point (7), and some other columns that i can summarize in one "rest" column (41). After one line there is no separator and the next line is just appended to the first line. All time steps are basically written consecutively in one massive line. There is exclusively numbers, characters and white space in this file.
With read.fwf('filepath', widths = c(8,4,7,41)) R stops reading after the first line due to lack of line separator.
Is there an argument to tell read.fwf() when to start reading the new line when there is no line separator? Or should i use a different read command?
Thanks in advance.
Maybe not the best idea but this should work:
content <- scan('filepath','character',sep='~') # Warning choose a sep not appearing in datas to get the whole file.
# Split content in lines:
lines <- regmatches(content,gregexpr('.{60}',content))[[1]]
x <- tempfile()
write(lines,x)
data <- read.fwf(x, widths = c(8,4,7,41))
unlink(x)
The idea is to read the whole file, get each occurence of 60 chars into a single entry, write this to a tempfile, and read the data from this tempfile before deleting the temporary file.
Another approach is doable with regexes and package stringr (still with content resulting from scan above):
library(stringr)
d <- data.frame( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5], stringsAsFactors=FALSE)
which gives:
V1 V2 V3 V4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
str_match_all return a list, here with 1 element because there's only one line as input, so we remove it with [[1]].
Now the return is 5 columns, the first one being the full match, others being the capture groups so we subset the matrix on columns 2 to 5 to get only the 4 columns we need and wrap it in as.data.frame to get a data.frame at end.
you can then name the columns with colnames(d) <- c('date','time','data_point','rest')
If you wish to clean up the white spaces you can wrap the str_extract_all result in trimws (thanks to #jaap for the remind of this function) like this:
td <- data.frame( trimws( str_match_all( content, "(.{8})(.{4})(.{7})(.{41})")[[1]][,2:5] ), stringsAsFactors=FALSE)
Output:
X1 X2 X3 X4
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
A different, and probably less elegant, solution with readLines, substr, trimws, separate (tidyr) and mutate_all (dplyr):
txt <- readLines('filepath')
dfx <- data.frame(V1 = sapply(seq(from=1, to=nchar(txt), by=60),
function(x) substr(txt, x, x+59)))
library(dplyr)
library(tidyr)
dfx %>%
separate(V1, c(paste0("V",LETTERS[1:5])), c(8,12,19,55)) %>%
mutate_all(trimws)
which gives:
VA VB VC VD VE
1 20141101 77h 3.210 0 3
2 20141102 76h 3.090 0 3
To get different column names , just replace c(paste0("V",LETTERS[1:5]) with a vector of columnnames you want.
If you want to transform the columns into the correct classes instead of into character, you can use funs(ul = type.convert(trimws(.))) inside mutate_all.
In addition to the other answers, some general info about dbf files:
Unless this is a one time read of a static file, it would be best to check the file/fields structure first in case that changes over time. See here for the internal structure of a dbf file.
But maybe even more important:
Each record in a dbf file is preceded by one byte for the delete flag. If this is a space, the record is not deleted, if it's an asterisk * the record is marked for deletion (records are not removed from a dbf file until the file is packed), and you probably want to skip those records. The first part of the data could also be overwritten with "DELETED" for example.
So, in your record c(8,4,7,41), the last byte of the rest column (41) is actually the delete flag of the record that follows it - and the last record in the file will only have 40 bytes for that field (but if you're lucky, the file has an EOF marker (0x1a), so maybe you didn't have a problem with the size there).
Thus, your record should actually be: c(1,8,4,7,40), where the 1 is the delete flag, and starting one byte sooner.
I've had some temperature measurements in .csv format and am trying to analyse them in R. For some reason the data files contain temperature with degree C following the numeric value. Is there a way to remove the degree C symbol and return the numeric value? I though of producing an example here but did not know how to generate a degree symbol in a string in R. Anyhow, this is what the data looks like:
> head(mm)
dateTime Temperature
1 2009-04-23 17:01:00 15.115 °C
2 2009-04-23 17:11:00 15.165 °C
3 2009-04-23 17:21:00 15.183 °C
where the class of mm[,2] is 'factor'
Can anyone suggest a method for converting the second column to 15.115 etc?
You can remove the unwanted part and convert the rest to numeric all at the same time with scan(). Setting flush = TRUE treats the last field (after the last space) as a comment and it gets discarded (since sep expects whitespace separators by default).
mm <- read.table(text = "dateTime Temperature
1 '2009-04-23 17:01:00' '15.115 °C'
2 '2009-04-23 17:11:00' '15.165 °C'
3 '2009-04-23 17:21:00' '15.183 °C'", header = TRUE)
replace(mm, 2, scan(text = as.character(mm$Temp), flush = TRUE))
# dateTime Temperature
# 1 2009-04-23 17:01:00 15.115
# 2 2009-04-23 17:11:00 15.165
# 3 2009-04-23 17:21:00 15.183
Or you can use a Unicode general category to match the unicode characters for the degree symbol.
type.convert(sub("\\p{So}C", "", mm$Temp, perl = TRUE))
# [1] 15.115 15.165 15.183
Here, the regular expression \p{So} matches various symbols that are not math symbols, currency signs, or combining characters. C matches the character C literally (case sensitive). And type.convert() takes care of the extra whitespace.
If all of your temperature values have the same number of digits you can make left and right functions (similar to those in Excel) to select the digits that you want. Such as in this answer from a different post: https://stackoverflow.com/a/26591121/4459730
First make the left function:
left = function (string,char){
substr(string,1,char)
}
Then recreate your Temperature string using just the digits you want:
mm$Temperature<-left(mm$Temperature,6)
degree symbol is represented as \u00b0, hence following code should work:
df['Temperature'] = df['Temperature'].replace('\u00b0','', regex=True)