Control text justification - r

I am trying to create an input file for another program that is space-delimited. I'm pasting together the contents of multiple columns and having problems when the number have different lengths due to what appears to be a default right-justify in R. For example:
row_id monthly_spend
123 4.55
567 24.64
678 123.09
becomes :
row_id:123 monthly_spend: 4.55
row_id:567 monthly_spend: 24.64
row_id:678 monthly_spend:123.09
while what I need is this:
row_id:123 monthly_spend:4.55
row_id:567 monthly_spend:24.64
row_id:678 monthly_spend:123.09
the code I'm using is derived from this question here and looks like this:
paste(row_id, monthly_spend, sep=":", collapse=" ")
i've tried formatting the columns as numeric or integer without any change.
Any suggestions?

if you put your vectors into a data.frame (if they are not already)
you can use:
apply(sapply(names(myDF), function(x)
paste(x, myDF[, x], sep=":") ), 1, paste, collapse=" ")
# [1] "row_id:123 monthly_spend:4.55"
# [2] "row_id:567 monthly_spend:24.64"
# [3] "row_id:678 monthly_spend:123.09"
or alternatively:
do.call(paste, lapply(names(myDF), function(x) paste0(x, ":", myDF[, x])))
sprintf is also an option. You've got many ways of going about it
sample data used:
myDF <- read.table(header=TRUE, text=
"row_id monthly_spend
123 4.55
567 24.64
678 123.09")

With your data snippet:
df <- read.table(text = "row_id monthly_spend
123 4.55
567 24.64
678 123.09", header = TRUE)
The we can paste together but employ the format function with trim = TRUE to take care of stripping the spaces you don't want:
with(df, paste("row_id:", row_id,
"monthly_spend:", format(monthly_spend, trim = TRUE)))
Which gives:
> with(df, paste("row_id:", row_id,
+ "monthly_spend:", format(monthly_spend, trim = TRUE)))
[1] "row_id: 123 monthly_spend: 4.55" "row_id: 567 monthly_spend: 24.64"
[3] "row_id: 678 monthly_spend: 123.09"
If you need this in a data frame before writing out to file, use:
newdf <- with(df, data.frame(foo = paste("row_id:", row_id,
"monthly_spend:",
format(monthly_spend, trim = TRUE))))
newdf
> newdf
foo
1 row_id: 123 monthly_spend: 4.55
2 row_id: 567 monthly_spend: 24.64
3 row_id: 678 monthly_spend: 123.09
When you write this out, the columns will be justified as you want.

Here is a general answer (any number of variables), assuming your data is in a data.frame dat:
x <- mapply(names(dat), dat, FUN = paste, sep = ":")
write.table(x, file = stdout(),
quote = FALSE, row.names = FALSE, col.names = FALSE)
And you can replace stdout() with a filename.

assuming the data frame is called df
write.table(as.data.frame(sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"))),"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");
equivalent to following substeps:
# generating the column separated records
df_cp<-sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"));
### casting to data frame
df_cp<-as.data.frame(df_cp);
### writing out to disk
write.table(df_cp,"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");

Related

reading txt file and converting it to dataframe

I have a .txt file that consists of some investment data. I want to convert the data in file to data frame with three columns. Data in .txt file looks like below.
Date:
06-04-15, 07-04-15, 08-04-15, 09-04-15, 10-04-15
Equity :
-237.79, -170.37, 304.32, 54.19, -130.5
Debt :
16318.49, 9543.76, 6421.67, 3590.47, 2386.3
If you are going to use read.table(), then the following may help:
Assuming the dat.txt contains above contents, then
dat <- read.table("dat.txt",fill=T,sep = ",")
df <- as.data.frame(t(dat[seq(2,nrow(dat),by=2),]))
rownames(df) <- seq(nrow(df))
colnames(df) <- trimws(gsub(":","",dat[seq(1,nrow(dat),by=2),1]))
yielding:
> df
Date Equity Debt
1 06-04-15 -237.79 16318.49
2 07-04-15 -170.37 9543.76
3 08-04-15 304.32 6421.67
4 09-04-15 54.19 3590.47
5 10-04-15 -130.5 2386.3
Assuming the text file name is demo.txt here is one way to do this
#Read the file line by line
all_vals <- readLines("demo.txt")
#Since the column names and data are in alternate lines
#We first gather column names together and clean them
column_names <- trimws(sub(":", "", all_vals[c(TRUE, FALSE)]))
#we can then paste the data part together and assign column names to it
df <- setNames(data.frame(t(read.table(text = paste0(all_vals[c(FALSE, TRUE)],
collapse = "\n"), sep = ",")), row.names = NULL), column_names)
#Since most of the data is read as factors, we use type.convert to
#convert data in their respective format.
type.convert(df)
# Date Equity Debt
#1 06-04-15 -237.79 16318.49
#2 07-04-15 -170.37 9543.76
#3 08-04-15 304.32 6421.67
#4 09-04-15 54.19 3590.47
#5 10-04-15 -130.50 2386.30

Splitting a column in a data frame by an nth instance of a character

I have a dataframe with several columns, and one of those columns is populated by pipes "|" and information that I am trying to obtain.
For example:
View(Table$Column)
"|1||KK|12|Gold||4K|"
"|1||Rst|E|Silver||13||"
"|1||RST|E|Silver||18||"
"|1||KK|Y|Iron|y|12||"
"|1||||Copper|Cpr|||E"
"|1||||Iron|||12|F"
And so on for about 120K rows.
What I am trying to excavate is everything in between the 5th pipe and the 6th pipe in this series, but in it's own column vector, so the end result looks like this:
View(Extracted)
Gold
Silver
Silver
Iron
Copper
Iron
I don't want to use RegEx. My tools are only limited to R here. Would you guys happen to have any advice how to overcome this?
Thank you.
1) Assuming x as defined reproducibly in the Note at the end use read.table as shown. No regular expressions or packages are used.
read.table(text = Table$Column, sep = "|", header = FALSE,
as.is = TRUE, fill = TRUE)[6]
giving:
V6
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
2) This alternative does use a regular expression (which the question asked not to) but just in case here is a tidyr solution. Note that it requires tidyr 0.8.2 or later since earlier versions of tidyr did not support NA in the into= argument.
library(dplyr)
library(tidyr)
Table %>%
separate(Column, into = c(rep(NA, 5), "commodity"), sep = "\\|", extra = "drop")
giving:
commodity
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
3) This is another base solution. It is probably not the one you want given that (1) is so much simpler but I wanted to see if we could come up with a second approach in base that did not use regexes. Note that if the split= argument of strsplit is "" then it is treated specially and so is not a regex. It creates a list each of whose components is a vector of single characters. Each such vector is passed to the anonymous function which labels | and the characters in the field after it with its ordinal number. We then take the characters corresponding to 5 (except the first as it is |) and collapse them together using paste.
data.frame(commodities = sapply(strsplit(Table$Column, ""), function(chars) {
wx <- which(cumsum(chars == "|") == 5)
paste(chars[seq(wx[2], tail(wx, 1))], collapse = "")
}), stringsAsFactors = FALSE)
giving:
commodities
1 Gold
2 Silver
3 Silver
4 Iron
5 Copper
6 Iron
Note
Table <- data.frame(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||",
"|1||KK|Y|Iron|y|12||",
"|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F"), stringsAsFactors = FALSE)
You can try this:
df <- data.frame(x = c("|1||KK|12|Gold||4K|", "|1||Rst|E|Silver||13||"), stringsAsFactors = FALSE)
library(stringr)
stringr::str_split(df$x, "\\|", simplify = TRUE)[, 6]
1) We can use strsplit from base R on the delimiter | and extract the 6th element from the list of vectors
sapply(strsplit(Table$Column, "|", fixed = TRUE), `[`, 6)
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
2) Or using regex (again from base R), use sub to extract the 6th word
sub("^([|][^|]+){4}[|]([^|]*).*", "\\2",
gsub("(?<=[|])(?=[|])", "and", Table$Column, perl = TRUE))
#[1] "Gold" "Silver" "Silver" "Iron" "Copper" "Iron"
data
Table <- structure(list(Column = c("|1||KK|12|Gold||4K|",
"|1||Rst|E|Silver||13||",
"|1||RST|E|Silver||18||", "|1||KK|Y|Iron|y|12||", "|1||||Copper|Cpr|||E",
"|1||||Iron|||12|F")), class = "data.frame", row.names = c(NA,
-6L))

r - Replace String In Specific Column

I have a dataset of approximately 2 million rows and 45 columns. I would like to replace a list of values in one specific column within this dataset.
I have tried gsub but it is proving to take a prohibitive length of time. I need to perform 16 replacements.
To give you an example of what I've done :
setwd("C:/RStudio")
dat2 <- read.csv("2016 new.csv", stringsAsFactors=FALSE)
dat3 <- read.csv("2017 new.csv", stringsAsFactors=FALSE)
dat4 <- read.csv("2018 new.csv", stringsAsFactors=FALSE)
myfulldata <- rbind(dat2, dat3)
myfulldata <- rbind(myfulldata, dat4)
myfulldata <- myfulldata[, -c(1,5,10,11,12,13,15,20,21,22,41,42,43,44,48,50,51,52,59,61,62,64,65,66,67,68,69,70,71,72)]
gc()
myfulldata[is.na(myfulldata)] <- ""
gc()
myfulldata <- gsub("Text Being Replaced","CS1",myfulldata, fixed=TRUE)
I've bound several files then removed the columns I don't need. The bottom line is where I begin the string replace section. I only want to replace cases in one specific column. With this in mind can I use something other than gsub or whatever works best so that I'm only replacing cases in column number 36, named Waypoint?
Many thanks,
Eoghan
Answer going out to phiver:
set.seed(123)
# data simulation
n = 10 #2e6
m = 45 #45
myfulldata <- as.data.frame(matrix(paste0("Text", 1:(n * m)), ncol = m), stringsAsFactors = FALSE)
names(myfulldata)[36] <- "Waypoint"
myfulldata$Waypoint[sample(seq.int(nrow(myfulldata)), 5)] <- "Text Being Replaced"
myfulldata$Waypoint
# [1] "Text351" "Text352" "CS1" "CS1" "Text355" "CS1" "CS1" "CS1"
# "Text359" "Text360"
# data replacement
myfulldata$Waypoint <- gsub("Text Being Replaced", "CS1", myfulldata$Waypoint, fixed = TRUE)
myfulldata
Output:
V33 V34 V35 Waypoint V37 V38
1 Text321 Text331 Text341 Text351 Text361 Text371
2 Text322 Text332 Text342 Text352 Text362 Text372
3 Text323 Text333 Text343 CS1 Text363 Text373
4 Text324 Text334 Text344 CS1 Text364 Text374
5 Text325 Text335 Text345 Text355 Text365 Text375
6 Text326 Text336 Text346 CS1 Text366 Text376
7 Text327 Text337 Text347 CS1 Text367 Text377
8 Text328 Text338 Text348 CS1 Text368 Text378
9 Text329 Text339 Text349 Text359 Text369 Text379
10 Text330 Text340 Text350 Text360 Text370 Text380

r String split and merge

My dataset looks like this below
Id Col1
--------------------
133 Mary 7E
281 Feliz 2D
437 Albert 4C
What I am trying to do is to take the 1st two characters from the 1st word in Col1 and all the whole second word and then merge them.
My final expected dataset should look like this below
Id Col1
--------------------
133 MA7E
281 FE2D
437 AL4C
Any suggestions on how to accomplish this is much appreciated.
You can do
my_data$Col1 <- sub("(\\w{2})(\\w* )(\\b\\w+\\b)", "\\1\\3", my_data$Col1)
my_data$Col1 <- toupper(my_data$Col1)
my_data
# Id Col1
# 1 133 MA7E
# 2 281 FE2D
# 3 437 AL4C
The brackets show the single groups that are matched and only the first and the third are retained. \\w matches letters and numbers and \\b matches the boundary of words.
We can also do this in paste0 together the output of substr and str_split within a dplyr pipe chain:
df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
library(stringr)
df %>%
mutate(Col1 = toupper(paste0(substr(Col1, 1, 2),
stringr::str_split(Col1, ' ')[[1]][-1])))
You can do this in several steps. First split by space, subset first two letters of the name and capitalize them. Paste that together with the second part. Result is in column final. You could take all these intermediate steps or chain commands into less statements, whatever floats your boat.
xy <- data.frame(id = c(133, 281, 437),
name = c("Mary 7E", "Feliz 2D", "Albert 4C"),
stringsAsFactors = FALSE)
xy$first <- sapply(strsplit(xy$name, " "), "[", 1)
xy$second <- sapply(strsplit(xy$name, " "), "[", 2)
xy$first_upper <- toupper(substr(x = xy$first, start = 1, stop = 2))
xy$final <- paste(xy$first_upper, xy$second, sep = "")
xy
id name first second first_upper final
1 133 Mary 7E Mary 7E MA MA7E
2 281 Feliz 2D Feliz 2D FE FE2D
3 437 Albert 4C Albert 4C AL AL4C
Here is another variation using sub. We can use lookarounds in Perl mode to selectively remove everything except for the first two, and last two, characters. Then, make a call to toupper() to capitalize all letters.
df$Col1 <- toupper(sub("(?<=^..).*(?=..$)", "", df$Col1), perl=TRUE)
[1] "MA7E" "FE2D" "AL4C"
Demo
rather than one row solution this is easy to interpret and modify
xx_df <- data.frame(id = c(133,281,437),
Col1 = c("Mary 7E", "Feliz 2D", "Albert 4C"))
xx_df %>%
mutate(xpart1 = stri_split_fixed(Col1, " ", simplify = T)[,1]) %>%
mutate(xpart2 = stri_split_fixed(Col1, " ", simplify = T)[,2]) %>%
mutate(Col1_new = paste0(substr(xpart1,1,2), substr(xpart2, 1, 2))) %>%
select(id, Col1 = Col1_new) %>%
mutate(Col1 = toupper(Col1))
result is
id Col1
1 133 MA7E
2 281 FE2D
3 437 AL4C
For this solution use substr to take the first 2 elements from each string, and the last 2. For selecting the last 2 we need nchar, as part of sapply. paste0 together. Also using toupper to have capital letters.
l2 <- sapply(df$Col1, function(x) nchar(x))
paste0(toupper(substr(df$Col1,1,2)), substr(df$Col1, l2-1, l2))
[1] "MA7E" "FE2D" "AL4C"

Merging Two Headings Into One

Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))

Resources