I want to read a text file into R, but I got a problem that the first column are mixed with the column names and the first column numbers.
Data text file
revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855
R Code:
data.predicted_values = read.table("predicted_values.txt", sep=",")
Output:
V1 V2 V3 V4 V5 V6
1 revenues 4118000000.0 4315000000 4512000000 4709000000 4906000000 5103000000
2 cost_of_revenue-1595852945.4985902 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
3 gross_profit 2522147054.5014095 2663170808 2806054293 2950797511 3097400462 3245863145
How can I split the first column into two parts? I mean I want the first column V1 is revenues,cost_of_revenue, gross_profit. V2 is 4118000000.0,-1595852945.4985902,2522147054.5014095. And so on and so forth.
This is along the same lines of thinking as #DWin's, but accounts for the negative values in the second row.
TEXT <- readLines("predicted_values.txt")
A <- gregexpr("[A-Za-z_]+", TEXT)
B <- read.table(text = regmatches(TEXT, A, invert = TRUE)[[1]], sep = ",")
C <- cbind(FirstCol = regmatches(TEXT, A)[[1]], B)
C
# FirstCol V1 V2 V3 V4 V5 V6
# 1 revenues 4118000000 4315000000 4512000000 4709000000 4906000000 5103000000
# 2 cost_of_revenue -1595852945 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
# 3 gross_profit 2522147055 2663170808 2806054293 2950797511 3097400462 3245863145
Since you have no commas btwn the rownames and the values you need to add them back in:
txt <- "revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855"
Lines <- readLines( textConnection(txt) )
# replace textConnection(.) with `file = "predicted_values.txt"`
res <- read.csv( text=sub( "(^[[:alpha:][:punct:]]+)(\\s|-)" ,
"\\1,", Lines) ,
header=FALSE, row.names=1 )
res
The decimal fractions may not print but they are there.
You want the row.names argument of read.table. Then you can simply transpose your data:
data.predicted_values = read.table("predicted_values.txt", sep=",", row.names=1)
data.predicted_values <- t(data.predicted_values)
Related
I have a .txt file that consists of some investment data. I want to convert the data in file to data frame with three columns. Data in .txt file looks like below.
Date:
06-04-15, 07-04-15, 08-04-15, 09-04-15, 10-04-15
Equity :
-237.79, -170.37, 304.32, 54.19, -130.5
Debt :
16318.49, 9543.76, 6421.67, 3590.47, 2386.3
If you are going to use read.table(), then the following may help:
Assuming the dat.txt contains above contents, then
dat <- read.table("dat.txt",fill=T,sep = ",")
df <- as.data.frame(t(dat[seq(2,nrow(dat),by=2),]))
rownames(df) <- seq(nrow(df))
colnames(df) <- trimws(gsub(":","",dat[seq(1,nrow(dat),by=2),1]))
yielding:
> df
Date Equity Debt
1 06-04-15 -237.79 16318.49
2 07-04-15 -170.37 9543.76
3 08-04-15 304.32 6421.67
4 09-04-15 54.19 3590.47
5 10-04-15 -130.5 2386.3
Assuming the text file name is demo.txt here is one way to do this
#Read the file line by line
all_vals <- readLines("demo.txt")
#Since the column names and data are in alternate lines
#We first gather column names together and clean them
column_names <- trimws(sub(":", "", all_vals[c(TRUE, FALSE)]))
#we can then paste the data part together and assign column names to it
df <- setNames(data.frame(t(read.table(text = paste0(all_vals[c(FALSE, TRUE)],
collapse = "\n"), sep = ",")), row.names = NULL), column_names)
#Since most of the data is read as factors, we use type.convert to
#convert data in their respective format.
type.convert(df)
# Date Equity Debt
#1 06-04-15 -237.79 16318.49
#2 07-04-15 -170.37 9543.76
#3 08-04-15 304.32 6421.67
#4 09-04-15 54.19 3590.47
#5 10-04-15 -130.50 2386.30
I have a dataset of approximately 2 million rows and 45 columns. I would like to replace a list of values in one specific column within this dataset.
I have tried gsub but it is proving to take a prohibitive length of time. I need to perform 16 replacements.
To give you an example of what I've done :
setwd("C:/RStudio")
dat2 <- read.csv("2016 new.csv", stringsAsFactors=FALSE)
dat3 <- read.csv("2017 new.csv", stringsAsFactors=FALSE)
dat4 <- read.csv("2018 new.csv", stringsAsFactors=FALSE)
myfulldata <- rbind(dat2, dat3)
myfulldata <- rbind(myfulldata, dat4)
myfulldata <- myfulldata[, -c(1,5,10,11,12,13,15,20,21,22,41,42,43,44,48,50,51,52,59,61,62,64,65,66,67,68,69,70,71,72)]
gc()
myfulldata[is.na(myfulldata)] <- ""
gc()
myfulldata <- gsub("Text Being Replaced","CS1",myfulldata, fixed=TRUE)
I've bound several files then removed the columns I don't need. The bottom line is where I begin the string replace section. I only want to replace cases in one specific column. With this in mind can I use something other than gsub or whatever works best so that I'm only replacing cases in column number 36, named Waypoint?
Many thanks,
Eoghan
Answer going out to phiver:
set.seed(123)
# data simulation
n = 10 #2e6
m = 45 #45
myfulldata <- as.data.frame(matrix(paste0("Text", 1:(n * m)), ncol = m), stringsAsFactors = FALSE)
names(myfulldata)[36] <- "Waypoint"
myfulldata$Waypoint[sample(seq.int(nrow(myfulldata)), 5)] <- "Text Being Replaced"
myfulldata$Waypoint
# [1] "Text351" "Text352" "CS1" "CS1" "Text355" "CS1" "CS1" "CS1"
# "Text359" "Text360"
# data replacement
myfulldata$Waypoint <- gsub("Text Being Replaced", "CS1", myfulldata$Waypoint, fixed = TRUE)
myfulldata
Output:
V33 V34 V35 Waypoint V37 V38
1 Text321 Text331 Text341 Text351 Text361 Text371
2 Text322 Text332 Text342 Text352 Text362 Text372
3 Text323 Text333 Text343 CS1 Text363 Text373
4 Text324 Text334 Text344 CS1 Text364 Text374
5 Text325 Text335 Text345 Text355 Text365 Text375
6 Text326 Text336 Text346 CS1 Text366 Text376
7 Text327 Text337 Text347 CS1 Text367 Text377
8 Text328 Text338 Text348 CS1 Text368 Text378
9 Text329 Text339 Text349 Text359 Text369 Text379
10 Text330 Text340 Text350 Text360 Text370 Text380
I have script that generates a data.table with some columns I want to divide by some other columns and store the results in new columns. Here's an example.
library(data.table)
dt <- data.table(V1 = c( 5.553465, 4.989168, 2.563682, 6.987971, 19.220936),
V2 = c(4.248335, 19.768138, 3.840026, 17.411003, 17.939368),
V3 = c(9.683953, 15.344424, 11.729091, 7.534210, 5.404000),
V4 = c(5.949093, 4.553023, 9.765656, 11.211069, 4.085964),
V5 = c(11.814671, 5.460138, 2.492230, 1.48792, 8.164280))
list1 <- list(c("V1", "V2", "V3"))
list2 <- list(c("V2", "V4", "V5"))
listRatio <- list(c("rat1","rat2","rat3"))
I have tried a variety of approaches to dividing the values in the list1 elements by the values in the list2 elements, unsuccessfully. Two are below; neither works.
dt[, (listRatio) := list1/list2]
dt[, c("rat1","rat2","rat3") := mapply(dt, function(x,y) x / y, x = c(V1, V2, V3), y = c(V2, V4, V5))]
We need to convert the list to vector by using [[ and then get the values of each vector in a list with mget, use Map to divide (/) the corresponding columns of each of the list values and assign it to the vector (listRatio[[1]]).
dt[, (listRatio[[1]]) := Map(`/`, mget(list1[[1]]), mget(list2[[1]]))]
dt
# V1 V2 V3 V4 V5 rat1 rat2 rat3
#1: 5.553465 4.248335 9.683953 5.949093 11.814671 1.3072098 0.7141147 0.8196549
#2: 4.989168 19.768138 15.344424 4.553023 5.460138 0.2523843 4.3417611 2.8102630
#3: 2.563682 3.840026 11.729091 9.765656 2.492230 0.6676210 0.3932174 4.7062635
#4: 6.987971 17.411003 7.534210 11.211069 1.487920 0.4013537 1.5530190 5.0635854
#5: 19.220936 17.939368 5.404000 4.085964 8.164280 1.0714389 4.3904861 0.6619077
NOTE: As #Frank mentioned in the comments, it is better to create a vector of variables names and not a list.
By using data.frame function
dt <- data.frame(V1 = c( 5.553465, 4.989168, 2.563682, 6.987971, 19.220936),
V2 = c(4.248335, 19.768138, 3.840026, 17.411003, 17.939368),
V3 = c(9.683953, 15.344424, 11.729091, 7.534210, 5.404000),
V4 = c(5.949093, 4.553023, 9.765656, 11.211069, 4.085964),
V5 = c(11.814671, 5.460138, 2.492230, 1.48792, 8.164280))
list1 <- list(dt[,c("V1", "V2", "V3")])
list2 <- list(dt[,c("V2", "V4", "V5")])
dt$rat3 <- dt$rat2 <- dt$rat1 <- ""
dt[, c("rat1","rat2","rat3")] <- unlist(list1)/unlist(list2)
V1 V2 V3 V4 V5 rat1 rat2 rat3
1 5.553465 4.248335 9.683953 5.949093 11.814671 1.3072098 0.7141147 0.8196549
2 4.989168 19.768138 15.344424 4.553023 5.460138 0.2523843 4.3417611 2.8102630
3 2.563682 3.840026 11.729091 9.765656 2.492230 0.6676210 0.3932174 4.7062635
4 6.987971 17.411003 7.534210 11.211069 1.487920 0.4013537 1.5530190 5.0635854
5 19.220936 17.939368 5.404000 4.085964 8.164280 1.0714389 4.3904861 0.6619077
I have the following data frame
Loci p-value chromosome start end geneDescription
A 2.046584849E-2 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
...
However, when I want to print the data frame with the following code:
write.table(table,"~/Desktop/genes.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE)
I get the following:
Loci p-value chromosome start end geneDescription
A 2.046584849E-20 1 98542 98699 tyrosine kinase
B 5.67849483E-20 2 8958437 8958437 endocytosis
I know that it has to do with the "\t", but can R adjust automatically the width of the columns when printing to get the original data frame above?
Thank you.
No, since this is a tab formatting issue and can be partially solved by increasing the tabwidth on you editor. Try normalizing the length of the column names.
max.name <- max(sapply(colnames(table), nchar))
colnames(table) <- sapply(colnames(table), function(name) paste0(c(name, rep(" ", max.name - nchar(name))), collapse = ''))
Perhaps you're just looking for capture.output or sink.
In the following examples, replace x with an actual file name. This is just done for illustrative purposes.
x <- tempfile()
capture.output(mydf, file=x)
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
x <- tempfile()
sink(file = x)
mydf
sink()
readLines(x)
# [1] " Loci p.value chromosome start end geneDescription"
# [2] "1 A 2.046585e-02 1 98542 98699 tyrosinekinase"
# [3] "2 B 5.678495e-20 2 8958437 8958437 endocytosis"
The readLines step is just to show you what was written to your "file".
Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))