Read only lines from a large text file which fulfill specific condition - r

I have a large file (data.txt, 35 GB) which has 3 columns.
Some example part of the file would look like the following:
... ... ...
5 701565 8679.56
8 1.16201e+006 3193.18
1 1.16173e+006 4457.85
14 1.16173e+006 4457.85
9 1.77942e+006 7208.73
4 1.78011e+006 8239.88
14 1.78019e+006 8195.57
9 2.00206e+006 8858.55
4 2.00199e+006 7924
... ... ...
I want to plot a histogram for the 3rd column when the values in the second column are between 0 and 50'000.
Then I want to do another histogram where the values of the first column are between 50'000 and 100'000. And so on, and so forth.
I don't know how to load/read only the data I need at a time. Any help would be appreciated!
If I should use the sqldf package then the question I have would be how I can say that the value of the 2nd column should be smaller than a e.g. 50'000?
The difference to How do i read only lines that fulfil a condition from a csv into R? is that I don't have any column names. Therefore I cannot do what they propose in their solution:
sql = "select * from file where Sepal.Length > 5"

I think recent versions of readr support this sort of thing. The following is just adapted from the help for readr::read_csv_chunked
library(readr)
f <- function(x, pos) subset(x, X3 > 0 & X3 < 50000)
df <- read_csv_chunked(
'test.csv',
DataFrameCallback$new(f),
chunk_size = 100000,
col_names = F
)

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

R: combine two csv files with spark

I have two very large csv files and I'm using spark with R. My first file was uploaded this way:
data <- spark_read_csv(sc, "D:/my_file.csv")
After working with first file I have these variables:
Name | Number
The second csv file that has these variables:
Name | Number | Surname
You can also see that the second file has one more variable than the first. I would like to ignore the Surname column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?
From what I gather, you want to get rid of the Surname column in your second dataframe and make a union with the first.
spark_read_csv seems to come from sparklyr that I have never used but in plain SparkR, we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.
> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
Name Number
1 x 7
2 y 8
> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
Name Number Surname
1 z 5 zz
2 w 6 ww
Then, it is pretty straightforward:
> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
Name Number
1 x 7
2 y 8
3 z 5
4 w 6

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

Filter rows based on ID over multiple data frames with for loop

How can I filter 180 .csv files from my global directory based on a matching ID in another df named 'Camera' in R? When I tried to incorporate my one by one file filtering code (see step 3b) into a for-loop (see step 3a) I get the error:
Error in paste("i")$SegmentID : $ operator is invalid for atomic vectors.
I'm quite new to for loop functions, so I really appreciate your help! All the 180 files have a unique name, are different in length, but have the same column structure & names. They look like:
df 'File1' df 'Camera'
ID Speed Location ID Time
1 30 4 1 10
2 35 5 3 11
3 40 6 5 12
4 30 7
5 35 8
Filtered df 'File1'
ID Speed Location
1 30 4
3 40 6
5 35 8
These are some samples of my code:
#STEP 1: read files
filenames <- list.files(path="06-06-2017_0900-1200uur",
pattern="*.csv")
# STEP 2: import files
for(i in filenames){
filepath <- file.path("06-06-2017_0900-1200uur",paste(i))
assign(i, read.csv2(filepath, header = TRUE, skip = "1"))
}
# STEP 3a: delete rows that do not match ID in df 'Cameras'
for(i in filesnames){
paste("i") <- paste("i")[paste("i")$ID %in% Cameras$ID,]
}
#STEP 3b: filtering one by one
File1 <- File1[File1$ID %in% Camera$ID,]
Here is an approach that makes use of lists (generally a better way to go). First, utilize the include.names argument in list.files():
fns <- list.files(
path = "06-06-2017_0900-1200uur",
pattern = "*.csv",
include.names = T
)
Now you have a list of your filenames. Next, apply read.csv2 to each of the filenames in your list:
dat <- lapply(fns, read.csv2, header = T, skip = 1)
Now you have a list of data frames (the output from calling read.csv). Finally, apply subset() to each of the data frames to keep only those rows which match the ID column:
out <- lapply(dat, function(x) subset(x, ID %in% Camera$ID))
If I understand the question, the output should be a data frame from file1 where the ID for all rows matches one of the rows in the Camera file.
This is easily accomplished with the sqldf() package and structured query language.
rawFile1 <- "ID Speed Location
1 30 4
2 35 5
3 40 6
4 30 7
5 35 8
"
rawCamera <- " ID Time
1 10
3 11
5 12
"
file1 <- read.table(textConnection(rawFile1),header=TRUE)
Camera <- read.table(textConnection(rawCamera),header=TRUE)
library(sqldf)
sqlStmt <- "select * from file1 where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")
...and the output:
ID Speed Location
1 1 30 4
2 3 40 6
3 5 35 8
>
To extend this logic to a number of csv files, first we obtain the list of files from the subdirectory where they are stored using the list.files() function. For example, if the files were in a data subdirectory of the R working directory, one might use the following function call.
theFiles <- list.files("./data/",".csv",full.names=TRUE)
We can read these files with read.table() to create a list() of data frames.
theData <- lapply(theFiles,function(x) {
read.table(x,header=TRUE)})
To combine the files into a single data frame, we execute do.call().
combinedData <- do.call(rbind,theData)
Now we can read the camera data and use sqldf to keep only the IDs matching the camera data.
Camera <- read.table(...,header=TRUE)
library(sqldf)
sqlStmt <- "select * from combinedData where ID in(select ID from Camera)"
sqldf(sqlStmt,drv="SQLite")

Average across some rows in R

I have not found a way to take an average across SOME columns in R when working with a data frame table. Basically, I want to take the average of the 3 controls (CTR_R1+CTR_R2+CTR_R3) and insert that value as another column right after CTR_R3 (see below). The same for the TRT.
Is there away to take the average and insert it in a specific location?
GeneID|CTR_R1|CTR_R2|CTR_R3|CTR_AVG|TRT_R1| TRT_R2| TRT_R3|TRT_AVG|pValue
How about
df$CTR_AVG <- rowMeans(df[,2:4])
df$TRT_AVG <- rowMeans(df[,6:8])
This code should work for you, if your data.frame is named df:
df$CTR_AVG <- ( df$CTR_R1 + df$CTR_R2 + df$CTR_R3 ) / 3
That is assuming that the CTR_AVG column already exists as you shown in your question. If it does not the code will put the column at the end of the data.frame. To move it to the right spot, you will need to select the columns in the correct order, like so:
df[ , c( 'GeneID', 'CTR_R1', 'CTR_R2', 'CTR_R3', 'CTR_AVG', 'TRT_R1', 'TRT_R2', 'TRT_R3','TRT_AVG','pValue' ]
The below code should work even if there are many CTR or TRT columns (i.e. 100s). But, I am guessing #beginneR's solution to be faster.
indx <- grep("^CTR", colnames(df1), value=TRUE)
indxT <- grep("^TRT", colnames(df1), value=TRUE)
df1[,c('CTR_Avg', 'TRT_Avg')] <- lapply(list(indx, indxT),
function(x) Reduce(`+`, df1[,x])/length(x))
or you can use rowMeans in the above step.
df2 <- df1[,c('GeneID', indx, 'CTR_Avg', indxT, 'TRT_Avg', 'pValue')]
head(df2,2)
# GeneID CTR_R1 CTR_R2 CTR_R3 CTR_Avg TRT_R1 TRT_R2 TRT_R3 TRT_Avg pValue
#1 1 6 2 10 6.000000 10 11 15 12 0.091
#2 2 5 12 8 8.333333 5 3 13 7 0.051
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:20,20*6, replace=TRUE), ncol=6))
colnames(df1) <- c("CTR_R1", "CTR_R2", "CTR_R3", "TRT_R1", "TRT_R2", "TRT_R3")
df1 <- cbind(GeneID=1:20, df1,
pValue=sample(seq(0.001, 0.10, by=0.01), 20, replace=TRUE))
make some dummy data
df=data.frame(CTR_R1=1:10,CTR_R2=1:10,CTR_R3=1:10,somethingelse=1:10)
get a new column
df$CTR_AVG=apply(df[c("CTR_R1","CTR_R2","CTR_R3")],1,mean)
Thanks so much for your replies. I am sorry I did not phrase my original question better. I meant to ask how to write one script to take the average and place that value in the right place. I do not have in my table the column that says "CTR_AVG", nor the column "TRT_AVG".
I was wondering if i could do it more 'elegantly' than doing what i did below (which works too).
Many thanks.
#
names (edgeR_table)
"GeneID" "CTR_R1" "CTR_R2" "CTR_R3" "TRT_R1" "TRT_R2" "TRT_R3" "logFC" "logCPM" "LR" "PValue" "FDR"
#
edgeR_table$CTR_AVG <- rowMeans(edgeR_table[,2:4])
edgeR_table$TRT_AVG <- rowMeans(edgeR_table[,5:7])
edgeR_table <- edgeR_table[, c(1,2,3,4,13,5,6,7,14,8,9,10,11,12)]

Resources