How Can I Compile Counts of Bigrams from Several .csv Files into One .csv File using R - r

I have a series of .csv files labeled Trigrams1A, Trigrams1B, ... Trigrams66A, Trigrams66B consisting of the counts of trigrams from a text file. My goal is to compile counts of specific trigrams from each of these files into one table. My thought was to try to create a row with the specific counts in each file and then stack the rows, but it does not seem to be working. A sample code follows:
z <- 1
while (z <= 66) {
document <- paste("Trigrams", z, "B", ".csv", sep="")
mytable <- read.csv(document, header = T, sep=",")
a <- length(mytable[which(mytable=="came here from")])
b <- length(mytable[which(mytable=="go home with")])
c <- length(mytable[which(mytable=="going to split")])
d <- length(mytable[which(mytable=="i m gonna")])
e <- length(mytable[which(mytable=="a lot of")])
f <- length(mytable[which(mytable=="lot of money")])
g <- length(mytable[which(mytable=="all the way")])
h <- length(mytable[which(mytable=="i m going")])
i <- length(mytable[which(mytable=="i promise you")])
j <- length(mytable[which(mytable=="i want to")])
k <- length(mytable[which(mytable=="i trust you")])
rows <- c(a,c,d,e,f,g,h,i,j,k)
columns <- rbind(rows)
z <- z+1
}
How can I effectively get the counts from the tables, combine them, and then write this combination into a new table?

Related

Appending every nth column using loop in R

I have a data frame which consists of paired columns of ratings given by participants and the reasons for giving their ratings. I would like to insert a blank column after each pair of columns, so that after column 1 and 2 there's a new column. I managed to do this manually by creating a vector, inserting them all at the end, and then reorganizing myself. Here's the code for that so it is clear what I am trying to achieve:
v <- rep(NA, 184)
Scheme1$Code1.1 <- v
Scheme1$Code2.1 <- v
Scheme1$Code1.2 <- v
Scheme1$Code2.2 <- v
Scheme1$Code1.3 <- v
Scheme1$Code2.3 <- v
Scheme1$Code1.4 <- v
Scheme1$Code2.4 <- v
Scheme1$Code1.5 <- v
Scheme1$Code2.5 <- v
Scheme1$Code1.6 <- v
Scheme1$Code2.6<- v
Scheme1$Code1.7 <- v
Scheme1$Code2.7 <- v
# Reorganize
Scheme1 <- Scheme1[,c(1,2,15,16,3,4,17,18,5,6,19,20,7,8,21,22,9,10,23,24
,11,12,25,26,13,14,27,28)]
I wanted to see how this could be achieved by using a for loop.
Thanks!
Based on the description, may be this helps
lst1 <- split.default(Scheme1, as.integer(gl(ncol(Scheme1), 2, ncol(Scheme1))))
do.call(cbind, unname(Map(function(x, i) {x[paste0(names(x), ".", i)] <- NA;x}, lst1, names(lst1))))
dta
set.seed(24)
Scheme1 <- as.data.frame(matrix(rnorm(14 * 5), ncol = 14))

subsetting a data.frame using a for loop

I have a data.frame, and I want to subset it every 10 rows and then applied a function to the subset, save the object, and remove the previous object. Here is what I got so far
L3 <- LETTERS[1:20]
df <- data.frame(1:391, "col", sample(L3, 391, replace = TRUE))
names(df) <- c("a", "b", "c")
b <- seq(from=1, to=391, by=10)
nsamp <- 0
for(i in seq_along(b)){
a <- i+1
nsamp <- nsamp+1
df_10 <- df[b[nsamp]:b[a], ]
res <- lapply(seq_along(df_10$b), function(x){...}
saveRDS(res, file="res.rds")
rm(res)
}
My problem is the for loop crashes when reaching the last element of my sequence b
When partitioning data, split is your friend. It will create a list with each data subset as an item which is then easy to iterate over.
dfs = split(df, 1:nrow(df) %/% 10)
Then your for loop can be simplified to something like this (untested... I'm not exactly sure what you're doing because example data seems to switch from df to sc2_10 and I only hope your column named b is different from your vector named b):
for(i in seq_along(dfs)){
res <- lapply(seq_along(dfs[[i]]$b), function(x){...}
saveRDS(res, file = sprintf("res_%s.rds", i))
rm(res)
}
I also modified your save file name so that you aren't overwriting the same file every time.

-Compute the average for a 2D matrix in a loop

I made a script for reading 15 data files, computing a difference between each 2 files and writing the results to 5 different files. These 5 files are matrixes, 10x259 values.
I need to make a matrix in which each element will be the average of the elements on the same position in the previous 5 matrixes. I can’t make the average work.
I tried the classic way of “sum=sum+i” inside the loop, but R gives an error for the recursive sum.
I tried making a 3 dimensional matrix and filling it with 5 “pages” containing 2D matrixes, but I get errors for trying to fill the matrix with a content of another size.
I tried with rowMeans(), but can’t get it to do the job as I need to get the mean of 5 iterations of the same variable.
The only way I could do it is reading all the resulting files again into separate variables, adding them and dividing by 5. But this only works for a few files. I will need to extend to many files so I need to make it work in a loop somehow.
Can anyone give me a better idea?
I’m new to R. The script is probably very inefficient, but it only needs to do the job.
Below is my code:
MAM <- c("M","N","O","P","R")
S <-c("a","b","c","d","e")
T<-c("a","b","c","d","e")
V<-c("a","b","c","d","e")
Min2000<- array(3,dim=c(259,10,5))
Min2010<- array(5,dim=c(259,10,5))
# this will be done 5 times
for (i in 1:5) {
# preparing file names to be read
S[i] <- paste(MAM [i],"2000.txt",sep="_")
T[i] <- paste(MAM [i],"2150.txt",sep="_")
V[i] <- paste(MAM [i],"2250.txt",sep="_")
# import data from the files
file1 <- read.table(S[i], header=TRUE,sep="\t")
file2 <- read.table(T[i], header=TRUE,sep="\t")
file3 <- read.table(V[i], header=TRUE,sep="\t")
# delete the first column
file1[,2:11]
file2[,2:11]
file3[,2:11]
file1a <- file1[,c(2:11)]
file2a <- file2[,c(2:11)]
file3a <- file3[,c(2:11)]
# compute data
Min2000<- (file2a-file1a)/file1a
Min2010<- (file3a-file1a)/file1a
colMeans(Min2000)
#cub[,,i]= Min2000 #doesn'twork
#rowMeans(datamonth, dims = 2) #doesn'twork
}
Try this, see comment in code for explanation
# load library
library(dplyr)
# Create vectors of names to be read in
MAM <- c("M","N","O","P","R")
S.Name <- paste(MAM,"2000.txt",sep="_")
T.Name <- paste(MAM,"2150.txt",sep="_")
V.Name <- paste(MAM,"2250.txt",sep="_")
# Read in data into list and drop first column
S = lapply(S.Name, read.table, header = T, sep = "\t") %>% lapply(function(x) x[,-1])
T = lapply(T.Name, read.table, header = T, sep = "\t") %>% lapply(function(x) x[,-1])
V = lapply(V.Name, read.table, header = T, sep = "\t") %>% lapply(function(x) x[,-1])
# Sum up the files, then divide to find mean.
# This does (matrix1 + matrix2 + matrix3 + matrix4 + matrix5) / # of matrices
S = S %>% {Reduce("+", .) / length(S)}
T = T %>% {Reduce("+", .) / length(T)}
V = V %>% {Reduce("+", .) / length(V)}
Here are two possibilities:
Min2000<- array(NA,dim=c(259,10,5))
Min2010<- array(NA,dim=c(259,10,5))
# this will be done 5 times
for (i in 1:5) {
# import data from the files
file1 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
file2 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
file3 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
# delete the first column
file1a <- file1[,-1]
file2a <- file2[,-1]
file3a <- file3[,-1]
# compute data
Min2000[,,i] <- (file2a-file1a)/file1a
Min2010[,,i] <- (file3a-file1a)/file1a
}
A2 <- apply(Min2000,1:2,"mean")
A3 <- apply(Min2010,1:2,"mean")
.
Sum2 <- matrix(0,259,10)
Sum3 <- matrix(0,259,10)
# this will be done 5 times
for (i in 1:5) {
# import data from the files
file1 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
file2 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
file3 <- matrix(sample(1:10,2849,replace=TRUE),259,11)
# delete the first column
file1a <- file1[,-1]
file2a <- file2[,-1]
file3a <- file3[,-1]
# compute data
Sum2 <- Sum2 + (file2a-file1a)/file1a
Sum3 <- Sum3 + (file3a-file1a)/file1a
}
B2 <- Sum2/5
B3 <- Sum3/5
I replaced the files by random matrices.
The result is almost the same:
> max((A2-B2)^2)
[1] 7.888609e-31
> max((A3-B3)^2)
[1] 7.888609e-31
Thank you both for your answers.
mra68, I tried both solutions. The first one (with: A2 <- apply(Min2000,1:2,"mean”and A3 <- apply(Min2010,1:2,"mean") gives me an error.
The second solution works perfectly (with: B2 <- Sum2/5, B3 <- Sum3/5). I obtained the mean that I needed.
Vlo, I tried the library but with your code I can make only the sum between the 10 columns and the mean of these columns. I wanted the mean between the 5 excels, not between the 10 columns. The code does more than I managed to do myself in the beginning, but it’s not exactly what I needed.

Filtering multiple csv files while importing into data frame

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the data frame for which a variable is within a given range (above min threshold & below max threshold), e.g.
v1 v2 v3
1 x q 2
2 c w 4
3 v e 5
4 b r 7
Filtering for v3 (v3>2 & v3<7) should results in:
v1 v2 v3
1 c w 4
2 v e 5
So fare I import all the data from all csvs into one data frame and then do the filtering:
#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]
I would rather apply the filter while importing each single csv file into the data frame. I assume a for loop would be the best way of doing it, but I am not sure how.
I would appreciate any suggestion.
Edit
After testing the suggestion from mnel, which worked, I ended up with a different solution:
fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)
Thanks for all the suggestions!
Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between
library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
xx <- fread(x, sep = ',')
xx[, fileID := gsub(".csv.*", "", x)]
xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
}, min = 2, max = 3))
If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.
If you want to do "filtering" before importing the data try to use read.csv.sql from sqldf package
If you are really stuck for memory then the following solution might work. It uses LaF to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)
library("LaF")
colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")
fileNames <- list.files(pattern = "*.csv")
# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
d <- laf$v3[]
lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))
# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)),
stringsAsFactors=FALSE)
names(df) <- colnames
# Read the lines from the files
i <- 0
for (fn in names(lines)) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
n <- length(lines[[fn]])
df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
i <- i + n
}

R create a matrix

I have to read some external files, extract some columns and complete the missing values with zeros. So if the first file has in the column$Name: a, b, c, d, and the column$Area with discrete values; the second file has in the some column: b, d, e, f and so on for the further files I need to create a data frame such this:
a b c d e f
File1 value value value value 0 0
File2 0 value 0 value value value
This is the dummy code I wrote to try to better explain my problem:
listDFs <- list()
for(i in 1:10){
listDFs[[i]] <-
data.frame(Name=c(
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse="")),
c(paste(sample(letters,size=2,replace=TRUE),collapse=""))),
Area=runif(7))
}
lComposti <- sapply(listDFs, FUN = "[","Name")
dfComposti <- data.frame(matrix(unlist(lComposti),byrow=TRUE))
colnames(dfComposti) <- "Name"
dfComposti <- unique(dfComposti)
#
## The CORE of the code
lArea <- list()
for(i in 1:10){
lArea[[i]] <-
ifelse(dfComposti$Name %in% listDFs[[i]]$Name, listDFs[[i]]$Area, 0)}
#
mtxArea <- (matrix(unlist(lArea),nrow=c(10),ncol=dim(dfComposti)[1],byrow=TRUE))
The problem is about the "synchronization" between the column name and each values.
Have you some suggestion??
If my code result to be un-clear I can also upload the files I work with.
Best
The safest is never to lose track of the names: they could be put back in the wrong order...
You can concatenate all your data.frames into a tall data.frame, with do.call(rbind, ...), and then convert it to a wide data.frame with dcast.
# Add a File column to the data.frames
names( listDFs ) <- paste( "File", 1:length(listDFs) )
for(i in seq_along(listDFs)) {
listDFs[[i]] <- data.frame( listDFs[[i]], file = names(listDFs)[i] )
}
# Concatenate them
d <- do.call( rbind, listDFs )
# Convert this tall data.frame to a wide one
# ("sum" is only needed if some names appear several times
# in the same file: since you used "replace=TRUE" for the
# sample data, it is likely to happen)
library(reshape2)
d <- do.call( rbind, listDFs )
d <- dcast( d, file ~ Name, sum, value.var="Area" )

Resources