I have a matrix of information that I import from tab separated files. Once I import this data, I consolidate it in to a dataframe, and perform some editing on it to make it usable.
The last step, is for me to convert all the numbers to numeric. In order to do this, i use
as.numeric(as.character()). Unfortunately, the numbers do not change to numeric. They are still of chr type.
Here is my code:
stringsAsFactors=F
filelist <- list.files(path="C:\\Users\\LocalAdmin\\Desktop\\Correlation Project\\test", full.names=TRUE, recursive=FALSE)
temp <- data.frame()
TSV <- data.frame()
for (i in seq (1:length(filelist)))
{
temp <- read.table(file=filelist[i],head=TRUE,sep="\t")
TSV <- rbind(TSV,temp)
}
for (i in seq(15,1,-1)) #getting rid of extraneous dataframe entries
{
TSV <- TSV[-i,] #deleting by row
}
for(i in seq(1,ncol(TSV),1))
{
TSV[,i] <- as.numeric(as.character(TSV[,i]))
}
Thank you for your help!
You can use
TSV <- as.data.frame(as.numeric(as.matrix(TSV)))
This will only work if all values can be transformed into numbers.
A couple of things here:
Prefer vector operations whenever possible
no need to read the files in a for loop
TSV<-do.call(rbind,lapply(filelist, read.delim))
your loop to get rid of extraneous info can be reduced to a vector operation
TSV<-TSV[-(1:15),]
I'm assuming you are getting factors and integers that you want as numeric
oldClasses<-sapply(TSV[1,],class)
int2numeric<-oldClasses == "integer"
factor2numeric<-oldClasses == "factor"
TSV[,int2numeric]<-as.numeric(TSV[,int2numeric])
TSV[,factor2numeric]<-as.numeric(as.character(TSV[,factor2numeric]))
you could arguably reduce the 2 above to one, but I think this makes your intent clear
and that should be it
#JPC I finally managed to get this to work. Here is my code:
TSVnew<-apply(TSV,2,as.numeric)
rownames(TSVnew)<-rownames(TSV)
TSV<-TSVnew
However, I still don't understand why my previous attempt using this didn't work:
for(i in seq(1,ncol(TSV),1))
{
TSV[,i] <- as.numeric(as.character(TSV[,i]))
}
Related
I have got a lot of data frames in my R environment and I want to do the as.numeric() function on all of the variables in the data.frames and overwrite them. I do not know how to address all of them.
The following is my attempt, but ls() seemingly just writes the name to x:
for (i in 1:length(ls())){
x <- ls()[i]
for (i in 1:length(x)){
x[i] <- as.numeric(x[i])
}
}
So, there were two helpful answers to my question. One, that was later deleted and another one by #Henrik.
The deleted one followed my approach to convert all data frames in global environment (that has an "V" in it - in my example) as numerics. This is the code:
res <- lapply(mget(ls(pattern = 'V')), \(x) {
x[] <- lapply(x, as.numeric)
return(x)
})
list2env(res, .GlobalEnv)
# Check
str(VA01.000306__ft2)
The second approach uses lists instead of multiple objects. When I have stored my multiple csv files into lists. This is the csv to list import:
F_EB_names <- list.files(pattern="*.csv")### Daten in Liste speichern?
F_EB <- lapply(F_EB_names, read.csv2)
names(F_EB) <- gsub(".wav.csv","_ft2",F_EB_names)
And this is the conversion to numerals:
F_EB <- type.convert(F_EB) # Conversion
str(F_EB) # Check
Thank you both for the help.
I have a small number of csv files, each containing two columns with numeric values. I want to write a for loop that reads the files, sums the columns, and stores the sum totals for each csv in a numeric vector. This is the closest I've come:
allfiles <- list.files()
for (i in seq(allfiles)) {
total <- numeric()
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
total
}
My result is all NA's save a value for the last file. I understand that I'm overwriting each iteration each time the for loop executes and I think* I need to do something with indexing.
The first problem is that you are not pre-allocating the right length of (or properly appending to) total. Regardless, I recommend against that method.
There are several ways to do this, but the R-onic (my term, based on pythonic ... I know, it doesn't flow well) is based on vectors/lists.
alldata <- sapply(allfiles, read.csv, simplify = FALSE)
totals <- sapply(alldata, function(a) sum(subset(a, select=Gift.1), subset(a, select=Gift.2)))
I often like to that, keeping the "raw/unaltered" data in one list and then repeatedly extract from it. For instance, if the files are huge and reading them is a non-trivial amount of time, then if you realize you also need Gift.3 and did it your way, then you'd need to re-read the entire dataset. Using my method, however, you just update the second sapply to include the change and rerun on the already-loaded data. (Most of the my rationale is based on untrusted data, portions that are typically unused, or other factors that may not be there for you.)
If you really wanted to reduce the code to a single line, something like:
totals <- sapply(allfiles, function(fn) {
x <- read.csv(fn)
sum(subset(x, select=Gift.1), subset(x, select=Gift.2))
})
allfiles <- list.files()
total <- numeric()
for (i in seq(allfiles)) {
total[i] <- sum(subset(read.csv(allfiles[i]), select=Gift.1), subset(read.csv(allfiles[i]), select=Gift.2))
}
total
if possible try and give the total a known length before hand ie total<-numeric(length(allfiles))
Low level R user here.
I have 3 population data frames (low.proj, med.proj, high.proj) with the exact same number of rows and columns I'm trying to clean and reshape.
I want to eliminate some extra commas in the country column all three of the frames so I'm trying this loop with gsub:
for(i in c("low.proj", "med.proj", "high.proj")){
i$Country <- gsub(",","",i[,"Country"])
}
When I run this I get the error "Error in i[, "Country"] : incorrect number of dimensions"
When I run the code without the loop:
low.proj$Country <- gsub(",","",low.proj[,"Country"])
It works. What causes this error and how do I fix it?
In order to retrieve the contents of the object with the string contained in i use get() to put new data in that object use assign
for(i in c("low.proj", "med.proj", "high.proj")){
tmp <- get(i)
tmp$Country <- gsub(",","",tmp[,"Country"])
assign(i, tmp)
}
You're indexing the wrong variable:
i$Country <- gsub(",","",i[,"Country"])
i is a string, so i$Country doesn't have any meaning.
I have a perhaps basic questions and I have searched on the web. I have a problem reading files. Though, I managed to get to read my files, following #Konrad suggestions, which I appreciate: How to get R to read in files from multiple subdirectories under one large directory?
It is a similar problem, however, I have not resolved it.
My problem:
I have large number of files of with same name ("tempo.out") in different folders. This tempo.out has 5 columns/headers. And they are all the same format with 1048 lines and 5 columns:
id X Y time temp
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out"
,full.names=T)
readDatFile <- function(f) {
dat.fl <- read.table(f)
}
data.filesf <- sapply(dat.files, readDatFile)
# I might not have the right sintax in sub5:
subs5 <- sapply(data.filesf,`[`,5)
matr5 <- do.call(rbind, subs5)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
print(q)
I want to extract the fifth column (temp) of each of those thousands of files and make calculations such as quantiles.
I tried first to read all subfiles in "ress"
The latter gave no error, but my main problem is the "data.filesf" is not a matrix but list, and actually the 5th column is not what I expected. Then the following:
matr5 <- do.call(rbind, subs5)
is also not giving the required values/results.
What could be the best way to get columns into what will become a huge matrix?
Try
lapply(data.filef,[,,5)
Hope this will help
Consider extending your defined function, readDatFile, to extract fifth column, temp, and assign directly to matrix with sapply or vapply (since you know ahead the needed structure -numeric matrix length equal to nrows or 1048). Then, run needed rowQuantiles:
setwd("~/Documents/ewat")
dat.files <- list.files(path="./ress",
recursive=T,
pattern="tempo.out",
full.names=T)
readDatFile <- function(f) read.table(f)$temp # OR USE read.csv(f)[[5]]
matr5 <- sapply(dat.files, readDatFile, USE.NAMES=FALSE)
# matr5 <- vapply(dat.files, readDatFile, numeric(1048), USE.NAMES=FALSE)
probs <- c(0.05,0.1,0.16,0.25,0.5,0.75,0.84,0.90,0.95,0.99)
q <- rowQuantiles(matr5, probs=probs)
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
grep(pattern="_m", temp, value=TRUE)
Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables.
Now I'm quite unsure how to do this, I'm quite new to coding and R.
Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables.
First, assign will not work as you think, because it expects a string (or character, as they are called in R). It will use the first element as the variable (see here for more info).
What you can do depends on the structure of your data. read.dta13 will load each file as a data.frame.
If you look for column names, you can do something like that:
myList <- character()
for (i in 1:length(temp)) {
# save the content of your file in a data frame
df <- read.dta13(temp[i], nonint.factors = TRUE))
# identify the names of the columns matching your pattern
varMatch <- grep(pattern="_m", colnames(df), value=TRUE)
# check if at least one of the columns match the pattern
if (length(varMatch)) {
myList <- c(myList, temp[i]) # save the name if match
}
}
If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation.
A good introduction to dplyr is available in the package vignette here.
Note that in R, appending to a vector can become very slow (see this SO question for more details).
Here is one way to figure out which files have variables with names ending in "_m":
# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))
# loop through each file
for (i in 1:length(temp)) {
# read file
fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)
# fill in vector with TRUE if any variable ends in "_m"
inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}
In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl.
# print out these file names
temp[inFileVec]