In all honesty, I'm not quite sure what the issue is but I've had a similar issue in R in the past. I've written code to extract the variables I want from .dat files (specifically the current panel survey). I have CSV files that contain the positions of each variable by year (positions change by year). For example, HRFS12M1 in 2010 is 1173-1174, and in 2019 1223-1224 (and this part of the code is not shown but works so I didn't include it). So I have two folders and two separate directories one with the positions and one with the .dat files. I first loop through the positions files and create dfs with positions for each year (2010-2019). After the position dfs are generated I run the code below to obtain what variables I want in a large merged df. Now the code works as intended when I select 4 or fewer variables in the varList. However, the moment I try to use more variables the df starts to produce values that aren't within those columns. Does anyone know why it's doing this? I've tried several different variables to confirm it's not a problem with the position files but a problem with the number of variables.
#Loop through list of .dat files (lst2 contains name of files example:2010dec.dat)
for(i in 1:length(lst2)){
#Import the data cps data set
temp_cps<-readLines(lst2[i])
#Get the positions of the relevant year
temp_pos<- get(paste("Year", i, sep = "."))
#List of Variables we are looking at (can't use more than 4)
**varList=c("HRYEAR4","GESTFIPS","HESP1","HRFS12M1")**
#Get positions only for the variables selected
temp_pos=temp_pos[grep(paste(varList, collapse="|"), temp_pos$Variable),]
#Create the dataframe
df<-NULL
for(j in 1:length(varList)){
df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j]))
}
df<-as.data.frame(df)
names(df)<-varList
assign(paste("CPS", i, sep = "."), df)
}
#AutoMate appending each year
for (k in 1:(length(lst2)-1)){
if(k==1){
CPS1 <- get(paste("CPS", k, sep = "."))
CPS2 <- get(paste("CPS", k+1, sep = "."))
#Append to keep only rows of second data set
merged_data=rbind(CPS1,CPS2)
}
else{
CPS_C <- get(paste("CPS", k+1, sep = "."))
merged_data=merged_data=rbind(merged_data,CPS_C)
}
if(k==length(lst2)-1){
#Clear Console
rm(list=setdiff(ls(), "merged_data"))
}
}
This is what it looks like before it breaks
This what happens after adding more than 4 variables
I think I figured it out. Need to run a few extra variables to confirm. But the program currently won't work if my list of variables is not in order in terms of position. For example, if "HRYEAR4" is 82-84 and "GESTFIPS" is 93-94 then the program will fail if I put GETSFIPS before HRYEAR4 in varList. However, if HRYEAR comes first then the program will run as intended. Does anyone, have any quick idea how to replace this line df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j])) to make it more dynamic and not have this issue? If not, it's not a big deal for the moment I'll just put them in order and see if I can find a better solution in the future. Thanks to anyone who tried to help.
Related
I have a dataframe with one column containing a bunch of sound file filenames (.wav). I would like to measure something within each sound file, and have each measurement listed in a second new column next to the corresponding filename. In my code, each measurement is not being put into the second column. If I take each line of the for-loop and run it independently with the values 1, 2, 3 etc substituted for i, the resulting dataframe has the output measurements correctly entered.
I have written a toy example below, which cannot be run without some wav files, but perhaps the problem can be spotted based on the code alone:
library(seewave); library(tuneR)
setwd("D:/wavs")
#make a dataframe containing a column of wav filenames
file_list <-data.frame(c("wav1.wav", "wav2.wav", "wav3.wav"))
colnames(file_list)[1] <-"filelist" #give it a sensible column name
file_list$filelist <-as.character(file_list$filelist) #convert from factor
file_list$mx <-NA #a new empty vector for the measurement results
str(file_list) #this is how it will look before adding measurements
#now read in each wav file, measure something,
#and put the outcome in the empty cell next to that corresponding filename.
for (i in length(file_list$filelist)){
a <-as.character(file_list$filelist[[i]]) #this seemed to be a requirement, that 'wav1.wav' etc be a character
temp <-readWave(a) #read the file using tuneR package
mx <-max(range(temp#left)) #take some measurement from the left channel
file_list$mx[[i]] <-mx #put it in a new column next to the original filename
rm(mx); rm(temp); rm(a) #kill unnecessary things before starting again, for just in case
}
Needless to say, I have scoured the web and Stackoverflow for guidance without success, and tried a bunch of things (e.g. using }next{). Perhaps I need something similar to: file_list2$mx[i,] <-mx. Maybe some easy points for someone? Thank you, I am always grateful for help on SO.
The only problem with your code is that, You have not iterated the for loop. Means as per your code the for loop runs only one time i.e i=3. The correct code given below:
for (i in 1:length(file_list$filelist)){
a <-as.character(file_list$filelist[[i]]) #this seemed to be a requirement, that 'wav1.wav' etc be a character
temp <-readWave(a) #read the file using tuneR package
mx <-max(range(temp#left)) #take some measurement from the left channel
file_list$mx[[i]] <-mx #put it in a new column next to the original filename
rm(mx); rm(temp); rm(a) #kill unnecessary things before starting again, for just in case
}
Hope this could help you !
How about a tidyverse solution?
library(dplyr)
library(purrr)
data.frame(filelist=c("wav1.wav", "wav2.wav", "wav3.wav"), stringsAsFactors = F) %>%
mutate(mx = map_dbl(filelist, ~ readWave(.x)#left %>% max))
This question is pretty simple and maybe even dumb, but I can't find an answer on google. I'm trying to read a .txt file into R using this command:
data <- read.csv("perm2test.txt", sep="\t", header=FALSE, row.names=1, col.names=paste("V", seq_len(max(count.fields("perm2test.txt", sep="\t"))), sep=""), fill=TRUE)
The reason I have the col.names command is because every line in my .txt file has a different number of observations. I've tested this on a much smaller file and it works. However, when I run it on my actual dataset (which is only 48MB), I'm not sure if it is working... The reason I'm not sure is because I haven't received an error message, yet it has been "running" for over 24 hours at this point (just the read.csv command above). Is it possible that it has run out of memory and it just doesn't output a warning?
I've looked around and I know people say there are functions out there to reduce the size and remove lines that aren't needed, etc. but to be honest I don't think this file is THAT big, and unfortunately I do need every line in the file... (it's actually only 70 lines, but some lines contain as much as 100k entries, while others may only have say 100). Any ideas what is happening?
Obviously untested but should give you some code to modify:
datL <- readLines("perm2test.txt") # one line per group
# may want to exclude some lines but question is unclear
listL <- lapply(datL, function(L) read.delim(text=L, colCasses="numeric") )
# This is a list of values by group
dfL <- data.frame( vals = unlist(listL),
# Now build a grouping vector that is associated with each bundle of values
groups= rep( LETTERS[1:length(listL)] ,
sapply(listL, length) )
# Might have been able to do that last maneuver with `stack`.
library(lattice)
bwplot( vals ~ groups, data=dfL)
I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
I'm facing a challenge in R. I'm writing a code that incorporates another code written in C++ called MHX.
MHX is used for chemical data analysis by inputting some concentrations, etc. The integration between R and MHX works fine. So I'm able to write my MHX code definitions in the form of cat(CODE HERE) then calling a bash command to run MHX from terminal.
Now the results from MHX are given as tab delimited data tables that I am able to read without a problem in R. The problem is that I use R to simulate a large number of MHX calculations using loops.
Hence the need to write dynamic variables and here were I'm stuck. Let me give you more information with examples of my R code:
for (i in 1:100) {
fin <- file.create("input/ex1") #MHX input file
fout <- file.create("output/ex1.out") #MHX output file
FNM <- paste0("table_data/pH", i, ".txt") #filename used inside MHX definition
file.create(FNM) #this is used to create FNM table in R
fXY <- file.create(paste0("table_data/ECOMXY", i, ".txt"))
ifelse (HERE SOME MATHEMATICAL DEFINITIONS OF SOME VARIABLES)
ksource(MHXCode) #THIS CALLS MY MHX CODE which is inside another R code called `MHXCode` using a custom function KSOURCE. No problem here.
Up to here I don't have major problems. Now I need to setup the dynamic variables:
First I am creating variables PHL1 to PHL100
assign(paste("PHL", i, sep=""), read.table(paste0("table_data/pH", i, ".txt") ,skip=0, sep="\t", head=TRUE, na.strings = "-Inf"))
Each PHL table contains two rows and about 20 columns. Now I am interested in creating data frames from the second row for each column. Take for example row number 1 which is called EMF, ideally I need to do the following for all tables from PHLto PHL100 which is very tedious:
EMFT <- cbind(PHL1$EMF[2], PHL2$EMF[2], PHL3$EMF[2], PHL4$EMF[2], PHL5$EMF[2], PHL6$EMF[2],PHL7$EMF[2], PHL8$EMF[2], PHL9$EMF[2], PHL10$EMF[2], ....... etc up to PHL100! )
I tried many things to achieve the above, but I was not successful, including:
XX <- assign(paste0("PHL", i, "$EMF[2]"), cat(paste0("PHL", i, "$EMF[2]")))
I will need to do the same for other variables in order to be able to create some complicated plots. I hope anyone would be able to help.
I must mention that the main problem with assign is that I get qouted names of variables hence cannot return their values. Also for cat, you cannot use it to return a value, you will get NULL in the example above. Simple I am stuck!!
Please help.
Thanks to Justin he gave me a clue to answer my question. Here is what I have done:
files <- list.files(path="table_data", pattern=".dat", full.names=T); files
FRM <- NULL
for (f in files){
dat <- read.table(f, skip=0, header=TRUE, sep="\t", na.strings="",quote="", colClasses="character")[2,]
note that the [2, ] argument means that you skip all lines except line number 2 while keeping header which exactly what I was looking for.
Now I can bind it all in one table for my plots.
FRM <- rbind(FRM, dat)
This is a short answer and I think it is neat, sorted!