I am trying to use the load() function in MATLAB to read in data from a text file. However, every line of the text file ends with '...'. The data file is not produced by MATLAB, so I have no control over the source of the ellipses.
The data file I'm loading in looks something like this:
11191425 NaN NaN 0.0 ...
11191426 NaN NaN 0.0 ...
11191427 NaN NaN 0.0 ...
11191428 NaN NaN 0.0 ...
11191429 2280.5 1910.1 455.0 ...
11191430 2280.5 1910.1 455.0 ...
11191431 2298.0 1891.1 454.0 ...
11191432 2317.3 1853.7 453.0 ...
11191433 2335.6 1811.1 458.0 ...
11191434 2350.6 1769.8 466.0 ...
11191435 2365.3 1729.7 475.0 ...
11191436 2379.5 1691.2 485.0 ...
11191437 2378.3 1647.6 492.0 ...
11191438 2375.4 1621.3 499.0 ...
11191439 2372.7 1598.5 499.0 ...
11191440 2372.7 1598.5 499.0 ...
11191441 NaN NaN 0.0 ...
11191442 294.9 1283.5 1163.0 ...
11191443 294.9 1283.5 1163.0 ...
Its actual length is in excess of 100,000 rows, but you get the idea. Using the load() command throws an error because of the '...'s at the end of each line. All I'm looking for is to read in those first four columns.
What would be the most efficient way of loading the data in, whilst completely omitting the rogue column of ellipses at the end? A method that doesn't involve making the system parse the whole text file twice would be preferable, though not necessary.
This is pretty easy if instead of using load, you use textscan. You can treat that last column as a string column and then just ignore it.
fid = fopen('data.txt');
data = textscan(fid,'%d %f %f %f %s');
fclose(fid);
You could then make the output a single matrix by concatenating the columns you want to keep together.
data = [data{1:4}];
The fifth column is just filled with '...' strings. You can just ignore it.
Related
I have the following custom function that I am using to create a table of summary statistics in R.
regression.stats<-function(fit){
formula<-fit$call;
data<-eval(getCall(fit)$data);
abserror<-abs(exp(fit$fitted.values)-data$bm)/exp(fit$fitted.values);
QMLE<-exp((sigma(fit)^2)/2);
smear<-sum(exp(fit$residuals))/nrow(data);
RE<-mean(data$bm)/mean(exp(fit$fitted.values));
CF<-(RE+smear+QMLE)/3;
adjPE<-mean(abs((exp(fit$fitted.values)*CF)-data$bm)
/(exp(fit$fitted.values)*CF));
SEE<-exp(sigma(fit)+4.6052)-100;
summary<-summary(fit)
statistics<-data.frame("df"=fit$df.residual,
"r2"=round(summary(fit)$r.squared,4),
"adjr2"=round(summary(fit)$adj.r.squared,4),
"AIC"=AIC(fit),"BIC"=BIC(fit),
"logLik"=logLik(fit),
"PE"=round(mean(abserror)*100,2),QMLE=round(QMLE,3),
smear=round(smear,3),RE=round(RE,3),CF=round(CF,3),
"adjPE"=round(mean(adjPE)*100,2),
"SEE"=round(SEE,2),row.names = print(substitute(fit)));
return(statistics)
}
I want to bind the resulting rows into a data.frame in order to produce a table of comparison statistics between regression analyses. For example, using the data from the mtcars dataset...
data(mtcars)
lm1<-(cyl~mpg,data=mtcars)
lm2<-(cyl~disp,data=mtcars)
lm2<-(disp~mpg,data=mtcars)
rbind(regression.stats(lm1),regression.stats(lm2),regression.stats(lm3))
I am creating this for an R Markdown html file and I want readers to be able to tell which regression equation produced which statistics. However when I run the code it also ends up printing a list of the names of the lm functions in addition to the regression statistics in the resulting html document.
I have managed to track the problem down to the line row.names = print(substitute(fit))) in my function. If I remove that line it no longer prints the lm name when running the function. However, what happens then is my rows are no longer associated with the correct model name. How can I adjust my function so that it only prints the name of the model function as the row name of the summary function, rather than creating an additional list?
The line
...
row.names = print(substitute(fit))
...
should be
row.names = deparse(substitute(fit))
Or simply substitute(fit) as this gets converted to character
as print doesn't have any return value and it is just printing on the console
After the change in function
rbind(regression.stats(lm1),regression.stats(lm2),regression.stats(lm3))
# df r2 adjr2 AIC BIC logLik PE QMLE smear RE CF adjPE SEE
#lm1 30 0.7262 0.7171 91.46282 95.86003 -42.73141 NaN 1.570 1.443000e+00 NA NA NaN 1.585700e+02
#lm2 30 0.8137 0.8075 79.14552 83.54273 -36.57276 NaN 1.359 1.317000e+00 NA NA NaN 1.189600e+02
#lm3 30 0.7183 0.7090 363.71635 368.11356 -178.85818 NaN Inf 1.861805e+65 NA NA NaN 1.092273e+31
I'm trying to prepare a dataset to use it as training data for a deep neural network. It consists of 13 .txt files, each between 500MB and 2 GB large. However, when trying to run a "data_prepare.py" file, I get the Value error of this post's title.
Reading answers from previous posts, I have loaded my data into R and checked both for NaN and infinite numbers, but the commands used tell me there appears to be nothing wrong with my data. I have done the following:
I load my data as one single dataframe using magrittr, data.table and purrr packages(there are about 300 Million rows, all with 7 variables):
txt_fread <-
list.files(pattern="*.txt") %>%
map_df(~fread(.))
I have used sapply to check for finite and NaN values:
>any(sapply(txt_fread, is.finite))
[1] TRUE
> any(sapply(txt_fread, is.nan))
[1] FALSE
I have also tried loading each data frame into a jupyter notebook and check individually for those values using the following commands:
file1= pd.read_csv("File_name_xyz_intensity_rgb.txt", sep=" ", header=None)
np.any(np.isnan(file1))
False
np.all(np.isfinite(file1))
True
And when I use print(file1.info()), this is what I get as info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22525176 entries, 0 to 22525175
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 0 float64
1 1 float64
2 2 float64
3 3 int64
4 4 int64
5 5 int64
6 6 int64
dtypes: float64(3), int64(4)
memory usage: 1.2 GB
None
I know the file containing the code (data_prepare.py) works because it runs properly with a similar dataset. I therefore know it must be a problem with the new data I mention here, but I don't know what I have missed or done wrong while checking for NaNs and infinites. I have also tried reading and checking the .txt files individually, but it also hasn't helped much.
Any help is really appreciated!!
Btw: the R code with map_df came from a post by leerssej in How to import multiple .csv files at once?
I'm having a problem understanding how R functions interact with variable names. If you pass a variable name into a function, it seems to behave differently than if you pass the variable value to the function, which confuses me.
I have tried searching the forums, but would appreciate some clarification, as I think there is something fundamentally wrong with my understanding of R.
The following code produces the desired effect:
library(MASS)
hist(Boston$crim,xlab='Crime Rate',ylab='Frequency', main='Frequency plot of Crime Rate')
Expected Behaviour
The histogram titles and labels are all as defined in the function.
The problem arises, when I try to do it in a loop and do multiple plots, by calling the labels and plots using variables in lists. It seems calling the strings by pointing to a value in a list doesn't pass through to the histogram function.
sectors =c('crim','tax','ptratio')
xlabels =c('Crime Rate','Property Tax Rate', 'Pupil Teacher Ratio')
titles =c('Frequency plot of Crime Rate', 'Frequency plot of Tax Rate', 'Frequency Plot of Pupil:Teacher')
hist(Boston[sectors[1]],ylab='Frequency',xlab=as.character(xlabels[1]),main=as.character(titles[1]))
This produces the wrong image, where as you can see the titles and labels are wrong.
Not expected behaviour
I'm not observing any error messages, and I'm not entirely sure what to call this effect to google it correctly. I apologize if this has been answered before and would appreciate any and all help.
Thanks in advance
Notice the difference:
str(Boston[sectors[1]])
'data.frame': 506 obs. of 1 variable:
# $ crim: num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
str(Boston[, sectors[1]])
# num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
str(Boston[[sectors[1]]])
# num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
A data frame is a special kind of list so it can get confusing. The first example extracts the first element of the list (which here is also a column) and returns it as a list (a data frame). You should get an error message because the hist function does not take a list/data frame but a vector. The second uses the matrix/data frame way of extracting a column so you get a numeric vector. The third example treats Boston as a list and extracts just the first element and returns it without wrapping it in a list. The ?Extract manual page talks about this but it can take multiple readings to begin to figure it out. Also using str() can help you figure out what you are getting.
Also, your vectors are character so you do not need as.character():
i <- 3
hist(Boston[, sectors[i]], ylab='Frequency', xlab=xlabels[i], main=titles[i])
I have this file test.csv. I have used -
test <- read.csv ("test.csv", check.names=FALSE)
To get it into R. I have used check.names as the column headers contains brackets and if I dont use it, they turn into periods which I have issues with when coding.
I have then done this-
sink(file='interest.txt')
print((test["test$log(I)">=1 & test$number >= 6 , "Name"]),)
My aim is to create a sink file so the print output is put into there. I wanted to print the value in the name column if the values for 2 columns (log(I) and number) equal a certain value.
log(I) Number Name
1.00 6 LAMP1
3.50 6 MND1
1.20 2 GGD3
0.98 7 KLP1
So in this example, the code would output just LAMP1 and MND1 to the sink file I created.
My issue is that I don't think R is recognising that log(I) is the header title as it seems to give me the same result with or without this part included.
If I dont use
check.names=FALSE
Then the column is turned to log.I. instead. How can I get around this issue?
Thanks
I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.
I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.
Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.