What are the equivalent ways to calling the beginning (or ending) of a data set in Octave / MATLAB?
These are incredibly useful functions to avoid printing out the entire dataset on the console, and get an idea of the headings and type of data.
It would be great to also have an equivalent for str() along the same lines...
There is no built-in but you can easily grab the first N rows or the last M rows.
A = rand(10000, 2);
% First 10 rows
A(1:10, :)
% Last 10 columns
A((end-9):end,:)
The same will work if you are using a table to store your data.
t = table(rand(10000,1), rand(10000,1));
t(1:10,:)
t((end-9):end,:)
Or a dataset
d = dataset(rand(10000,1), rand(10000,1))
d(1:10,:)
d((end-9):end,:)
You could easily create the following head() and tail() anonymous functions which you could use to do this easily.
tail = #(data)disp(data(max(size(data, 1)-9, 1):end,:));
head = #(data)disp(data(1:min(10, size(data,1)),:));
And use them like a normal function
head(d)
Variables editor can be useful for quickly inspecting your data. There's also a handy keyboard shortcut to open your variable in the editor - select the variable name (either in editor or in command window) and press ctrl+D. It also displays structure arrays quite nicely - often that's much easier than inspecting through command window.
Related
I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"
This is the code I am trying to run:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I want to first add the new column ("Ubb") from "new_table" and then add a calculated column using that new column. However, I get an error saying that Ubb column does not exist. So it's not performing merge before running mutate? When I separate the functions everything works fine:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')
data_table<-data_table%>%
mutate(Normalized_value = ((1.8^(data_table$Ubb - data_table$Ct_adj))*10000))
I would like to keep everything together just for style, but I'm also just curious, shouldn't R perform merge first and then mutate? How does order of operation during piping work?
Thank you!
you dont need to refer to column name with $ sign. i.e. use Normalized_value = ((1.8^(Ubb - Ct_adj))*10000)
because it is merged now. with $ sign I believe R, even though does the merge, has original data_table still in memory. because the assignment operator did not work yet. the assignment will take place after all operations.
Try running the code like this:
data_table<-data_table%>%
merge(new_table, by = 'Sample ID')%>%
mutate(Normalized_value = ((1.8^(Ubb - Ct_adj))*10000))
Notice that I'm not using the table name with the $ within the pipe. Your way is telling the mutate column to look at a vector. Maybe it's having some trouble understanding the length of that vector when used with the merge. Just call the variable name within the pipe. It's easiest to think of the %>% as 'and then': data_table() and then merge() and then mutate(). You might also want to think about a left_join instead of a merge.
Trying to learn the ropes in R and already struggling trying to find a replacement for SAS macro.
I'm trying to run a piece of code several times, but I'm having a hard time and came here for help.
First, I'm working with this example file, with a variable that gives me the quantity of rows that I have previously analised in another file (qtde_registros), followed by three variables that give me the quantity of rows that had different type of errors.
file <- readRDS(file="file.Rda")
file
qtde_registros error1 error2 error3
1 1175 0 0 0
After that, I created a list with the errors and another one with the description of each one of them.
Then, using those lists and the file mentioned initially, I wish to create several files (one for each error) that will later be binded in one last file to form a final report.
As I said, I'm struggling with it, so I made an example code of how it would be forming the first file:
error_list <- list("Error1","Error2","Error3",)
description_list <- list("Code not found",
"Invalid date.",
"Negative value.")
error1 <- file
error1$file_name <- "Clients"
error1$error <- error_list[1]
error1$qtde <- error1$error1
error1$desc <- description_list[1]
error1 <- select(error1, file_name, error, qtde, desc)
error1
file_name error qtde desc
1 Clients Error1 0 Code not found
And that leads to my question: how can I make the code above run several times, one for each erros on my list?
I'm aware that the whole mentality may not be the best, as the approach to do certain things are different depending on the language used, but I have to work with the knowledge I have at the moment.
I'm thinking of using the apply family of functions, but I didn't managed to work it out.
Thanks in advance for the help and sorry for any errors in typing or grammar (english is not my first language).
EDIT: forgot to say that I'm not intend to do via For or While loop.
In R (and many other languages) you'll be using a form of for-loop. In R there are several wrappers for loops with specific outcome in the *apply family. Here's a short (incomplete) list of the *apply family and their input/output:
lapply -> list output
sapply -> List or atomic (integer vector, numeric vector etc.)
mapply -> Similar to sapply but can take more than 1 input to go over (so if you have 2 simultanious things to loop over for example)
tapply -> loop over groups defined by INDEX
apply -> Loop over an array (either rows or columns) return matrix/vector
And so on.
I am guessing that your example is incomplete, but I'll show 3 examples to get you started. One using a for-loop, one using lapply and one using mapply.
for-loop
A for-loop is the classic method (found in most programming languages). It works by having a for(---) where --- is replaced by something to iterate over. This could be error_list or it could be a numeric vector seq(1, n) or 1:n. Here you have more than 1 thing to iterate over, so a numeric vector makes sense (and we use this to subset the data)
errors <- list() # <== Somewhere to put our results
for(i in 1:length(error_list)){
error_i <- list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case
desc = description_list[[i]]
)
# Put into our errors list. Create "error1" using paste and our index
errors[[paste0('error', i)]] <- error_i
}
And by the end all of your results will be in the errors list to be extracted using errors[1] or errors["errors1"] (change the number to your error). This can then be combined using do.call(rbind, errors) and then saved using write.table (or write.csv or similar).
lapply
For the *apply family, the *apply takes care of the looping. But instead we have to provide a function to execute (a macro in SAS terms) in each iteration. So we wrap the contents of the loop in the function above.
macro <- function(i){
list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case?
desc = description_list[[i]]
)
}
errors <- lapply(1:length(error_list), macro)
#set names afterwards
names(errors) <- paste0("error", 1:length(error_list))
And once again we have the data ready to be extracted saved etc. This is equivalent to:
errors <- list()
for(i in 1:length(error_list))
errors[[i]] <- macro(i)
names(errors) <- paste0("error", 1:length(error_list))
mapply
Now in your case you have more than 1 thing to iterate over. An alternative is to use mapply and add these as parameters to your function instead. This way we remove error_list[[i]] and description_list[[i]] from the function and instead add these as parameters
macro_mapply <- function(error, description){
list(file = file,
file_name = "Clients",
error = error, # No need to use I here anymore
qtde = error, # Maybe this should be something else in your case?
desc = description
)
}
errors <- mapply(macro_mapply,
# parameters to iterate over comes after function
error = error_list,
description = description_list,
# Avoid simplification (if we want a list returned)
SIMPLIFY = FALSE)
names(errors) <- paste0("error", 1:length(error_list))
Note that "mapply" will try to return a vector if possible, so I set SIMPLIFY = FALSE to avoid this.
Things to note:
In the above 3 examples I have not taken into account if you read multiple files, or any other parameters changing. So if you have to read a file in each iteration it will make sense to go with the first 2 examples and add readRDS to the loop or function with appropriate file naming. Also I have used your data, but I am guessing qtde and error should be different in your specific case but this is not clear from your example.
I hope this will help getting you started.
Once you've gotten the hang of your first loops I and somewhat understand how *applys work, I would then suggest checking out tidyverse which provides what many find to be a more "user-friendly" and intuitive interface to data transformation.
I hope that this will help you getting started on solving your problem.
I have a list of identifiers as follows:
url_num <- c('85054655', '85023543', '85001177', '84988480', '84978776', '84952756', '84940316', '84916976', '84901819', '84884081', '84862066', '84848942', '84820189', '84814935', '84808144')
And from each of these I'm creating a unique variable:
for (id in url_num){
assign(paste('test_', id, sep = ""), FUNCTION GOES HERE)
}
This leaves me with my variables which are:
test_8505465, test_85023543, etc, etc
Each of them hold the correct output from the function (I've checked), however my next step is to combine them into one big vector which holds all of these created variables as a seperate element in the vector. This is easy enough via:
c(test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144)
However, as I update the original 'url_num' vector with new identifiers, I'd also have to come down to the above chunk and update this too!
Surely there's a more automated way I can setup the above chunk?
Maybe some sort of concat() function in the original for-loop which just adds each created variable straight into an empty vector right then and there?
So far I've just been trying to list all the variable names and somehow get the output to be in an acceptable format to get thrown straight into the c() function.
for (id in url_num){
cat(as.name(paste('test_', id, ",", sep = "")))
}
...which results in:
test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144,
This is close to the output I'm looking for but because it's using the cat() function it's essentially a print statement and its output can't really get put anywhere. Not to mention I feel like this method I've attempted is wrong to begin with and there must be something simpler I'm missing.
Thanks in advance for any help you guys can give me!
Troy
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.