I deal with large datasets frequently. Unfortunately, RStudio IDE's environment shows row count of data.frame like 156212811 and it'll be really useful if it shows in readable notation (like 156,212,811).
Is there any way how I can force it to use commas in row count?
Attached screenshot for reference
Related
I have had this problem for ages. Sometimes I am looking for the index of a column in a dataset, with View() in Rstudio. I already know I have to watch out with this though. After working on the datasets (adding and removing columns), the column number shown in Rstudio, does not match the actual index of the column. Often Rstudio even gives me a column that is out of range (i.e. 240 when the dataset only has 145 columns). No amount of updates or refreshing has made this issue any better. Even making a copy does not solve the issue. Does anyone know where this problem stems from, whether there is a fix or is going to be a fix?
The Issue:
.
The actual columns:
I have the question that is linked to the financial data of stock (open price, close close, high, low). Since the data which we download are not always the similar one, it's the problem to automize the code where this data are used.
F.E. sometimes I download the data that have the next columns:
open close high low
Sometimes this columns may be names as:
open_ask close_bid high low
Is there function in R which allows to work with data, where the columns may be named similar but not exactly same name? F.e. I want to plot the candle chart, and it's required that R may use the necessary column, where the open and close price are.
You could try identifying columns in your data frame using a regex which provides a logical match. For example, to match the open or open_ask columns, you could use:
open_col <- df[, grepl("open", names(df))]
If the names cannot be correlated in any meaningful way, then you might be able to go by position. But this runs the risk of error should columns shift position, whereas a regex works regardless of where a potentially matching column is positioned.
My goal of this code is to create a loop that aggregates each company's word frequency by a certain principle vector I created and adds it to a list. The problem is, after I run this, it only prints the 7 principles that I have rather than the word frequencies along side them. The word frequencies being the certain column of the FREQBYPRINC.AG data frame. Individually, running this code without the loop and just testing out a certain column, it works no problem. For some reason, the loop doesn't want to give me the correct data frames for the list. Any suggestions?
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
attach(FREQBYPRINC.AG)
list.agg[i]<-aggregate(FREQBYPRINC.AG[,i+1],by=list(Type=principle),FUN=sum,na.rm=TRUE)
}
I really wish I could help. After reading your statement, It seems that to you , you feel that the code should be working and it is not. Well maybe there exists a glitch.
Since you had previously specified list. agg as a list, you need to subset it with double square brackets. Try this one out:
list.agg<-vector("list",ncol(FREQBYPRINC.AG)-2)
for (i in 1:14){
list.agg[[i]]<-aggregate(FREQBYPRINC.AG[,i+1],by=list
(Type=principle),FUN=sum,na.rm=TRUE)}
So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).
I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.