can someone provide and equivalent code in SPSS that merges datasets in SPSS to replicate the rbind and cbind commands usable in R ? Many thanks !
To add rows from dataset1 to dataset2, you can use ADD FILES. This requires that both datasets hold the same variables, with matching variable names and formats.
To add columns from dataset1 to dataset2, use MATCH FILES. This command matches the values for the desired variables in dataset2 to the right rows in dataset1 using keys present in both files (such as a respondent id). The keys are defined in the BY subcommand.
Please note that R and SPSS work in a totally different way. In short, SPSS (mainly) works with datasets in which variables are defined and formatted, while R can handle single values, vectors, matrices, dataframes etc. Simply copying columns from an existing dataset to another dataset (without paying attention tohow the files are sorted) and simply adding rows without matching the variable names and types in the existing dataset are very unusual in SPSS.
If you post an example of what you are trying to achieve, I could give you a more useful answer...
Related
I have to deal with data organized by row. So, R reads observation as variables and variables as observation. I have tried to transpose using function t() but R changed all data to character.
The original file is a .csv one.
Thank you.
apologies for what is probably an already-answered question, I couldn't seem to find what I was looking for in the archives.
I'm currently in the process of trying to merge multiple excel files into one df for data analysis.
It's experimental data across different versions, and the variables in each column in excel are inconsistent (ie, in Version 1, ReactionTime is in Column AB, in Version 2 it's in AG). I need to merge the values from specified variables from the (~24) data files with different column structures into one long format df.
I've only ever used an excel macro to merge files before, and am unsure of how to go about specifying the variable names for merging. Any help you could provide would be appreciated!
It's the first time I deal with Matlab files in R.
The rationale for saving the information in a .mat file type was the length. (the dataset contains 226518 rows). We were worried to excel (and then a csv) would not take them.
I can upload the original file if necessary
So I have my Matlab file and when I open it in Matlab all good.
There are various arrays and the one I want is called "allPoints"
I can open it and then see that it contains values around 0.something.
Screenshot:
What I want to do is to extract the same data in R.
library(R.matlab)
df <- readMat("170314_Col_HD_R20_339-381um_DNNhalf_PPP1-EN_CellWallThickness.mat")
str(df)
And here I get stuck. How do I pull out "allPoints" from it. $ does not seem to work.
I will have multiple files that need to be put together in one single dataframe in R so the plan is to mutate each extracted df generating a new column for sample and then I will rbind together.
Could anybody help?
I'm trying to use the DESeq2 package in R for differential gene expression, but I'm having trouble creating the required RangedSummarizedExperiment object from my input data. I have found several tutorials and vignettes for doing this, but they all seem to apply to a raw data set that is different from mine. My data has gene names as row names and patient id as column names, and the data is simply integer count data. There has to be a simple way to create the RangedSummarizedExperiment object from this type of input data, but I haven't yet found a way. Can anybody help? Thanks.
I had a similar problem understanding how to use this data structure. I eventually managed to do without it by using DESeqDataSetFromMatrix. You can see an example in the first code block of Modify r object with rpy2 (this code is pure R, rpy2 stuff comes after). In this example, I have genes as rows and samples as columns, so it is likely you will be able to adopt the same approach.
I am currently working with the SparkR and sparklyr package and I think that they are not suitable for high-dimensional sparse data sets.
Both packages have the paradigm that you can select/filter columns and rows of data frames by simple logical conditions on a few columns or rows. But this is often not what you would do on such large data sets. There you need to select rows and columns based on the values of hundreds of row or column entries. Often, you first have to calculate statistics on each row/column, and then use these values for the selections. Or, you want to address certain values in the data frame only.
For example,
How can I select all rows or columns that have less than 75% missing values?
How can I impute missing values with column- or row-specific values that were derived from each column or row?
To solve (#2), I need to execute functions on each row or column of a data frame separately. However, even functions like dapplyCollect of SparkR do not really help, as they are far too slow.
Maybe I am missing something, but I would say that SparkR and sparklyr do not really help in these situations. Am I wrong?
As a side note, I do not understand how libraries like MLlib or H2O could be integrated with Sparklyr if there are such severe limitations, e.g. in handling missing values.