I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.
Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]
Related
I am plotting a data that consists of some intervals that are more or less constant, and spikes in the data originating from the data being a quotient from two parameters. The relatively high and large quotients aren't not relevant for my purpose, so I have been looking for a way to filter these out. The dataset contains 40k+ values so I can not manually remove the high/low quotients.
Is there any function that can trim/filter out the very large/small quotients?
You can use the filter() function from dplyr. This can create a new dataframe without outliers that you can then plot. For example:
no_spikes <- filter(original_df, x > -100 & x < 100)
This would create a new dataframe, no_spikes, that only contains observations where the variable x is between the values -100 and 100.
I have a .csv file of 39 variables and 713 rows, each containing a count of plastic items. I have another column which is the survey length, and I want to standardise each count of items by a survey length of 100. I am unsure how to create a loop to run through each row and cell individually to do this. Many also have NA values.
Any ideas would be great.
Thank you.
Consider applying formula directly on columns without need of looping:
# RETRIEVE ALL COLUMN NAMES (MINUS SURVEY LENGTH)
vars <- names(df)[!grepl("survey_length", names(df))]
# EXPAND SINGLE COLUMN TO EQUAL DIMENSION OF DATA FRAME
survey_length_mat <- matrix(df$survey_length, ncol=length(vars), nrow=nrow(df))
# APPLY FORMULA
df[vars] <- (df[vars] / survey_length_mat) * 100
df
My data frame has a first column of factors, and all the other columns are numeric.
Origin spectrum_740.0 spectrum_741.0 spectrum_742.0 etc....
1 Warthog 0.6516295 0.6520196 0.6523843
2 Tiger 0.4184067 0.4183569 0.4183805
3 Sperm whale 0.9028763 0.9031688 0.9034069
I would like to convert the data frame into two variables, a vector (the first column) and a matrix (all the numeric columns), so that I can do calculations on the matrix, such as applying msc from the pls package. Basically, I want the data frame to be like the gasoline data set from pls, which has one variable as a numeric vector, and a second variable called NIR as a matrix with 401 columns.
Alternatively, if you have any suggestions for applying calculations to the numeric data while keeping the Origin column connected, that would work too, but all the examples I have seen use gasoline or similarly formatted data frames to do the calculations on the NIR matrix.
Thank you!
M = as.matrix(df[,-1])
row.names(M) = df[,1]
M
spectrum_740.0 spectrum_741.0 spectrum_742.0
Warthog 0.6516295 0.6520196 0.6523843
Tiger 0.4184067 0.4183569 0.4183805
Sperm_whale 0.9028763 0.9031688 0.9034069
I'm relatively new in R so excuse me if I'm not even posting this question the right way.
I have a matrix generated from combination function.
double_expression_combinations <- combn(marker_column_vector,2)
This matrix has x columns and 2 rows. Each column has 2 rows with numbers that will be used to represent column numbers in my main data frame named initial. These columns numbers are combinations of columns to be tested. The initial data frame is 27 columns (thousands of rows) with values of 1 and 0. The test consists in using the 2 numbers given by double_expression_combinations as column numbers to use from initial. The test consists in adding each row of those 2 columns and counting how many times the sum is equal to 2.
I believe I'm able to come up with the counting part, I just don't know how to use the data from the double_expression_combinations data frame to select columns to test from the "initial" data frame.
Edited to fix corrections made by commenters
Using R it's important to keep your terminology precise. double_expression_combinations is not a dataframe but rather a matrix. It's easy to loop over columns in a matrix with apply. I'm a bit unclear about the exact test, but this might succeed:
apply( double_expression_combinations, 2, # the 2 selects each column in turn
function(cols){ sum( initial[ , cols[1] ] + initial[ , cols[2] ] == 2) } )
Both the '+' and '==' operators are vectorised so no additional loop is needed inside the call to sum.
I have 5 categorical variables: age(5 levels), sex(2 levels), zone(4 levels), qmat(5 levels), and qsoc(5 levels) for a total of 1000 unique combinations. Each unique combination has a corresponding data value (e.g. population size). I would like to assign this data to a 1000 x 6 table where the first five columns contain the indices of age, sex, zone, qmat, qsoc and the 6th column holds the data value.
I would like to avoid using nested for loops which are inefficient in R (some of my datasets will have more than 1000 unique combinations). I know there exist many tools in R for parallel operations (but am not familiar with them). Is there an efficient way to perform the above variable assignment using parallel/vector operations? Any suggestions or references would be appreciated.
It's hard to understand how the original data you have looks like, but assuming that you have your data on a data frame, you may want to use aggregate().
# simulating a data frame
set.seed(1)
N = 9000
df = data.frame(pop=rnorm(N),
age=sample(1:5, N, replace=T),
sex=sample(1:2, N, replace=T)
)
# 'aggregate' this data frame by 'age' and 'sex'
newData = aggregate(pop ~ age + sex, data=df, FUN=sum)
The R function expand.grid() will solve my problem e.g.
expand.grid(list(age,sex,zone,qmat,qsoc))
Thanks for all the responses and I apologize for any possible vagueness in the wording of my question.