R- table of a feature in dataframe, but only if x occurances - r

I have a data frame in R, and when I do something such as:
table(data$brand)
I get about a hundred factors (many with 0 after cleaning data), and many with only 1 or 2 occurances. I only care about ones that there are >50 occurrences of. Is there a way to get a table like this instead of reading through the long list?

We can subset
tbl <- table(data$brand)
tbl[tbl > 50]

Related

R - Performance between subset and creating DF from vectors

I was wondering if there is a huge performance difference/impact for large datasets when you try to subset the data.
In my scenario, I have a dataframe with just under 29,000 records/data.
When I had to subset the data, I thought of 2 ways to do this.
The data is read from a csv file using reactive.
option 1
long_lat_df <- reactive({
long_lat <- subset(readFile(), select=c(Latitude..deg.,Longitude..deg.))
return(long_lat)
})
option 2
what I had in mind was to extract the 2 columns and assign the 2 columns to its own variable long and lat. From there I can combine the 2 columns to form a new data frame where I can use it to work with spatial analysis.
Would there be a potential performance impact between the 2 options?

Using data frame values to select columns of a different data frame

I'm relatively new in R so excuse me if I'm not even posting this question the right way.
I have a matrix generated from combination function.
double_expression_combinations <- combn(marker_column_vector,2)
This matrix has x columns and 2 rows. Each column has 2 rows with numbers that will be used to represent column numbers in my main data frame named initial. These columns numbers are combinations of columns to be tested. The initial data frame is 27 columns (thousands of rows) with values of 1 and 0. The test consists in using the 2 numbers given by double_expression_combinations as column numbers to use from initial. The test consists in adding each row of those 2 columns and counting how many times the sum is equal to 2.
I believe I'm able to come up with the counting part, I just don't know how to use the data from the double_expression_combinations data frame to select columns to test from the "initial" data frame.
Edited to fix corrections made by commenters
Using R it's important to keep your terminology precise. double_expression_combinations is not a dataframe but rather a matrix. It's easy to loop over columns in a matrix with apply. I'm a bit unclear about the exact test, but this might succeed:
apply( double_expression_combinations, 2, # the 2 selects each column in turn
function(cols){ sum( initial[ , cols[1] ] + initial[ , cols[2] ] == 2) } )
Both the '+' and '==' operators are vectorised so no additional loop is needed inside the call to sum.

missing values for each participant in the study

I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).

Transpose/Reshape Data in R

I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.
Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]

Vector of vectors in R?

I have a large set of data and I'm trying to group different rows together. I will know how to group the rows by using an ID. In the dataset, these IDs are sequential.
For example,
So what I want to do is iterate through this set of data and then place the data contained in these rows into a vector of vectors for processing later. The data contained in these rows of identical ID are going to be compared with one another to categorize the groupings.
I would like my data structure to look like something like this.
1 -> 1 -> 1
|
V
2 -> 2
So row 1 would contain only data from 1 type of ID, then the next row in the vector would be a vector of another type of ID. How would I go about doing this in R? In C++ it would just be a vector of vectors but I haven't been able to figure out how to do the same in R.
Is this even the right way to be approaching this problem? Is there a better way to do what I'm trying to do?
You would want to work with Data Frames rather than simple matrices. Have a look as the Documentation R-tutor Data.Frames.
It is doable. Best!

Resources