I have 34 subsets with a bunch of variables and I am making a new dataframe with summarizing information about each variable for the subsets.
- Example: A10, T2 and V2 are all subsets with ~10 variables and 14 observations where one variable is population.
I want my new dataframe to have a column which says how many times per subset variable 2 hit zero.
I've looked at a bunch of different count functions but they all seem to make separate tables and count the occurrences of all variables. I'm not interested in how many times each unique value shows up because most of the values are unique, I just want to know how many times population hit zero for each subset of 14 observations.
I realize this is probably a simple thing to do but I'm not very good at creating my own solutions from other R code yet. Thanks for the help.
I've done something similar with a different dataset where I counted how many times 'NA' occurred in a vector where all the other values were numerical. For that I used:
na.tmin<- c(sum(is.na(s1997$TMIN)), sum(is.na(s1998$TMIN)), sum(is.na(s1999$TMIN))...
Which created a column (na.tmin) that had the number of times each subset recorded NA instead of a number. I'd like to just count the number of times the value 0 occurred but is.0 is of course not a function because 0 is numerical. Is there a function that will just count the number of times a specific value shows up? If there's not should I use the count occurrences for unique values function?
Perhaps:
sum( abs( s1997$TMIN ) < 0.00000001 )
It's safer to use a tolerance value unless you are sure that you value is an integer. See FAQ 7.31.
sum( abs( pi - (355/113+seq(-0.001, 0.001, length=1000 ) ) )< 0.00001 )
[1] 10
Related
I want to find a series of consecutive rows in a dataset where a condition is met the most often.
I have two columns that I can use for this; Either one with ones and zeros that alternate based on the presence or absence of a condition or a column which increments for the duration across which the desirable condition is present. I envision that I will need to use subset(),filter(), and/or rle() in order to make this happen but am at a loss as to how to get it to work.
In the example, I want to find 6 sequential rows that maximize the instances in which happens occurs.
Given the input:
library(data.frame)
df<-data.frame(time=c(1:10),happens=c(1,1,0,0,1,1,1,0,1,1),count=c(1,2,0,0,1,2,3,0,1,2))
I would like to see as the output the rows 5 through 10, inclusive, as the data subset output, using either the happens or count columns since this sequence of rows would yield the highest output of happens occurrences on 6 consecutive rows.
library(zoo)
which.max( rollapply( df$happens, 6, sum) )
#[1] 5
The fifth window of 6 rows apparently holds the maximum sum of df$happens
So the answer is row 5:10
I have a time series and panel data data frame with a specific ID in the first column, and a weekly status for employment: Unemployed (1), employed (0).
I have 261 variables (the weeks every year) and 1.000.000 observations.
I would like to count the maximum number of times '1' occurs consecutively for every row in R.
I have looked a bit at rowSums and rle(), but I am not as far as I know interested in the sum of the row, as it is very important the values are consecutive.
You can see an example of the structure of my data set here - just imagine more rows and columns
We can write a little helper function to return the maximum number of times a certain value is consecutively repeated in a vector, with a nice default value of 1 for this use case
most_consecutive_val = function(x, val = 1) {
with(rle(x), max(lengths[values == val]))
}
Then we can apply this function to the rows of your data frame, dropping the first column (and any other columns that shouldn't be included):
apply(your_data_frame[-1], MARGIN = 1, most_consecutive_val)
If you share some easily imported sample data, I'll be happy to help debug in case there are issues. dput is an easy way to share a copy/pasteable subset of data, for example dput(your_data[1:5, 1:10]) would be a great way to share the first 5 rows and 10 columns of your data.
If you want to avoid warnings and -Inf results in the case where there are no 1s, use Ryan's suggestion from the comments:
most_consecutive_val = function(x, val = 1) {
with(rle(x), if(all(values != val)) 0 else max(lengths[values == val]))
}
I am having a large dataset having more than 10 million records and 20 variables. I need to get every possible combination for 11 variables out of these 20 variables and for each combination the frequency also should be displayed.
I have tried count() in plyr package and table() function. But both of them are unable to get all possible combinations, since the number of combinations are very high (greater than 2^32 combinations) and also the size is huge.
Assume following dataset having 5 variables and 6 observations -
And I want all possible combinations of first three variables where frequencies are greater than 0.
Is there any other function to achieve this? I am just interested in combinations whose frequency is non-zero.
Thanks!
OK. I think I have an idea of what you require. If you are saying you want the count by N categories of rows in your table, you can do so with the data.table package. It will give you the count of all combinations that exist in the table. Simply list the required categories in the by arguement
DT<-data.table(val=rnorm(1e7),cat1=sample.int(10,1e7,replace = T),cat2=sample.int(10,1e7,replace = T),cat3=sample.int(10,1e7,replace = T))
DT_count<-DT[, .N, by=.(cat1,cat2,cat3)]
I'm importing a large dataset in R and curious if there's a way to quickly go through the columns and identify whether the column has categorical values, numeric, date, etc. When I use str(df) or class(df), the columns mostly come back mislabeled.
For example, some columns are labeled as numeric, but there are only 10 unique values in the column (ranging from 1-10), indicating that it should really be a factor. There are other columns that only have 11 unique values representing a rating, from 0-5 in 0.5 increments. Another column has country codes (172 values), which range from 1-230.
Is there a way to quickly identify if a column should be a factor without going through each of the columns to understand the nature of variable? (there are many columns in the dataset)
Thanks!
At the moment, I've been using variations of the following code to catch the first two cases:
as.numeric(df[,51]) #convert the column to numeric
len = length(unique(df[,51])) #find number of unique values
diff = max(df[,51]) - min(df[,51]) #calculate difference between min and max
ord = (len - 1) / diff # calculate the increment if equally spaced
#subtract the max value from second to max value to find the actual increment (only uses last two values)
step = sort(unique(df[,51]),partial=len)[len] -
sort(unique(df[,51]),partial=len-1)[len-1]
ord == step #check if the last increment equals the implied increment
However, this approach assumes that each of the variables are equally spaced (for example, incremented 0.5) and only tests the space between the last two values. This wouldn't catch a column that contains c(1,2,3.5,4.5,5,6) which has 6 unique values, but uneven spacing in the middle (not that this is common in my dataset).
It is not obvious how many distinct values would indicate a factor vs a numeric variable, but you can examine all variables to see what is in your data with
table(sapply(df, function(x) { length(unique(x))} ))
and if you decide that the boundary between factor and numeric is k you can identify the factors with
which(sapply(df, function(x) {length(unique(x)) < k}))
I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).