I have not found a clear answer to this question, so hopefully someone can put me in the right direction!
I have a nested data frame (panel data), with multiple observations within multiple individuals. I want to subset my data frame by those individuals (id) which have at least 20 rows of data.
I have tried the following:
subset1 = subset(df, table(df$id)[df$id] >= 20)
However, I still find individuals with less that 20 rows of data.
Can anyone supply a solution?
Thanks in advance
subset1 = subset(df, as.logical(table(df$id)[df$id] >= 20))
Now, it should work.
The subset function actually is getting a series of true and false from the condition part, which indicates if the row should be kept or not/ meet the condition or not. Hence, the output of the condition part should be a series of true or false.
However, if you put table(df$id)[df$id]>=20 in the console, you will see it returns an array rather than logic. In this case, it is pretty straight that you just need to turn it into logic. Then, it works.
Related
I have a dataset with weekly number of lucky days for some of those weekly values i have values greater than 7 which must be a mistake.
Therefore what I want to do is to delete rows which have a value greater than 7 in one of the multiple columns. Those columns are column 21 to 68. What I have tried so far is this:
new_df <- subset(df, 21:68 <= 7)
This leaves me with an completely empty new_df.
I know there is a option that goes like this:
new_df <- subset(df, b != 7 & d != 7)
But I feel like there must be a more elegant way than to name every single column which I want to refer to. Do I need to use square brackets or sth. like that?
There is no Error message when computing the above mentioned command.
The referred values are numerical.
Can someone help?
Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.
I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!
I have a large data set with over a thousand participant. Each participant has a unique ID. Each time a participant was tested their data was entered on a separate row. Participants were tested under two conditions coded "1" and "2". Some participants were always tested under condition 1 Some were always tested under condition 2. Still other participants were tested under both condition 1 and 2.
For this analysis, I want to eliminate participants that were tested under two different conditions, retaining only participants that were always tested under the same condition.
I have to find rows with identical id's (showing same participant) but different condition codes and eliminate those rows. I am familiar with subset, but I am not sure how to create the data subset I need in this case.
Any help would be appreciated.
In data.table
library(data.table)
setDT(old_data)
new_data <- old_data[ , if (uniqueN(condition_code) == 1) .SD, by = participant_id]
setDT adds the data.table class to your data.frame so it can be passed to data.table methods. uniqueN is equivalent to (but faster than) length(unique()) and this statement ensures there is exactly one unique condition code associated with a given participant (as identified by their participant_id).
.SD is a temporary data set created within each group. Without further modification, .SD simply represents the full set of columns and rows associated with a particular participant_id, so the construction says to return all data associated with participant_ids passing your condition; for those that don't pass, return nothing (NULL is technically returned, and then those rows are dropped in clean-up)
I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).