G'day All,
I have a data.frame with multiple columns, with a count corresponding to each column and row. The data looks like the following:
d <- data.frame(replicate(10,sample(0:15,10,rep=TRUE)))
I am wanting to count the number of rows which have a value greater than 1, and then I want to repeat this all the way to 15. I then want to do the same thing for every column in the data.frame. So I will end up with data that tells me how many rows have values greater than 0:14 (iteratively) for each column of data.
I can do this one at a time with the following code:
length(which(d[,1]>0))
length(which(d[,1]>1))
length(which(d[,1]>2))
.
.
.
length(which(d[,10]>14))
This is really slow, and my real data set is much larger than this so I don't want to have to do it this way. I thought I might be able to do it if as follows:
lrows <- data.frame('cnt' = rep(0:14, 10), 'column' = rep(1:10, each = 15),
't' = 0)
lrows$t <- length(which(d[,lrows[,2]]>lrows[,1]))
But it doesn't do it iteratively. I have tried a couple of other things unsuccessfully and was hoping someone here might be able to point me in the right direction.
I know this is going to be obvious once it is pointed out but I just can't figure it out, sorry. Thank you for your time and help!
All the best
Related
I have the following data frame in R:
df <- data.frame(time=c("10:01","10:05","10:11","10:21"),
power=c(30,32,35,36))
Problem: I want to calculate the energy consumption, so I need the sum of the time differences multiplied by the power. But every row has one timestamp, meaning I need to do subtraction between two different rows. And that is the part I cannot figure out. I guess I would need some kind of function but I couldn't find online hints.
Example: It has to subtract row2$time from row1$time, and then multiply it to row1$power.
As said, I do not know how to implement the step in one call, I am confused about the subtraction part since it takes values from different rows.
Expected output: E=662
Try this:
tmp = strptime(df$time, format="%H:%M")
df$interval = c(as.numeric(diff(tmp)), NA)
sum(df$interval*df$power, na.rm=TRUE)
I got 662 back.
I am working in r, what I want to di is make a table or a graph that represents for each participant their missing values. i.e. I have 4700+ participants and for each questions there are between 20 -40 missings. I would like to represent the missing in such a way that I can see who are the people that did not answer the questions and possible look if there is a pattern in the missing values. I have done the following:
Count of complete cases in a data frame named 'data'
sum(complete.cases(mydata))
Count of incomplete cases
sum(!complete.cases(mydata$Variable1))
Which cases (row numbers) are incomplete?
which(!complete.cases(mydata$Variable1))
I then got a list of numbers (That I am not quite sure how to interpret,at first I thought these were the patient numbers but then I noticed that this is not the case.)
I also tried making subsets with only the missings, but then I litterly only see how many missings there are but not who the missings are from.
Could somebody help me? Thanks!
Zas
If there is a column that can distinguish a row in the data.frame mydata say patient numbers patient_no, then you can easily find out the patient numbers of missing people by:
> mydata <- data.frame(patient_no = 1:5, variable1 = c(NA,NA,1,2,3))
> mydata[!complete.cases(mydata$variable1),'patient_no']
[1] 1 2
If you want to consider the pattern in which the users have missed a particular question, then this might be useful for you:
Assumption: Except Column 1, all other columns represent the columns related to questions.
> lapply(mydata[,-1],function(x){mydata[!complete.cases(x),'patient_no']})
Remember that R automatically attach numbers to the observations in your data set. For example if your data has 20 observations (20 rows), R attaches numbers from 1 to 20, which is actually not part of your original data. They are the row numbers. The results produced by the R code: which(!complete.cases(mydata$Variable1)) correspond to those numbers. The numbers are the rows of your data set that has at least one missing data (column).
I have not found a clear answer to this question, so hopefully someone can put me in the right direction!
I have a nested data frame (panel data), with multiple observations within multiple individuals. I want to subset my data frame by those individuals (id) which have at least 20 rows of data.
I have tried the following:
subset1 = subset(df, table(df$id)[df$id] >= 20)
However, I still find individuals with less that 20 rows of data.
Can anyone supply a solution?
Thanks in advance
subset1 = subset(df, as.logical(table(df$id)[df$id] >= 20))
Now, it should work.
The subset function actually is getting a series of true and false from the condition part, which indicates if the row should be kept or not/ meet the condition or not. Hence, the output of the condition part should be a series of true or false.
However, if you put table(df$id)[df$id]>=20 in the console, you will see it returns an array rather than logic. In this case, it is pretty straight that you just need to turn it into logic. Then, it works.
So this is a quick question.
I have a dataframe of panel data, in which I have a column of identifications/names/IDs for each individual. Lets say there are n levels to this column, that is, n individuals in the panel over a certain timeframe.
I want to add a column N to the dataframe with this value n, that is a numbering of levels.
That is each ID/name/level gets assigned a number from 1 through n.
Here is a code that does what I want:
i = 1
for(l in levels(data$IDs)) {
data[data$ID == l,]$N = i
i = i+ 1
}
So far so good. Issue: My dataset is large. Very large. Far too much to do this manually. And the above operation takes too much time.
This is a loop, so my guess is that there is a faster way to do this in R using vector operations.
Anyone know a computationally quick way to do this?
Just use data$N <- as.integer(data$ID). Factor variables are integers internally. Thus, it is easy to turn them into integer variables.
I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.
It is easy to select the given entries by
DT[Month==6][Day>=15]
DT[Month==7][Day<=15]
Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)
I am aware of a way to do it with join, but only day by day
setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]
Can anyone help how to solve it in more general way?
Great question. I've edited the question title to match the question.
A simple approach avoiding as.Date which reads nicely :
DT[!(Month*100L+Day) %between% c(0615L,0715L)]
That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :
DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]
That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :
setkey(DT,mmdd)
DT[-J(list(0615,0715))]
That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.
Based on the answer here, you might try something like
# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")
# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))
DT[-na.omit(DT[excl, which = TRUE])]
If your data contain at least one entry for each day you want to exclude, na.omit might not be required.