R counting a field in data.table [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data table which could be reduced to this:
set.seed(1);
dt<-data.table(form=c(1,1,1,2,3,3,3,4,4,5),
mx=c("a","b","c","d","e","f","g","e","g","b"),
vr=runif(10,100,200),
usr=c("l","l","l","m","o","o","o","l","l","m"),
type=c("A","A","A","C","C","C","C","C","C","A"))
I can generate a table with:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr)),
by=usr]
What I haven't been able to to is to count the number of formulas of type A (each row is an observation, the form is the formula number). I've tried:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr),n.A=sum(type=="A"),
by=usr]
and also:
dt[,
list(n.form=length(unique(form)),n.mx=length(unique(mx)),tot.vr=sum(vr),n.A=length(unique(type=="A"))),
by=usr]
but none of those takes into account the fact that the number of "A" found needs to be related to the unique formula (form) number.
What I'd like to have as a result is:
usr n.form n.mx tot.vr n.A
1: l 2 5 750.0398 1
2: m 2 2 296.9994 1
3: o 1 3 504.4747 0
but I can't find a way to achieve it. Any light shed is much appreciated.
Thanks,
======= EDIT TO ADD ========
I want to know how many of the formulas (unique numbers in dt$form) are of type "A" (so I can calculate a proportion out of total formulas). The direct number (sum) is the total number of observations of type A, while the existence (any) gives me if there was at least one formula of type "A", but not the number of formulas of that type (which is what I want). Please notice that any given formula will always be either of type "A" or "C" (not mixed types in one formula)

In the devel version of data.table, you can use uniqueN instead of length(unique(..,
library(data.table)#v1.9.5+
dt[,list(n.form=uniqueN(form), n.mx=uniqueN(mx),tot.vr=sum(vr),
n.A=uniqueN(form[type=='A'])) , by = usr]
# usr n.form n.mx tot.vr n.A
#1: l 2 5 750.0398 1
#2: m 2 2 296.9994 1
#3: o 1 3 504.4747 0

Related

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

R - counting adjacent duplicate items

New to R and would like to do the following operation:
I have a set of numbers e.g. (1,1,0,1,1,1,0,0,1) and need to count adjacent duplicates as they occur. The result I am looking for is:
2,1,3,2,1
as in 2 ones, 1 zero, 3 ones, etc.
Thanks.
We can use rle
rle(v1)$lengths
#[1] 2 1 3 2 1
data
v1 <- c(1,1,0,1,1,1,0,0,1)

R: Compare a column of a data.table with a vector

I have a column of a data.table:
DT = data.table(R=c(3,8,5,4,6,7))
Further on I have a vector of upper cluster limits for the cluster 1, 2, 3 and 4:
CP=c(2,4,6,8)
Now I want to compare each entry of R with the elements of CP considering the order of CP. The result
DT[,NoC:=c(2,4,3,2,3,4)]
shall be a column NoC in DT, whose entries are just the number of that cluster, which the element of R belongs to.
(I need the cluster number to choose a factor out of another data.table.)
For example take the 1st entry of R: 3 is not smaller than 2 (out of CP), but smaller than 4 (out of CP). So, 3 belongs to cluster 2.
Another exmaple, take the 6th entry of R: 7 is neither smaller than 2, 4 nor 6 (out of CP), but shmaller than 8 (out of CP). So, 7 belongs to cluster 4.
How can I do that without using if-clauses?
You can accomplish this using rolling joins:
data.table(CP, key="CP")[DT, roll=-Inf, which=TRUE]
# [1] 2 4 3 2 3 4
roll=-Inf performs a NOCB rolling join - Next Observation Carried Backward. That is, in the event of value falling in a gap, the next observation will be rolled backward. Ex: 7 falls between 6 and 8. The next value is 8 - will be rolled backward. We simply get the corresponding index of each match using which=TRUE.
You can just add this as a column to DT using := as you've shown.
Note that this will return the indices after ordering CP. In your example, CP is already ordered, so it returns the result as intended. If CP is not already ordered, you'll have to add an additional column and extract that column instead of using which=TRUE. But I'll leave it to you to work it out.
From your description this would seem to be the code to deliver the correct answers, but Arun, a most skillful data.tablist, seems to have come up with a completely different way to fit your expectations, so I think there must be a different way of reading your requirements.
> DT[ , NoC:= findInterval(R, c(0, 2,4,6,8) , rightmost.closed=TRUE)]
> DT
R NoC
1: 3 2
2: 8 4
3: 5 3
4: 4 3
5: 6 4
6: 7 4
I'm also very puzzled that findInterval is assigning the 5th item to the 4th interval since 6 is not greater than the upper boundary of the third interval (6).

finding "almost" duplicates indices in a data table and calculate the delta

i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources