Using transform and plyr to add a counting column in R - r

I have a two-level dataset (let's say classes nested within schools) and the dataset was coded
Like this:
School Class
A 1
A 1
A 2
A 2
B 1
B 1
B 2
B 2
But to run an analysis I need the data to have a unique Class ID, regardless of school membership.
School Class NewClass
A 1 1
A 1 1
A 2 2
A 2 2
B 1 3
B 1 3
B 2 4
B 2 4
I tried using transform and ddply, but I'm not sure how to keep NewClass continually incrementing to a larger number for each combination of School and Class. I can think of a few inelegant ways to do this, but I'm sure there are much easy solutions I just can't think of right now. Any help would be appreciated!

using interaction to create a factor, and then coerce it to integer:
transform(dat,nn = as.integer(interaction(Class,School)))
School Class nn
1 A 1 1
2 A 1 1
3 A 2 2
4 A 2 2
5 B 1 3
6 B 1 3
7 B 2 4
8 B 2 4

Using data.table:
library(data.table)
dt = as.data.table(your_df)
dt[, NewClass := .GRP, by = list(School, Class)]
dt
# School Class NewClass
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: A 2 2
#5: B 1 3
#6: B 1 3
#7: B 2 4
#8: B 2 4
.GRP is simply a group counter. Also note that you don't really need to do this and can keep using the above combination list(School, Class) in whatever by operation you need to do.
Note that from data.table versions >= 1.9.0, a function setDT is exported that converts a data.frame to data.table by reference (no copy is made), in case you'd want to stick to data.tables.
require(data.table) ## >= 1.9.0
setDT(your_df) ## your_df is now a data.table, changed by reference.

Related

Quick way to use the row element as name and the value as column in R

I do this with some steps that are not at all elegant and safe, but I'm sure that there is a way more easily and fast.
I need a help to know what is a quick way to go from dataframe_1 to dataframe_2.
#from this
a<-c("A","A","B","B","C","C")
b<-c(1,2,12,2,4,5)
dataframe_1<-cbind.data.frame(a,b)
a b
1 A 1
2 A 2
3 B 12
4 B 2
5 C 4
6 C 5
#to this
a<-c(1,2)
b<-c(12,2)
c<-c(4,5)
dataframe_2<-cbind.data.frame(A=a,B=b,C=c)
A B C
1 1 12 4
2 2 2 5
Try unstack
> unstack(rev(dataframe_1))
A B C
1 1 12 4
2 2 2 5
One option IF the number of elements in each group is constant.
data.frame(do.call(cbind, split(dataframe_1$b, dataframe_1$a)))
A B C
1 1 12 4
2 2 2 5
This can be also be done with dcast and rowid from data.table:
dcast(as.data.table(dataframe_1), rowid(a) ~ a, value.var = 'b')[, -1]
A B C
1: 1 12 4
2: 2 2 5
Here, [, -1] removes the first column (which is rowid(a)).

Using another data table to condition on columns in a primary data table r

Suppose I have two data tables, and I want to use the second one, which contains a row with some column values, to condition the first one.
Specifically, I want to use d2 to select rows where its variables are less than or equal to the values.
d1 = data.table('d'=1,'v1'=1:10, 'v2'=1:10)
d2 = data.table('v1'=5, 'v2'=5)
So I would want the output to be
d v1 v2
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 1 4 4
5: 1 5 5
But I want to do this without referencing specific names unless it's in a very general way, e.g. names(d2).
You could do it with a bit of text manipulation and a join:
d2[d1, on=sprintf("%1$s>=%1$s", names(d2)), nomatch=0]
# v1 v2 d
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: 4 4 1
#5: 5 5 1
It works because the sprintf expands to:
sprintf("%1$s>=%1$s", names(d2))
#[1] "v1>=v1" "v2>=v2"

for loop & if function in R

I was writing a loop with if function in R. The table is like below:
ID category
1 a
1 b
1 c
2 a
2 b
3 a
3 b
4 a
5 a
I want to use the for loop with if function to add another column to count each grouped ID, like below count column:
ID category Count
1 a 1
1 b 2
1 c 3
2 a 1
2 b 2
3 a 1
3 b 2
4 a 1
5 a 1
My code is (output is the table name):
for (i in 2:nrow(output1)){
if(output1[i,1] == output[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
But the result returns as all count column values are all "1".
ID category Count
1 a 1
1 b 1
1 c 1
2 a 1
2 b 1
3 a 1
3 b 1
4 a 1
5 a 1
Please help me out... Thanks
There are packages and vectorized ways to do this task, but if you are practicing with loops try:
output1$rn <- 1
for (i in 2:nrow(output1)){
if(output1[i,1] == output1[i-1,1]){
output1[i,"rn"]<- output1[i-1,"rn"]+1
}
else{
output1[i,"rn"]<-1
}
}
With your original code, when you made this call output1[i-1,"rn"]+1 in the third line of your loop, you were referencing a row that didn't exist on the first pass. By first creating the row and filling it with the value 1, you give the loop something explicit to refer to.
output1
# ID category rn
# 1 1 a 1
# 2 1 b 2
# 3 1 c 3
# 4 2 a 1
# 5 2 b 2
# 6 3 a 1
# 7 3 b 2
# 8 4 a 1
# 9 5 a 1
With the package dplyr you can accomplish it quickly with:
library(dplyr)
output1 %>% group_by(ID) %>% mutate(rn = 1:n())
Or with data.table:
library(data.table)
setDT(output1)[,rn := 1:.N, by=ID]
With base R you can also use:
output1$rn <- with(output1, ave(as.character(category), ID, FUN=seq))
There are vignettes and tutorials on the two packages mentioned, and by searching ?ave in the R console for the last approach.
looping solution will be painfully slow for bigger data. Here is one line solution using data.table:
require(data.table)
a<-data.table(ID=c(1,1,1,2,2,3,3,4,5),category=c('a','b','c','a','b','a','b','a','a'))
a[,':='(category_count = 1:.N),by=.(ID)]
what you want is actually a column of factor level. do this
df$count=as.numeric(df$category)
this will give out put as
ID category count
1 1 a 1
2 1 b 2
3 1 c 3
4 2 a 1
5 2 b 2
6 3 a 1
7 3 b 2
8 4 a 1
9 5 a 1
provided your category is already a factor. if not first convert to factor
df$category=as.factor(df$category)
df$count=as.numeric(df$category)

R data.table not preserving factor when applying function by group [duplicate]

The data comes from another question I was playing around with:
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
# user country event
#1: 3 1 1
#2: 3 1 2
#3: 3 1 3
#4: 3 1 4
#5: 3 2 5
#6: 4 2 6
#7: 4 2 7
#8: 4 2 8
#9: 4 2 9
#10: 4 2 10
And here's the surprising behavior:
dt[user == 3, as.data.frame(table(country))]
# country Freq
#1 1 4
#2 2 1
dt[user == 4, as.data.frame(table(country))]
# country Freq
#1 2 5
dt[, as.data.frame(table(country)), by = user]
# user country Freq
#1: 3 1 4
#2: 3 2 1
#3: 4 1 5
# ^^^ - why is this 1 instead of 2?!
Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected
dt[, blah, by = user]
to return identical result to
rbind(dt[user == 3, blah], dt[user == 4, blah])
Is that expectation incorrect?
The idiomatic data.table approach is to use .N
dt[ , .N, by = list(user, country)]
This will be far quicker and it will also retain country as the same class as in the original.
As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.
What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:
> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
user country Freq
1: 3 1 4
2: 3 2 1
3: 4 2 5
Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:
> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources