R - How to reformat wide dataset to wider dataset by changing the primary unique row variable - r

Sorry if this has been asked already and if the title is very confusing. I did look and only found questions on reformating where the values of one of the columns were used as column headings in the output dataset.
My dataset is organized so the Filter is the unique value for each row. I want to change it so the id of the individual within each sampling season is unique for each row since individuals had multiple filters. Basically, I want to reformat Table 1 so it looks like Table 2.
Table 1
id season FilterI
1: 1 1 A
2: 1 1 B
3: 2 1 C
4: 2 1 D
5: 1 2 E
6: 1 2 F
Table 2
id season FilterI1 FilterI2
1: 1 1 A B
2: 1 2 E F
3: 2 1 C D
Reshape does not seem to work because none of the columns in the first dataset contain the column headings for the second dataset.

Using dcast with rowid, change from 'long' to 'wide' (assuming the example data is data.table)
library(data.table)
dcast(Table1, id + season ~ paste0("FilterI", rowid(id)), value.var = "FilterI")
# id season FilterI1 FilterI2
#1: 1 1 A B
#2: 1 2 E F
#3: 2 1 C D

Related

data.table: How to indicate first occurrence of unique column value by group

I have a large data.table ~ 18*10^6 rows filled with columns ID and CLASS and I want to create a new binary column that indicates the occurrence of a new CLASS value by ID.
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"))
### Starting
ID CLASS
1 a
1 a
1 b
2 c
2 b
### Desired
ID CLASS NEWCLS
1 a 1
1 a 0
1 b 1
2 c 1
2 b 1
I originally initialized the NEWCLS variable and used the data.table::shift() function to lag a 1 by ID and CLASS
DT[,NEWCLS:=0]
DT[,NEWCLS:=data.table::shift(NEWCLS, n = 1L, fill = 1, type = "lag"),by=.(ID,CLASS)]
This creates the desired output but with ~18*10^6 rows it takes quite some time, even for data.table.
Would someone know how to create the NEWCLS variable in quicker and more efficient way using solely data.table arguments?
One possibility could be:
DT[, NEWCLS := as.integer(!duplicated(CLASS)), by = ID]
ID CLASS NEWCLS
1: 1 a 1
2: 1 a 0
3: 1 b 1
4: 2 c 1
5: 2 b 1

Using another data table to condition on columns in a primary data table r

Suppose I have two data tables, and I want to use the second one, which contains a row with some column values, to condition the first one.
Specifically, I want to use d2 to select rows where its variables are less than or equal to the values.
d1 = data.table('d'=1,'v1'=1:10, 'v2'=1:10)
d2 = data.table('v1'=5, 'v2'=5)
So I would want the output to be
d v1 v2
1: 1 1 1
2: 1 2 2
3: 1 3 3
4: 1 4 4
5: 1 5 5
But I want to do this without referencing specific names unless it's in a very general way, e.g. names(d2).
You could do it with a bit of text manipulation and a join:
d2[d1, on=sprintf("%1$s>=%1$s", names(d2)), nomatch=0]
# v1 v2 d
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: 4 4 1
#5: 5 5 1
It works because the sprintf expands to:
sprintf("%1$s>=%1$s", names(d2))
#[1] "v1>=v1" "v2>=v2"

Frequency of Characters in Strings as columns in data frame using R

I have a data frame initial of the following format
> head(initial)
Strings
1 A,A,B,C
2 A,B,C
3 A,A,A,A,A,B
4 A,A,B,C
5 A,B,C
6 A,A,A,A,A,B
and the data frame I want is final
> head(final)
Strings A B C
1 A,A,B,C 2 1 1
2 A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4 A,A,B,C 2 1 1
5 A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0
to generate the data frames the following codes can be used to keep the number of rows high
initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))
What is the fastest way I can achieve this? Any help will be greatly appreciated
We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.
cbind(df1, as.data.frame.matrix(
table(
stack(
setNames(
strsplit(as.character(df1$Strings),','), 1:nrow(df1))
)[2:1])))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
or we can use mtabulate after splitting the column.
library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
Update
For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.
df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Adding a counter column for a set of similar rows in R [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 9 years ago.
I have a data-frame in R with two columns. The first column contains the subjectID and the second column contains the trial ID that subject has done.
The a specific subjectID might have done the trial for more than 1 time. I want to add a column with a counter that starts counting for each subject-trial unique value and increment by 1 till it reaches the last row with that occurance.
More precisely, I have this table:
ID T
A 1
A 1
A 2
A 2
B 1
B 1
B 1
B 1
and I want the following output
ID T Index
A 1 1
A 1 2
A 2 1
A 2 2
B 1 1
B 1 2
B 1 3
B 1 4
I really like the simple syntax of data.table for this (not to mention speed)...
# Load package
require( data.table )
# Turn data.frame into a data.table
dt <- data.table( df )
# Get running count by ID and T
dt[ , Index := 1:.N , by = c("ID" , "T") ]
# ID T Index
#1: A 1 1
#2: A 1 2
#3: A 2 1
#4: A 2 2
#5: B 1 1
#6: B 1 2
#7: B 1 3
#8: B 1 4
.N is an integer equal to the number of rows in each group. The groups are defined by the column names in the by argument, so 1:.N gives a vector as long as the group.
As data.table inherits from data.frame any function that takes a data.frame as input will also take a data.table as input and you can easily convert back if you wished ( df <- data.frame( dt ) )

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources