unique and unlist function alternative R - r

I have a huge (More than 10 Million rows), and I need to get the count as mentioned below.
For this, I will check first all the unique ones like this.
lev<-(unique(unlist(mtcars[,8:11])))
Then count using the table function.
as.data.frame(sapply(mtcars[,8:11], function(x) table(factor(x, levels = lev))))
But the above will work only for small datasets. Most of the time, R will kill this command if I use it for a large dataset.
Is there any suggestion/way to improve speed for large datasets, for example, for using dplyr?

perhaps a data.table approach might work for you
it first melts the data to a long format, and then casts to wide again. This automatically gets the unique values (=rows), and the length of these values (the default fun.aggregate for dcast.data.table) by column (i.e. variable).
DT <- as.data.table(mtcars) # or setDT(mydata)
dcast(melt(DT[,8:11], measure.vars = names(DT)[8:11]),
value ~ variable)
# value vs am gear carb
# 1: 0 18 19 0 0
# 2: 1 14 13 0 7
# 3: 2 0 0 0 10
# 4: 3 0 0 15 3
# 5: 4 0 0 12 10
# 6: 5 0 0 5 0
# 7: 6 0 0 0 1
# 8: 8 0 0 0 1

Related

Transforming longitudinal data for time-to-event analysis in R

I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1

How to convert the result of xtabs() into dataframe in R? [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too

incidence matrix for bipartite groups with multiple joins

I was wondering if there's a fast way to get an incidence matrix for this such a problem. I've got two data frames with three columns (the join keys)
df1 <- data.frame(K1=c(1,1,0,1,3,2,2),K2=c(1,2,1,0,2,0,1),K3=c(0,0,3,2,1,3,0))
df2 <- data.frame(K1=c(1,2,0,3),K2=c(0,1,2,0),K3=c(2,0,3,1))
and I need to obtain the corresponding incidence matrix
# IM:
# 1 2 3 4
# 1 1 1 0 0
# 2 1 0 1 0
# 3 0 1 1 0
# 4 1 0 0 0
# 5 0 0 1 1
# 6 0 1 1 0
# 7 0 1 0 0
where it's set 1 if there's a match between the corresponding KEY (column value) of rows of the two data frames.
I would do by using multiple loops
for (j in seq_len(nrow(df2)))
for (k in seq_len(ncol(df2))) {
if (df2[j,k])
m[which(df1[,k] == df2[j,k]),j] <- 1
}
but it's a C approach and maybe there's something faster in R. Do you have any other ideas? Besides, when the data.frame are quite big (around 50k and 20k rows), I cannot allocate the matrix as it seems too big.

R: How to sum two separate values of two variables?

I have data 7320 obs of 3 variables: age groups and contact number between them. Ex:
ageGroup ageGroup1 mij
0 0 0.012093847617507
0 1 0.00510485237464309
0 2 0.00374919082969427
0 3 0.00307241431437433
0 4 0.00254487083293498
0 5 0.00213734013959765
0 6 0.00182565778959543
0 7 0.00159036659169942
1 0 0.00475097494199872
1 1 0.00748329237103462
1 2 0.00427123298868537
1 3 0.00319622224196792
1 4 0.00287522072903812
1 5 0.00257773394696414
1 6 0.00230322568677366
1 7 0.00205265986733139
and so on until 86. I have to calculate mean of contact number (mij) between ageGroups so that, for example, ageGroup = 0 contacts with ageGroup1 =1 with mij and ageGroup = 1 contacts with ageGroup1 = 0 with mij. I need to sum this values and divide by 2 to get an average between then. Would you be so kind to give me a hint how to do that all over the data?
Use ddply from plyr package (assuming your dataframe is data)
ddply(data,.(ageGroup,ageGroup1),summarize,sum.mij=sum(mij))
ageGroup ageGroup1 sum.mij
1 0 0 0.012093848
2 0 1 0.005104852
3 0 2 0.003749191
4 0 3 0.003072414
5 0 4 0.002544871
6 0 5 0.002137340
7 0 6 0.001825658
8 0 7 0.001590367
9 1 0 0.004750975
10 1 1 0.007483292
11 1 2 0.004271233
12 1 3 0.003196222
13 1 4 0.002875221
14 1 5 0.002577734
15 1 6 0.002303226
16 1 7 0.002052660
I think I see what you're trying to do here. You want to treat interactions between the two ageGroup columns as being non-directional and get the mean interaction? The code below should do this using base R functions.
Note that since the example dataset is truncated, it will only give a correct answer for the group with index 01. However if you run with the full dataset, it should work for all interactions.
# Create the data frame
df=read.table(header=T,text="
ageGroup,ageGroup1,mij
0,0,0.012093848
0,1,0.005104852
0,2,0.003749191
0,3,0.003072414
0,4,0.002544871
0,5,0.00213734
0,6,0.001825658
0,7,0.001590367
1,0,0.004750975
1,1,0.007483292
1,2,0.004271233
1,3,0.003196222
1,4,0.002875221
1,5,0.002577734
1,6,0.002303226
1,7,0.00205266
",sep=",")
df
# Using the strSort function from this SO answer:
# http://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string-in-r
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
# Label each of the i-j interactions and j-i interactions with an index ij
# e.g. anything in ageGroup=1 interacting with ageGroup1=0 OR ageGroup=0 interacting with ageGroup1=1
# are labelled with index 01
df$ind=strSort(paste(df$ageGroup,df$ageGroup1,sep=""))
# Use the tapply function to get mean interactions for each group as suggested by Paul
tapply(df$mij,df$ind,mean)

Filtering data without loops in R

I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3

Resources