For loop to iterate through columns in data.table [duplicate] - r

This question already has answers here:
Convert *some* column classes in data.table
(2 answers)
Closed 4 years ago.
I am trying to write a "for" loop that iterates through each column in a data.table and return a frequency table. However, I keep getting an error saying:
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,i]))
}
Error in `[.data.table`(cars, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[, ..i]. This difference to data.frame is deliberate and explained in FAQ 1.1.
When I use each column individually like below, I do not have any problem:
> table(cars[,dist])
2 4 10 14 16 17 18 20 22 24 26 28 32 34 36 40 42 46 48 50 52 54 56 60 64 66
1 1 2 1 1 1 1 2 1 1 4 2 3 3 2 2 1 2 1 1 1 2 2 1 1 1
68 70 76 80 84 85 92 93 120
1 1 1 1 1 1 1 1 1
My data is quite large (8921483x52), that is why I want to use the "for" loop and run everything at once then look at the result.
I included the cars dataset (which is easier to run) to demonstrate my code.
If I convert the dataset to data.frame, there is no problem running the "for" loop. But I just want to know why this does not work with data.table because I am learning it, which work better with large dataset in my belief.
If by chance, someone saw a post with an answer already, please let me know because I have been trying for several hours to look for one.

Some solution found here
My personal preference is the apply function though
library(datasets)
data(cars)
cars <- as.data.table(cars)
apply(cars,2,table)
To make your loop work you tweak the i
library(datasets)
data(cars)
cars <- as.data.table(cars)
for (i in names(cars)){
print(table(cars[,(i) := as.character(get(i))]))
}

Related

R countif and sum on multiple columns matching elements in specified vector

I am applying this function to my dataset column DL1 on another vector as below and receiving the results expected
table(df$DL1[df$DL1 %in% undefined_dl_codes])
Result:
0 10 30 3B 4 49 54 5A 60 7 78 8 90
24 366 4 3 665 40 1 1 14 8 4 87 1
however I do have columns DL2, DL3 and DL4 which have same data, how can I apply the function to multiple columns and receive the result of all. I would need to go through all 4 required columns and receive 1 result as summary.
Any help highly appreciated!
May not be the best of the methods, however you could do the following
table(c(df$DL1[df$DL1 %in% undefined_dl_codes],
df$DL2[df$DL2 %in% undefined_dl_codes],
df$DL3[df$DL3 %in% undefined_dl_codes],
df$DL4[df$DL4 %in% undefined_dl_codes]
)
)
Using Raghuveer solution I further simplified,
attach(df)
table(c(DL1,DL2,DL3,DL4)[c(DL1,DL2,DL3,DL4) %in% undefined_dl_codes])
detach(df)

Reassigning one column according to another using data.table

I am interested in replacing the value of -11 in one column "contra_end" to the corresponding values contained in "current_age", another column. -11 is a variable indicating current activity, and I want to replace that value with the actual age of each individual stored in "current_age". Age has ~500,000 values and only ~4,000 values from the first column have the value -11. When I run the following code to assign my age column values to the -11 values in "contra_end" I get the following error. Can I make this work without creating a new age variable?
biobank[contra_end == -11, contra_end := biobank[,"current_age", with=FALSE]]
Error in `[.data.table`(biobank, contra_end == -11, `:=`(contra_end, biobank[, :
Supplied 500000 items to be assigned to 4919 items of column 'contra_end'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
I used a short dataset which I made using this code
biobank <- data.frame(contra_end = c(0,13,15,109,-11,23,45),
current_age = c(34,35,36,46,43,56,23))
which gives
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 -11 43
6 23 56
7 45 23
Using the tidyverse::mutate
biobank_2 <- biobank %>%
mutate(contra_end = ifelse(contra_end == -11, current_age, contra_end))
Or using base
biobank$contra_end[biobank$contra_end==-11] <- biobank$current_age[biobank$contra_end==-11]
Both options give:
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 43 43
6 23 56
7 45 23
EDIT: I didn't even notice that you were looking for a solution in data.table until after I posted. It doesn't sound like you have too many records for either of the solutions I posted to not be efficient enough, though.

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

R - create new vectors based on elements of existing vector

and thanks in advance for your help. I am very new to R and am having some trouble with code that, to me looks like it should work, but isn't. I have a data frame like the one below:
studentID classNumber classRating
7 1 4
7 2 4
7 4 3
79 1 5
79 2 3
116 1 5
116 2 4
134 1 5
134 3 5
134 4 5
And I want it to read like this:
Student ID class1 class2 class3 class4
7 4 4 NA 3
79 5 3 NA NA
116 5 4 NA NA
134 5 NA 5 5
I've tried to piece together different things that I've come across and it seemed like the best approach was to create a new data frame and matrix and then populate it from the current data frame. I came up with the broken code below:
classRatings = data.frame(matrix(NA,4,5))
for(i in 1:nrow(classDB)){
#Find ratings by each student
rowsToReplace = classDB$studentID==classRatings$studentID[i]
#Make a row for each unique studentID in classRatings
classDB$studentID[rowsToReplace] = classRatings$studentID[i]
#for each studentID, find put the given rating for each unique class into
#it's own vector
for(j in classDB$classNumber){
if(classDB$classNumber==1){classRatings$class1==classDB$classRating}[j]
if(classDB$classNumber==2){classRatings$class2==classDB$classRating}[j]
if(classDB$classNumber==3){classRatings$class3==classDB$classRating}[j]
if(classDB$classNumber==4){classRatings$class4==classDB$classRating}[j]
if(classDB$classNumber==5){classRatings$class5==classDB$classRating}[j]
}
}
I'm getting an error that says:
the condition has length > 1 and only the first element will be used
and I am beyond my skill level to figure it out. Any help is appreciated.
The tidyr package can spread this long table into a wider one:
library(tidyr)
spread(classDB,classNumber,classRating,fill=NA)

Sum multiple columns [duplicate]

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 3 years ago.
I am trying to write a function that will sum the column(s) in the data frame according to the values in the first two columns.For example I have a matrix M,
Crs gr P_7 P_8
38 1 3 16
38 1 12 45
38 1 9 28
40 2 3 9
40 2 14 29
40 1 4 3
40 2 8 2
I want to sum the columns according to column1(crs) first and then column2(gr). Result will be,
Crs gr P_7 P_8
38 1 24 89
40 2 25 40
40 1 4 3
Currently I am using,
M <- M[, list(sum(P_7),sum(P_8)), by=list(Crs,gr)]
But the problem with this, is that I have to define the names of columns which wont be fixed. So, I was wondering how can I do this without defining the names of the columns.
Thanks in advance!
You're looking for this:
M[, lapply(.SD, sum), by = list(Crs, gr)]
The package plyr has some magic for situations just like this. Use a combination of ddply and numcolwise, like this:
library(plyr)
ddply(dat, .(Crs, gr), numcolwise(sum))
results in:
Crs gr P_7 P_8
1 38 1 24 89
2 40 1 4 3
3 40 2 25 40

Resources