R : set colnames to numeric - r

Is It possible to set colnames of a matrix in R to numeric ?
I know that they are character or NULL by default.
But, If there's a way to transform them to numeric, It would be so helpful for me.
Any clarification would be welcome.
EDIT
I'll explain myself more clearely :
I have a dataframe that contains two numeric column, for example :
> xy
x y
[1,] 1 1
[2,] 2 2
[3,] 3 5
[4,] 4 7
> xy = as.data.table(xy)
> xy_cast = dcast(xy, x~y, value.var="y", fun.aggregate=length)
> xy_cast
x 1 2 5 7
1 1 1 0 0 0
2 2 0 1 0 0
3 3 0 0 1 0
4 4 0 0 0 1
> xy_cast = xy_cast[-1]
> xy_cast
1 2 5 7
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
> class(colnames(xy_cast))
[1] "character"
As you see my colnames are numbers by they are coerced to character. If I can transform colnames to numeric it would reduce the execution time of the rest of the algorithm.
But, I'm not sure that's possible.
SOLUTION
I tried to treat the problem from another corner, so I thought differently :
which( colnames(df)=="b" )
This R function helped me to go through my colnames by selecting the column number of my column name, which helped me reduce execution time.
The first answer to this question helped me in my problem resolution :
Get the column number in R given the column name
Thank you for responses.

You can by default to the columns by their respective number
df[,3]
will return third column, or you can also use
df[,"155"]
which will return the column with name "155" which is a character.
If you're after getting the column name as a numeric, you can extract it and then coerce say
as.numeric(names(df)[3])
will return the name of third column as a numeric, if it is a numerical character. So you can easily go back and forth..

Related

For loop in R takes forever to run

people_id activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
(fctr) (fctr) (int) (int) (dbl) (int) (int) (dbl) (dbl)
1 ppl_100 act2_1734928 0 1 0 0 1 0 NA
2 ppl_100 act2_2434093 0 1 0 0 2 0 0
3 ppl_100 act2_3404049 0 1 0 0 3 0 0
4 ppl_100 act2_3651215 0 1 0 0 4 0 0
5 ppl_100 act2_4109017 0 1 0 0 5 0 0
6 ppl_100 act2_898576 0 1 0 0 6 0 0
7 ppl_100002 act2_1233489 1 1 1 1 1 1 1
8 ppl_100002 act2_1623405 1 1 1 2 2 1 0
9 ppl_100003 act2_1111598 1 1 1 1 1 1 0
10 ppl_100003 act2_1177453 1 1 1 2 2 1 0
I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:
success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
success_rate_trend[i] = NA
}
else {
success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
}}
It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.
Use vectorization:
success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_
Note:
people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].
You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.
n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)
success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA
The answer by Zheyuan Li is neater.
I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat. The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave.
success_rate_trend <- with( dat,
ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )
success_rate_trend
[1] NA 0 0 0 0 0 NA 0 NA 0
# not a very interesting result

Creating a 2 by 2 table with no values in one column or row in R

I need to table my data to a 2*2 table, however, due to no values in some cells, the table command in R does not provide a column or row depending on the data. For example:
a<-matrix(c(0,1,1,1,1,1,1,1),4,2)
table(a[,1],a[,2])
This is how it presents:
1
0 1
1 3
However, I need it to be like
0 1
0 0 1
1 0 3
Any suggestion?
The problem is that your matrix a contains numbers and with numbers R has no chance to know which columns should be shown. The solution is though easy. You have to transform you data into a factor, where you provide all potential values:
table(factor(a[,1], levels = unique(c(a))),factor(a[,2], levels = unique(c(a))))
# 0 1
# 0 0 1
# 1 0 3

Matching values in a column of one data frame with subsets of a column in another data frame

I am trying to match the values in a column of one data frame with the values in a column of a second data frame. The tricky part is that I would like to do the matching using subsets of the second data frame (designated by a distinct column in the second data frame from the one that is being matched). This is different from the commonly posted problem of trying to subset based on matching between data frames.
My problem is the opposite - I want to match data frames based on subsets. To be specific, I would like to match subsets of the column in the second data frame with the entire column of the first data frame, and then create new columns in the first data frame that show whether or not a match has been made for each subset.
These subsets can have varying number of rows. Using the two dummy data frames below...
DF1 <- data.frame(number=1:10)
DF2 <- data.frame(category = rep(c("A","B","C"), c(5,7,3)),
number = sample(10, size=15, replace=T))
...the objective would be to create three new columns (DF1$A, DF1$B, and DF$C) that show whether the values in DF1$number match with the values in DF2$number for each of the respective subsets of DF2$category. Ideally the rows in these new columns would show a '1' if a match has been made and a '0' if a match has not. With the dummy data below I would end up with DF1 having 4 columns (DF1$number, DF1$A, DF1$B, and DF$C) of 10 rows each.
Note that in my actual second data frame I have a huge number of categories, so I don't want to have to type them out individually for whatever operation is needed to accomplish this objective. I hope that makes sense! Sorry if I'm missing something obvious and thanks very much for any help you might be able to provide.
This should work:
sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)
A B C
[1,] 0 0 1
[2,] 1 1 0
[3,] 1 1 1
[4,] 0 1 0
[5,] 0 0 1
[6,] 0 1 0
[7,] 1 1 0
[8,] 1 0 0
[9,] 1 0 0
[10,] 0 1 0
You can add this back to DF1 like:
data.frame(
DF1,
sapply(split(DF2$number, DF2$category), function(x) DF1$number %in% x + 0)
)
number A B C
1 1 0 0 1
2 2 1 1 0
3 3 1 1 1
4 4 0 1 0
5 5 0 0 1
6 6 0 1 0
7 7 1 1 0
8 8 1 0 0
9 9 1 0 0
10 10 0 1 0

R bins are percentages of column length

I have a table of several columns, with values from 1 to 8. The columns have different lenghts so I have filled them with NAs at the end. I would like to transform each column of the data so I will get something like this for each column:
1 2 3 4 5 6 7 8
0-25 1 0 0 0 0 1 0 2
25-50 5 1 2 0 0 0 0 1
50-75 12 2 2 3 0 1 1 1
75-100 3 25 1 1 1 0 0 0
where the row names are percentages of the actual length of the original column (i.e. without the NAs), the column names are the original 0 to 8 values, and the new values are the number of occurances of the original values in each percentage. Any ideas will be appreciated.
Best,
Lince
PS/ I realize that my original message was very confusing. The data I want to transform contain a number of columns from time series like this:
1
1
8
1
3
4
1
5
1
6
2
7
1
NA
NA
and I need to calculate the frequency of occurences of each value (1 to 8) at the 0-25%, 25-50% et cetera of the series. Joris' answer is very useful. I can work on it. Thanks!
Given the lack of some information, I can offer you this :
Say 0 is no occurence, and 1 is occurence. Then you can use the following little script for the results of one column. Wrap it in a function, apply it over the columns and you get what you need.
x <- c(1,0,0,1,1,0,1,0,0,0,1,0,1,1,1,NA,NA,NA,NA,NA,NA)
prop <- which(x==1) / sum(!is.na(x))*100
result <- cut(prop,breaks=c(0,25,50,75,100))
table(result)

R concatenating two factors

This is making me feel dumb, but I am trying to produce a single vector/df/list/etc (anything but a matrix) concatenating two factors. Here's the scenario. I have a 100k line dataset. I used the top half to predict the bottom half and vice versa using knn. So now I have 2 objects created by knn predict().
> head(pred11)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(pred12)
[1] 0 1 1 0 0 0
Levels: 0 1
> class(pred11)
[1] "factor"
> class(pred12)
[1] "factor"
Here's where my problem starts:
> pred13 <- rbind(pred11, pred12)
> class(pred13)
[1] "matrix"
There are 2 problems. First it changes the 0's and 1's to 1's and 2's and second it seems to create a huge matrix that's eats all my memory. I've tried messing with as.numeric(), data.frame(), etc, but can't get it to just combine the 2 50k factors into 1 100k one. Any suggestions?
#James presented one way, I'll chip in with another (shorter):
set.seed(42)
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
unlist(list(x1,x2))
# [1] 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1
#Levels: 0 1
...This might seem a bit like magic, but unlist has special support for factors for this particular purpose! All elements in the list must be factors for this to work.
rbind will create 2 x 50000 matrix in your case which isn't what you want. c is the correct function to combine 2 vectors in a single longer vector. When you use rbind or c on a factor, it will use the underlying integers that map to the levels. In general you need to combine as a character before refactoring:
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
factor(c(as.character(x1),as.character(x2)))
[1] 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0
Levels: 0 1

Resources