how to keep values when converting character to factor in R - r

So I am working with this matrix (see below) on R where you have the individuals and the the number of times they fought on Left,Right and total fight. I would like to do ANOVA to see the difference in number of fights per individuals. However I cannot use the column with the names so I need to add it and that's when I have a problem:
Left Right Total
DarkMale 0 1 1
Melman 5 2 7
Polp 0 12 12
Sun 10 1 11
Kevin 0 11 11
McFly 0 30 30
Lovely 36 0 36
Aquarius 0 30 30
Kenny 0 23 23
Lethabo 16 0 16
Charlie 0 3 3
Indv=rbind("DarkMale","Melman","Polp","Sun","Kevin","McFly","Lovely","Aquarius","Kenny","Lethabo","Charlie")
tab=cbind(tab,Total,Indv)
colnames(tab)=c("Left","Right","Total","Individuals")
I did this but then it converters the rest of table in Character which I cannot use either.
I have tried testtab=as.data.frame(tab,stringsAsFactors=FALSE)
which got rid of the "" in the table but still keeps all values in character.
How can I convert the table by keeping these values (see below) but with it being integer or factor that I could use for anova?
Left Right Total Individuals
DarkMale 0 1 1 DarkMale
Melman 5 2 7 Melman
Polp 0 12 12 Polp
Sun 10 1 11 Sun
Kevin 0 11 11 Kevin
McFly 0 30 30 McFly
Lovely 36 0 36 Lovely
Aquarius 0 30 30 Aquarius
Kenny 0 23 23 Kenny
Lethabo 16 0 16 Lethabo
Charlie 0 3 3 Charlie
Cheers

We need to first convert to data.frame and then create a column from the row names
d1 <- transform(as.data.frame(m1), Individuals = row.names(m1))
Using cbind on a matrix with a character element/elements convert the whole matrix to character as matrix can hold only a single class. Afterwards, if we convert to data.frame, the class remains as such or change to factor depending on whether stringsAsFactors is FALSE/TRUE.

Here is another way to do it. I generated a matrix to start with what you are starting with, then transformed it into a dataframe. For more compact solution use transform as mentioned in akrun solution.
tab <- matrix(data =c(1:33) , nrow = 11, ncol = 3)
df <- as.data.frame(tab)
Indv <- c("DarkMale","Melman","Polp","Sun","Kevin","McFly","Lovely","Aquarius","Kenny","Lethabo","Charlie")
colnames <- c("Left","Right","Total","Individuals")
df[4] <- Indv
rownames(df) <- Indv
colnames(df) <- colnames
#
# Left Right Total Individuals
# DarkMale 1 12 23 DarkMale
# Melman 2 13 24 Melman
# Polp 3 14 25 Polp
# Sun 4 15 26 Sun
# Kevin 5 16 27 Kevin
# McFly 6 17 28 McFly
# Lovely 7 18 29 Lovely
# Aquarius 8 19 30 Aquarius
# Kenny 9 20 31 Kenny
# Lethabo 10 21 32 Lethabo
# Charlie 11 22 33 Charlie

Related

Creating Groups by Matching Values of Different Columns

I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B

R- How many times do values from one column of a dataframe appear in others? (preferably without the use of for loop)

I've been struggling with this problem for a while now, so I hope someone can help me find a more time efficient solution.
So, I have a dataframe of ID's like this:
IDinsurer<-c(rep(11,3),rep(12,2),rep(11,2),rep(13,2),11)
ClaimFileNum<-c(rep('AA',3),rep('BB',2),rep('CC',2),rep('DD',2),'EE')
IDdriver<-c(rep(11,3),rep(12,2),rep(21,2),rep(13,2),11)
IDclaimant<-c(31,11,32,12,33,11,34,13,11,11)
IDclaimdriver<-c(41,11,32,12,11,21,34,13,12,11)
dt<-data.frame(ClaimFileNum,IDinsurer,IDdriver,IDclaimant,IDclaimdriver)
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver
1 AA 11 11 31 41
2 AA 11 11 11 11
3 AA 11 11 32 32
4 BB 12 12 12 12
5 BB 12 12 33 11
6 CC 11 21 11 21
7 CC 11 21 34 34
8 DD 13 13 13 13
9 DD 13 13 11 12
10 EE 11 11 11 11
What I'd like to do is to count the number of different claim files (ClaimFileNum) the individual IDinsurer has appeared on in other roles ( i.e. not as an insurer). So for each IDinsurer I only want the count of claim files, where his ID appeared in either IDdriver, IDclaimant or IDclaimdriver while at the same time he isn't the IDinsurer of the given claimfile. For example, IDinsurer==11 appeared with all ClaimFileNums, but only on "BB" and "DD" he wasn't also the IDinsurer meaning I'd want my program to return 2.
So this is how I'd like my final data frame to look like:
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver N
1 AA 11 11 31 41 2
2 AA 11 11 11 11 2
3 AA 11 11 32 32 2
4 BB 12 12 12 12 1
5 BB 12 12 33 11 1
6 CC 11 21 11 21 2
7 CC 11 21 34 34 2
8 DD 13 13 13 13 0
9 DD 13 13 11 12 0
10 AA 11 11 11 11 2
So this is what I was able to come up with so far:
1)
For each of the three other roles (IDdriver, IDclaimant, IDclaimdriver) I individually calculated a new column with numbers revealing how many claim files the specific ID's appeared on IN THAT ROLE ONLY, excluding the cases of claim files, where they were also the insurers (for IDclaimdriver however it made more sense to exclude the cases where the ID matched either IDclaimant or IDdriver instead) . This is the code for the IDdriver counts:
count.duplicates <- function(dt){ #removing duplicated columns and adding a column with the frequency of duplications
x <- do.call('paste', c(dt[,c("ClaimFileNum","IDdriver")], sep = '\r'))
ox <- order(x)
rl <- rle(x[ox])
cbind(dt[ox[cumsum(rl$lengths)],,drop=FALSE],count = rl$lengths)
}
dt<-count.duplicates(dt)
dt<-data.table(dt)
dt[,same:=ifelse(dt$IDinsurer==dt$IDdriver,0,1)]
dt[,N_IDdriver:=sum(same,na.rm = T),by=list(IDdriver)]
dt[,same:=NULL]
setorder(dt,ClaimFileNum)
dt<-expandRows(dt,"count")
dt<-as.data.frame(dt)
And this is the output for my example after all three counts:
ClaimFileNum IDinsurer IDdriver IDclaimant IDclaimdriver N_IDdriver N_IDclaimant N_IDclaimdriver
1 AA 11 11 31 41 0 1 1
2 AA 11 11 11 11 0 1 1
3 AA 11 11 32 32 0 1 0
4 BB 12 12 12 12 0 0 1
5 BB 12 12 33 11 0 1 1
6 CC 11 21 11 21 1 1 0
7 CC 11 21 34 34 1 1 0
8 DD 13 13 13 13 0 0 0
9 DD 13 13 11 12 0 1 1
10 EE 11 11 11 11 0 1 1
2) I now used a for loop over an entire IDinsurer column first to check if the insurerID[i] has appeared in any of the other three roles ID's using match function. If the match was found I simply added the count from the corresponding N_ column to the overall count.
Here is my for loop:
total<-length(dt$IDinsurer)
for(i in 1:total) {
j<-match(dt$IDinsurer[i],dt$IDdriver,nomatch=0);
k<-match(dt$IDinsurer[i],dt$IDclaimant,nomatch=0);
l<-match(dt$IDinsurer[i],dt$IDclaimdriver,nomatch=0);
dt$N[i]<-ifelse(j==0,0,N_IDdriver[j])+ifelse(k==0,0,N_IDclaimant[k])+ifelse(l==0,0,N_IDclaimdriver[l]);
}
Now while this approach gives me all the information I need, it's unfortunately incredibly sluggish, especially on a dataset with over 2 million cases like the one I'll have to work with. I'm sure there must be a more elegant solution and I've been trying to figure out how to do it with some more efficient tools (like data.table) but I just can't get the grasp of it.
EDIT: I decided to try both of the answers to my question on my example and compare them with my attempt so here are the calculation times:
Thom Quinn's for loop: 0.15sec,
my for loop: 0.25 sec,
bounyball's approach: 0.35 sec.
Using my loop on a 1,042,000 row dataset took just under 10 hours.
Match is notoriously slow and not needed in this case. In fact, you already solved the problem in English, you need just need to translate it to computer lingo!
So for each IDinsurer I only want the count of claim files, where his ID appeared in either IDdriver, IDclaimant or IDclaimdriver while at the same time he isn't the IDinsurer of the given claimfile
So, let's do just that. In pseudo-code:
for each unique IDinsurer:
count when IDdriver OR IDclaimant OR IDclaimdriver AND NOT IDinsurer
In R, this is:
for(i in unique(dt$IDinsurer)){
index <- dt$IDinsurer != i & (dt$IDdriver == i | dt$IDclaimant == i | dt$IDclaimdriver == i)
dt[dt$IDinsurer == i, "N"] <- sum(index)
}
We can use lapply to apply to do.call to merge.
We first split the data by unique ID. Then, we look at the data by excluding any rows where the ID equals the IDInsurer. Within that data set, we look for entries where any of the other ID's are equal to the ID we're working with. Then we combine the data and fold it up using merge.
res.df <-
do.call('rbind.data.frame',
lapply(unique(dt$IDinsurer), function(x)
c(
x, sum(apply(dt[dt$IDinsurer != x, 3:5] == x, 1, function(y) any(y)))
)
)
)
names(res.df) <- c('ID', 'Count')
merge(dt, res.df, by.x = 'IDinsurer', by.y = 'ID')
IDinsurer ClaimFileNum IDdriver IDclaimant IDclaimdriver Count
1 11 AA 11 31 41 2
2 11 AA 11 11 11 2
3 11 AA 11 32 32 2
4 11 CC 21 11 21 2
5 11 CC 21 34 34 2
6 11 EE 11 11 11 2
7 12 BB 12 12 12 1
8 12 BB 12 33 11 1
9 13 DD 13 13 13 0
10 13 DD 13 11 12 0

Create a new variable base on other variables contain a specific value in r

I have several time serious variables and I want to create two new dummy variables.
Variable one: if other variables contain a specific value, then variable one equal 1.
Variable two: if other variables contain a specific value continuously, then variable two equal 1.
My data looks like
ID score_2011 score_2012 score_2013 score_2014 score_2015
1 12 15 96 96 16
2 12 15 15 15 16
3 12 96 20 15 16
4 12 15 18 15 16
5 12 15 96 15 16
I want to get the new variables like the following
IF score_2011~2015 contain 96 then with_96=1
IF score_2011~2015 contain continuous 96 then back_to_back_96=1
I want the result to look like..
ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
1 12 15 96 96 16 1 1
2 12 15 15 15 16 0 0
3 12 96 20 15 16 1 0
4 12 15 18 15 16 0 0
5 96 15 96 15 16 1 0
Thanks in advance
One option would be to loop through the rows, find if there are any values that are 96 ('x1'), do run-length encoding on each of the rows, check whether there are any of the lengths for the 'TRUE' values are greater than 1 ('x2') , concatenate both, transpose and assign two new columns to the output.
df1[c("with_96", "back_to_back_96")] <- t(apply(df1[-1], 1, FUN= function(x) {
x1 <- as.integer(any(x==96))
rl <- rle(x==96)
x2 <- any(rl$lengths[rl$values]>1)
c(x1, x2)}))
df1
# ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
#1 1 12 15 96 96 16 1 1
#2 2 12 15 15 15 16 0 0
#3 3 12 96 20 15 16 1 0
#4 4 12 15 18 15 16 0 0
#5 5 12 15 96 15 16 1 0
Or another option is using rowSums
df1["with_96"] <- +(!!rowSums(df1[-1]==96))
df1["back_to_back_96"] <- rowSums((df1[-c(1, ncol(df1))]==96) +
(df1[-c(1,2)]==96)>1)
You can do some fanciness with data.table if you are so inclined. Working on a long format, melted dataset might make the logic of some of these comparisons a bit simpler.
library(data.table)
setDT(dat)
melt(dat, id="ID")[, .(with96=any(value==96), b2b96=any(diff(which(value==96))==1)), by=ID]
# ID with96 b2b96
#1: 1 TRUE TRUE
#2: 2 FALSE FALSE
#3: 3 TRUE FALSE
#4: 4 FALSE FALSE
#5: 5 TRUE FALSE

Split data when time intervals exceed a defined value

I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>

Resources