I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.
I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766
I have a data frame like this:
id no age
1 1 7 23
2 1 2 23
3 2 1 25
4 2 4 25
5 3 6 23
6 3 1 23
and I hope to aggregate the date frame by id to a form like this: (just sum the no if they share the same id, but keep age there)
id no age
1 1 9 23
2 2 5 25
3 3 7 23
How to achieve this using R?
Assuming that your data frame is named df.
aggregate(no~id+age, df, sum)
# id age no
# 1 1 23 9
# 2 3 23 7
# 3 2 25 5
Even better, data.table:
library(data.table)
# convert your object to a data.table (by reference) to unlock data.table syntax
setDT(DF)
DF[ , .(sum_no = sum(no), unq_age = unique(age)), by = id]
Alternatively, you could use ddply from plyr package:
require(plyr)
ddply(df,.(id,age),summarise,no = sum(no))
In this particular example the results are identical. However, this is not always the case, the difference between the both functions is outlined here. Both functions have their uses and are worth exploring, which is why I felt this alternative should be mentioned.
I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))