Generating unique ids and group ids using dplyr and concatenation - r

I have a problem that I suspect has arisen from a dplyr update combined with my hacky code. Given a data frame in which every row is duplicated, I want to assign each row a unique id by combining the entries of two columns with either "_" or "a_" in the middle. I also want to assign a group id by combining the entries of one column with either "" or "a". Because these formats are important for lining up with another data frame, I can't use solutions based on interact and factor that I've seen in other posts.
So I want to go from this:
Generation Identity
1 1 X
2 1 Y
3 1 Z
4 2 X
5 2 Y
6 2 Z
7 3 X
8 3 Y
9 3 Z
10 1 X
11 1 Y
12 1 Z
13 2 X
14 2 Y
15 2 Z
16 3 X
17 3 Y
18 3 Z
to this:
Generation Identity Unique_id Group_id
1 1 X 1_X X
2 1 Y 1_Y Y
3 1 Z 1_Z Z
4 2 X 2_X X
5 2 Y 2_Y Y
6 2 Z 2_Z Z
7 3 X 3_X X
8 3 Y 3_Y Y
9 3 Z 3_Z Z
10 1 X 1a_X Xa
11 1 Y 1a_Y Ya
12 1 Z 1a_Z Za
13 2 X 2a_X Xa
14 2 Y 2a_Y Ya
15 2 Z 2a_Z Za
16 3 X 3a_X Xa
17 3 Y 3a_Y Ya
18 3 Z 3a_Z Za
The minimal example below is based on code that previously worked for me and others in setting the unique id but that now causes RStudio to crash with a seg fault (Exception Type: EXC_BAD_ACCESS (SIGSEGV)). When I call a function containing this code it generates the message
Error in match(vector, df$Unique_id) : 'translateCharUTF8' must be
called on a CHARSXP
which I've read can be symptomatic of memory issues.
library(dplyr)
dff <- data.frame(Generation = rep(1:3, each = 3),
Identity = rep(LETTERS[24:26], times = 3))
dff <- rbind(dff, dff) # duplicate rows
dff <- group_by_(dff, ~Generation, ~Identity) %>%
mutate(Unique_id = c(paste0(Identity[1], "_", Generation[1]), paste0(Identity[1], "a", "_", Generation[1]))) %>%
ungroup
I think the problem is related to an update of dplyr (I'm using the latest release versions of RStudio and all packages, on OSX Sierra). In any case, my solution above is something of a hack. I'd very much appreciate suggestions for improved code, preferably using either base R or dplyr (since the code is part of a package that currently depends on dplyr).

Here is how you can approach the problem:
First find the duplicates of your data. I called my data A
dup=duplicated(A)
Then add a counter row:
A$count=1:nrow(A)
n=ncol(A)#THE COLUMN ADDED
now obtain the two columns needed and cbind it with the original dataframe:
B=data.frame(t(apply(A,1,function(x)
if(dup[as.numeric(x[n])]) c(paste0(x["Identity"],"a"),paste(x[-n],collapse="a_"))
else c(x["Identity"],paste(x[-n],collapse="_")))))
`names<-`(cbind(A[-n],B),c(names(A[-1]),"Group_ID","Unique_ID"))
Identity count Group_ID Unique_ID
1 1 X X 1_X
2 1 Y Y 1_Y
3 1 Z Z 1_Z
4 2 X X 2_X
5 2 Y Y 2_Y
6 2 Z Z 2_Z
7 3 X X 3_X
8 3 Y Y 3_Y
9 3 Z Z 3_Z
10 1 X Xa 1a_X
11 1 Y Ya 1a_Y
12 1 Z Za 1a_Z
13 2 X Xa 2a_X
14 2 Y Ya 2a_Y
15 2 Z Za 2a_Z
16 3 X Xa 3a_X
17 3 Y Ya 3a_Y
18 3 Z Za 3a_Z

Here's my amended version of Onyambu's solution, which refers to columns by name rather than number (and so can handle data frames that have additional columns):
dup <- duplicated(dff) # identify duplicates
dff$count <- 1:nrow(dff) # add count column to the dataframe
# create a new dataframe containing the unique and group ids:
B <- data.frame(t(apply(dff, 1, function(x)
if(dup[as.numeric(x["count"])]) c(paste0(x["Identity"], "a"),
paste(x["Identity"], x["Generation"], sep = "a_"))
else c(x["Identity"], paste(x["Identity"], x["Generation"], sep = "_")))))
# combine the dataframes:
colnames(B) <- c("Group_id", "Unique_id")
dff <- cbind(dff[-ncol(dff), B)

Related

Is there a methodology to assign integer values to factors in R

I am quite new to R, but was wondering if there is a specific way to group/analyze integer values from my data frame i.e.,
Sample X : int 1 2 3 4 5
Sample Y : int 6 7 8 9 10
Sample Z : int 11 12 13 14 15
and assign these to my factor variable which has the corresponding number of levels (5 in this example) which are called in this example lvl 1, lvl 2, lvl 3, lvl 4, lvl 5. The goal is to be able to graph the observations at each level, for example lvl 1 had the observations 1, 6, and 11/ lvl 2 had 2, 7, and 12, etc.
I've found no clean way to do this. Other attempts have including individually typing out the name of each sample and manually linking this to the factor levels, but that has not gone well.
Any advice would be appreciated!
If I understood correctly, you want to have each x, y and z observations associated with a level and plot by level.
library(ggplot2)
library(reshape2)
df = data.frame(x = 1:5, y = 6:10, z = 11:15)
df$level = factor(paste0("lvl",1:5))
df
df
# x y z level
# 1 1 6 11 lvl1
# 2 2 7 12 lvl2
# 3 3 8 13 lvl3
# 4 4 9 14 lvl4
# 5 5 10 15 lvl5
It's easier to use long formatted data for plot (with ggplot2 package). I use reshape2::melt here but you could find equivalent solution with tidyr::pivot_long
df <- reshape2::melt(df, id.vars = "level")
df
level variable value
1 lvl1 x 1
2 lvl2 x 2
3 lvl3 x 3
4 lvl4 x 4
5 lvl5 x 5
6 lvl1 y 6
7 lvl2 y 7
8 lvl3 y 8
9 lvl4 y 9
10 lvl5 y 10
11 lvl1 z 11
12 lvl2 z 12
13 lvl3 z 13
14 lvl4 z 14
15 lvl5 z 15
Finally, you can plot. Let's say you want points for each level:
ggplot(df, aes(x = level, y = value)) + geom_point()

How to calculate the sum of each column on merging two data frames without specify the column name

I have two data frames as below:
> d1
v x y
1 X 1 5
2 X 2 6
3 X 3 7
4 X 4 8
> d2
v x y
1 X 1 5
2 X 2 6
3 X 3 7
4 X 4 8
I want to merge them and sum each x and y. Below command works fine for me:
> ddply(merge(d1,d2, all=TRUE), .(v), summarise, x=sum(x), y=sum(y))
v x y
1 X 10 26
In above command, I have to specify the column name for x and y. I am looking for a way to calculate the sum value with specifying each column name. Because I have a data frame which includes more than twenty columns, I don't want to specify each of them. Is there an automatical way for me to calculate all columns?

deleting first row based on column variable

How do I delete the first row of each new variable? For example, here is some data:
m <- c("a","a","a","a","a","b","b","b","b","b")
n <- c('x','y','x','y','x','y',"x","y",'x',"y")
o <- c(1:10)
z <- data.frame(m,n,o)
I want to delete the first entry for a and b in column m. I have a very large data frame so I want to do this based on the change from a to b and so on.
Here is what I want the data frame to look like.
m n o
1 a y 2
2 a x 3
3 a y 4
4 a x 5
5 b x 7
6 b y 8
7 b x 9
8 b y 10
Thanks.
Just use duplicated:
z[duplicated(z$m),]
# m n o
#2 a y 2
#3 a x 3
#4 a y 4
#5 a x 5
#7 b x 7
#8 b y 8
#9 b x 9
#10 b y 10
Why this works? Consider:
duplicated("a")
#[1] FALSE
duplicated(c("a","a"))
#[1] FALSE TRUE
data.table is preferred for large datasets in R. setDT converts z data frame to data table by reference. Group by m and remove the first row.
library('data.table')
setDT(z)[, .SD[-1], by = "m"]
Using group_by and row_numberfrom package dplyr:
z %>%
group_by(m) %>%
filter(row_number(o)!=1)

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

In R, is there a simple way to convert two data frame columns for a formula with a grouping factor?

I have two vectors -- really columns in a data table -- and I want to compare the means with wilcox_test from coin.
With wilcox.test or t.test I can just do this:
wilcox.test(data$x,data$y)
But I need to use wilcox_test, which requires a formula like this:
wilcox_test(outcome ~ grp, data=myData)
I came up with this solution, which works:
outcome <- c(data$x,data$y)
grp <- c(c(rep(0, length(data$x))),c(rep(1, length(data$y))))
grp <- as.factor(grp)
wilcox_test(outcome ~ grp)
But I'm wondering - is there a simpler way to do this? Or is this the best way?
You can use function melt from the package reshape2:
> library(reshape2)
> melt(data.frame(x=1:10,y=11:20))
Using as id variables
variable value
1 x 1
2 x 2
3 x 3
4 x 4
5 x 5
6 x 6
7 x 7
8 x 8
9 x 9
10 x 10
11 y 11
12 y 12
13 y 13
14 y 14
15 y 15
16 y 16
17 y 17
18 y 18
19 y 19
20 y 20
And then use wilcox_test(value ~ variable,data=melt(data.frame(x=1:10,y=11:20)))
You can use stack. Here is an example
dat <- data.frame(x = 1:3, y = 4:6)
# x y
# 1 1 4
# 2 2 5
# 3 3 6
dat2 <- stack(dat)
# values ind
# 1 1 x
# 2 2 x
# 3 3 x
# 4 4 y
# 5 5 y
# 6 6 y
Now, the outcome variable is in column values and the grouping variable is in column ind.

Resources