R append columns in different ways - r

I have a dataset with multiple columns:
Value1 Value3 Annotation Value4 Value2
1 4 s 9 4
2 5 t 0 4
3 6 q 4 4
The order of the values and the position of Annotation is unknown.
Now I'm trying to append multiple columns and try to keep the Annotation correct. The result should be a 2 column matrix
eg. when I try to append Value1 and Value3:
Annotation Valuelist
s 1
t 2
q 3
s 4
t 5
q 6
I found the method append() but I can't figure out how to make sure the annotation is correct.

The stack function might come in handy for what you want to do in base R:
cbind(mydf["Annotation"], stack(mydf[c("Value1", "Value3")])["values"])
# Annotation values
# 1 s 1
# 2 t 2
# 3 q 3
# 4 s 4
# 5 t 5
# 6 q 6
stack normally creates two columns, "values" and "ind" (in that order), so we select just the first to match what you describe.
A similar approach can be taken with melt from "reshape2":
library(reshape2)
melt(mydf, id.vars="Annotation",
measure.vars=c("Value1", "Value3"))[c("Annotation", "value")]
# Annotation value
# 1 s 1
# 2 t 2
# 3 q 3
# 4 s 4
# 5 t 5
# 6 q 6

You could construct this with something like:
columns <- c("Value1", "Value3")
data.frame(Annotation=rep(dat$Annotation, length(columns)),
Valuelist=as.vector(as.matrix(dat[columns])))
# Annotation Valuelist
# 1 s 1
# 2 t 2
# 3 q 3
# 4 s 4
# 5 t 5
# 6 q 6

Related

R - How to create multiple datasets based on levels of factor in multiple columns?

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.
This is my dataset:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
And I need each column to be a factor:
df[colnames(df)] <- lapply(df[colnames(df)], factor)
Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies.
I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.
The first two dataframes should look something like this:
Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"
Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"
I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).
I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:
Splitting with a list of columns doesn't work
list_df <- split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")
Can you help? Thanks in advance!
You need two lapplys:
vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
# A B C D E dummy1
# 3 2 2 3 5 2 no
# 4 3 3 3 5 3 no
# 5 4 4 4 5 4 no
# 6 5 2 2 4 2 no
# 8 1 5 1 5 1 no
#
# $dummy1$yes
# A B C D E dummy1
# 1 1 4 3 1 1 yes
# 2 2 4 3 2 4 yes
# 7 1 1 5 5 5 yes
# 9 2 2 2 2 2 yes
# 10 3 2 3 3 3 yes
#
#
# $dummy2
# $dummy2$high
# A B C D E dummy2
# 1 1 4 3 1 1 high
# 5 4 4 4 5 4 high
# 6 5 2 2 4 2 high
# 7 1 1 5 5 5 high
# 10 3 2 3 3 3 high
#
# $dummy2$low
# A B C D E dummy2
# 2 2 4 3 2 4 low
# 3 2 2 3 5 2 low
# 4 3 3 3 5 3 low
# 8 1 5 1 5 1 low
# 9 2 2 2 2 2 low
For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].
For programming purposes it is usually better to keep the list intact since it makes it simple to write code that processes all of the data frames in the list without having to specify them individually.
You are very close:
tbls <- unlist(step2, recursive=FALSE)
list2env(tbls, envir=.GlobalEnv)
ls()
# [1] "df" "dummies" "dummy1.no" "dummy1.yes" "dummy2.high" "dummy2.low" "step1" "step2" "tbls" "vals"
This will create the same set of tables.

Recoding specific column values using reference list

My dataframe looks like this
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Gender=c('Male','Female','Female','Female','Male','Female','Male','Male','Female','Female'))
And I have a reference list that looks like this -
ref=list(Male=1,Female=2)
I'd like to replace values in the Gender column using this reference list, without adding a new column to my dataframe.
Here's my attempt
do.call(dplyr::recode, c(list(data), ref))
Which gives me the following error -
no applicable method for 'recode' applied to an object of class
"data.frame"
Any inputs would be greatly appreciated
An option would be do a left_join after stacking the 'ref' list to a two column data.frame
library(dplyr)
left_join(data, stack(ref), by = c('Gender' = 'ind')) %>%
select(ID, Gender = values)
A base R approach would be
unname(unlist(ref)[as.character(data$Gender)])
#[1] 1 2 2 2 1 2 1 1 2 2
In base R:
data$Gender = sapply(data$Gender, function(x) ref[[x]])
You can use factor, i.e.
factor(data$Gender, levels = names(ref), labels = ref)
#[1] 1 2 2 2 1 2 1 1 2 2
You can unlist ref to give you a named vector of codes, and then index this with your data:
transform(data,Gender=unlist(ref)[as.character(Gender)])
ID Gender
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
7 7 1
8 8 1
9 9 2
10 10 2
Surprisingly, that one works as well:
data$Gender <- ref[as.character(data$Gender)]
#> data
# ID Gender
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 2
# 5 5 1
# 6 6 2
# 7 7 1
# 8 8 1
# 9 9 2
# 10 10 2

reshaping data with time represented as spells

I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.
You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6
So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4

Identifying unique duplicates in vector in R

I am trying to identify duplicates based of a match of elements in two vectors. Using duplicate() provides a vector of all matches, however I would like to index which are matches with each other or not. Using the following code as an example:
x <- c(1,6,4,6,4,4)
y <- c(3,2,5,2,5,5)
frame <- data.frame(x,y)
matches <- duplicated(frame) | duplicated(frame, fromLast = TRUE)
matches
[1] FALSE TRUE TRUE TRUE TRUE TRUE
Ultimately, I would like to create a vector that identifies elements 2 and 4 are matches as well as 3,5,6. Any thoughts are greatly appreciated.
Another data.table answer, using the group counter .GRP to assign every distinct element a label:
d <- data.table(frame)
d[,z := .GRP, by = list(x,y)]
# x y z
# 1: 1 3 1
# 2: 6 2 2
# 3: 4 5 3
# 4: 6 2 2
# 5: 4 5 3
# 6: 4 5 3
How about this with plyr::ddply()
ddply(cbind(index=1:nrow(frame),frame),.(x,y),summarise,count=length(index),elems=paste0(index,collapse=","))
x y count elems
1 1 3 1 1
2 4 5 3 3,5,6
3 6 2 2 2,4
NB = the expression cbind(index=1:nrow(frame),frame) just adds an element index to each row
Using merge against the unique possibilities for each row, you can get a result:
labls <- data.frame(unique(frame),num=1:nrow(unique(frame)))
result <- merge(transform(frame,row = 1:nrow(frame)),labls,by=c("x","y"))
result[order(result$row),]
# x y row num
#1 1 3 1 1
#5 6 2 2 2
#2 4 5 3 3
#6 6 2 4 2
#3 4 5 5 3
#4 4 5 6 3
The result$num vector gives the groups.

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources