Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1
Related
I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))
I have a dataframe that is similar to a simplified version below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
df<-data.frame(MO1,MO2,MO3)
df
I am trying to create a new variable that would scan through the observations looking for all the 1 values. I would then like the observations in this new variable to take on the name of the column variable that it was obtained from, see below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
MOTIVATION<-c("MO2","MO1","MO3","")
df2<-data.frame(MO1,MO2,MO3,MOTIVATION)
df2
Sorry, I do not know how to just show the resulting data frame, df2 from above.
I have 989 observations and 19 different MO.. variables in my dataset.
Another option
> ind <- which(df==1, arr.ind = TRUE)
> df2 <- df # just cloning df
> df2$MOTIVATION <- NA
> df2$MOTIVATION[ind[,1]] <- names(df) [ind[,2]]
> df2
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
An option is to use apply in combination with which as:
df$MOTIVATION <- apply(df,1,function(x)names(df)[which(x==1)])
df
# MO1 MO2 MO3 MOTIVATION
# 1 0 1 3 MO2
# 2 1 0 2 MO1
# 3 2 3 1 MO3
# 4 3 2 0
1) Try max.col like this. Insert a 1 in front of each row and then find the column of the last 1. Subtract 1 so that it corresponds tot he original column numbers and a missing 1 gives 0. Then replace all zeros with NA and look up the corresponding column names.
ix <- max.col(cbind(1, df) == 1, "last") - 1
transform(df, MOTIVATION = names(df)[replace(ix, ix == 0, NA)])
giving:
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
2) A variation would be the following. We compute max.col and then multiply each result by 1 if there is a 1 in that row or NA if not.
df1 <- df == 1
transform(df, MOTIVATION = names(df)[max.col(df1) * match(rowSums(df1), 1)])
The following does the trick (note that this support the case where two Columns have "1" not sure if this was a valid edge case for you.
(I slightly modified MO4 from original so that it would contain two "1"
MO1<-c("0","1","2","3")
MO2<-c("1","2","3","2")
MO3<-c("3","2","1","0")
MO4<-c("3","2","1","1")
df<-data.frame(MO1,MO2,MO3,MO4)
df
findx <- function(dfx)
{
idx <- which(dfx=="1")
res <- lapply(idx, function(x) paste0('MO', x))
res
}
found <- apply(df,2,findx)
newdf <- unlist(found)
newdf
With an ouput of
"MO2" "MO1" "MO3" "MO3" "MO4"
This question already has answers here:
Reshaping a data.frame so a column containing multiple features becomes multiple binary columns
(4 answers)
Closed 5 years ago.
I have a data frame that has a bunch of data that's joined with commas in certain elements of the rows. Something that looks like:
df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014)
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")
)
names(df) <- c("year", "type")
I want to get it in a form that dcast is close to getting it to, with the year,a,b,c,etc being the columns, and the frequency across the data frame being in the cells of the resultant data frame. I tried first to do colsplit on df and then use dcast after, but that seems to only work if I want to aggregate on one of the levels instead of all.
df2 <- data.frame( df$year, colsplit(df$type, ',' , c('v1','v2','v3','v4','v5')) )
df3 <- dcast(df2, df.year ~ v1)
This result only gives me for the first level of the colsplit, instead of all of them. Am I close to a solution or should I be using a different approach entirely?
Here is a single line option with base R by splitting the 'type' column with strsplit, then set the names of the list output as 'year', stack it to a single data.frame and get the frequency count using table
table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1])
# values
#ind a b c d e f g h i
# 2012 2 1 3 2 1 1 0 0 0
# 2013 4 1 1 1 0 0 0 0 0
# 2014 1 1 0 0 1 0 2 1 1
You are close to the solution. You just need one more step. You need to melt all values in one column before dcast. See the example.
require(reshape2)
df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014),
c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i"))
names(df) <- c("year", "type")
df
df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5')))
df2
df3 <- melt(df2, id.vars = "df.year", na.rm = T)
df3
df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length)
df4
Here's a data.table approach:
library(data.table)
setDT(df)
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year],
year ~ V1, value.var = "V1", fun.aggregate = length)
# year a b c d e f g h i
#1: 2012 2 1 3 2 1 1 0 0 0
#2: 2013 4 1 1 1 0 0 0 0 0
#3: 2014 1 1 0 0 1 0 2 1 1
We first split the type column by comma and per year-group to a long-format, then dcast to wide with the length as aggregate function.
Maybe, something like this could work?
# extract unique values and years
vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]")))
years <- unique(df$year)
# count
df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){
sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl)
}))})))
df4 <- cbind(years, df4)
df4
#result
years a b c d e f g h i
1 2012 2 1 3 2 1 1 0 0 0
2 2013 4 1 1 1 0 0 0 0 0
3 2014 1 1 0 0 1 0 2 1 1
I am trying to mutate colSums at the bottom of a dataframe, but the first column of the table is a character vector containing labels.
For example,
df=data.frame(
label = c("A","B","C","D","E","F","G","H","I","J"),
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
Without the label col,
df %>% mutate(Total = colSums(df[, 1:5], na.rm = TRUE))
should work fine... but I tried
df %>% mutate(Total = colSums(df[, 2:6], na.rm = TRUE))
which gave me an error message
Error in mutate_impl(.data, dots) : wrong result size (5), expected
10 or 1
How can I ignore that first column and still mutate colSums into the bottom of my data frame?
Thank you.
mutate adds a new column to a data.frame. You indicate you're trying to add a new row to the bottom. Thus the error message: in trying to create a new column, mutate expects a vector of length 10 (or a single value that it can fill the entire column with).
If you want to add a totals row to a data.frame, try janitor::adorn_totals("row"):
library(janitor)
df %>%
adorn_totals("row")
label x1 x2 x3 x4 x5
A 1 1 0 1 1
B 0 1 1 0 1
C 0 NA 0 NA NA
D NA 1 1 1 1
E 0 1 1 0 1
F 1 0 0 0 1
G 1 NA NA NA NA
H NA NA NA 0 1
I 0 0 0 0 0
J 1 1 1 1 1
Total 4 5 4 3 7
Self-promotion disclaimer, I wrote the janitor package and this function - posting this answer because the function addresses precisely this situation.
How do you add a variable to a dataset using the aggregate and by commands? For example, I have:
num x1
1 1
1 0
2 0
2 0
And I'm looking to create a variable to identify every variable for which any num is 1, for example:
num x1 x2
1 1 1
1 0 1
2 0 0
2 0 0
or
num x1 x2
1 1 TRUE
1 0 TRUE
2 0 FALSE
2 0 FALSE
I've tried to use
df$x2 <- aggregate(df$x1, by = list(df$num), FUN = sum)
But I'm getting an error that says the replacement has a different number of rows than the data. Can anyone help?
This can be done by grouping with 'num' and checking if there are any 1 element in 'x'1. The ave from base R is convenient for this instead of aggregate
df1$x2 <- with(df1, ave(x1==1, num, FUN = any))
df1$x2
#[1] 1 1 0 0
Or using dplyr, we group by 'num' and create the 'x2' by checking if any 'x1' is equal to 1. It will be a logical vector if we are not wrapping with as.integer to convert to binary
library(dplyr)
df1 %>%
group_by(num) %>%
mutate(x2 = as.integer(any(x1==1)))