I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.
group_a
group_b
n_variants_a
n_variants_b
1
NA
1
2
NA
2
5
4
1
2
2
0
I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.
Running this over just one group and n_variant set works:
sum(subset(df, (!is.na(group_a)))$n_variants_a
However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.
summed_variants <- lapply(list_of_groups, function(g) {
lapply(list_of_variants, function(v) {
sum(subset(df, !(is.na(g)))$v)
I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.
Thanks for your help!
We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum
mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]),
names(df1)[1:2], names(df1)[3:4])
group_a group_b
3 4
Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum
library(stringr)
library(dplyr)
df1 %>%
summarise(across(contains('variants'),
~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))
-output
n_variants_a n_variants_b
1 3 4
data
df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))
Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a data.frame that looks like this.
x a 1
x b 2
x c 3
y a 3
y b 3
y c 2
I want this in matrix form so I can feed it to heatmap to make a plot. The result should look something like:
a b c
x 1 2 3
y 3 3 2
I have tried cast from the reshape package and I have tried writing a manual function to do this but I do not seem to be able to get it right.
There are many ways to do this. This answer starts with what is quickly becoming the standard method, but also includes older methods and various other methods from answers to similar questions scattered around this site.
tmp <- data.frame(x=gl(2,3, labels=letters[24:25]),
y=gl(3,1,6, labels=letters[1:3]),
z=c(1,2,3,3,3,2))
Using the tidyverse:
The new cool new way to do this is with pivot_wider from tidyr 1.0.0. It returns a data frame, which is probably what most readers of this answer will want. For a heatmap, though, you would need to convert this to a true matrix.
library(tidyr)
pivot_wider(tmp, names_from = y, values_from = z)
## # A tibble: 2 x 4
## x a b c
## <fct> <dbl> <dbl> <dbl>
## 1 x 1 2 3
## 2 y 3 3 2
The old cool new way to do this is with spread from tidyr. It similarly returns a data frame.
library(tidyr)
spread(tmp, y, z)
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using reshape2:
One of the first steps toward the tidyverse was the reshape2 package.
To get a matrix use acast:
library(reshape2)
acast(tmp, x~y, value.var="z")
## a b c
## x 1 2 3
## y 3 3 2
Or to get a data frame, use dcast, as here: Reshape data for values in one column.
dcast(tmp, x~y, value.var="z")
## x a b c
## 1 x 1 2 3
## 2 y 3 3 2
Using plyr:
In between reshape2 and the tidyverse came plyr, with the daply function, as shown here: https://stackoverflow.com/a/7020101/210673
library(plyr)
daply(tmp, .(x, y), function(x) x$z)
## y
## x a b c
## x 1 2 3
## y 3 3 2
Using matrix indexing:
This is kinda old school but is a nice demonstration of matrix indexing, which can be really useful in certain situations.
with(tmp, {
out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
dimnames=list(levels(x), levels(y)))
out[cbind(x, y)] <- z
out
})
Using xtabs:
xtabs(z~x+y, data=tmp)
Using a sparse matrix:
There's also sparseMatrix within the Matrix package, as seen here: R - convert BIG table into matrix by column names
with(tmp, sparseMatrix(i = as.numeric(x), j=as.numeric(y), x=z,
dimnames=list(levels(x), levels(y))))
## 2 x 3 sparse Matrix of class "dgCMatrix"
## a b c
## x 1 2 3
## y 3 3 2
Using reshape:
You can also use the base R function reshape, as suggested here: Convert table into matrix by column names, though you have to do a little manipulation afterwards to remove an extra columns and get the names right (not shown).
reshape(tmp, idvar="x", timevar="y", direction="wide")
## x z.a z.b z.c
## 1 x 1 2 3
## 4 y 3 3 2
The question is some years old but maybe some people are still interested in alternative answers.
If you don't want to load any packages, you might use this function:
#' Converts three columns of a data.frame into a matrix -- e.g. to plot
#' the data via image() later on. Two of the columns form the row and
#' col dimensions of the matrix. The third column provides values for
#' the matrix.
#'
#' #param data data.frame: input data
#' #param rowtitle string: row-dimension; name of the column in data, which distinct values should be used as row names in the output matrix
#' #param coltitle string: col-dimension; name of the column in data, which distinct values should be used as column names in the output matrix
#' #param datatitle string: name of the column in data, which values should be filled into the output matrix
#' #param rowdecreasing logical: should the row names be in ascending (FALSE) or in descending (TRUE) order?
#' #param coldecreasing logical: should the col names be in ascending (FALSE) or in descending (TRUE) order?
#' #param default_value numeric: default value of matrix entries if no value exists in data.frame for the entries
#' #return matrix: matrix containing values of data[[datatitle]] with rownames data[[rowtitle]] and colnames data[coltitle]
#' #author Daniel Neumann
#' #date 2017-08-29
data.frame2matrix = function(data, rowtitle, coltitle, datatitle,
rowdecreasing = FALSE, coldecreasing = FALSE,
default_value = NA) {
# check, whether titles exist as columns names in the data.frame data
if ( (!(rowtitle%in%names(data)))
|| (!(coltitle%in%names(data)))
|| (!(datatitle%in%names(data))) ) {
stop('data.frame2matrix: bad row-, col-, or datatitle.')
}
# get number of rows in data
ndata = dim(data)[1]
# extract rownames and colnames for the matrix from the data.frame
rownames = sort(unique(data[[rowtitle]]), decreasing = rowdecreasing)
nrows = length(rownames)
colnames = sort(unique(data[[coltitle]]), decreasing = coldecreasing)
ncols = length(colnames)
# initialize the matrix
out_matrix = matrix(NA,
nrow = nrows, ncol = ncols,
dimnames=list(rownames, colnames))
# iterate rows of data
for (i1 in 1:ndata) {
# get matrix-row and matrix-column indices for the current data-row
iR = which(rownames==data[[rowtitle]][i1])
iC = which(colnames==data[[coltitle]][i1])
# throw an error if the matrix entry (iR,iC) is already filled.
if (!is.na(out_matrix[iR, iC])) stop('data.frame2matrix: double entry in data.frame')
out_matrix[iR, iC] = data[[datatitle]][i1]
}
# set empty matrix entries to the default value
out_matrix[is.na(out_matrix)] = default_value
# return matrix
return(out_matrix)
}
How it works:
myData = as.data.frame(list('dim1'=c('x', 'x', 'x', 'y','y','y'),
'dim2'=c('a','b','c','a','b','c'),
'values'=c(1,2,3,3,3,2)))
myMatrix = data.frame2matrix(myData, 'dim1', 'dim2', 'values')
myMatrix
> a b c
> x 1 2 3
> y 3 3 2
base R, unstack
unstack(df, V3 ~ V2)
# a b c
# 1 1 2 3
# 2 3 3 2
This may not be a general solution but works well in this case.
data
df<-structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), V3 = c(1L,
2L, 3L, 3L, 3L, 2L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-6L))
For sake of completeness, there's a tapply() solution around.
with(d, tapply(z, list(x, y), sum))
# a b c
# x 1 2 3
# y 3 3 2
Data
d <- structure(list(x = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("x",
"y"), class = "factor"), y = structure(c(1L, 2L, 3L, 1L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), z = c(1, 2,
3, 3, 3, 2)), class = "data.frame", row.names = c(NA, -6L))
From tidyr 0.8.3.9000, a new function called pivot_wider() is introduced. It is basically an upgraded version of the previous spread() function (which is, moreover, no longer under active development). From pivoting vignette:
This vignette describes the use of the new pivot_longer() and
pivot_wider() functions. Their goal is to improve the usability of
gather() and spread(), and incorporate state-of-the-art features found
in other packages.
For some time, it’s been obvious that there is something fundamentally
wrong with the design of spread() and gather(). Many people don’t find
the names intuitive and find it hard to remember which direction
corresponds to spreading and which to gathering. It also seems
surprisingly hard to remember the arguments to these functions,
meaning that many people (including me!) have to consult the
documentation every time.
How to use it (using the data from #Aaron):
pivot_wider(data = tmp, names_from = y, values_from = z)
x a b c
<fct> <dbl> <dbl> <dbl>
1 x 1 2 3
2 y 3 3 2
Or in a "full" tidyverse fashion:
tmp %>%
pivot_wider(names_from = y, values_from = z)
The tidyr package from the tidyverse has an excellent function that does this.
Assuming your variables are named v1, v2 and v3, left to right, and you data frame is named dat:
dat %>%
spread(key = v2,
value = v3)
Ta da!
My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.
If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]
If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2
library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.
One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3
if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))
I am transitioning from using data.frame in R to data.table for better performance. One of the main segments in converting code was applying custom functions from apply on data.frame to using it in data.table.
Say I have a simple data table, dt1.
x y z---header
1 9 j
4 1 n
7 1 n
Am trying to calculate another new column in dt1, based on values of x,y,z
I tried 2 ways, both of them give the correct result, but the faster one spits out a warning. So want to make sure the warning is nothing serious before I use the faster version in converting my existing code.
(1) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}]
(2) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}, by = 1:nrow(x)]
Version 1 runs faster than version 2, but spits out a warning" the condition has length > 1 and only the first element will be used"
But the result is good.
The second version is slightly slower but doesn't give that warning.
I wanted to make sure version one doesn't give erratic results once I start writing complicated functions.
Please treat the question as a generic one with the view to run a user defined function which wants to access different column values in a given row and calculate the new column value for that row.
Thanks for your help.
If 'x', 'y', and 'z' are the columns of 'dt1', try either the vectorized ifelse
dt1[, a:=ifelse(x<1 & y >3 & z=='n', 6, 7)]
Or create 'a' with 7, then assign 6 to 'a' based on the logical index.
dt1[, a := 7][x<1 & y >3 & z=='n', a:=6][]
Using a function
getnewvariable <- function(v1, v2, v3){
ifelse(v1 <1 & v2 >3 & v3=='n', 6, 7)
}
dt1[, a:=getnewvariable(x,y,z)][]
data
df1 <- structure(list(x = c(0L, 1L, 4L, 7L, -2L), y = c(4L, 9L, 1L,
1L, 5L), z = c("n", "j", "n", "n", "n")), .Names = c("x", "y",
"z"), class = "data.frame", row.names = c(NA, -5L))
dt1 <- as.data.table(df1)