Recreating a data frame in R [duplicate] - r

I have a data frame where I would like to add an additional row that totals up the values for each column. For example, Let's say I have this data:
x <- data.frame(Language=c("C++", "Java", "Python"),
Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
stringsAsFactors=FALSE)
Data looks like this:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
My instinct is to do this:
y <- rbind(x, c("Total", colSums(x[,2:3])))
And this works, it computes the totals:
> y
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404
The problem is that the Files and LOC columns have all been converted to strings:
> y$LOC
[1] "15328" "876" "200" "16404"
I understand that this is happening because I created a vector c("Total", colSums(x[,2:3]) with inputs that are both numbers and strings, and it's converting all the elements to a common type so that all of the vector elements are the same. Then the same thing happens to the Files and LOC columns.
What's a better way to do this?

See adorn_totals() from the janitor package:
library(janitor)
x %>%
adorn_totals("row")
#> Language Files LOC
#> C++ 4009 15328
#> Java 210 876
#> Python 35 200
#> Total 4254 16404
The numeric columns remain of class numeric.
Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.

A tidyverse way to do this would be to use bind_rows (or eventually add_row) and summarise to compute the sums. Here the issue is that we want sums for all but one, so a trick would be:
summarise_all(x, ~if(is.numeric(.)) sum(.) else "Total")
In one line:
x %>%
bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Total"))
Edit with dplyr >=1.0
One can also use across(), which is slightly more verbose in this case:
x %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))

Here's a way that gets you what you want, but there may very well be a more elegant solution.
rbind(x, data.frame(Language = "Total", t(colSums(x[, -1]))))
For the record, I prefer Chase's answer if you don't absolutely need the Language column.

Do you need the Language column in your data, or is it more appropriate to think of that column as the row.names? That would change your data.frame from 4 observations of 3 variables to 4 observations of 2 variables (Files & LOC).
x <- data.frame(Files = c(4009, 210, 35), LOC = c(15328,876, 200),
row.names = c("C++", "Java", "Python"), stringsAsFactors = FALSE)
x["Total" ,] <- colSums(x)
> x
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Total 4254 16404

Try this
y[4,] = c("Total", colSums(y[,2:3]))

If (1) we don't need the "Language" heading on the first column then we can represent it using row names and if (2) it is ok to label the last row as "Sum" rather than "Total" then we can use addmargins like this:
rownames(x) <- x$Language
addmargins(as.table(as.matrix(x[-1])), 1)
giving:
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Sum 4254 16404
If we do want the first column labelled "Language" and the total row labelled "Total" then its a bit longer:
rownames(x) <- x$Language
Total <- sum
xa <- addmargins(as.table(as.matrix(x[-1])), 1, FUN = Total)
data.frame(Language = rownames(xa), as.matrix(xa[]), row.names = NULL)
giving:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404

Extending the answer of Nicolas Ratto, if you were to have a lot more columns you could use
x %>% add_row(Language = "Total", summarise(., across(where(is.numeric), sum)))

Try this
library(tibble)
x %>% add_row( Language="Total",Files = sum(.$Files),LOC = sum(.$LOC) )

df %>% bind_rows(purrr::map_dbl(.,sum))

Are you sure you really want to have the column totals in your data frame? To me, the data frame's interpretation now depends on the row. For example,
Rows 1-(n-1): how many files are associated with a particular language
Row n: how many files are associated with all languages
This gets more confusing if you start to subset your data. For example, suppose you want to know which languages have more than 100 Files:
> x = data.frame(Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
row.names=c("C++", "Java", "Python"),
stringsAsFactors=FALSE)
> x["Total" ,] = colSums(x)
> x[x$Files > 100,]
Files LOC
C++ 4009 15328
Java 210 876
Total 4254 16404#But this refers to all languages!
The Total row is now wrong!
Personally I would work out the column sums and store them in a separate vector.

Since you mention this is a last step before exporting for presentation, you may have column names that will include spaces in them for clarity (i.e. "Grand Total"). If so, the following will insure that the created data.frame will rbind to the original dataset without an error caused by mismatched column names:
dfTotals <- data.frame(Language="Total",t(colSums(x[,-1]))))
colnames(dfTotals) <- names(x)
rbind(x, dfTotals)

Your original instinct would work if you coerced your columns to numeric:
y$LOC <- as.numeric(y$LOC)
y$Files <- as.numeric(y$Files)
And then apply colSums() and rbind().

Related

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

How to combine similar elements in a data frame in R

I have a data frame consisting of
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I'd like to consolidate the dataframe into this
Lancaster001 111
Lancaster002 55
And so remove the smaller categorising. I couldn't find a way to do with merge, is there a general function that can be used using similarity?
Here is a base R solution using a regex to remove all characters after three numeric characters:
DF <- read.table(text = "Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9")
setNames(aggregate(V2 ~ gsub("(?<=\\d{3}).*", "", V1, perl = TRUE),
DF, FUN = sum),
c("V1", "V2"))
# V1 V2
#1 Lancaster001 111
#2 Lancaster002 55
It would be trivial to use data.table if the aggregation is too slow on a large dataset.
Adjust the regex as needed if the structure of your data is different.
Let's assume these names for your columns, and let's assume the 'smaller categorising' means one letter at the end.
id value
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I use dplyr for everything. Install dplyr, make sure your column names are correct, and then try:
library(dplyr)
mydata %>%
mutate(id = substr(id, 1, nchar(id)-1) %>% # removes last character
group_by(id) %>%
summarize(sum = sum(value))
Edit: An even simpler data.table solution from #Arun's helpful tip:
library(data.table)
dt[, list(sum=sum(value)), by = substr(as.character(id),1,nchar(as.character(id)) - 1)]
id sum
1: Lancaster001 111
2: Lancaster002 55

dplyr gives me different answers depending on how I select columns

I may be having trouble understanding some of the basics of dplyr, but it appears that R behaves very differently depending on whether you subset columns as one column data frames or as traditional vectors. Here is an example:
mtcarsdf<-tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
#subsetting to cyl this way gives integer vector
example(mtcars$gear,mtcarsdf$cyl)
# 3 112
# 4 56
# 5 30
#subsetting this way gives a one column data table
example(mtcars$gear,mtcarsdf[,"cyl"])
# 3 198
# 4 198
# 5 198
all(mtcarsdf$cyl==mtcarsdf[,"cyl"])
# TRUE
Since my inputs are technically equal the fact that I am getting different outputs tells me I am misunderstanding how the two objects behave. Could someone please enlighten me on how to improve the example function so that it can handle different objects more robustly?
Thanks
First, the items that you are comparing with == are not really the same. This could be identified using all.equal instead of ==:
all.equal(mtcarsdf$cyl, mtcarsdf[, "cyl"])
## [1] "Modes: numeric, list"
## [2] "Lengths: 32, 1"
## [3] "names for current but not for target"
## [4] "Attributes: < target is NULL, current is list >"
## [5] "target is numeric, current is tbl_df"
With that in mind, you should be able to get the behavior you want by using [[ to extract the column instead of [.
mtcarsdf <- tbl_df(mtcars)
example<-function(x,y) {
df<-tbl_df(data.frame(x,y))
df %>% group_by(x) %>% summarise(total=sum(y))
}
example(mtcars$gear, mtcarsdf[["cyl"]])
However, a safer approach might be to integrate the renaming of the columns as part of your function, like this:
example2 <- function(x, y) {
df <- tbl_df(setNames(data.frame(x, y), c("x", "y")))
df %>% group_by(x) %>% summarise(total = sum(y))
}
Then, any of the following should give you the same results.
example2(mtcars$gear, mtcarsdf$cyl)
example2(mtcars$gear, mtcarsdf[["cyl"]])
example2(mtcars$gear, mtcarsdf[, "cyl"])

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Add row to a data frame with total sum for each column

I have a data frame where I would like to add an additional row that totals up the values for each column. For example, Let's say I have this data:
x <- data.frame(Language=c("C++", "Java", "Python"),
Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
stringsAsFactors=FALSE)
Data looks like this:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
My instinct is to do this:
y <- rbind(x, c("Total", colSums(x[,2:3])))
And this works, it computes the totals:
> y
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404
The problem is that the Files and LOC columns have all been converted to strings:
> y$LOC
[1] "15328" "876" "200" "16404"
I understand that this is happening because I created a vector c("Total", colSums(x[,2:3]) with inputs that are both numbers and strings, and it's converting all the elements to a common type so that all of the vector elements are the same. Then the same thing happens to the Files and LOC columns.
What's a better way to do this?
See adorn_totals() from the janitor package:
library(janitor)
x %>%
adorn_totals("row")
#> Language Files LOC
#> C++ 4009 15328
#> Java 210 876
#> Python 35 200
#> Total 4254 16404
The numeric columns remain of class numeric.
Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.
A tidyverse way to do this would be to use bind_rows (or eventually add_row) and summarise to compute the sums. Here the issue is that we want sums for all but one, so a trick would be:
summarise_all(x, ~if(is.numeric(.)) sum(.) else "Total")
In one line:
x %>%
bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Total"))
Edit with dplyr >=1.0
One can also use across(), which is slightly more verbose in this case:
x %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
Here's a way that gets you what you want, but there may very well be a more elegant solution.
rbind(x, data.frame(Language = "Total", t(colSums(x[, -1]))))
For the record, I prefer Chase's answer if you don't absolutely need the Language column.
Do you need the Language column in your data, or is it more appropriate to think of that column as the row.names? That would change your data.frame from 4 observations of 3 variables to 4 observations of 2 variables (Files & LOC).
x <- data.frame(Files = c(4009, 210, 35), LOC = c(15328,876, 200),
row.names = c("C++", "Java", "Python"), stringsAsFactors = FALSE)
x["Total" ,] <- colSums(x)
> x
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Total 4254 16404
Try this
y[4,] = c("Total", colSums(y[,2:3]))
If (1) we don't need the "Language" heading on the first column then we can represent it using row names and if (2) it is ok to label the last row as "Sum" rather than "Total" then we can use addmargins like this:
rownames(x) <- x$Language
addmargins(as.table(as.matrix(x[-1])), 1)
giving:
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Sum 4254 16404
If we do want the first column labelled "Language" and the total row labelled "Total" then its a bit longer:
rownames(x) <- x$Language
Total <- sum
xa <- addmargins(as.table(as.matrix(x[-1])), 1, FUN = Total)
data.frame(Language = rownames(xa), as.matrix(xa[]), row.names = NULL)
giving:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404
Extending the answer of Nicolas Ratto, if you were to have a lot more columns you could use
x %>% add_row(Language = "Total", summarise(., across(where(is.numeric), sum)))
Try this
library(tibble)
x %>% add_row( Language="Total",Files = sum(.$Files),LOC = sum(.$LOC) )
df %>% bind_rows(purrr::map_dbl(.,sum))
Are you sure you really want to have the column totals in your data frame? To me, the data frame's interpretation now depends on the row. For example,
Rows 1-(n-1): how many files are associated with a particular language
Row n: how many files are associated with all languages
This gets more confusing if you start to subset your data. For example, suppose you want to know which languages have more than 100 Files:
> x = data.frame(Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
row.names=c("C++", "Java", "Python"),
stringsAsFactors=FALSE)
> x["Total" ,] = colSums(x)
> x[x$Files > 100,]
Files LOC
C++ 4009 15328
Java 210 876
Total 4254 16404#But this refers to all languages!
The Total row is now wrong!
Personally I would work out the column sums and store them in a separate vector.
Since you mention this is a last step before exporting for presentation, you may have column names that will include spaces in them for clarity (i.e. "Grand Total"). If so, the following will insure that the created data.frame will rbind to the original dataset without an error caused by mismatched column names:
dfTotals <- data.frame(Language="Total",t(colSums(x[,-1]))))
colnames(dfTotals) <- names(x)
rbind(x, dfTotals)
Your original instinct would work if you coerced your columns to numeric:
y$LOC <- as.numeric(y$LOC)
y$Files <- as.numeric(y$Files)
And then apply colSums() and rbind().

Resources