How to combine similar elements in a data frame in R

How to combine similar elements in a data frame in R - r

I have a data frame consisting of
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I'd like to consolidate the dataframe into this
Lancaster001 111
Lancaster002 55
And so remove the smaller categorising. I couldn't find a way to do with merge, is there a general function that can be used using similarity?

Here is a base R solution using a regex to remove all characters after three numeric characters:
DF <- read.table(text = "Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9")
setNames(aggregate(V2 ~ gsub("(?<=\\d{3}).*", "", V1, perl = TRUE),
DF, FUN = sum),
c("V1", "V2"))
# V1 V2
#1 Lancaster001 111
#2 Lancaster002 55
It would be trivial to use data.table if the aggregation is too slow on a large dataset.
Adjust the regex as needed if the structure of your data is different.

Let's assume these names for your columns, and let's assume the 'smaller categorising' means one letter at the end.
id value
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I use dplyr for everything. Install dplyr, make sure your column names are correct, and then try:
library(dplyr)
mydata %>%
mutate(id = substr(id, 1, nchar(id)-1) %>% # removes last character
group_by(id) %>%
summarize(sum = sum(value))

Edit: An even simpler data.table solution from #Arun's helpful tip:
library(data.table)
dt[, list(sum=sum(value)), by = substr(as.character(id),1,nchar(as.character(id)) - 1)]
id sum
1: Lancaster001 111
2: Lancaster002 55

Related

Merge named vectors in different sizes into data frame

I have some different named vectors, and I want to combine them into one date frame that sums the actions.
adjust balance drive idle other pick putdown replace sort wait
4 9 16 82 4 350 61 16 26 18
walk
14
adjust balance drive idle pick putdown replace sort unload walk
1 42 14 47 385 118 4 83 19 7
i want it to be this way:
adjust balance drive
5 51 30
and etc..
i find it very challenging because those are named vectors
would be grateful for your help, thank you!

We can use aggregate + stack like below
aggregate(. ~ ind, rbind(stack(vec1), stack(vec2)), sum)

You could convert to a data.frame and use the dplyr package to group by the names and sum the numbers together.
library(dplyr)
vec <- c(4, 9, 16, 1, 42, 14)
names(vec) <- c("adjust", "balance", "drive", "adjust", "balance", "drive")
data.frame(values = vec, name = names(vec)) %>% group_by(name) %>% summarise(values = sum(values))

If we want to add all elements that match between the two vectors:
# Resolve the matching names of the vectors:
# vec_nm_order => character vector
vec_nm_order <- intersect(
names(vec1),
names(vec2)
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]
If we only want to add values for adjust, balance, drive:
# Choose the names (keys) of elements we want to add together:
# vec_nm_order => character vector
vec_nm_order <- c(
"adjust", "balance", "drive"
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]

Recreating a data frame in R [duplicate]

I have a data frame where I would like to add an additional row that totals up the values for each column. For example, Let's say I have this data:
x <- data.frame(Language=c("C++", "Java", "Python"),
Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
stringsAsFactors=FALSE)
Data looks like this:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
My instinct is to do this:
y <- rbind(x, c("Total", colSums(x[,2:3])))
And this works, it computes the totals:
> y
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404
The problem is that the Files and LOC columns have all been converted to strings:
> y$LOC
[1] "15328" "876" "200" "16404"
I understand that this is happening because I created a vector c("Total", colSums(x[,2:3]) with inputs that are both numbers and strings, and it's converting all the elements to a common type so that all of the vector elements are the same. Then the same thing happens to the Files and LOC columns.
What's a better way to do this?

See adorn_totals() from the janitor package:
library(janitor)
x %>%
adorn_totals("row")
#> Language Files LOC
#> C++ 4009 15328
#> Java 210 876
#> Python 35 200
#> Total 4254 16404
The numeric columns remain of class numeric.
Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.

A tidyverse way to do this would be to use bind_rows (or eventually add_row) and summarise to compute the sums. Here the issue is that we want sums for all but one, so a trick would be:
summarise_all(x, ~if(is.numeric(.)) sum(.) else "Total")
In one line:
x %>%
bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Total"))
Edit with dplyr >=1.0
One can also use across(), which is slightly more verbose in this case:
x %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))

Here's a way that gets you what you want, but there may very well be a more elegant solution.
rbind(x, data.frame(Language = "Total", t(colSums(x[, -1]))))
For the record, I prefer Chase's answer if you don't absolutely need the Language column.

Do you need the Language column in your data, or is it more appropriate to think of that column as the row.names? That would change your data.frame from 4 observations of 3 variables to 4 observations of 2 variables (Files & LOC).
x <- data.frame(Files = c(4009, 210, 35), LOC = c(15328,876, 200),
row.names = c("C++", "Java", "Python"), stringsAsFactors = FALSE)
x["Total" ,] <- colSums(x)
> x
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Total 4254 16404

Try this
y[4,] = c("Total", colSums(y[,2:3]))

If (1) we don't need the "Language" heading on the first column then we can represent it using row names and if (2) it is ok to label the last row as "Sum" rather than "Total" then we can use addmargins like this:
rownames(x) <- x$Language
addmargins(as.table(as.matrix(x[-1])), 1)
giving:
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Sum 4254 16404
If we do want the first column labelled "Language" and the total row labelled "Total" then its a bit longer:
rownames(x) <- x$Language
Total <- sum
xa <- addmargins(as.table(as.matrix(x[-1])), 1, FUN = Total)
data.frame(Language = rownames(xa), as.matrix(xa[]), row.names = NULL)
giving:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404

Extending the answer of Nicolas Ratto, if you were to have a lot more columns you could use
x %>% add_row(Language = "Total", summarise(., across(where(is.numeric), sum)))

Try this
library(tibble)
x %>% add_row( Language="Total",Files = sum(.$Files),LOC = sum(.$LOC) )

df %>% bind_rows(purrr::map_dbl(.,sum))

Are you sure you really want to have the column totals in your data frame? To me, the data frame's interpretation now depends on the row. For example,
Rows 1-(n-1): how many files are associated with a particular language
Row n: how many files are associated with all languages
This gets more confusing if you start to subset your data. For example, suppose you want to know which languages have more than 100 Files:
> x = data.frame(Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
row.names=c("C++", "Java", "Python"),
stringsAsFactors=FALSE)
> x["Total" ,] = colSums(x)
> x[x$Files > 100,]
Files LOC
C++ 4009 15328
Java 210 876
Total 4254 16404#But this refers to all languages!
The Total row is now wrong!
Personally I would work out the column sums and store them in a separate vector.

Since you mention this is a last step before exporting for presentation, you may have column names that will include spaces in them for clarity (i.e. "Grand Total"). If so, the following will insure that the created data.frame will rbind to the original dataset without an error caused by mismatched column names:
dfTotals <- data.frame(Language="Total",t(colSums(x[,-1]))))
colnames(dfTotals) <- names(x)
rbind(x, dfTotals)

Your original instinct would work if you coerced your columns to numeric:
y$LOC <- as.numeric(y$LOC)
y$Files <- as.numeric(y$Files)
And then apply colSums() and rbind().

Filter rows where any values in a vector are contained in a column

I have a dataset with a single column that contains multiple ICD-10 codes separate by spaces, eg
Identifier Codes
1 A14 R17
2 R069 D136 B08
3 C11 K71 V91
I have a vector with the ICD-10 codes that are relevant to my analysis, eg goodcodes<-c("C11","A14","R17","O80"). I want to select rows from my dataset where the Codes column contains any of the codes in my vector, but does not need to exactly match a code in my vector.
Using medicalinfo<-filter(medicalinfo, Codes %in% goodcodes) returns only rows where a single matching code is listed in the Codes column. I could also filter based on a partial string, I only know how to do that for a single partial string, not all of those in my codes vector.
Is there a way to get all the rows where any of these codes are present in the column?

One trick is to combine the goodcodes into a regular expression:
library(dplyr)
ptn <- paste0("\\b(", paste(goodcodes, collapse = "|"), ")\\b")
ptn
# [1] "\\b(C11|A14|R17|O80)\\b"
FYI, the \\b( and )\\b are absolutely necessary if there's a chance that you will have codes A10 and A101; without \\b(...)\\b, then grepl("A10", "A101") will be a false-positive. See
grepl("A10|B20", "A101")
# [1] TRUE
grepl("\\b(A10|B20)\\b", "A101")
# [1] FALSE
Finally, let's use that ptn:
dat %>%
filter(grepl(ptn, Codes))
# Identifier Codes
# 1 1 A14 R17
# 2 3 C11 K71 V91
Another way is to split the Codes column into a list of individual codes, and look for membership with %in%:
sapply(strsplit(trimws(dat$Codes), "\\s+"), function(a) any(a %in% goodcodes))
# [1] TRUE FALSE TRUE
Depending on how complex things are, a third way is to "unnest" Codes and look for matches.
dat %>%
mutate(Codes = strsplit(trimws(Codes), "\\s+")) %>%
tidyr::unnest(Codes) %>%
group_by(Identifier) %>%
filter(any(Codes %in% goodcodes)) %>%
ungroup()
# # A tibble: 5 x 2
# Identifier Codes
# <dbl> <chr>
# 1 1 A14
# 2 1 R17
# 3 3 C11
# 4 3 K71
# 5 3 V91
(If you really prefer them combined into a single space-delimited string as before, that's easy enough to do with group_by(Identifier) %>% summarize(Codes = paste(Codes, collapse = " ")). I don't recommend it, per se, since I prefer to have that type of information broken out like this, but there is likely context I don't know.)

With subset from base R. Loop over the 'goodcodes' vector, use that as pattern in grepl, Reduce the list of logical vectors into a single logical vector to subset the rows
subset(dat, Reduce(`|`, lapply(goodcodes, function(x) grepl(x, Codes))))
# Identifier Codes
#1 1 A14 R17
#3 3 C11 K71 V91
data
dat <- structure(list(Identifier = 1:3, Codes = c("A14 R17", "R069 D136 B08",
"C11 K71 V91")), class = "data.frame", row.names = c(NA, -3L))

Read CSV in R with first column as dataframe header

I have a simply text file where the first column is names (strings) and the second column is values (floats). As an example, names and ages:
Name, Age
John, 32
Heather, 46,
Jake, 23
Sally, 19
I'd like to read this in as a dataframe (call this df) but transposed so that I can access ages by names such that df$John would return 32. How can I do this?
Previous I tried creating a new dataframe, tdf, looping through the data in a for loop, assigning each name and age and then inserting into the empty dataframe as tdf[name] = age but this did not work as I expected.

You can read your data using read.table().
Then you can transpose it using t() and set colnames after.
Example:
If df is:
df=read.table("dummydata", header=T, sep=",")
df
Name Age
1 John 32
2 Heather 46
3 Jake 23
4 Sally 19
You transpose the age and then transform them into a dataframe:
tdf=as.data.frame(t(df$Age))
colnames(tdf)=t(df$Name)
So tdf will return:
tdf
John Heather Jake Sally
1 32 46 23 19
And, as you asked, tdf$John will return:
tdf$John
[1] 32
Now, if you have more than two columns you can do the same but instead of indicating the name of the column you can simply indicate the position using brackets.
df=read.table("dummydata", header=T, sep=",")
With t(df[2:ncol(df)]) you transpose the whole table starting from the second column, no matter the number of columns. The first column will be the names after the transpose.
tdf=as.data.frame(t(df[2:ncol(df)]))
Then you set the columnames.
colnames(tdf)=t(df[1])
tdf$John
[1] 32

This should add the the row as header when you read from the file
read.csv2(filename, as.is = TRUE, header = TRUE)

Read the data into a data frame, DF (see Note).
1) Assign the names to the rows of DF in which case this will give John's age without having to create a new data structure:
rownames(DF) <- DF$Name
DF["John", "Age"]
## [1] 32
2) Alternatively, split DF into a named list in which case you can get the precise syntax requested:
ages <- with(DF, split(Age, Name))
ages$John
## [1] 32
3) This alternative would also create the same list:
ages <- with(DF, setNames(as.list(Age), Name))
Note: DF in reproducible form is as follows. (We have removed the trailing comma on one line in the question but if it is really there add fill = TRUE to the read.csv line.)
Lines <- "Name, Age
John, 32
Heather, 46
Jake, 23
Sally, 19"
DF <- read.csv(text = Lines)

A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header:
read.csv("df.csv", header = TRUE, row.names = 1)

Add row to a data frame with total sum for each column

I have a data frame where I would like to add an additional row that totals up the values for each column. For example, Let's say I have this data:
x <- data.frame(Language=c("C++", "Java", "Python"),
Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
stringsAsFactors=FALSE)
Data looks like this:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
My instinct is to do this:
y <- rbind(x, c("Total", colSums(x[,2:3])))
And this works, it computes the totals:
> y
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404
The problem is that the Files and LOC columns have all been converted to strings:
> y$LOC
[1] "15328" "876" "200" "16404"
I understand that this is happening because I created a vector c("Total", colSums(x[,2:3]) with inputs that are both numbers and strings, and it's converting all the elements to a common type so that all of the vector elements are the same. Then the same thing happens to the Files and LOC columns.
What's a better way to do this?

See adorn_totals() from the janitor package:
library(janitor)
x %>%
adorn_totals("row")
#> Language Files LOC
#> C++ 4009 15328
#> Java 210 876
#> Python 35 200
#> Total 4254 16404
The numeric columns remain of class numeric.
Disclaimer: I created this package, including adorn_totals() which is made for precisely this task.

A tidyverse way to do this would be to use bind_rows (or eventually add_row) and summarise to compute the sums. Here the issue is that we want sums for all but one, so a trick would be:
summarise_all(x, ~if(is.numeric(.)) sum(.) else "Total")
In one line:
x %>%
bind_rows(summarise_all(., ~if(is.numeric(.)) sum(.) else "Total"))
Edit with dplyr >=1.0
One can also use across(), which is slightly more verbose in this case:
x %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))

Here's a way that gets you what you want, but there may very well be a more elegant solution.
rbind(x, data.frame(Language = "Total", t(colSums(x[, -1]))))
For the record, I prefer Chase's answer if you don't absolutely need the Language column.

Do you need the Language column in your data, or is it more appropriate to think of that column as the row.names? That would change your data.frame from 4 observations of 3 variables to 4 observations of 2 variables (Files & LOC).
x <- data.frame(Files = c(4009, 210, 35), LOC = c(15328,876, 200),
row.names = c("C++", "Java", "Python"), stringsAsFactors = FALSE)
x["Total" ,] <- colSums(x)
> x
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Total 4254 16404

Try this
y[4,] = c("Total", colSums(y[,2:3]))

If (1) we don't need the "Language" heading on the first column then we can represent it using row names and if (2) it is ok to label the last row as "Sum" rather than "Total" then we can use addmargins like this:
rownames(x) <- x$Language
addmargins(as.table(as.matrix(x[-1])), 1)
giving:
Files LOC
C++ 4009 15328
Java 210 876
Python 35 200
Sum 4254 16404
If we do want the first column labelled "Language" and the total row labelled "Total" then its a bit longer:
rownames(x) <- x$Language
Total <- sum
xa <- addmargins(as.table(as.matrix(x[-1])), 1, FUN = Total)
data.frame(Language = rownames(xa), as.matrix(xa[]), row.names = NULL)
giving:
Language Files LOC
1 C++ 4009 15328
2 Java 210 876
3 Python 35 200
4 Total 4254 16404

Extending the answer of Nicolas Ratto, if you were to have a lot more columns you could use
x %>% add_row(Language = "Total", summarise(., across(where(is.numeric), sum)))

Try this
library(tibble)
x %>% add_row( Language="Total",Files = sum(.$Files),LOC = sum(.$LOC) )

df %>% bind_rows(purrr::map_dbl(.,sum))

Are you sure you really want to have the column totals in your data frame? To me, the data frame's interpretation now depends on the row. For example,
Rows 1-(n-1): how many files are associated with a particular language
Row n: how many files are associated with all languages
This gets more confusing if you start to subset your data. For example, suppose you want to know which languages have more than 100 Files:
> x = data.frame(Files=c(4009, 210, 35),
LOC=c(15328,876, 200),
row.names=c("C++", "Java", "Python"),
stringsAsFactors=FALSE)
> x["Total" ,] = colSums(x)
> x[x$Files > 100,]
Files LOC
C++ 4009 15328
Java 210 876
Total 4254 16404#But this refers to all languages!
The Total row is now wrong!
Personally I would work out the column sums and store them in a separate vector.

Since you mention this is a last step before exporting for presentation, you may have column names that will include spaces in them for clarity (i.e. "Grand Total"). If so, the following will insure that the created data.frame will rbind to the original dataset without an error caused by mismatched column names:
dfTotals <- data.frame(Language="Total",t(colSums(x[,-1]))))
colnames(dfTotals) <- names(x)
rbind(x, dfTotals)

Your original instinct would work if you coerced your columns to numeric:
y$LOC <- as.numeric(y$LOC)
y$Files <- as.numeric(y$Files)
And then apply colSums() and rbind().

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to combine similar elements in a data frame in R - r

Edit: An even simpler data.table solution from #Arun's helpful tip: library(data.table) dt[, list(sum=sum(value)), by = substr(as.character(id),1,nchar(as.character(id)) - 1)] id sum 1: Lancaster001 111 2: Lancaster002 55

Related

Merge named vectors in different sizes into data frame

Recreating a data frame in R [duplicate]

Filter rows where any values in a vector are contained in a column

Read CSV in R with first column as dataframe header

Add row to a data frame with total sum for each column

Categories

Resources