Merge named vectors in different sizes into data frame - r

I have some different named vectors, and I want to combine them into one date frame that sums the actions.
adjust balance drive idle other pick putdown replace sort wait
4 9 16 82 4 350 61 16 26 18
walk
14
adjust balance drive idle pick putdown replace sort unload walk
1 42 14 47 385 118 4 83 19 7
i want it to be this way:
adjust balance drive
5 51 30
and etc..
i find it very challenging because those are named vectors
would be grateful for your help, thank you!

We can use aggregate + stack like below
aggregate(. ~ ind, rbind(stack(vec1), stack(vec2)), sum)

You could convert to a data.frame and use the dplyr package to group by the names and sum the numbers together.
library(dplyr)
vec <- c(4, 9, 16, 1, 42, 14)
names(vec) <- c("adjust", "balance", "drive", "adjust", "balance", "drive")
data.frame(values = vec, name = names(vec)) %>% group_by(name) %>% summarise(values = sum(values))

If we want to add all elements that match between the two vectors:
# Resolve the matching names of the vectors:
# vec_nm_order => character vector
vec_nm_order <- intersect(
names(vec1),
names(vec2)
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]
If we only want to add values for adjust, balance, drive:
# Choose the names (keys) of elements we want to add together:
# vec_nm_order => character vector
vec_nm_order <- c(
"adjust", "balance", "drive"
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]

Related

Fuction to return the first five columns in R

I wrote a function in R which is supposed to return the first five developers who made the most input:
developer.busy <- function(x){
bus.dev <- sort(table(test2$devf), decreasing = TRUE)
return(bus.dev)
}
bus.dev(test2)
ericb shields mdejong cabbey lord elliott-oss jikesadmin coar
3224 1432 998 470 241 179 77 1
At the moment it just prints out all developers sorted in decreasing range. I just want the first 5 to be shown. How can I make this possible. Any suggestion is welcome.
If we want the first five, either use index with [ or with head. Modified the function with three input, data object name, column name ('colnm') and number of elements to extract ('n')
developer.busy <- function(data, colnm, n){
sort(table(data[[colnm]]), decreasing = TRUE)[seq_len(n)]
# or another optioin is
head(sort(table(data[[colnm]]), decreasing = TRUE), n)
}
developer.busy(test2, "developerf", n = 5)
-using a reproducible example with mtcars dataset
data(mtcars)
developer.busy(mtcars, 'carb', 5)
# 2 4 1 3 6
#10 10 7 3 1

R : Extract a Specific Number out of a String

I have a vector as below
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
here i want to extract the first number before the "X" for each of the elements.
In case of situations with 2 "X" i.e. "6X2X75CL" the number 12 (6 multiplied by 2) should be calculated.
expected output
6, 24, 12, 168
Thank you for the help...
Here's a possible solution using regular expressions :
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
# this regular expression finds any group of digits followed
# by a upper-case 'X' in each string and returns a list of the matches
tokens <- regmatches(data,gregexpr('[[:digit:]]+(?=X)',data,perl=TRUE))
res <- sapply(tokens,function(x)prod(as.numeric(x)))
> res
[1] 6 24 12 168
Here is a method using base R:
dataList <- strsplit(data, split="X")
sapply(dataList, function(x) Reduce("*", as.numeric(head(x, -1))))
[1] 6 24 12 168
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numerics and the multiply them. The final element is dropped using head(x, -1).
As #zheyuan-li comments, prod can fill in for Reduce and will probably be a bit faster:
sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
[1] 6 24 12 168
We can also use str_extract_all
library(stringr)
sapply(str_extract_all(data, "\\d+(?=X)"), function(x) prod(as.numeric(x)))
#[1] 6 24 12 168
ind=regexpr("X",data)
val=as.integer(substr(data, 1, ind-1))
data2=substring(data,ind+1)
ind2=regexpr("[0-9]+X", data2)
if (!all(ind2!=1)) {
val2 = as.integer(substr(data2[ind2==1], 1, attr(ind2,"match.length")[ind2==1]-1))
val[ind2==1] = val[ind2==1] * val2
}

How to combine similar elements in a data frame in R

I have a data frame consisting of
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I'd like to consolidate the dataframe into this
Lancaster001 111
Lancaster002 55
And so remove the smaller categorising. I couldn't find a way to do with merge, is there a general function that can be used using similarity?
Here is a base R solution using a regex to remove all characters after three numeric characters:
DF <- read.table(text = "Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9")
setNames(aggregate(V2 ~ gsub("(?<=\\d{3}).*", "", V1, perl = TRUE),
DF, FUN = sum),
c("V1", "V2"))
# V1 V2
#1 Lancaster001 111
#2 Lancaster002 55
It would be trivial to use data.table if the aggregation is too slow on a large dataset.
Adjust the regex as needed if the structure of your data is different.
Let's assume these names for your columns, and let's assume the 'smaller categorising' means one letter at the end.
id value
Lancaster001A 76
Lancaster001B 35
Lancaster002A 46
Lancaster002D 9
.... ...
I use dplyr for everything. Install dplyr, make sure your column names are correct, and then try:
library(dplyr)
mydata %>%
mutate(id = substr(id, 1, nchar(id)-1) %>% # removes last character
group_by(id) %>%
summarize(sum = sum(value))
Edit: An even simpler data.table solution from #Arun's helpful tip:
library(data.table)
dt[, list(sum=sum(value)), by = substr(as.character(id),1,nchar(as.character(id)) - 1)]
id sum
1: Lancaster001 111
2: Lancaster002 55

R: subsetting dataframe using elements from a vector

I have a data frame which includes a vector of individual identifiers (which are 6 letters) and vectors of numbers
I would like to subset it using a vector of elements (again 6-letters identifiers) taken from another dataframe
Here is what I did (in a simplified version, my dataframe has over 200 columns and 64 rows)
n = c(2, 3, 5, 7, 8, 1)
i = c("abazzz", "bbaxxx", "ccbeee","dddfre", "sdtyuo", "loatvz" )
c = c(10, 2, 10, 2, 12, 34)
df1 = data.frame(n, i, c)
attach(example)
This is the vector whose elements I want to use for subsetting:
v<- c("abazzz", "ccbeee", "lllaaa")
This is what I do to subset
df2<-example[, i==abazzz | ccbeee | lllaaa]
This does not work, the error I get is "abazzz" not found ( I tried with and without "", I tried using the command subset, same error appears)
Moreover I would like to avoid the or operator as the vector I need to use for subsetting has about 50 elements. So, in words, what I would like to do is to subset df2 in order to extract only those individuals who already appear in df1 using their identifiers (column in df1)
Writing this makes me think this must be very easy to do, but I can't figure it out by myself, I tried looking up similar questions but could not find what I was looking for. I hope someone can help me, suggest other posts or manuals so I can learn. Thanks!
Here's another nice option using data.tables binary search (for efficiency)
library(data.table)
setkey(setDT(df1), i)[J(v), nomatch = 0]
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
Or if you don't want to reorder the data set and keep the syntax similar to base R, you could set a secondary key instead (contributed by #Arun)
set2key(setDT(df1), i)
df1[i %in% v]
Or dplyr (for simplicity)
library(dplyr)
df1 %>% filter(i %in% v)
# n i c
# 1: 2 abazzz 10
# 2: 5 ccbeee 10
As a side note: as mentioned in comments, never use attach
(1)
Instead of
attach(df1)
df2<-df1[, i==abazzz | ccbeee | lllaaa]
detach(df1)
try
df2 <- with(df1, df1[i=="abazzz" | i=="ccbeee" | i=="lllaaa", ])
(2)
with(df1, df1[i %in% v, ])
Both yield
# n i c
# 1 2 abazzz 10
# 3 5 ccbeee 10

How to repetitively replace substrings in variables in R

I've got the following task
Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))
I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.
My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.
How can I do that? I could write that command 100 times but I am sure there is a better solution.
Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:
cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))
(cnames is a vector containing c('V010','V020', ..., 'V1000'))
Next, we will get their indexes:
coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))
(coli is a vector containing the indexes in Treatment of the relevant columns)
Finally, we will apply your function over these columns:
Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])
Does it work?
PS: if anyone has a better/more concise way to do it, please tell me :)
EDIT:
The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.
Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])
(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)
A solution where relevant columns are selected based on a pattern that can be described with a regular expression.
Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits
Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
# V010 x010 xV1000 V1000
# 1 120 120 111 111
# 2 130 130 222 222
# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")
# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
as.numeric(substr(x, 1, 2))
})
Treatment
# V010 x010 xV1000 V1000
# 1 12 120 111 11
# 2 13 130 222 22

Resources