How to repetitively replace substrings in variables in R - r

I've got the following task
Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))
I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.
My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.
How can I do that? I could write that command 100 times but I am sure there is a better solution.

Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:
cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))
(cnames is a vector containing c('V010','V020', ..., 'V1000'))
Next, we will get their indexes:
coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))
(coli is a vector containing the indexes in Treatment of the relevant columns)
Finally, we will apply your function over these columns:
Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])
Does it work?
PS: if anyone has a better/more concise way to do it, please tell me :)
EDIT:
The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.
Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])
(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)

A solution where relevant columns are selected based on a pattern that can be described with a regular expression.
Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits
Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
# V010 x010 xV1000 V1000
# 1 120 120 111 111
# 2 130 130 222 222
# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")
# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
as.numeric(substr(x, 1, 2))
})
Treatment
# V010 x010 xV1000 V1000
# 1 12 120 111 11
# 2 13 130 222 22

Related

Flatten output of table whilst retaining names

How can I flatten the output of R's base::table function (applied to columns of a dataframe) in order to return an integer vector with names according to the convention specified below?
set.seed(123); N <- 100
df <- data.frame(A = sample(c("GOOD", "BAD"), N, TRUE, c(0.4, 0.6)),
B = sample(c("GOOD", "BAD"), N, TRUE, c(0.7, 0.3)))
table(df)
# B
# A BAD GOOD
# BAD 16 44
# GOOD 10 30
# Desired output (naming convention: A-B)
# BAD-BAD GOOD-BAD BAD-GOOD GOOD-GOOD
# 16 10 44 30
Using as.vector(table(df)) drops names, which I'd like to retain in one step for use downstream for example, in case the ordering of factors changes or to enable "toggling" of useNA = "always" when calling table whilst not having to manually track the positions of the respective cells in the output vector.
From memory (currently no access to R):
table(interaction(df$A, df$B, sep = "-"))
Could be that you need to specify the correct separator to interaction

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

Merge named vectors in different sizes into data frame

I have some different named vectors, and I want to combine them into one date frame that sums the actions.
adjust balance drive idle other pick putdown replace sort wait
4 9 16 82 4 350 61 16 26 18
walk
14
adjust balance drive idle pick putdown replace sort unload walk
1 42 14 47 385 118 4 83 19 7
i want it to be this way:
adjust balance drive
5 51 30
and etc..
i find it very challenging because those are named vectors
would be grateful for your help, thank you!
We can use aggregate + stack like below
aggregate(. ~ ind, rbind(stack(vec1), stack(vec2)), sum)
You could convert to a data.frame and use the dplyr package to group by the names and sum the numbers together.
library(dplyr)
vec <- c(4, 9, 16, 1, 42, 14)
names(vec) <- c("adjust", "balance", "drive", "adjust", "balance", "drive")
data.frame(values = vec, name = names(vec)) %>% group_by(name) %>% summarise(values = sum(values))
If we want to add all elements that match between the two vectors:
# Resolve the matching names of the vectors:
# vec_nm_order => character vector
vec_nm_order <- intersect(
names(vec1),
names(vec2)
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]
If we only want to add values for adjust, balance, drive:
# Choose the names (keys) of elements we want to add together:
# vec_nm_order => character vector
vec_nm_order <- c(
"adjust", "balance", "drive"
)
# Add the related scalars together:
# named integer vector => stdout(console)
vec1[vec_nm_order] + vec2[vec_nm_order]

R : Extract a Specific Number out of a String

I have a vector as below
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
here i want to extract the first number before the "X" for each of the elements.
In case of situations with 2 "X" i.e. "6X2X75CL" the number 12 (6 multiplied by 2) should be calculated.
expected output
6, 24, 12, 168
Thank you for the help...
Here's a possible solution using regular expressions :
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
# this regular expression finds any group of digits followed
# by a upper-case 'X' in each string and returns a list of the matches
tokens <- regmatches(data,gregexpr('[[:digit:]]+(?=X)',data,perl=TRUE))
res <- sapply(tokens,function(x)prod(as.numeric(x)))
> res
[1] 6 24 12 168
Here is a method using base R:
dataList <- strsplit(data, split="X")
sapply(dataList, function(x) Reduce("*", as.numeric(head(x, -1))))
[1] 6 24 12 168
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numerics and the multiply them. The final element is dropped using head(x, -1).
As #zheyuan-li comments, prod can fill in for Reduce and will probably be a bit faster:
sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
[1] 6 24 12 168
We can also use str_extract_all
library(stringr)
sapply(str_extract_all(data, "\\d+(?=X)"), function(x) prod(as.numeric(x)))
#[1] 6 24 12 168
ind=regexpr("X",data)
val=as.integer(substr(data, 1, ind-1))
data2=substring(data,ind+1)
ind2=regexpr("[0-9]+X", data2)
if (!all(ind2!=1)) {
val2 = as.integer(substr(data2[ind2==1], 1, attr(ind2,"match.length")[ind2==1]-1))
val[ind2==1] = val[ind2==1] * val2
}

Appending list to data frame in R

I have created an empty data frame in R with two columns:
d<-data.frame(id=c(), numobs=c())
I would like to append this data frame (in a loop) with a list, d1 that has output:
[1] 1 100
I tried using rbind:
d<-rbind(d, d2)
and merge:
d<-merge(d, d2)
And I even tried just making a list of lists and then converting it to a data frame, and then giving that data frame names:
d<-rbind(dlist1, dlist2)
dframe<-data.frame(d)
names(dframe)<-c("id","numobs")
But none of these seem to meet the standards of a routine checker (this is for a class), which gives the error:
Error: all(names(cc) %in% c("id", "nobs")) is not TRUE
Even though it works fine in my workspace.
This is frustrating since the error does not reveal where the error is occurring.
Can anyone help me to either merge 2 data frames or append a data frame with a list?
I think you are confusing the purpose of rbind and merge. rbind appends data.frames or named lists, or both vertically. While merge combines data.frames horizontally.
You seem to be also confused by vector's and list's. In R, list can take different datatypes for each element, while vector has to have all elements the same type. Both list and vector are one-dimensional. When you use rbind you want to append a named list, not a named/unnamed vector.
Unnamed Vectors and Lists
The way you define a vector is with the c() function. The way you define an unnamed list is with the list() function, like so:
vec1 = c(1, 10)
# > vec1
# [1] 1 10
list1 = list(1, 10)
# > list1
# [[1]]
# [1] 1
#
# [[2]]
# [1] 10
Notice that both vec1 and list1 have two elements, but list1 is storing the two numbers as two separate vectors (element [[1]] the vector c(1) and [[2]] the vector c(10))
Named Vectors and Lists
You can also create named vectors and lists. You do this by:
vec2 = c(id = 1, numobs = 10)
# > vec2
# id numobs
# 1 10
list2 = list(id = 1, numobs = 10)
# > list2
# $id
# [1] 1
#
# $numobs
# [1] 10
Same data structure for both, but the elements are named.
Dataframes as Lists
Notice that list2 has a $ in front of each element name. This might give you some clue that data.frame's are actually list's with each column an element of the list, since df$column is often used to extract a column from a dataframe. This makes sense since both list's and data.frame's can take different datatypes, unlike vectors's.
The rbind function
When your first element is a dataframe, rbind requires that what you are appending has the same names as the columns of the dataframe. Now, a named vector would not work, because the elements of a vector are not treated as columns of a dataframe, whereas a named list matches elements with columns if the names are the same:
To demonstrate:
d<-data.frame(id=c(), numobs=c())
rbind(d, c(1, 10))
# X1 X10
# 1 1 10
rbind(d, c(id = 1, numobs = 10))
# X1 X10
# 1 1 10
rbind(d, list(1, 10))
# X1 X10
# 1 1 10
rbind(d, list(id = 1, numobs = 10))
# id numobs
# 1 1 10
Knowing the above, it is obvious that you can most certainly also rbind two dataframes with column names that match:
df2 = data.frame(id = 1, numobs = 10)
rbind(d, df2)
# id numobs
# 1 1 10
For starters, the routine checker appears to be looking for columns labeled "id" and "nobs". If that doesn't match your file output, you'll get that error.
I'm taking what is probably the same class and had the same error; correcting my column names made that go away (I'd labeled the 2nd one "nob" not "nobs"!) Now I've gotten the routine checker to complete correctly, or so it seems... but it outputs three data files, and the first and last files are correct but the second one yields "Sorry, that is incorrect." No further feedback. Maddening!
No point posting my code here as it runs fine locally with all the course examples, and it's kinda hard to debug when you don't know what the script is asking for. Sigh.
That d2 object is being printed as an atomic vector would be. Maybe if you showed us either dput(d2) or str(d2) you would havea better understanding of R lists. Furthermore that first bit of code does not produce a two column dataframe, either.
> d<-data.frame(id=1, numobs=1)[0, ] # 2-cl dataframe with 0 rows
> dput(d)
structure(list(id = numeric(0), numobs = numeric(0)), .Names = c("id",
"numobs"), row.names = integer(0), class = "data.frame")
> d2 <- list(id="fifty three", numobs=6) # names that match names(d)
> rbind(d,d2)
id numobs
2 fifty three 6

Resources