extract values after first character in data frame column

extract values after first character in data frame column - r

I have the following dataframe
df <- data.frame(V1 = c(1, 2), V2 = c(10, 20), V3=c("9,1", "13,3,4"))
> df
V1 V2 V3
1 1 10 9,1
2 2 20 13,3,4
Now I want to create a new column 'V4' that takes the value after the first ',' from V3, divides it by the value in V2 and multiplies that by 100
In my example this would be:
(1 divided by 10) * 100 = 10
(3 divided by 20) * 100 = 15
So the output would look like this
df_new
V1 V2 V3 V4
1 1 10 9,1 10
2 2 20 13,3,4 15
How can this be achieved?

We can use regex to extract the number after first comma divide it by V2 and multiply by 100.
transform(df, V4 = as.integer(sub("\\d+,(\\d+).*", "\\1", V3))/V2 * 100)
# V1 V2 V3 V4
#1 1 10 9,1 10
#2 2 20 13,3,4 15

Related

How to use R language to determine whether a line is the same character

I have a data.frame like the following：
V1 V2 V3 V4 V5
1 a a b a a
2 a a a
3 b b b b
4 a c d
I want to keep the lines with the same character in one line (in my example, lines 2 and 3), is there any function that can help me achieve this requirement?

Here is a base R option using apply
df[apply(df, 1, function(x) length(unique(x[x != ''])) == 1), ]
#V1 V2 V3 V4 V5
#2 a a a
#3 b b b b
Explanation: length(unique(x[x != '')) == 1 checks if non-empty elements of a vector x contain only a single unique element. apply with MARGIN = 1 means that we loop through the rows of the data.frame.
Sample data
df <- read.table(text = " V1 V2 V3 V4 V5
1 a a b a a
2 a a a '' ''
3 b b b b ''
4 a c d '' ''", header = T)

How can I get the second index within a nested for loop to work in a data table

So I have a data.table where I need to fill in values based on the index of the column and then also based on the placeholder character. Example:
V1 V2 V3 V4
Row1 1 1 a d
Row2 1 1 a d
Row3 1 1 a d
Row4 1 2 a h
Row5 1 2 a h
Row6 1 2 a h
Row7 2 1 b i
Row8 2 1 b i
Row9 2 1 b i
Row10 2 2 b t
Row11 2 2 b t
Row12 2 2 b t
....
Row350k ...
What I need to figure out is how to write a for loop with a assignment by reference statement that slides along column 1's index. Basically
For each column index, one at a time:
For each V1 = 1 and V2 = 1 replace character 'a' with one
iteration of 0.0055 + rnorm(1, 0.0055, 0.08).
For each V1 = 1 and
V2 = 2 replace character 'a' with one iteration of 0.0055 +
rnorm(1, 0.0055, 0.08). (same variation but with another iteration of
the rnorm)
For each V1 = 2 and V1 = 1, replace character 'b' with
one iteration of 0.0055 + rnorm(1, 0.001, 0.01)
For each V1 = 2 and
V1 = 1, replace character 'b' with one iteration of 0.0055 +
rnorm(1, 0.001, 0.01) (same variation but with another iteration of
the rnorm).
And so on for each incrementing values of Col1 and Col2. In actuality its 20+ rows instead of just 2 for the second index.
Desired output then is:
Col1 Col2 Col3 Col4
Row1 1 1 0.00551 d
Row2 1 1 0.00551 d
Row3 1 1 0.00551 d
Row4 1 2 0.00553 h
Row5 1 2 0.00553 h
Row6 1 2 0.00555 h
Row7 2 1 0.0011 i
Row8 2 1 0.0011 i
Row9 2 1 0.0011 i
Row10 2 2 0.0010 t
Row11 2 2 0.0010 t
Row12 2 2 0.0010 t
....
Row350k ...
Just not sure how to do this with a loop since the values in col1 are repeated a certain num of times. Column1 has 300k plus values so the sliding loop needs to dynamically scalable.
Here's what i have tried:
for (i in seq(1, 4000, 1))
{for (ii in seq(1, 2, 1)) {
data.table[V3 == "a" , V3 := 0.0055 + rnorm(1, 0.0055, 0.08)]
data.table[V3 == "b" , V3 := 0.0055 + rnorm(1, 0.001, 0.01)]
}}
Thanks!

If I understand your problem correctly this might be of help.
library(data.table)
dt <- data.table(V1 = c(rep(1, 6), rep(2, 6)),
V2 = rep(c(rep(1, 3), rep(2, 3)), 2),
V3 = c(rep("a", 6), rep("b", 6)),
V4 = c(rep("d", 3), rep("h", 3), rep("i", 3), rep("t", 3)))
# define a catalog to join on V3 which contains the parameters for the random number generation
catalog <- data.table(V3 = c("a", "b"),
const = 0.0055,
mean = c(0.0055, 0.001),
std = c(0.08, 0.01))
# for each value of V3 generate .N (number of observations of the current V3 value) random numbers with the specified parameters
dt[catalog, V5 := i.const + rnorm(.N, i.mean, i.std), on = "V3", by = .EACHI]
dt[, V3 := V5]
dt[, V5 := NULL]

Ok so I figured out that I wasn't incrementing my counters properly. For a matrix/data table with 4000 scenarios in the 1st column each with 11 repeats in the 2nd column I used the follwing:
Col1counter <- 1
Col2counter <- 1
for(Col1counter in 1:4000) {
for(col2counter in 1:11) {
test1[V1 == col1counter & V2 == col2counter & V3 == "a" , V55 := 0.00558 + rnorm(1, 0.00558, 2)]
col2counter+ 1
}
Col1counter+ 1}
Using both indices in the conditional statement ensures that it crawls accurately through the rows.

Rearrange data from single observation to many

I've a text file in following form:
x1, y1, z1, x2, y2, z2, x3, y3, z3
If I import it with read.csv I've a single observation with nine variables (in the example, the number of triplets in real file is unknown).
I want to rearrange data in order to have many observation with three variables:
x1 y1 z1
x2 y2 z2
x3 y3 z3
So I can perform operations on each triplet.
For example I want to transform this
fileData <- read.table(text = "1 2 3 10 20 30 100 200 300")
> fileData
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 1 2 3 10 20 30 100 200 300
to this:
> fileData
V1 V2 V3
1 1 2 3
2 10 20 30
3 100 200 300
How can I split it?

Not sure what your actual goal is but using base R:
data.frame(matrix(fileData, ncol = 3, byrow = T))
This should get what you want
X1 X2 X3
1 1 2 3
2 10 20 30
3 100 200 300

akash gave a great answer but it may not work if you have mixed data types (numeric and character) because the matrix will force everything to be one type. An alternative is something like the following where we lapply across an index based on the number of columns desired.
fileData <- read.table(text = "m 2 3 a 20 30 cat 200 300")
rows = lapply(seq(3,ncol(fileData),by=3),
function(x){
range = paste("V",(x-2):x,sep="")
output = fileData[,range]
names(output) = c("x","y","z")
return(output)
})
do.call(rbind,rows)
#> x y z
#> 1 m 2 3
#> 2 a 20 30
#> 3 cat 200 300

R: selecting row values based on row range

I have a data frame (df) with 4 columns of values (V1 to V4 columns) that I need to select based on two other columns (max and min columns). My aim is to assign NAs to those values outside of the range set by the max and min columns for each row and calculate the mean of the remaining values.
V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16
The expected output would be:
V1 V2 V3 V4 max min mean
NA NA 6 NA 7 5 6
23 30 NA 17 30 16 35
So far, I can only do this by using the following script to assign NAs...
df$V1 <- ifelse(df$V1 > df$max | df$V1 < df$min, NA, df$V1)
df$V2 <- ifelse(df$V2 > df$max | df$V2 < df$min, NA, df$V2)
df$V3 <- ifelse(df$V3 > df$max | df$V3 < df$min, NA, df$V3)
df$V4 <- ifelse(df$V4 > df$max | df$V4 < df$min, NA, df$V4)
...and then the following to calculate the mean:
df$mean <- rowMeans(df[, 1:4], na.rm = TRUE)
The problem is that the number of columns in the real data will be much larger than 4 and this method seems to require far too much repetition. Is there a better way of doing this in R?
I have tried using data.table to subset the valid values to then use the apply function without success:
df <- df[df[,1:4] <= df$max | df[,1:4] >= df$min, ]
apply(df[,1:4], 1, function(x) mean(x))
Thank you.

For instance, you could try the following, which works by melting your data first.
# getting your data:
df <- read.table(text="V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16", header=T)
# melting the data:
library(reshape2)
df2 <- melt(df, id.vars = c("max", "min"))
df2
max min variable value
1 7 5 V1 1
2 30 16 V1 23
3 7 5 V2 3
4 30 16 V2 30
5 7 5 V3 6
6 30 16 V3 5
7 7 5 V4 8
8 30 16 V4 17
# I create a new vector with NAs, but you could easily just overwrite the values:
df2$val <- with(df2, ifelse(value > max | value < min, NA, value))
# Cast the data into the old form again.
df3 <- dcast(df2, max + min ~ variable, value.var = "val")
# calculate the rowMeans:
df3$mean <- rowMeans(df3[, 3:6], na.rm = TRUE)
# Doing some cosmetics here to get the same column ordering. Chose your preferred way or rearranging the columns, if required at all.
df3 <- df3[, c(paste0("V", 1:4),"max", "min", "mean") ]
df3
V1 V2 V3 V4 max min mean
1 NA NA 6 NA 7 5 6.00000
2 23 30 NA 17 30 16 23.33333
Note that the only difference is that the mean of the second row is lower. I am not sure how you got a value of 35 there.

Try:
df <- read.table(header=TRUE, text="V1 V2 V3 V4 max min
1 3 6 8 7 5
23 30 5 17 30 16")
df.new<-apply(df[,1:4],2,function(x) ifelse(x>df[,5] | x<df[,6],NA,x))
df.new<-cbind(df.new,df[,5:6])
df.new$mean=rowMeans(df.new[1:4],na.rm=TRUE)
df.new

Here is a simple solution with a for loop to fill in the NAs and rowMeans to calculate the mean of each row.
# loop through rows and fill in NA for values outside of min/max
for(i in 1:nrow(df))
is.na(df[i, 1:4]) <- df[i, 1:4] < df[i, "min"] | df[i, 1:4] > df[i, "max"]
# calculate mean of each row
df$mean <- rowMeans(df[, 1:4], na.rm=TRUE)
this returns
df
V1 V2 V3 V4 max min mean
1 NA NA 6 NA 7 5 6.00000
2 23 30 NA 17 30 16 23.33333

cbind specific columns from multiple data.tables efficiently

I have a list of data.tables that I need to cbind, however, I only need the last X columns.
My data is structured as follows:
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
[[1]]
x y v1 v2
1: 1 a 1 3
2: 1 a 2 4
[[2]]
x y v3
1: 1 a 5
2: 1 a 6
[[3]]
x y v4 v5 v6
1: 1 a 7 9 11
2: 1 a 8 10 12
Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:
x y v1 v2 v3 v4 v5 v6
1: 1 a 1 3 5 7 9 11
2: 1 a 2 4 6 8 10 12
I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.
thanks

Here's another way:
Reduce(
function(x,y){
newcols = setdiff(names(y),names(x))
x[,(newcols)] <- y[, ..newcols]
x
},
DT.list,
init = copy(DT.list[[1]][,c("x","y")])
)
# x y v1 v2 v3 v4 v5 v6
# 1: 1 a 1 3 5 7 9 11
# 2: 1 a 2 4 6 8 10 12
This avoids modifying the list (as #bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():
do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:
for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
## x y v1 v2
## 1: 1 a 1 3
## 2: 1 a 2 4
##
## [[2]]
## v3
## 1: 5
## 2: 6
##
## [[3]]
## v4 v5 v6
## 1: 7 9 11
## 2: 8 10 12
##
do.call(cbind,DT.list);
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.
# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
A) using a character vector to indicate which columns to exclude
# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
B) same as above but using the character vector directly in lapply
myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
C) same as above, but all in one line
do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract values after first character in data frame column - r

We can use regex to extract the number after first comma divide it by V2 and multiply by 100. transform(df, V4 = as.integer(sub("\\d+,(\\d+).", "\\1", V3))/V2 100) # V1 V2 V3 V4 #1 1 10 9,1 10 #2 2 20 13,3,4 15

Related

How to use R language to determine whether a line is the same character

How can I get the second index within a nested for loop to work in a data table

Rearrange data from single observation to many

R: selecting row values based on row range

cbind specific columns from multiple data.tables efficiently

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extract values after first character in data frame column - r

We can use regex to extract the number after first comma divide it by V2 and multiply by 100. transform(df, V4 = as.integer(sub("\\d+,(\\d+).*", "\\1", V3))/V2 * 100) # V1 V2 V3 V4 #1 1 10 9,1 10 #2 2 20 13,3,4 15

Related

How to use R language to determine whether a line is the same character

How can I get the second index within a nested for loop to work in a data table

Rearrange data from single observation to many

R: selecting row values based on row range

cbind specific columns from multiple data.tables efficiently

Categories

Resources

We can use regex to extract the number after first comma divide it by V2 and multiply by 100. transform(df, V4 = as.integer(sub("\\d+,(\\d+).", "\\1", V3))/V2 100) # V1 V2 V3 V4 #1 1 10 9,1 10 #2 2 20 13,3,4 15