I have the following dataframe
df <- data.frame(V1 = c(1, 2), V2 = c(10, 20), V3=c("9,1", "13,3,4"))
> df
V1 V2 V3
1 1 10 9,1
2 2 20 13,3,4
Now I want to create a new column 'V4' that takes the value after the first ',' from V3, divides it by the value in V2 and multiplies that by 100
In my example this would be:
(1 divided by 10) * 100 = 10
(3 divided by 20) * 100 = 15
So the output would look like this
df_new
V1 V2 V3 V4
1 1 10 9,1 10
2 2 20 13,3,4 15
How can this be achieved?
We can use regex to extract the number after first comma divide it by V2 and multiply by 100.
transform(df, V4 = as.integer(sub("\\d+,(\\d+).*", "\\1", V3))/V2 * 100)
# V1 V2 V3 V4
#1 1 10 9,1 10
#2 2 20 13,3,4 15
Let's say I have the following dataframe df containing weights:
df <- as.data.frame(t(matrix(seq(1,9), nrow = 3, ncol = 3)))
> df
V1 V2 V3
1 1 2 3
2 4 5 6
3 7 8 9
I would like to produce a new dataframe df_2 with normalised weights (sum of the columns must be equal to 1) as below:
> df_2
V1 V2 V3
1 0.1666667 0.3333333 0.5
2 0.2666667 0.3333333 0.4
3 0.2916667 0.3333333 0.375
Note that the way I normalise a vector w is the following:
w_normalised <- w/sum(w)
We need to divide with rowwise sum of the dataset
df/rowSums(df)
I have a list of data.tables that I need to cbind, however, I only need the last X columns.
My data is structured as follows:
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
[[1]]
x y v1 v2
1: 1 a 1 3
2: 1 a 2 4
[[2]]
x y v3
1: 1 a 5
2: 1 a 6
[[3]]
x y v4 v5 v6
1: 1 a 7 9 11
2: 1 a 8 10 12
Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:
x y v1 v2 v3 v4 v5 v6
1: 1 a 1 3 5 7 9 11
2: 1 a 2 4 6 8 10 12
I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.
thanks
Here's another way:
Reduce(
function(x,y){
newcols = setdiff(names(y),names(x))
x[,(newcols)] <- y[, ..newcols]
x
},
DT.list,
init = copy(DT.list[[1]][,c("x","y")])
)
# x y v1 v2 v3 v4 v5 v6
# 1: 1 a 1 3 5 7 9 11
# 2: 1 a 2 4 6 8 10 12
This avoids modifying the list (as #bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.
Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():
do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:
for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
## x y v1 v2
## 1: 1 a 1 3
## 2: 1 a 2 4
##
## [[2]]
## v3
## 1: 5
## 2: 6
##
## [[3]]
## v4 v5 v6
## 1: 7 9 11
## 2: 8 10 12
##
do.call(cbind,DT.list);
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
Another option would be to use the [,, indexing function option inside lapplyon the list of data tables and exclude "unwanted" columns (in your case x and y). In this way, duplicates columns are not created.
# your given test data
DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))
DT.list <- list(DT.1, DT.2, DT.3)
A) using a character vector to indicate which columns to exclude
# cbind a list of subsetted data.tables
exclude.col <- c("x","y")
myDT <- do.call(cbind, lapply(DT.list, `[`,,!exclude.col, with = FALSE))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
B) same as above but using the character vector directly in lapply
myDT <- do.call(cbind, lapply(DT.list, `[`,,!c("x","y")))
myDT
## v1 v2 v3 v4 v5 v6
## 1: 1 3 5 7 9 11
## 2: 2 4 6 8 10 12
# join x & y columns for final results
cbind(DT.list[[1]][,.(x,y)], myDT)
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
C) same as above, but all in one line
do.call( cbind, c(list(DT.list[[1]][,.(x,y)]), lapply(DT.list, `[`,,!c("x","y"))) )
# way too many brackets...but I think it works
## x y v1 v2 v3 v4 v5 v6
## 1: 1 a 1 3 5 7 9 11
## 2: 1 a 2 4 6 8 10 12
i am trying to generate 0 and 1 for absence and presence. my data is line segments and i have to plot 0 or 1 at and interval of 0.1 for points that lie within the segment or points outside the segment.
V1 V2 V3 V4 V5 V6 V7
3 17 26.0 26.0 0 12-Jun-84 1 0
4 17 48.0 48.0 1 12-Jun-84 3 0
5 17 56.7 56.7 0 12-Jun-84 1 0
143 17 16.3 16.3 0 19-Jun-84 1 8
144 17 17.7 17.7 0 19-Jun-84 1 8
145 17 22.0 22.0 0 19-Jun-84 1 8
v2 and v3 are the start and endpoints and v4 is the separation between them.
i have tried
tran17 <- seq(0, 80, by=0.1)
tran17.date1 <- rep(0, length(tran17))
##
sub1 <-which(tran17 >= c$V2[i] & tran17 <= c$V3[i])
tran17.date1[sub1] <- 1
thankyou
Ignoring your data example and focusing in your question, I think this solves the problem. Also, if V1 is a grouping factor, you can use tapply over PAmatrix.
# test data
sed.seed(1104)
dat = data.frame(V1=17, V2=runif(200, 10, 60))
dat$V3 = dat$V2 + runif(200, 0, 20)
dat$V4 = dat$V3 - dat$V2
V1 V2 V3 V4
1 17 37.25826 45.54194 8.2836734
2 17 17.44098 22.86841 5.4274331
3 17 49.78488 55.51627 5.7313965
4 17 51.66640 52.54813 0.8817293
5 17 21.84276 39.38477 17.5420079
6 17 53.39457 54.51613 1.1215530
# functions to solve the problem
isInside = function(limits, tran) as.numeric(tran>=limits[1] & tran<=limits[2])
PAmatrix = function(data, tran) t(apply(data, 1, isInside, tran=tran))
# calculate the PA matrix
tran17 = seq(0, 80, by=0.1)
PA17 = PAmatrix(data=dat[,c("V2","V3")], tran=tran17)
# plot the results
image(seq(nrow(dat)), tran17, PA17, col=c("blue", "red"))
tran17 <- seq(0, 80, by=0.1)
tran17.date1 <- rep(0, length(tran17))
dm <- which(c$V5 == "31-Jul-84")
for(i in dm){
print(i)
sub1 <-which(tran17 >= c$V2[i] & tran17 <= c$V3[i])
tran17.date1[sub1] <- 1
}
plot(tran17, tran17.date1)
I am a new user of R. I have this kind of data type. How can I separate the different types of variables (eg.binary, or counts; or others are continuous) from them?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0.17 0 0 12 22 2 1 1 240.215 65.049 1.478 114
0.15 1 0 13 22 2 1 1 247.133 66.315 1.474 120
0.16 0 0 12 22 2 0 1 233.329 58.163 1.353 110
0.07 0 0 12 20 2 0 1 219.660 56.162 1.370 114
0.11 0 0 12 26 2 0 2 289.294 70.844 1.389 134
Thanks in advance!
You can use the function typeof to determine the storage mode of an object.
An example data frame:
dat <- data.frame(a = 1:2,
b = c(0.5, -1.3),
c = c("a", "b"),
d = c(TRUE, FALSE), stringsAsFactors = FALSE)
With lapply you can apply the function to all columns:
lapply(dat, typeof)
The result:
$a
[1] "integer"
$b
[1] "double"
$c
[1] "character"
$d
[1] "logical"
If you want to select, for example, all character columns, you can use:
dat[sapply(dat, typeof) == "character"] # possibility 1
dat[sapply(dat, is.character)] # possibility 2
# both commands will return the same result
c
1 a
2 b
PS: You should also have a look at the functions modeand storage.mode.
In addition to typeof, str and summary are other possibilities. These can also be applied directly to the data frame, ie no lapply or looping required.