How to convert a string into assignment statement in R? - r

There are many columns in the list. I'd like to use some functions to work automatically.
I have a data.frame myData
There is no myData$home_player_X. I'm adding it one by one manually.
If I do it manually, the code looks like this:
myData$home_player_1 <- lDataFrames[[3]]$home_player_1
myData$home_player_2 <- lDataFrames[[3]]$home_player_2
...
myData$home_player_11 <- lDataFrames[[3]]$home_player_11
If we only consider the part after <-, I can convert it into an expression:
eval(parse(text=paste("lDataFrames[[3]]$home_player_",i,sep="")))
But I want to convert whole string. The whole string is this:
paste("myData$home_player_",i," <- lDataFrames[[3]]$home_player_", i,sep="")
I want to convert string into an assignment statement, so I can do it in a for loop

Instead of playing with strings, you can directly copy the required columns in mydata.
cols <- grep("^home_player", names(lDataFrames[[3]]), value = TRUE)
mydata[cols] <- lDataFrames[[3]][cols]
Using reproducible example,
df <- data.frame(home_player_1 = 1:5, home_player_2 = 6:10, home_player_3 = 11:15)
cols <- grep("^home_player", names(df), value = TRUE)
mydata <- data.frame(matrix(nrow = nrow(df), ncol = length(cols),
dimnames = list(NULL, cols)))
mydata[cols] <- df[cols]
mydata
# home_player_1 home_player_2 home_player_3
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15

Instead of using the $ notation, just use the variable name as an index. I am substituting Y for your lDataFrames[[3]], but it should be easy to translate.
myData = data.frame(Var1 = 1:10)
Y = data.frame(home_player_1 = 11:20,
home_player_2 = 21:30, home_player_3 = 31:40)
for(i in 1:3) {
VarName = paste0("home_player_", i)
myData[ ,VarName] = Y[ ,VarName]
}
myData
Var1 home_player_1 home_player_2 home_player_3
1 1 11 21 31
2 2 12 22 32
3 3 13 23 33
4 4 14 24 34
5 5 15 25 35
6 6 16 26 36
7 7 17 27 37
8 8 18 28 38
9 9 19 29 39
10 10 20 30 40

Related

How can I divide and multiply each row/column (cell) value in R?

I have this dataframe called mydf.
mydf<- structure(list(length = 18:21, A = c(40889L, 42585L, 60586L,
73374L), C = c(24283L, 66371L, 30027L, 40899L), G = c(38245L,
29170L, 37877L, 49023L), T = c(92544L, 159373L, 326940L, 654364L
)), .Names = c("length", "A", "C", "G", "T"), row.names = c(NA,
4L), class = "data.frame")
I want to perform mathematical function like below in mydf dataframe. In other words, I want to divide each value by a number (say x) and multiply by 100.
length A C G T
18 (40889/x)*100 (24283/x)*100 (38245/x)*100 (92544/x)*100
19 (42585/x)*100 (66371/x)*100 (29170/x)*100 (159373/x)*100
20 (60586/x)*100 (30027/)*100 (37877/x)*100 (326940/x)*100
21 (73374/x)*100 (40899/x)*100 (49023/x) (654364/x)*100
for column A:
mydf$A <- ((mydf$A / x)*100)
You should be able to google this as it is basic.
dplyr solution
mydf %>%
mutate_at(vars(c("A","C","G","T")),funs(.*x/100))
data.table solution
DT <- setDT(mydf)
DT[,.SD*x/100,.SDcols = c("A","C","G","T")]
and base
mydf[c("A","C","G","T")] <- mydf[c("A","C","G","T")]*x/100
x <- 4
mydf <- cbind(mydf[1], ( mydf[2:ncol(mydf)] / x) * 100)
Result:
length A C G T
1 18 1022225 607075 956125 2313600
2 19 1064625 1659275 729250 3984325
3 20 1514650 750675 946925 8173500
4 21 1834350 1022475 1225575 16359100
To apply to specific columns:
specific_cols <- which(names(mydf) %in% c('A', 'G', 'T'))
specific_result <- cbind(mydf[-specific_cols], ( mydf[specific_cols] / x) * 100)
Result:
length C A G T
1 18 24283 1022225 956125 2313600
2 19 66371 1064625 729250 3984325
3 20 30027 1514650 946925 8173500
4 21 40899 1834350 1225575 16359100
Something like this will do the job:
x = 5
cbind(length = mydf[,1], (mydf[,-1]/x)*100)
With dplyr,
library(dplyr)
mydf %>%
mutate_at(vars(-1), funs(./x*100))
Result:
length A C G T
1 18 817780 485660 764900 1850880
2 19 851700 1327420 583400 3187460
3 20 1211720 600540 757540 6538800
4 21 1467480 817980 980460 13087280
mydf
length A C G T
1 18 40889 24283 38245 92544
2 19 42585 66371 29170 159373
3 20 60586 30027 37877 326940
4 21 73374 40899 49023 654364
x<-2
mydf[,-1]<-(mydf[-1]/x)*100
mydf
length A C G T
1 18 2044450 1214150 1912250 4627200
2 19 2129250 3318550 1458500 7968650
3 20 3029300 1501350 1893850 16347000
4 21 3668700 2044950 2451150 32718200

Adding Proportionate Na's in a column [duplicate]

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))

Removing columns from a dataframe with a common string but keeping the first

In my dataframe frame I have columns with names like this:
stockA_1,stockA_2,stockA_3,stockA_4
I would like to delete all rows from my df which have the common string "stockA_" but I would like to keep only the first column with this synthetic "stockA_1"
How is it possible to make it?
This can be done with base functions:
d <- as.data.frame(matrix(1:35, 5, 7))
names(d) <- c("AA", "stockA_1", "BBB", "stockA_2", "stockA_3", "CCCCC", "stockA_4")
d[,-which(grepl("^stockA_", names(d)))[-1]]
The result is:
> d[,-which(grepl("^stockA_", names(d)))[-1]]
AA stockA_1 BBB CCCCC
1 1 6 11 26
2 2 7 12 27
3 3 8 13 28
4 4 9 14 29
5 5 10 15 30
If you want to conserve the column "stockA_1" (which is eventually not in the first place under the "stockA_"-columns) then you can do
d <- as.data.frame(matrix(1:35, 5, 7))
names(d) <- c("AA", "stockA_11", "BBB", "stockA_2", "stockA_1", "CCCCC", "stockA_4")
i <- (!grepl("^stockA_", names(d))) | grepl("^stockA_1$", names(d))
d[,i]
with the result:
> d[,i]
AA BBB stockA_1 CCCCC
1 1 11 21 26
2 2 12 22 27
3 3 13 23 28
4 4 14 24 29
5 5 15 25 30
Using the data table (if I have understood the problem correctly that is):
require(data.table)
data <- data.table(stockA_1 = c(1, 2, 3), stockA_2 = c(3, 4, 7), stockA_3 = c(4, 5, 6))
columns <- setdiff(grep("stockA_", names(data)), grep("stockA_1", names(data)))
data[, (columns):= NULL]

randomly insert sequence of missing data (NAs) [duplicate]

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))

Randomly insert NAs into dataframe proportionaly

I have a complete dataframe. I want to 20% of the values in the dataframe to be replaced by NAs to simulate random missing data.
A <- c(1:10)
B <- c(11:20)
C <- c(21:30)
df<- data.frame(A,B,C)
Can anyone suggest a quick way of doing that?
df <- data.frame(A = 1:10, B = 11:20, c = 21:30)
head(df)
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 15 25
## 6 6 16 26
as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 14 24
## 5 5 NA 25
## 6 6 16 26
## 7 NA 17 27
## 8 8 18 28
## 9 9 19 29
## 10 10 20 30
It's a random process, so it might not give 15% every time.
You can unlist the data.frame and then take a random sample, then put back in a data.frame.
df <- unlist(df)
n <- length(df) * 0.15
df[sample(df, n)] <- NA
as.data.frame(matrix(df, ncol=3))
It can be done a bunch of different ways using sample().
If you are in the mood to use purrr instead of lapply, you can also do it like this:
> library(purrr)
> df <- data.frame(A = 1:10, B = 11:20, C = 21:30)
> df
A B C
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
5 5 15 25
6 6 16 26
7 7 17 27
8 8 18 28
9 9 19 29
10 10 20 30
> map_df(df, function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 11 21
2 2 12 22
3 NA 13 NA
4 4 14 NA
5 5 15 25
6 6 16 26
7 7 17 27
8 8 NA 28
9 9 19 29
10 10 20 30
Same result, using binomial distribution:
dd=dim(df)
nna=20/100 #overall
df1<-df
df1[matrix(rbinom(prod(dd), size=1,prob=nna)==1,nrow=dd[1])]<-NA
df1
May i suggest a first function (ggNAadd) designed to do this, and improve it with a second function providing graphical distribution of the NAs created (ggNA)
What is neat is the possibility to input either a proportion of a fixed number of NAs.
ggNAadd = function(data, amount, plot=F){
temp <- data
amount2 <- ifelse(amount<1, round(prod(dim(data))*amount), amount)
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:amount2) temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- NA
if (plot) print(ggNA(temp))
return(temp)
}
And the plotting function:
ggNA = function(data, alpha=0.5){
require(ggplot2)
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)
to.plot <- cbind.data.frame('y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(is.na(DF)))*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame") +
xlab("columns") + ylab("lines")
pc <- round(sum(is.na(DF))/prod(dim(DF))*100, 2) # % NA
print(paste("percentage of NA data: ", pc))
return(g)
}
Which gives (using ggplot2 as graphical output):
ggNAadd(df, amount=0.20, plot=TRUE)
## [1] "percentage of NA data: 20"
## A B c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
## 4 4 NA 24
## ..
Of course, as mentioned earlier, if you ask too many NAs the actual percentage will drop because of repetitions.
A mutate_all approach:
df %>%
dplyr::mutate_all(~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.8, 0.2)),
as.character(.), NA))

Resources