Let's say I have 3 vectors (strings of 10):
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
Data.frame Y contains 10 columns and 6 rows:
1 2 3 4 5 6 7 8 9 10
0 1 0 0 1 1 1 0 1 0
1 1 1 0 1 0 1 0 0 0
0 0 0 0 1 0 0 1 0 1
1 0 1 1 0 1 1 1 0 0
0 0 0 0 0 0 1 0 0 0
1 1 0 1 0 0 0 0 1 1
I'd like to use vector X, H en I to make column selections in data.frame Y, using "1's" and "0's" in the vector as selection criterium .
So the results for vector X using the '1' as selection criterium should be:
X <- c(1,1,0,1,0, 1,1, 0, NA,NA)
1 2 4 6 7
0 1 0 1 1
1 1 0 0 1
0 0 0 0 0
1 0 1 1 1
0 0 0 0 1
1 1 1 0 0
For vector H using the '1' as selection criterium:
H <- c(0,0,1,0,NA,1,NA,1, 1, 1 )
3 6 8 9 10
0 1 0 1 0
1 0 0 0 0
0 0 1 0 1
1 1 1 0 0
0 0 0 0 0
0 0 0 1 1
For vector I using the '1' as selection criterium:
I <- c(0,0,0,0,0, 1,NA,NA,NA,1 )
6 10
1 0
0 0
0 1
1 0
0 0
0 1
For convenience and speed I'd like to use a loop. It might be something like this:
all.ones <- lapply[,function(x) x %in% 1]
In the outcome (all.ones), the result for each vector should stay separate. For example:
X 1,2,4,6,7
H 3,6,8,9,10
I 6,10
The standard way of doing this is using the %in% operator:
Y[, X %in% 1]
To do this for multiple vectors (assuming you want an AND operation):
mylist = list(X, H, I, D, E, K)
Y[, Reduce(`&`, lapply(mylist, function(x) x %in% 1))]
The problem is the NA, use which to get round it. Consider the following:
x <- c(1,0,1,NA)
x[x==1]
[1] 1 1 NA
x[which(x==1)]
[1] 1 1
How about this?
idx <- which(X==1)
Y[,idx]
EDIT: For six vectors, do
idx <- which(X==1 & H==1 & I==1 & D==1 & E==1 & K==1)
Y[,idx]
Replace & with | if you want all columns of Y where at least one of the lists has a 1.
Related
I want to create a new column based on some conditions imposed on several columns. For example, here is an example dataset:
a <- data.frame(x=c(1,0,1,0,0), y=c(0,0,0,0,0), z=c(1,1,0,0,0))
a
x y z
1 1 0 1
2 0 0 1
3 1 0 0
4 0 0 0
5 0 0 0
Specifically, if for any particular row 1 is present, then the new column returns 1. If all are 0, then the new column returns 0. So the dataset with the new column will be
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
My initial thought was to use %in% but couldn't get the result I want. Thank you for your help!
If your data frame consists of binary values, e.g., only 0 and 1, you can try the code below with rowSums
a$w <- +(rowSums(a)>0)
such that
> a
x y z w
1 1 0 1 1
2 0 0 1 1
3 1 0 0 1
4 0 0 0 0
5 0 0 0 0
We can use rowMaxs from matrixStats
library(matrixStats)
a$w <- rowMaxs(as.matrix(a))
a$w
#[1] 1 1 1 0 0
You can find max of each row :
a$w <- do.call(pmax, a)
a
# x y z w
#1 1 0 1 1
#2 0 0 1 1
#3 1 0 0 1
#4 0 0 0 0
#5 0 0 0 0
which can also be done with apply :
a$w <- apply(a, 1, max)
I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))
My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too.
I want to know how many times each string is repeated in all columns
for example P67809 is repeated 2 times in column a and column d
so the output should look likes
string No columns
P67809 2 a,b
Based on this function I can assign a row number to each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
Then I apply the function on all and each columns string like
myS <- lapply(mydata, normalize,";")
but I don't know how to then search and get the output
We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.
library(reshape2)
dM <- melt(mydata, id.var=NULL)
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
# a 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
# b 3 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0
# c 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0
# d 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1
# values
#ind Q99877 Q99879 Q9Y2T7
# a 0 0 1
# b 0 0 0
# c 0 0 0
# d 1 1 1
We get the total number of each element with colSums.
cS <- colSums(tbl)
If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.
library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
# string No columns
#1: P67809 2 a, d
#2: Q9Y2T7 2 a, d
One approach might be:
res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))
res2
P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1 1 1 1 1 1 1 1 1 0 0 0
b 0 0 0 0 0 0 0 0 0 1 1 1
c 0 0 0 0 0 0 0 0 0 1 1 1
d 1 1 0 0 0 0 0 0 0 0 0 0
P62979 P0CG47 P0CG48 Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 0 0 0 0 0 0 0
d 1 1 1 0 0 0 1 1 1 1 1 1 1
Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 1 1 1 1 1
as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
V1 V2
P67809 2 a,d
Q9Y2T7 2 a,d
P42356 1 a
Q8N8J0 1 a
A4QPH2 1 a
P35908 1 a
P19338 1 a
P15924 1 a
P14923 1 a
Q02413 2 b,c
P63241 2 b,c
Q6IS14 2 b,c
P62979 3 b,c,d
P0CG47 3 b,c,d
P0CG48 3 b,c,d
2 b,c
Q16658 1 c
P62851 1 c
Q07955 1 d
Q06830 1 d
P62807 1 d
O60814 1 d
P57053 1 d
Q99879 1 d
Q99877 1 d
Q93079 1 d
Q5QNW6 1 d
P58876 1 d
P62750 1 d
Q969S3 1 d
An alternative approach with cSplit from splitstackshape and gather from tidyr.
library(splitstackshape)
library(tidyr)
library(dplyr)
splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table
I have a txt file (data5.txt):
1 0 1 0 0
1 1 1 0 0
0 0 1 0 0
1 1 1 0 1
0 0 0 0 1
0 0 1 1 1
1 0 0 0 0
1 1 1 1 1
0 1 0 0 1
1 1 0 0 0
I need to count the frequency of one's and zero's in each column
if the frequency of ones >= frequency of zero's then I will print 1 after the last row for that Colum
I'm new in R, but I tried this, and I got error:
Error in if (z >= d) data[n, i] = 1 else data[n, i] = 0 :
missing value where TRUE/FALSE needed
my code:
data<-read.table("data5.txt", sep="")
m =length(data)
d=length(data[,1])/2
n=length(data[,1])+1
for(i in 1:m)
{
z=sum(data[,i])
if (z>=d) data[n,i]=1 else data[n,i]=0
}
You may try this:
rbind(df, ifelse(colSums(df == 1) >= colSums(df == 0), 1, NA))
# V1 V2 V3 V4 V5
# 1 1 0 1 0 0
# 2 1 1 1 0 0
# 3 0 0 1 0 0
# 4 1 1 1 0 1
# 5 0 0 0 0 1
# 6 0 0 1 1 1
# 7 1 0 0 0 0
# 8 1 1 1 1 1
# 9 0 1 0 0 1
# 10 1 1 0 0 0
# 11 1 1 1 NA 1
Update, thanks to a nice suggestion from #Arun:
rbind(df, ifelse(colSums(df == 1) >= ceiling(nrow(df)/2), 1, NA)
or even:
rbind(df, ifelse(colSums(df == 1) >= nrow(df)/2, 1, NA)
Thanks to #SvenHohenstein.
Possibly I misinterpreted your intended results. If you want 0 when frequency of ones is not equal or larger than frequency of zero, then this suffice:
rbind(df, colSums(df) >= nrow(df) / 2)
Again, thanks to #SvenHohenstein for his useful comments!
Step 1: I have a simplified dataframe like this:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
, D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
, G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0
Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)
row1-row2 1 0 1 0 0 -1 0 -1
row1-row3 0 1 0 1 -1 0 -1 0
row2-row3 -1 1 -1 1 -1 1 -1 1
step 3: replace all -1 to 0
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Could you mind to teach me how to do so?
I like using the plyr library for things like this using the combn function to generate all possible pairs of rows/columns.
require(plyr)
combos <- combn(nrow(df1), 2)
adply(combos, 2, function(x) {
out <- data.frame(df1[x[1] , ] - df1[x[2] , ])
out[out == -1] <- 0
return(out)
}
)
Results in:
X1 B C D E F G H I
1 1 1 0 1 0 0 0 0 0
2 2 0 1 0 1 0 0 0 0
3 3 0 1 0 1 0 1 0 1
If necessary, you can drop the first column, plyr spits that out automagically for you.
Similar questions:
Sum pairwise rows with R?
Chi Square Analysis using for loop in R
Compare one row to all other rows in a file using R
For the record, I would do this:
cmb <- combn(seq_len(nrow(df1)), 2)
out <- df1[cmb[1,], ] - df1[cmb[2,], ]
out[out < 0] <- 0
rownames(out) <- apply(cmb, 2,
function(x) paste("row", x[1], "-row", x[2], sep = ""))
This yields (the last line above is a bit of sugar, and may not be needed):
> out
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Which is fully vectorised and exploits indices to extend/extract the elements of df1 required for the row-by-row operation.
> df2 <- rbind(df1[1,]-df1[2,], df1[1,]-df1[3,], df1[2,]-df1[3,])
> df2
B C D E F G H I
1 1 0 1 0 0 -1 0 -1
2 0 1 0 1 -1 0 -1 0
21 -1 1 -1 1 -1 1 -1 1
> df2[df2==-1] <- 0
> df2
B C D E F G H I
1 1 0 1 0 0 0 0 0
2 0 1 0 1 0 0 0 0
21 0 1 0 1 0 1 0 1
If you'd like to change the name of the rows to those in your example:
> rownames(df2) <- c('row1-row2', 'row1-row3', 'row2-row3')
> df2
B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1
Finally, if the number of rows is not known ahead of time, the following should do the trick:
df1 = data.frame (B=c(1,0,1), C=c(1,1,0), D=c(1,0,1), E=c(1,1,0), F=c(0,0,1), G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))
n <- length(df1[,1])
ret <- data.frame()
for (i in 1:(n-1)) {
for (j in (i+1):n) {
diff <- df1[i,] - df1[j,]
rownames(diff) <- paste('row', i, '-row', j, sep='')
ret <- rbind(ret, diff)
}
}
ret[ret==-1] <- 0
print(ret)