R Using a for() loop to fill one dataframe with another - r

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time

You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").

It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.

Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Related

Nested for loops, different in R

d3:
Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22
I am using this loop:
for ( i in 1:(nrow (d3)-1) ){
for (j in (i+1):nrow(d3)) {
if(c(i) == c(j)) {
print(c(j))
# d4 <- subset.data.frame(c(j))
}
}
}
I want to compare all the rows in Col1 and eliminate the ones that are not the same. Then I want to output a data frame with only the ones that have the same values in col1.
Expected Output:
Col1 Col2
PBR565 22
PBR565 22
PBR565 22
Not sure whats up with my nested loop? Is it because I don't specify the col names?
The OP has requested to compare all the rows in Col1 and eliminate the ones that are not the same.
If I understand correctly, the OP wants to remove all rows where the value in Col1 appears only once and to keep only those rows where the values appears two or more times.
This can be accomplished by finding duplicated values in Col1. The duplicated() function marks the second and subsequent appearences of a value as duplicated. Therefore, we need to scan forward and backward and combine both results:
d3[duplicated(d3$Col1) | duplicated(d3$Col1, fromLast = TRUE), ]
Col1 Col2
2 PBR565 22
3 PBR565 22
4 PBR565 22
The same can be achieved by counting the appearances using the table() function as suggested by Ryan. Here, the counts are filtered to keep only those entries which appear two or more times.
t <- table(d3$Col1)
d3[d3$Col1 %in% names(t)[t >= 2], ]
Please, note that this is different from Ryan's solution which keeps only the rows whose value appears most often. Only one value is picked, even in case of ties. (For the given small sample dataset both approaches return the same result.)
Ryan's answer can be re-written in a slightly more concise way
d3[d3$Col1 == names(which.max(t)), ]
Data
d3 <- data.table::fread(
"Col1 Col2
PBR569 23
PBR565 22
PBR565 22
PBR565 22", data.table = FALSE)

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Using apply() and If() statement to sum() two columns

I have a dataframe with 2 columns and I want to use a if/else condition when using the apply function to sum() the rows in each column - specifically, for all the rows where Col1 >= Col2 take the sum() of Col1 and store it in variable a, and for all the rows where Col1 < Col2 take the sum() of Col1 and store it in variable b.
For example
df<-data.frame(Col1=c(1,2,3,4,5),Col2=c(5,4,3,2,1))
df
Col1 Col2
1 5
2 4
3 3
4 2
5 1
There are three instances in which Col1 >= Col2, so in Col1 I take the sum() of 3+4+5, which is 12. There are two instances in which Col1 < Col2, so in Col1 I take the sum() of 1+2, which is 3. So
>a
12
>b
3
This is the code I created, but it's still in the works:
apply(df, 1, function(x)
if(df$Col1 >= df$Col2)
a<-sum(df$Col1 >= df$Col2)
else
b<-sum(df$Col1 < df$Col2)
)
The code here doesn't work because it simply adds the number of times the condition is true and not the actual values.
There's really no need for any *apply() functions here, as these are fully vectorized operations. Here's how I might go about it, putting both results into a nice list.
with(df, {
x <- Col1 >= Col2
list(a = sum(Col1[x]), b = sum(Col1[!x]))
})
# $a
# [1] 12
#
# $b
# [1] 3
I'm not sure why you would want to tackle this problem with an using -apply-. It seems like an overkill. Also note that your -apply- statement lacks the margin argument with which you indicate whether you want to apply the function to rows, columns or both (also, the line defining df needs another closing paranthesis).
A simple two line solution would be this:
df<-data.frame(Col1=c(1,2,3,4,5),Col2=c(5,4,3,2,1)
a <- sum(df$Col1[df$Col1 >= df$Col2])
b <- sum(df$Col2[df$Col1 < df$Col2])

Average across some rows in R

I have not found a way to take an average across SOME columns in R when working with a data frame table. Basically, I want to take the average of the 3 controls (CTR_R1+CTR_R2+CTR_R3) and insert that value as another column right after CTR_R3 (see below). The same for the TRT.
Is there away to take the average and insert it in a specific location?
GeneID|CTR_R1|CTR_R2|CTR_R3|CTR_AVG|TRT_R1| TRT_R2| TRT_R3|TRT_AVG|pValue
How about
df$CTR_AVG <- rowMeans(df[,2:4])
df$TRT_AVG <- rowMeans(df[,6:8])
This code should work for you, if your data.frame is named df:
df$CTR_AVG <- ( df$CTR_R1 + df$CTR_R2 + df$CTR_R3 ) / 3
That is assuming that the CTR_AVG column already exists as you shown in your question. If it does not the code will put the column at the end of the data.frame. To move it to the right spot, you will need to select the columns in the correct order, like so:
df[ , c( 'GeneID', 'CTR_R1', 'CTR_R2', 'CTR_R3', 'CTR_AVG', 'TRT_R1', 'TRT_R2', 'TRT_R3','TRT_AVG','pValue' ]
The below code should work even if there are many CTR or TRT columns (i.e. 100s). But, I am guessing #beginneR's solution to be faster.
indx <- grep("^CTR", colnames(df1), value=TRUE)
indxT <- grep("^TRT", colnames(df1), value=TRUE)
df1[,c('CTR_Avg', 'TRT_Avg')] <- lapply(list(indx, indxT),
function(x) Reduce(`+`, df1[,x])/length(x))
or you can use rowMeans in the above step.
df2 <- df1[,c('GeneID', indx, 'CTR_Avg', indxT, 'TRT_Avg', 'pValue')]
head(df2,2)
# GeneID CTR_R1 CTR_R2 CTR_R3 CTR_Avg TRT_R1 TRT_R2 TRT_R3 TRT_Avg pValue
#1 1 6 2 10 6.000000 10 11 15 12 0.091
#2 2 5 12 8 8.333333 5 3 13 7 0.051
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:20,20*6, replace=TRUE), ncol=6))
colnames(df1) <- c("CTR_R1", "CTR_R2", "CTR_R3", "TRT_R1", "TRT_R2", "TRT_R3")
df1 <- cbind(GeneID=1:20, df1,
pValue=sample(seq(0.001, 0.10, by=0.01), 20, replace=TRUE))
make some dummy data
df=data.frame(CTR_R1=1:10,CTR_R2=1:10,CTR_R3=1:10,somethingelse=1:10)
get a new column
df$CTR_AVG=apply(df[c("CTR_R1","CTR_R2","CTR_R3")],1,mean)
Thanks so much for your replies. I am sorry I did not phrase my original question better. I meant to ask how to write one script to take the average and place that value in the right place. I do not have in my table the column that says "CTR_AVG", nor the column "TRT_AVG".
I was wondering if i could do it more 'elegantly' than doing what i did below (which works too).
Many thanks.
#
names (edgeR_table)
"GeneID" "CTR_R1" "CTR_R2" "CTR_R3" "TRT_R1" "TRT_R2" "TRT_R3" "logFC" "logCPM" "LR" "PValue" "FDR"
#
edgeR_table$CTR_AVG <- rowMeans(edgeR_table[,2:4])
edgeR_table$TRT_AVG <- rowMeans(edgeR_table[,5:7])
edgeR_table <- edgeR_table[, c(1,2,3,4,13,5,6,7,14,8,9,10,11,12)]

Filtering a data frame

I have read in a csv file in matrix form (having m rows and n columns). I want to filter the matrix by conducting a filter in verbal form:
Select all values from column x where the values of an another column in this row is equal to "blabla".
It is like a select statement in database where I say I am interested in a subset of the matrix where these constraints need to be satisfied.
How can I do it in r? I have the data as dataframe and can access it by the headers. data["column_values" = "15"] does not give me back the rows where the column named column_values have values 15 only.
Thanks
You said you just wanted the column x values where column_values was 15, right?
subset(dat, column_values==15, select=x)
I think this may come as a dataframe so it's possble you may need to unlist() it and maybe even "unfactor" it.
> dat
Subject Product
1 1 ProdA
2 1 ProdB
3 1 ProdC
4 2 ProdB
5 2 ProdC
6 2 ProdD
7 3 ProdA
8 3 ProdB
> subset(dat, Subject==2, Product)
Product
4 ProdB
5 ProdC
6 ProdD
> unlist( subset(dat, Subject==2, Product) )
Product1 Product2 Product3
ProdB ProdC ProdD
Levels: ProdA ProdB ProdC ProdD
> as.character( unlist( subset(dat, Subject==2, Product) ) )
[1] "ProdB" "ProdC" "ProdD"
If you want all of the columns you can drop the third argument (the select= argument):
subset(dat, Subject==2 )
Subject Product
4 2 ProdB
5 2 ProdC
6 2 ProdD
Assuming that dat is the data frame in question, col is the name of the column and "value" is the value that you want, you can do
dat[dat$col=="value",]
That fetches all of the rows of dat for which dat$col=="value", and all of the columns.
First, note that a matrix and a data.frame are different things in R. I imagine you have a data.frame (as that is what is returned by read.csv()). data.frame's have named columns (if you don't give them ones, generic ones are created for you).
You can subset a data.frame by indicating both what rows you want and/or what columns you want. The easiest way to specify which rows is with a logical vector, often built out of comparisons using specific columns of the data.frame. For example data[["column values"]] == "15" would make a logical vector which is TRUE if the corresponding entry in the column column values is the string "15" (since it is in quotes, it is a string, not a number). You can make as complicated a selection criteria as you like (combining logical vectors with & and |) to specify the rows you want. This vector becomes the first argument in the indexing.
A list of column names or numbers can be the second argument. If either argument is missing, all rows (or columns) are assumed.
Putting this all together, you get examples like
data[data[["column values"]] == "15", ]
or using an actual data set (mtcars)
mtcars[mtcars$am == 1, ]
mtcars[mtcars$am == 1 & mtcars$hp > 100, "mpg"]
mtcars[mtcars$am == 1 & mtcars$hp > 100, "mpg", drop=FALSE]
mtcars[mtcars$hp > 100, c("mpg", "carb")]
Take a look at what each of the conditionals (first arguments, e.g. mtcars$am == 1 & mtcars$hp > 100) return to get a better sense of how indexing works.

Resources