How to compare two dataframe columns without using for loop? - r

I am used to C++ style coding and having problems understanding how to convert a code comparing two dataframe column valued and create a new dataframe based on that without using for loops. My sample code is given below.
for(i in seq(1,nrow(DF1))){
for(j in seq(1,nrow(DF2))){
if(DF1$some_col1[i]==DF2$some_col1[j] && DF2$some_col2[i]!=all_df$some_col2[j]){
DF3[nrow(DF3)+1,]<- c(DF1$some_col1[i],DF1$some_col2[i],DF2$some_colm[j])
}
}
}
}

I will assume what you meant is you want to compare a column ( lets call it col1) from dataframe (lets call it df) to another column from the same dataframe ( lets call it col2).
I think you are trying several conditions on several columns from the same dataframe, and then if all conditions are met, you want to insert the values from those rows into a new data frame.
df = dataframe(col1 = c(1,2,3,8),
col2 = c(1,3,0,8),
col3 = c(TRUE,FALSE,TRUE,TRUE))
Now we do what i think you wanted:
newDF = df[ df$col1 == df$col2 & df$col3) , ]
now newDF will be a subset of your dataframe :
col1 col2 col3
1 1 1 TRUE
4 8 8 TRUE
modifying it will make not alter the original df.
**Some clarification:
In R you rarely need to use index variables if you are not using a loop. The reason is that R vectors support vector operations so instead of having a for loop with an index go through the entire column to check a condition, you can just specify the condition/operator and the vectors, R will do the rest:
>VEC1 = c(TRUE,TRUE,FALSE,TRUE)
>VEC2 = c(TRUE,FALSE,FALSE,TRUE)
>VEC1 & VEC2
[1] TRUE FALSE TRUE TRUE

Related

R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]`

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which
You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.
Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

"for" loop not working

I am trying to isolate some values from a data frame
example:
test_df0<- data.frame('col1'= c('string1', 'string2', 'string1'),
'col2' = c('value1', 'value2', 'value3'),
'col3' = c('string3', 'string4', 'string3'))
I want to obtain a new dataframe with only unique strings from col1, and the relevant strings from col3 (which will be identical for rows with identical col1.
This is the loop I wrote, but I must be doing some blunt mistake:
test_df1<- as.data.frame(matrix(ncol= 2, nrow=0))
colnames(test_df1)<- c('col1', 'col3')
for (i in unique(test_df0$col1)){
first_matching_row<- match(x = i, table = test_df0$col1)
temp_df<-
data.frame('col1'= i,
'col3'= test_df0[first_matching_row, 'col3'])
rbind(test_df1, temp_df)}
The resulting test_df1 though is empty. Cannot spot the mistake with the loop, I would be grateful for any suggestion.
Edit: the for loop is working, if its last line is print(temp_df) instead of the rbind command, I get the correct results. I am not sure why the rbind is not working
An easier and faster way to do with is with the use of the duplicated() function. duplicated() looks through and input vector and returns TRUE if that value has been seen at an earlier index in the vector. For example:
> duplicated(c(0,0,0,1,2,3,0,3))
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
Because for the first value of 0 it hadn't seen one before, but for the next two it had. The for 1, 2, and the first 3 it hadn't seen those numbers before, but it it had seen the last two numbers 0 and 3 previously. This means that !duplicated() will return TRUE for the unique values of the data.
We can use this to index into the data frame to get the rows of test_df0 with unique values of col1 as follows:
test_df0[!duplicated(test_df0[["col1"]]), ]
But this returns all columns of the data frame. If we just want col1 and col3 we can index into the columns as well using:
test_df0[!duplicated(test_df0[["col1"]]), c("col1", "col3")]
As for why the loop isn't working, as #Jacob mentions, you aren't assigning the value you are creating with rbind to a value, so the value you create disappears after the function call.
You aren't actually assinging the rbind to anything! Presumably you need something like:
test_df1 <- rbind(test_df1, temp_df)

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

In R, I have two columns and would like to take the sum if a condition is met

I have am trying to write a script in R where I would take sum of values correspnding to a condition from another column.
Say I have two columns, fakeVector & fakeVector1 of table "total"
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=rbind(fakeVector, fakeVector1)
I want to get the values for fakeVector1 where fakeVector = specific value.
For example, I would like to grab the fakeVector1 value where fakeVector = specific value, for example "NTC.H3"
How would I do that?
We can try
sum(as.numeric(total["fakeVector1",][total["fakeVector",]=="NTC.H3"]))
total[2,][which(total[1,] == "NTC.H3")]
#[1] "1"
v1 <- c('NTC.H3', 'NTC.F22', 'abc123')
sum(as.numeric(total[2,][which(total[1,] %in% v1)]))
#[1] 8
If your data set is organized as a data.frame and if you want to know the sum of one column for every condition in another column of, you can use the fast data.table package.
# load library
library(data.table)
# get your data
fakeVector = c('NTC.H3','NTC.F2','NTC.F22','abc123','sample1')
fakeVector1 = c('1','2','3','4','5')
total=cbind(fakeVector, fakeVector1)
total <- as.data.table(total)
total$fakeVector1 <- as.numeric(total$fakeVector1)
# Solution
total[, .(mysum = sum(fakeVector1)), by=.(fakeVector)]

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

Resources