I am trying to isolate some values from a data frame
example:
test_df0<- data.frame('col1'= c('string1', 'string2', 'string1'),
'col2' = c('value1', 'value2', 'value3'),
'col3' = c('string3', 'string4', 'string3'))
I want to obtain a new dataframe with only unique strings from col1, and the relevant strings from col3 (which will be identical for rows with identical col1.
This is the loop I wrote, but I must be doing some blunt mistake:
test_df1<- as.data.frame(matrix(ncol= 2, nrow=0))
colnames(test_df1)<- c('col1', 'col3')
for (i in unique(test_df0$col1)){
first_matching_row<- match(x = i, table = test_df0$col1)
temp_df<-
data.frame('col1'= i,
'col3'= test_df0[first_matching_row, 'col3'])
rbind(test_df1, temp_df)}
The resulting test_df1 though is empty. Cannot spot the mistake with the loop, I would be grateful for any suggestion.
Edit: the for loop is working, if its last line is print(temp_df) instead of the rbind command, I get the correct results. I am not sure why the rbind is not working
An easier and faster way to do with is with the use of the duplicated() function. duplicated() looks through and input vector and returns TRUE if that value has been seen at an earlier index in the vector. For example:
> duplicated(c(0,0,0,1,2,3,0,3))
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
Because for the first value of 0 it hadn't seen one before, but for the next two it had. The for 1, 2, and the first 3 it hadn't seen those numbers before, but it it had seen the last two numbers 0 and 3 previously. This means that !duplicated() will return TRUE for the unique values of the data.
We can use this to index into the data frame to get the rows of test_df0 with unique values of col1 as follows:
test_df0[!duplicated(test_df0[["col1"]]), ]
But this returns all columns of the data frame. If we just want col1 and col3 we can index into the columns as well using:
test_df0[!duplicated(test_df0[["col1"]]), c("col1", "col3")]
As for why the loop isn't working, as #Jacob mentions, you aren't assigning the value you are creating with rbind to a value, so the value you create disappears after the function call.
You aren't actually assinging the rbind to anything! Presumably you need something like:
test_df1 <- rbind(test_df1, temp_df)
Related
I have a large csv dataframe ("mydata") and would need to find if a value ("10295") is in the data frame and at which column. Here are my codes
any(mydata==10295)
which(apply(mydata, 2, function(x) any(grepl("10295", x))))
By doing so, I get TRUE at the first request and then get "1,2,5,39" as the columns having the searched value. However if I run
any(mydata$col1==10295) #col1 is the index name of column1
I get FALSE.
I am sorry if I cannot upload the data but it is a very large dataset. Does anyone have in mind where the mistake could be?
To find out columns which have value 10295 in it. You can try with colSums.
cols <- which(colSums(mydata == 10295, na.rm = TRUE) > 0)
cols will have column numbers that has at least 1 value of 10295 in it.
I am used to C++ style coding and having problems understanding how to convert a code comparing two dataframe column valued and create a new dataframe based on that without using for loops. My sample code is given below.
for(i in seq(1,nrow(DF1))){
for(j in seq(1,nrow(DF2))){
if(DF1$some_col1[i]==DF2$some_col1[j] && DF2$some_col2[i]!=all_df$some_col2[j]){
DF3[nrow(DF3)+1,]<- c(DF1$some_col1[i],DF1$some_col2[i],DF2$some_colm[j])
}
}
}
}
I will assume what you meant is you want to compare a column ( lets call it col1) from dataframe (lets call it df) to another column from the same dataframe ( lets call it col2).
I think you are trying several conditions on several columns from the same dataframe, and then if all conditions are met, you want to insert the values from those rows into a new data frame.
df = dataframe(col1 = c(1,2,3,8),
col2 = c(1,3,0,8),
col3 = c(TRUE,FALSE,TRUE,TRUE))
Now we do what i think you wanted:
newDF = df[ df$col1 == df$col2 & df$col3) , ]
now newDF will be a subset of your dataframe :
col1 col2 col3
1 1 1 TRUE
4 8 8 TRUE
modifying it will make not alter the original df.
**Some clarification:
In R you rarely need to use index variables if you are not using a loop. The reason is that R vectors support vector operations so instead of having a for loop with an index go through the entire column to check a condition, you can just specify the condition/operator and the vectors, R will do the rest:
>VEC1 = c(TRUE,TRUE,FALSE,TRUE)
>VEC2 = c(TRUE,FALSE,FALSE,TRUE)
>VEC1 & VEC2
[1] TRUE FALSE TRUE TRUE
I have a dataframe in R, and I want to check for any record in a vector that finds matches for the string in the DF. I can't seem to get it to work exactly right.
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
This gets me close, but the result is a vector of (1,0,0) instead of a 0 or 1 for each row
exampledf$match=by(exampledf,1:nrow(exampledf),function(row) ifelse(grepl(exampledf$Column1,examplevector),1,0))
Expected result:
exampledf$match=c("1","0")
grepl returns a logical vector the same length as your examplevector. You can wrap it with the any() function (equivalent to using sum() as suggested above).
Here's a slightly modified form of your code:
exampledf$match = vapply(exampledf$Column1, function(x) any(grepl(x, examplevector)), 1L)
So here is my solution:
library(dplyr)
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
pmatch does what you want and gives you which example vector it matches to. Use duplicates.ok because you want multiple matches to show up. If you dont want that, then make the argument equal to false. I just used dpylr to create the new column but you can do this however you would like.
exampledf %>% mutate(match_flag = ifelse(is.na(pmatch(Column1, examplevector, duplicates.ok = T)),0
, pmatch(Column1, examplevector, duplicates.ok = T)))
Column1 match_flag
1 PIT 1
2 SLC 0
I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which
You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.
Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.
It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.