Select non NA data from Single Row - r

This should be pretty easy, not sure what I'm missing here. I want to select a single row from a data frame, let's say row 1000, and get all columns where that specific row is not NA.
This works
df<- df[1000,]
df<- df[, !is.na(df)]
This fails
df<- df[1000, !is.na(df)]
ERROR "undefined columns selected"

You missed indexing the part concerning with is.na, here's an approach:
df <- df[1000, !is.na(df[1000, ])]

Related

Merge dataframes with unequal rows, and no matching column names R

I am trying to take df1 (a summary table), and merge it into df2 (master summary table).
This is a snapshot of df2, ignore the random 42, just the answer to the ultimate question.
This is an example of what df1, looks like.
Lastly, I have a vector called Dates. This matches the dates that are the column names for df2.
I am trying to cycle through 20 file, and gather the summary statistics of that file. I then want to enter that data into df2 to be stored permanently. I only need to enter the Earned column.
I have tried to use merge but since they do not have shared column names, I am unable to.
My next attempt was to try this. But it gave an error, because of unequal row numbers.
df2[,paste(Dates[i])] <- cbind(df2,df1)
Then I thought that maybe if I specified the exact location, it might work.
df2[1:length(df1$Earned),Dates[i]] <- df1$Earned
But that gave and error "New columns would leave holes after existing columns"
So then I thought of trying that again, but with cbind.
df2[1:length(df1$Earned),Dates[i]] <- cbind(df2, df1$Earned)
##This gave an error for differing row numbers
df2 <- cbind(df2[1:length(df1$Earned),Dates[i]],df1$earned)
## This "worked" but it replaced all of df2 with df1$earned, so I basically lost the rest of the master table
Any ideas would be greatly appreciated. Thank you.
Something like this might work:
df1[df1$TreatyYear %in% df2$TreatyYear, Dates] <- df2$Earned
Example
df <- data.frame(matrix(NA,4,4))
df$X1 <- 1:4
df[df$X1 %in% c(1,2),c("X3","X4")] <- c(1,2)
The only solution that I have found so far is to force df1$Earned into a vector. Then append the vector to be the exact length of the df2. Then I am able to insert the values into df2 by the specific column.
temp_values <- append(df1$Earned,rep(0,(length(df2$TreatyYear)-length(df1$TreatyYear))),after=length(df1$Earned))
df2[,paste(Dates[i])] <- temp_values
This is kind of a roundabout way to fix it, but not a very pleasant way. Any better ideas would be appreciated.

Count values per rows in a data frame R

I know, there is other questions like this one but none of them answer my specific problem.
On my data frame, I need to count the number of values in each rows between cols 3 and 8.
I want a simple NB.VAL like in Excel..
base_graphs$NB <- rowSums(!is.na(base_graphs)) # with this code, I count all values except NAs but I can't select specific columns
How to create this new column "NB" on my data frame "base_graphs" ?
You were really close:
base_graphs$NB <- rowSums(!is.na(base_graphs[, 3:8]))
The [, 3:8] subsets and selects columns 3 through 8.
apply can apply a function to each row of a data frame. Try:
base_graphs$NB <- apply(base_graphs[3:8], 1, function (x) sum(is.na(x)))

R: Leaving out undefined columns when subsetting a dataframe

I have a dataframe df1 with 300+ Columns, and I am trying to subset it to df2 by column names based on a couple dozen inputs in list1. However, some of the items in list1 are not actually column names in df1. So, I get an "undefined columns" error:
df2 <- df1[,list1]
Error in [.data.frame(df1, , list1) :
undefined columns selected
I realize the reason for this error. BUT I can't go through list1 and see which ones occur in df1 and which ones don't. Because I have to do this many many times with many many different lists. Is there a way to simply ignore those (and not include those null column name values from list1 in the subset df2)?
Thanks for your kind help.
How about something like: df2 <- df1[,list1[list1 %in% names(df1)]]
I was hoping for a nicer/shorter) solution for my own use.
Here is another
df1[,intersect(names(df1), list1)]

How to remove rows where columns satisfy certain condition in data frame

I have a data frame that looks like this
df <- data.frame(cbind(1:10, sample(c(1:5), 10, replace=TRUE)))
# in real case the columns could be more than two
# and the column name could be anything.
What I want to do is to remove all rows where the value of all its columns
is smaller than 5.
What's the way to do it?
df[!apply(df,1,function(x)all(x<5)),]
First of all ...please stop using cbind to create data.frames. You will be sorry if you continue. R will punish you.
df[ !rowSums(df <5) == length(df), ]
(The length() function returns the number of columns in a dataframe.)

Is there a more elegant way to find duplicated records?

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?
In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.
duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.
# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ] #note the comma
You can put it all together in one line
df[duplicated(df$var), ] # again, the comma, to indicate we are selected rows
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]
Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Resources