Trying to get rid of a data frame row - r

I am trying to get rid of a data frame row. I read the data with
temp_data <- read.table(blablabla)
and then when I try to get rid of the first row with
temp_data <- temp_data[-1,]
it turns temp_data into a vector. Why is this happening?

As commented by others, by default for [, it is drop=TRUE. From the ?"["
drop: For matrices and arrays. If TRUE the result is coerced to the
lowest possible dimension (see the examples). This only works for
extracting elements, not for the replacement. See drop for further
details.
So, we need
temp_data[-1, , drop=FALSE]
If we convert to data.table, for subsetting the rows, it is not needed,
library(data.table)
temp_data[-1]
data
temp_data <- data.frame(Col1 = 1:5)

Related

Why can I not rename columns of a tbl?

I came across a weird function in dplyr's tbl:
df <- as.tibble(iris)
i <- colnames(df)[5]
df$new <- df[,i]
For some reason the newly created column new is named new.Species (at least when I View(df)), however it should be named new only....
I do not understand why this happens. An obious fix is to simply save df as a data.frame - but I still would like to understand what happens here.
Because the df[,i] is still a tibble with one column. We need df[[i]]:
df$new <- df[[i]]
With data.frame, when we use [, by default drop = TRUE (?Extract), but in tibble, it won't drop the dimensions to create a vector. We need [[ to extract the column.

How to conditionally change colnames from certain rows?

I have a problem that arises often working with excel survey data: the first 10 or so column names are appropriate in a data set, the remaining x:ncol need to be renamed to the values of first row of the data set, starting at x+1. (The colnames are correct until x, after which point the colnames become empty, with the values that I would like to have as the colname being in the first row).
I have been doing this manually, writing them out one by one using dplyr::select(). How can I automate this in a tidy workflow? I imagine using set_names() or rename_at() but can't get the syntax. Thank you in advance
mtcars %>%
select(miles_per_gallon = "mpg", everything()) %>% #etc. keep some names
rename_at(vars(3:ncol(.)), funs(mtcars[,1]))
Error: `nm` must be `NULL` or a character vector the same length as `x`
The error isn't surprising, but to illustrate the point - how to have the names from x:ncol() replaced by the first row's values starting from x+1?
I think this should do it for you -
x <- 10 # means 1:10 column names are appropriate
names(df)[(x+1):ncol(df)] <- df[1, (x+1):ncol(df)]
df <- df[-1, ] # removing 1st row assuming it's bad data

Square brackets and dataframes in R

Okay I am rather puzzled by the different behaviours of dataframes and xtses in R and I'm hoping someone can explain it to me.
df = as.data.frame(x = c(1,2),row.names = c("2012-12-12","2012-12-13"))
xts = as.xts(x=c(1,2),order.by = as.POSIXct(c("2012-12-12","2012-12-13")))
I have two different datasets here. When you print them, they look almost similar. When I want the first row of the xts, xts[1,] returns the row with colnames and the index. But when you do df[1,] it only returns a vector.
Is there a way to return the first row of the dataframe, complete with the rownames and colname? I'm aware that I can hack it by doing as.data.frame(as.xts(df)[1,]) but is there a more elegant solution?
This is a very particular case where a subsetting operation by rows on a data frame has only ONE cell.
In that case, you need to specify drop = FALSE, here
df[1, , drop = FALSE]
I'd add a recommendation that when you create a data frame from scratch, use the data.frame() function instead of as.data.frame()

renaming subset of columns in r with paste0

I have a data frame (my_df) with columns named after individual county numbers. I melted/cast the data from a much larger set to get to this point. The first column name is year and it is a list of years from 1970-2011. The next 3010 columns are counties. However, I'd like to rename the county columns to be "column_"+county number.
This code executes in R but for whatever reason doesn't update the column names. they remain solely the numbers... any help?
new_col_names = paste0("county_",colnames(my_df[,2:ncol(my_df)]))
colnames(my_df[,2:ncol(my_df)]) = new_col_names
The problem is the subsetting within the colnames call.
Try names(my_df) <- c(names(my_df)[1], new_col_names) instead.
Note: names and colnames are interchangeable for data.frame objects.
EDIT: alternate approach suggested by flodel, subsetting outside the function call:
names(my_df)[-1] <- new_col_names
colnames() is for a matrix (or matrix-like object), try simply names() for a data.frame
Example:
new_col_names=paste0("county_",colnames(my_df[,2:ncol(my_df)]))
my_df <- data.frame(a=c(1,2,3,4,5), b=rnorm(5), c=rnorm(5), d=rnorm(5))
names(my_df) <- c(names(my_df)[1], new_col_names)

Is there a more elegant way to find duplicated records?

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?
In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.
duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.
# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ] #note the comma
You can put it all together in one line
df[duplicated(df$var), ] # again, the comma, to indicate we are selected rows
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]
Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Resources