Top rows of count of a specific column in DF in R - r

This is my dataset - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
Ive been try to create a histogram for the most busiest ID's of a dataset.
So first I create the count
hostcount <-plyr::count(nycab2$host_id)
Then I try to use just the top 1-
hostcounttop <- head(arrange(hostcount, decreasing = TRUE), n = 10)
But I get this error
Error: Length of ordering vectors don't match data frame size
What's wrong with my code as the it's the same data frame size?

arrange does not have decreasing = TRUE argument. You can use desc on the column name.
However, plyr has been retired and most of the functions are available in dplyr. You can do :
library(dplyr)
hostcounttop <- nycab2 %>% count(host_id) %>% slice_max(n, 1)
#If you have dplyr version < 1.0.0 use `top_n`.
#hostcounttop <- nycab2 %>% count(host_id) %>% top_n(1, n)
In base R, we can do this as :
mat <- stack(table(nycab2$host_id))
mat <- mat[order(mat$values), ]
#to get top 10 values use tail
tail(mat, 10)

You could try using order() instead of arrange, and subsetting the dataframe using brackets?
E.g. assuming the problem isn't with your dataframe, try:
order(hostcount, decreasing = TRUE)[1:10]
You can change the 1:10 to get as many as you want.

Related

How to find duplicated values in column in R [duplicate]

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Want to write a loop to find reflectance values for each column

So I have this code in R that I'm using on a dataframe df that comes in the format where each row is a wavelength (823 rows/wavelengths) and each column is a pixel (written as V1-V2554).
I have the code to normalise each reflectance value as such per each spectrum/pixel:
# Define function to find vector length
veclen=function(vec) {
sqrt(sum(vec^2))
}
# Find vector length for spectrum of each pixel
df_vecV6 <- df %>%
group_by(Wavelength) %>%
summarise(veclengthV6 = veclen(V6))
# Join new variable "veclength"
df <- df %>%
left_join(df_vecV6, by = "Wavelength")
# Define function that return normalized vector
vecnorm=function(vector) {
vector/veclen(vector)
}
# Normalize by dividing each reflectance value by the vector’s length
df$refl_normV6 <- vecnorm(df$V6)
but I want to create a loop to do this for all 2553 columns. I started writing it but seem to come up with problems. In this case df is finaldatat and I wanted to create a list svec to store vector lengths before the next steps:
for(i in (1:ncol(finaldatat))){
svec[[i]]<- finaldatat %>%
#group_by(Wavelength) %>%
summarise (x = veclen(finaldatat[,i]))
}
That first step runs, but the vector lengths that are meant to be below zero are way above so I already know there's a problem. Any help is appreciated!
Ideally in the final dataframe I would only have the normalised results in the same 2554x824 format.
You can use dplyr's across function to apply vecnorm to all columns from V1 to V2554.
result <- df %>%
group_by(Wavelength) %>%
summarise(across(V1:V2554, vecnorm))
#In older version of dplyr use summarise_at :
summarise_at(vars(V1:V2554), vecnorm)

Filter rows based on vector index instead of column name or index

I have a very simple sample data frame df_test as:
df_test <- data.frame("A" = 1:5)
I would like to select the row containing 5. I know I can achieve it by using the filter() command as:
df_analysis <- df_test %>%
filter(A == 5)
However, I want to run a for loop (as the actual data set has many variables and is complex), thus instead of filtering columns manually one by one, I would like to run a for loop of columns that can pick one variable at a time and filter rows accordingly. For this example, I create a character vector v as v = c("A").
Now to filter, instead of using the column name, when I try to use this vector index as:
df_analysis <- df_test %>%
filter(v[1] == 5)
It produces 0 rows instead of 1.
How can I filter rows using vector index instead of column index or name?
Thanks!
With the addition of purrr, you can do:
map(.x = v,
~ df_test %>%
filter(across(all_of(.x)) == 5))
[[1]]
A
1 5
We can use base R
df_test[df_test[[v]] == 5, , drop = FALSE]
Or with dplyr, by converting to symbol and evaluate (!!)
library(dplyr)
df_test %>%
filter(!! rlang::sym(v) == 5)
# A
#1 5
Or with .data
df_test %>%
filter(.data[[v]] == 5)
In its current form, your filter operation compares the literal string "A" (i.e., the contents of v[1]) to the numeric 5, which is of course always false and therefore can't return any valid rows. Instead, you'd need to pass the variable A (contained in df_test) as the first argument to filter(). You can do this by using get() like so:
df_analysis <- df_test %>%
filter(get(v[1]) == 5)
The other solution here using purrr is honestly much better, but I wanted to point out why your code didn't work as expected.

Making quick calculations on subsets with R

and thanks to all in advance.
I have the following data:
set.seed(123)
data <- data.frame (name=LETTERS[sample(1:26, 500, replace=T)],present=sample(0:1,500,replace = T))
And I want to quickly calculate the percentage of present observations (1's) for each letter. I can do it manually, but I believe there is an easier way to do this:
library(dplyr)
A <- filter(data, name=="A" & present==1)
A2 <- filter(data, name=="A")
data$Percentage[data$name=="A"] <- nrow(A)/nrow(A2)
And so on until I arrive to "Z".
Can I make this task automatically without having to change the values of the "name" colum manually?
Best regards,
We can use prop.table with table to get the proportion
prop.table(table(data), 1)[,2]
To add it as a column, we can expand it by matching with the 'names'
data$Percentage <- prop.table(table(data), 1)[,2][as.character(data$name)]
Or as #Lars Lau Raket suggested, we don't need to convert to character
prop.table(table(data), 1)[,2][data$name]
If we need to create a column
library(dplyr)
data %>%
group_by(name) %>%
mutate(Percentage = mean(present==1))

How can I remove all duplicates so that NONE are left in a data frame?

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.
I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?
I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.
This will extract the rows which appear only once (assuming your data frame is named df):
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.
Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.
A possibility involving dplyr could be:
df %>%
group_by_all() %>%
filter(n() == 1)
Or:
df %>%
group_by_all() %>%
filter(!any(row_number() > 1))
Since dplyr 1.0.0, the preferable way would be:
data %>%
group_by(across(everything())) %>%
filter(n() == 1)
Try it
library(dplyr)
DF1 <- data.frame(Part = c(1,2,3,4,5), Age = c(23,34,23,25,24), B.P = c(87,76,75,75,78))
DF2 <- data.frame(Part =c(3,5), Age = c(23,24), B.P = c(75,78))
DF3 <- rbind(DF1,DF2)
DF3 <- DF3[!(duplicated(DF3) | duplicated(DF3, fromLast = TRUE)), ]

Resources