This problem has to be done in R only not SQL.
I have a problem where I am given below dataset.
Data Dictionary
UserID – 4848 customers who provided a rating for each movie - (Row)
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users (Columns)
1) I need to find Which movies have maximum views/ratings?
2) Define the top 5 movies with the least audience
I was able to get the max rating for each movie(column) by below. But after this how do I limit this result with highest rating.. what kind of filter or function can be used.
I used this :
dataset <- read.csv("Amazon - Movies and TV Ratings.csv", row.names = 1)
sapply(dataset,max,na.rm=TRUE)
This gives me one row with max value fr each col (5,5,2,5,3 etc.)
Sample dataset:
Movie1 Movie2 Movie3 Movie4 Movie5 Movie6
USer1 5 5 NA NA NA NA
USer2 NA NA 2 NA NA NA
USer3 NA NA NA 5 NA NA
USer4 NA NA NA 5 NA NA
USer5 NA NA NA NA 5 NA
USer6 NA NA NA NA 2 NA
USer7 NA NA NA NA 5 NA
USer8 NA NA NA NA 2 NA
USer9 NA NA NA NA 5 NA
USer10 NA NA NA NA 5 NA
Sample data screenshot:
Amazon rating dataset
For your first question,
data <-cbind(c(1,5,NA,2,3,5,2,3),c(3,NA,4,1,2,1,3,2),c(NA,1,1,3,4,3))
data <- as.data.frame(data)
colnames(data) <- c("Movie1","Movie2","Movie3")
data
apply(data,2,max,na.rm=TRUE)
#Movie1 Movie2 Movie3
#5 4 4
For the second question, I believe - you need to specify the criteria on which you want to say a movie is top one. ex : something like do you want to compare the rating with average rating of that movie?
Related
I have a 996x12 database that collects categorical variables. All of them are dummy variables (1,0). One of the variables indicates whether or not they disclose on the environment and the other eleven variables indicate different sectors whether or not it belongs to that sector.
My intention is for R to return me a table where the correlation between whether they disclose or not and the sector it belongs to is calculated. In other words, compare a variable with the other eleven variables.
How would it be done?
I have tried to test the cor () function but I get missing values NA.
DISCL Energy Materials Industrials Consumer.discretionary Consumer.staples .......
DISCL 1
Energy NA 1
Materials NA NA 1
Industrials NA NA NA 1
Consumer.discretionary NA NA NA NA 1
Consumer.staples NA NA NA NA NA 1
Health.care NA NA NA NA NA NA
Financials NA NA NA NA NA NA
Information.technology NA NA NA NA NA NA
Communication.services NA NA NA NA NA NA
I want to have a column's values equal another column's values if the first column's value is NA in this row. So I want to change something like this
A B
3 NA
NA NA
NA NA
5 NA
NA NA
NA NA
7 5
to something like this
A B
3 3
NA NA
NA NA
5 5
NA NA
NA NA
7 5
I am fairly new to R and any other kind of programming.
As per OP's description:
equal another column's values if the first column's value is NA in
this row
Could you please try following and let me know if this helps you.
df21223$B[is.na(df21223$B[1])] <- df21223$A
Output will be as follows for data frame's B part:
> df21223$B
[1] 3 NA NA 5 NA NA 7
Where Sample data is:
> df21223$A
[1] 3 NA NA 5 NA NA 7
> df21223$B
[1] NA NA NA NA NA NA NA
try:
df$B[is.na(df$B)] <- df$A
So basically I have a dataframe that kinda looks like this:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA NA NA NA
Alcan Border NA NA NA NA NA 2 NA
Alcan Border NA NA NA NA 5 NA NA
Ambler City 224 NA NA NA NA NA NA
Ambler City NA NA NA 17 NA NA NA
Is there a simple way to combine multiple rows based on multiple column data? I've seen a few scripts that say you can combine one duplicate variable in a column based on one or two data columns but I need to do it more large scale (I have ~400 rows with duplicates and ~30 columns (and each column has a large name).
Ideally it would look like:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA 5 2 NA
Ambler City 224 NA NA 17 NA NA NA
I'm very new at R. Thank you!
Edit - I used the following code however a lot of column data (the data in rows after the first duplicate community name disappeared ex: the Alcon border values for 10-14 and 15-19 became NA) went missing when I collapsed it. Ideas?
library(dplyr)
census8 <- census7 %>%
group_by(Community) %>%
summarise_each(funs(sum))
To keep the NAs in there the way you want you could use data.table:
library(data.table)
setDT(df)[,lapply(.SD, function(x) ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = T))),
by = Community]
# Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
#1: Akutan_city NA NA NA NA NA NA 71
#2: Alcan_Border NA NA 2 NA 5 2 NA
#3: Ambler_City 224 NA NA 17 NA NA NA
I am trying to extract a table from the url: http://gnomad.broadinstitute.org/variant/9-34647855-C-T
I do the following:
library(rvest)
url<-"http://gnomad.broadinstitute.org/variant/9-34647855-C-T"
frq_table <- read_html(url) %>% html_nodes("#frequency_table") %>% html_table()
I got that "#frequency_table" bit by using inspect element in Chrome and
copying selector corresponding to the table. However the table I get do to contain any values just NAs.
frq_table
[[1]]
Population Allele Count Allele Number Number of Homozygotes Allele Frequency
1 European (Non-Finnish) NA NA NA NA
2 Ashkenazi Jewish* NA NA NA NA
3 East Asian NA NA NA NA
4 Other NA NA NA NA
5 African NA NA NA NA
6 Latino NA NA NA NA
7 South Asian NA NA NA NA
8 European (Finnish) NA NA NA NA
9 Total NA NA NA NA
I must be assigning the wrong path .... can't figure out how to extract the values.
Any help is much appreciated!
I have a data frame with only one column. Column contain some names. I need change this data frame.
I created a list with some places:
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
How can i include on this data frame the number of column according the names of the list?
Is a vector your one column data frame? You can convert a vector to a data.frame and add columns. I use to add columns with NA and add values later. Check this example:
vtr <-c(1:6)
df <- as.data.frame(vtr)
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
df[,2:(length(voos_inter)+1)] <- NA
names(df)[2:(length(voos_inter)+1)] <- voos_inter
df
vtr PUJ SCL EZE MVD ASU VVI
1 1 NA NA NA NA NA NA
2 2 NA NA NA NA NA NA
3 3 NA NA NA NA NA NA
4 4 NA NA NA NA NA NA
5 5 NA NA NA NA NA NA
6 6 NA NA NA NA NA NA