How do you collapse multiple rows based on multiple columns in r? - r

So basically I have a dataframe that kinda looks like this:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA NA NA NA
Alcan Border NA NA NA NA NA 2 NA
Alcan Border NA NA NA NA 5 NA NA
Ambler City 224 NA NA NA NA NA NA
Ambler City NA NA NA 17 NA NA NA
Is there a simple way to combine multiple rows based on multiple column data? I've seen a few scripts that say you can combine one duplicate variable in a column based on one or two data columns but I need to do it more large scale (I have ~400 rows with duplicates and ~30 columns (and each column has a large name).
Ideally it would look like:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA 5 2 NA
Ambler City 224 NA NA 17 NA NA NA
I'm very new at R. Thank you!
Edit - I used the following code however a lot of column data (the data in rows after the first duplicate community name disappeared ex: the Alcon border values for 10-14 and 15-19 became NA) went missing when I collapsed it. Ideas?
library(dplyr)
census8 <- census7 %>%
group_by(Community) %>%
summarise_each(funs(sum))

To keep the NAs in there the way you want you could use data.table:
library(data.table)
setDT(df)[,lapply(.SD, function(x) ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = T))),
by = Community]
# Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
#1: Akutan_city NA NA NA NA NA NA 71
#2: Alcan_Border NA NA 2 NA 5 2 NA
#3: Ambler_City 224 NA NA 17 NA NA NA

Related

How to find columns with maximum ratings

This problem has to be done in R only not SQL.
I have a problem where I am given below dataset.
Data Dictionary
UserID – 4848 customers who provided a rating for each movie - (Row)
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users (Columns)
1) I need to find Which movies have maximum views/ratings?
2) Define the top 5 movies with the least audience
I was able to get the max rating for each movie(column) by below. But after this how do I limit this result with highest rating.. what kind of filter or function can be used.
I used this :
dataset <- read.csv("Amazon - Movies and TV Ratings.csv", row.names = 1)
sapply(dataset,max,na.rm=TRUE)
This gives me one row with max value fr each col (5,5,2,5,3 etc.)
Sample dataset:
Movie1 Movie2 Movie3 Movie4 Movie5 Movie6
USer1 5 5 NA NA NA NA
USer2 NA NA 2 NA NA NA
USer3 NA NA NA 5 NA NA
USer4 NA NA NA 5 NA NA
USer5 NA NA NA NA 5 NA
USer6 NA NA NA NA 2 NA
USer7 NA NA NA NA 5 NA
USer8 NA NA NA NA 2 NA
USer9 NA NA NA NA 5 NA
USer10 NA NA NA NA 5 NA
Sample data screenshot:
Amazon rating dataset
For your first question,
data <-cbind(c(1,5,NA,2,3,5,2,3),c(3,NA,4,1,2,1,3,2),c(NA,1,1,3,4,3))
data <- as.data.frame(data)
colnames(data) <- c("Movie1","Movie2","Movie3")
data
apply(data,2,max,na.rm=TRUE)
#Movie1 Movie2 Movie3
#5 4 4
For the second question, I believe - you need to specify the criteria on which you want to say a movie is top one. ex : something like do you want to compare the rating with average rating of that movie?

Remove columns in a dataframe by partial columns characters recognition R

I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize.
where the data frame is:
ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA
library(stringr)
df[str_detect(names(df), "ABBA" )]
works, and returns:
ABBA01A ABBA01B ABBA02A ABBA02B
1908 NA NA NA NA
So, I would like to create a dataframe for each of my species:
Speciesnames=unique ( substring (names(df),0, 4))
Speciesnames
[1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL"
I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it.
and I would like to add additional calculations in the loop
for ( i in seq_along(Speciesnames)){
df=df[str_detect(names(df), pattern =[i])]
print(df)
#my function for the subsetted dataframe
}
thank you for your help!
Using your data you could do the following:
create a list to hold the data.frames to be created.
filter the data.frames and store in the list
give each data.frame the name of of the specie
bring all the data.frames to the global environment out of the list
Speciesnames <- unique(substring(names(df),0, 4))
data <- vector("list", length(Speciesnames))
for(i in seq_along(Speciesnames)) {
data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
}
names(data) <- Speciesnames
list2env(data, envir = globalenv())
The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.
An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.
# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))
# Pass each species
listOfDFs <- mapply(function(x){
df[,startsWith(names(df),x)] # Return only columns starting with species
}, species, SIMPLIFY=FALSE)
listOfDFs
# $ABBA
# ABBA01A ABBA01B ABBA02A ABBA02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
#
# $ACRU
# ACRU01A ACRU01B ACRU02A ACRU02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
Data:
df <- read.table(text =
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA",
header = TRUE, stringsAsFactors = FALSE)
I think that you should select all matching columns first, and then subselect your data.frame.
patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]
res object is a list of matched columns for each pattern
Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.
If you are writing production code probably it is not the best answer.

How to use regular expression to modify a text file in r and obtain the desired file?

I have dataframe like this:
Patient_ID=c(10001,10002,10002,10003,10001,10004,10005,10005,10006,10006)
Diagnosis_Codes=c(1,16,5,55,28,1,1,12,14,83)
Diag_Index= c(1,2,3,4,5,6,7,8,9,10)
df=data.frame(Patient_ID,Diag_Index,Diagnosis_Codes)
df
I used the following code to have wide-format data:
library(reshape2)
wide_df <- dcast(df, Patient_ID ~ Diag_Index, value.var='Diagnosis_Codes')
wide_df
Patient_ID 1 2 3 4 5 6 7 8 9 10
1 10001 1 NA NA NA 28 NA NA NA NA NA
2 10002 NA 16 5 NA NA NA NA NA NA NA
3 10003 NA NA NA 55 NA NA NA NA NA NA
4 10004 NA NA NA NA NA 1 NA NA NA NA
5 10005 NA NA NA NA NA NA 1 12 NA NA
6 10006 NA NA NA NA NA NA NA NA 14 83
Now I need to convert this dataframe to a text file in which NAs are removed and columns are separated by ,0, except for the first and second column I need only the “comma” as a separator! The last column would be ,0
The desired text file should look like this:
10001,1,0,28,0
10002,16,0,5,0
10003,55,0
10004,1,0
10005,1,0,12,0
10006,14,0,83,0
Using the following code, I converted the df to a text file and use ,0, as a separator.
write.table(wide_df, file = “raw_file.txt", row.names=FALSE, col.names=FALSE, sep=",0,")
Then tried to edit the file by regular expression to omit the NAs and make other required changes but I don’t know much on regular expression and could not get it done yet! Is the regular expression the right method for this problem? or I should do something else? Thanks for your help.
It would be better to go from your long data.frame to your desired output.
Here's a possibility:
library(data.table)
out <- as.data.table(df)[, sprintf("%s,0",
paste(Diagnosis_Codes, collapse = ",0,")), Patient_ID]
out
# Patient_ID V1
# 1: 10001 1,0,28,0
# 2: 10002 16,0,5,0
# 3: 10003 55,0
# 4: 10004 1,0
# 5: 10005 1,0,12,0
# 6: 10006 14,0,83,0
fwrite(out, file = "your_file.csv", row.names = FALSE, col.names = FALSE)

web scraping a table with rvest

I am trying to extract a table from the url: http://gnomad.broadinstitute.org/variant/9-34647855-C-T
I do the following:
library(rvest)
url<-"http://gnomad.broadinstitute.org/variant/9-34647855-C-T"
frq_table <- read_html(url) %>% html_nodes("#frequency_table") %>% html_table()
I got that "#frequency_table" bit by using inspect element in Chrome and
copying selector corresponding to the table. However the table I get do to contain any values just NAs.
frq_table
[[1]]
Population Allele Count Allele Number Number of Homozygotes Allele Frequency
1 European (Non-Finnish) NA NA NA NA
2 Ashkenazi Jewish* NA NA NA NA
3 East Asian NA NA NA NA
4 Other NA NA NA NA
5 African NA NA NA NA
6 Latino NA NA NA NA
7 South Asian NA NA NA NA
8 European (Finnish) NA NA NA NA
9 Total NA NA NA NA
I must be assigning the wrong path .... can't figure out how to extract the values.
Any help is much appreciated!

Create a new data frame

I have a data frame with only one column. Column contain some names. I need change this data frame.
I created a list with some places:
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
How can i include on this data frame the number of column according the names of the list?
Is a vector your one column data frame? You can convert a vector to a data.frame and add columns. I use to add columns with NA and add values later. Check this example:
vtr <-c(1:6)
df <- as.data.frame(vtr)
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
df[,2:(length(voos_inter)+1)] <- NA
names(df)[2:(length(voos_inter)+1)] <- voos_inter
df
vtr PUJ SCL EZE MVD ASU VVI
1 1 NA NA NA NA NA NA
2 2 NA NA NA NA NA NA
3 3 NA NA NA NA NA NA
4 4 NA NA NA NA NA NA
5 5 NA NA NA NA NA NA
6 6 NA NA NA NA NA NA

Resources