R - How can I create an "empty" dataframe after transposing? [duplicate] - r

This question already has answers here:
Initialize an empty tibble with column names and 0 rows
(6 answers)
Closed 1 year ago.
After transposing my data I am right now at this stage:
Alex Aro
Billie Piper
Chris Fe
Daron Chlim
Erik Fuc
(3000 more names)
Only headers, but no data inside. Now I want to populate the empty dataframe like this:
Alex Aro
Billie Piper
Chris Fe
Daron Chlim
Erik Fuc
(3000 more names)
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
(18 000 rows)
NA
NA
NA
NA
NA
It does not matter if I have zeros or NA in the end. As you can see lots of rows and columns, so code is only useful if I do not have to type every row and column by myself. Thanks in advance!

You subset the empty dataframe. If the dataframe with headers is called df to create 18000 rows with NA values you can do -
out <- df[1:18000, ]
rownames(out) <- NULL
out

Related

R: Combining columns under precondition

I want to have a column's values equal another column's values if the first column's value is NA in this row. So I want to change something like this
A B
3 NA
NA NA
NA NA
5 NA
NA NA
NA NA
7 5
to something like this
A B
3 3
NA NA
NA NA
5 5
NA NA
NA NA
7 5
I am fairly new to R and any other kind of programming.
As per OP's description:
equal another column's values if the first column's value is NA in
this row
Could you please try following and let me know if this helps you.
df21223$B[is.na(df21223$B[1])] <- df21223$A
Output will be as follows for data frame's B part:
> df21223$B
[1] 3 NA NA 5 NA NA 7
Where Sample data is:
> df21223$A
[1] 3 NA NA 5 NA NA 7
> df21223$B
[1] NA NA NA NA NA NA NA
try:
df$B[is.na(df$B)] <- df$A

Remove columns in a dataframe by partial columns characters recognition R

I would like to subset my data frame by selecting columns with partial characters recognition, which works when I have a single "name" to recognize.
where the data frame is:
ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA
library(stringr)
df[str_detect(names(df), "ABBA" )]
works, and returns:
ABBA01A ABBA01B ABBA02A ABBA02B
1908 NA NA NA NA
So, I would like to create a dataframe for each of my species:
Speciesnames=unique ( substring (names(df),0, 4))
Speciesnames
[1] "ABBA" "ACRU" "ARCU" "PIAB" "PIGL"
I have tried to make a loop and use [i] as species name but the str_detect funtion does not recognise it.
and I would like to add additional calculations in the loop
for ( i in seq_along(Speciesnames)){
df=df[str_detect(names(df), pattern =[i])]
print(df)
#my function for the subsetted dataframe
}
thank you for your help!
Using your data you could do the following:
create a list to hold the data.frames to be created.
filter the data.frames and store in the list
give each data.frame the name of of the specie
bring all the data.frames to the global environment out of the list
Speciesnames <- unique(substring(names(df),0, 4))
data <- vector("list", length(Speciesnames))
for(i in seq_along(Speciesnames)) {
data[[i]] <- df %>% select(starts_with(Speciesnames[i]))
}
names(data) <- Speciesnames
list2env(data, envir = globalenv())
The end result after list2envis 2 data.frames called "ABBA" "ACRU" which you then can access. If further manipulation is needed you might leave everything in the list and do it there.
An option is to use mapply with SIMPLIFY=FALSE to return list of data frames for each species. startsWith function from base-R will provide option to subset columns starting with specie name.
# First find species but taking unique first 4 characters from column names
species <- unique(gsub("([A-Z]{4}).*", "\\1",names(df)))
# Pass each species
listOfDFs <- mapply(function(x){
df[,startsWith(names(df),x)] # Return only columns starting with species
}, species, SIMPLIFY=FALSE)
listOfDFs
# $ABBA
# ABBA01A ABBA01B ABBA02A ABBA02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
#
# $ACRU
# ACRU01A ACRU01B ACRU02A ACRU02B
# 1908 NA NA NA NA
# 1909 NA NA NA NA
# 1910 NA NA NA NA
# 1911 NA NA NA NA
# 1912 NA NA NA NA
# 1913 NA NA NA NA
Data:
df <- read.table(text =
"ABBA01A ABBA01B ABBA02A ABBA02B ACRU01A ACRU01B ACRU02A ACRU02B
1908 NA NA NA NA NA NA NA NA
1909 NA NA NA NA NA NA NA NA
1910 NA NA NA NA NA NA NA NA
1911 NA NA NA NA NA NA NA NA
1912 NA NA NA NA NA NA NA NA
1913 NA NA NA NA NA NA NA NA",
header = TRUE, stringsAsFactors = FALSE)
I think that you should select all matching columns first, and then subselect your data.frame.
patterns <- c("ABB", "CDC")
res <- lapply(patterns, function(x) grep(x, colnames(df), value=TRUE))
df[, unique(unlist(res))]
res object is a list of matched columns for each pattern
Next step is to select unique set of columns: unique(unlist(res)) and subselect data.frame.
If you are writing production code probably it is not the best answer.

How do you collapse multiple rows based on multiple columns in r?

So basically I have a dataframe that kinda looks like this:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA NA NA NA
Alcan Border NA NA NA NA NA 2 NA
Alcan Border NA NA NA NA 5 NA NA
Ambler City 224 NA NA NA NA NA NA
Ambler City NA NA NA 17 NA NA NA
Is there a simple way to combine multiple rows based on multiple column data? I've seen a few scripts that say you can combine one duplicate variable in a column based on one or two data columns but I need to do it more large scale (I have ~400 rows with duplicates and ~30 columns (and each column has a large name).
Ideally it would look like:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA 5 2 NA
Ambler City 224 NA NA 17 NA NA NA
I'm very new at R. Thank you!
Edit - I used the following code however a lot of column data (the data in rows after the first duplicate community name disappeared ex: the Alcon border values for 10-14 and 15-19 became NA) went missing when I collapsed it. Ideas?
library(dplyr)
census8 <- census7 %>%
group_by(Community) %>%
summarise_each(funs(sum))
To keep the NAs in there the way you want you could use data.table:
library(data.table)
setDT(df)[,lapply(.SD, function(x) ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = T))),
by = Community]
# Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
#1: Akutan_city NA NA NA NA NA NA 71
#2: Alcan_Border NA NA 2 NA 5 2 NA
#3: Ambler_City 224 NA NA 17 NA NA NA

web scraping a table with rvest

I am trying to extract a table from the url: http://gnomad.broadinstitute.org/variant/9-34647855-C-T
I do the following:
library(rvest)
url<-"http://gnomad.broadinstitute.org/variant/9-34647855-C-T"
frq_table <- read_html(url) %>% html_nodes("#frequency_table") %>% html_table()
I got that "#frequency_table" bit by using inspect element in Chrome and
copying selector corresponding to the table. However the table I get do to contain any values just NAs.
frq_table
[[1]]
Population Allele Count Allele Number Number of Homozygotes Allele Frequency
1 European (Non-Finnish) NA NA NA NA
2 Ashkenazi Jewish* NA NA NA NA
3 East Asian NA NA NA NA
4 Other NA NA NA NA
5 African NA NA NA NA
6 Latino NA NA NA NA
7 South Asian NA NA NA NA
8 European (Finnish) NA NA NA NA
9 Total NA NA NA NA
I must be assigning the wrong path .... can't figure out how to extract the values.
Any help is much appreciated!

Create a new data frame

I have a data frame with only one column. Column contain some names. I need change this data frame.
I created a list with some places:
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
How can i include on this data frame the number of column according the names of the list?
Is a vector your one column data frame? You can convert a vector to a data.frame and add columns. I use to add columns with NA and add values later. Check this example:
vtr <-c(1:6)
df <- as.data.frame(vtr)
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
df[,2:(length(voos_inter)+1)] <- NA
names(df)[2:(length(voos_inter)+1)] <- voos_inter
df
vtr PUJ SCL EZE MVD ASU VVI
1 1 NA NA NA NA NA NA
2 2 NA NA NA NA NA NA
3 3 NA NA NA NA NA NA
4 4 NA NA NA NA NA NA
5 5 NA NA NA NA NA NA
6 6 NA NA NA NA NA NA

Resources