I am trying to use the pipe function in dplyr and left_join to clean some meta data up. Setting up variables....
library(openxlsx)
library(tidyverse)
mdat <- read.xlsx("https://journals.plos.org/plospathogens/article/file?type=supplementary&id=info:doi/10.1371/journal.ppat.1005511.s011",
startRow = 3, fillMergedCells = TRUE) %>%
mutate(sample=Accession.Number)
dge$samples$sample=
[1] "SRR1346026" "SRR1346027" "SRR1346028" "SRR1346029" "SRR1346030" "SRR1346031" "SRR1346032" "SRR1346033" "SRR1346034"
[10] "SRR1346035" "SRR1346036" "SRR1346037" "SRR1346038" "SRR1346039" "SRR1346040" "SRR1346041" "SRR1346042" "SRR1346043"
[19] "SRR1346044" "SRR1346045" "SRR1346046" "SRR1346047" "SRR1346049" "SRR1346048" "SRR1346050" "SRR1346051" "SRR1346052"
I am trying to pipe in the dge$samples$sample, which is a character class. It needs to become a data frame of one column named sample so I can merge mdat with it by left join in order to remove all the metadata I don't have a sample for. If you run dim(mdat) you will find it is 35 by 15, I want to reduce it to the 19 samples I actually have data for, these are given in the dge$samples$sample list. I am trying to use the following code to first convert dge$samples$sample into a data frame with one column titled sample for joining the two and essentially removing all metadata that is not of interest to me. The code below has been my progress so far but I think I am failing to understand how pipe works.
test = data.frame(dge$samples$sample) %>%
colnames(.) = c("sample") %>%
left_join(
.,
mdat,
by = sample,
copy = FALSE,
suffix = c(".x", ".y"),
keep = FALSE,
na_matches = c("na", "never")
)
Why not just check if theyre in there and filter them:
mdat %>% filter( sample %in% dge$samples$sample )
It's easier to understand and controll than a join and performance shouldn't be an issue.
I think your code can be reduced to
library(dplyr)
test <- data.frame(sample = dge$samples$sample) %>%
left_join(mdat, by = 'sample')
Or an inner join should work as well, using base R :
test <- merge(data.frame(sample = dge$samples$sample), mdat, by = 'sample')
Using collapse
library(collapse)
sbt(mdat, sample %in% dge$samples$sample)
Related
I tried to run a panel var on dataset I got from Statistics Sweden and here is what I get:
df<- read_excel("Inkfördelning per kommun.xlsx")
nujavlar <- pvarfeols(dependent_vars = c("Kvintil-1", "Kvintil-4", "Kvintil-5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year")
)
Error: Can't subset columns that don't exist.
x Column `Kvintil-1` doesn't exist.
I often get this message too:
Warning in xtfrm.data.frame(x) : cannot xtfrm data frames
Error: Can't subset columns that don't exist.
x Location 2 doesn't exist.
ℹ There are only 1 column.
I have made sure that all data is numeric. I have also tried cleaning my workspace and restarted the programme. I also tried to convert it into a paneldata frame with palm package. I also tried converting my entity variable "Kommun" (Municipality) into factors and it still doesn't work.
Here's the data if someone wants to give it a go.
https://docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC/edit?usp=sharing&ouid=113164216369677216623&rtpof=true&sd=true
The column names in your dataframe are Kvintil 1, not Kvintil-1, so the variable you are referring to really does not exist. Please be aware that in R, variable names cannot have hyphens and it is good practice to avoid spaces in variable names because it is annoying to refer to variables with spaces. I have included a reproducible example below.
library(tidyverse)
library(gsheet)
library(panelvar)
url <- 'docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC'
df <- gsheet2tbl(url) %>%
rename(Kvintil1 = `Kvintil 1`) %>%
rename(Kvintil2 = `Kvintil 2`) %>%
rename(Kvintil3 = `Kvintil 3`) %>%
rename(Kvintil4 = `Kvintil 4`) %>%
rename(Kvintil5 = `Kvintil 5`) %>%
as.data.frame()
nujavlar <- pvarfeols(
dependent_vars = c("Kvintil1", "Kvintil4", "Kvintil5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year"))
I am trying to unite first and last names in each dataframe in a list of dataframes. The problem is that purrr doesn't seem to recognize colnames within each df.
Each df in data$authors_list looks something like
authid
surname
given-name
12345
Smith
John
85858
Scott
Jane
I want to unite the "surname" and "given-names" into a column called AuN.
data <- data %>%
mutate(authors_list = map(authors_list,
unite(col=AuN,
c(`given-name`,
surname),
sep = " ")))
However, I get the following error.
Error in unite(col = AuN, c(`given-name`, surname), sep = " ") :
object 'given-name' not found
I am new to using purrr, and I haven't been able to find solutions to a similar problem online. Any help would be appreciated!
I think this is what you're after. You need to put in .x in the unite call to stand in for each data frame in the list. For each one, it will unite with the parameters you specified.
library(tidyverse)
#Set up the data (but please in the future give us data so we don't have to set it up)
df <- tibble(authid = c(12345, 85858), surname = c("Smith", "Scott"), `given-name` = c("John","Jane"))
list_df <- list(df, df, df)
list_df_unite <- map(list_df, ~ unite(.x, AuN, c(`given-name`,surname), sep = " "))
I have a wide spark data frame of a few thousand columns by about a million rows, for which I would like to calculate the row totals. My solution so far is below. I used:
dplyr - sum of multiple columns using regular expressions and
https://github.com/tidyverse/rlang/issues/116
library(sparklyr)
library(DBI)
library(dplyr)
library(rlang)
sc1 <- spark_connect(master = "local")
wide_df = as.data.frame(matrix(ceiling(runif(2000, 0, 20)), 10, 200))
wide_sdf = sdf_copy_to(sc1, wide_df, overwrite = TRUE, name = "wide_sdf")
col_eqn = paste0(colnames(wide_df), collapse = "+" )
# build up the SQL query and send to spark with DBI
query = paste0("SELECT (",
col_eqn,
") as total FROM wide_sdf")
dbGetQuery(sc1, query)
# Equivalent approach using dplyr instead
col_eqn2 = quo(!! parse_expr(col_eqn))
wide_sdf %>%
transmute("total" := !!col_eqn2) %>%
collect() %>%
as.data.frame()
The problems come when the number of columns is increased. On spark SQL it seems to be calculated one element at a time i.e. (((V1 + V1) + V3) + V4)...) This is leading to errors due to very high recursion.
Does anyone have an alternative more efficient approach? Any help would be much appreciated.
You're out of luck here. One way or another you're are going to hit some recursion limits (even if you go around SQL parser, sufficiently large sum of expressions will crash query planner). There are some slow solutions available:
Use spark_apply (at the cost of conversion to and from R):
wide_sdf %>% spark_apply(function(df) { data.frame(total = rowSums(df)) })
Convert to long format and aggregate (at the cost of explode and shuffle):
key_expr <- "monotonically_increasing_id() AS key"
value_expr <- paste(
"explode(array(", paste(colnames(wide_sdf), collapse=","), ")) AS value"
)
wide_sdf %>%
spark_dataframe() %>%
# Add id and explode. We need a separate invoke so id is applied
# before "lateral view"
sparklyr::invoke("selectExpr", list(key_expr, "*")) %>%
sparklyr::invoke("selectExpr", list("key", value_expr)) %>%
sdf_register() %>%
# Aggregate by id
group_by(key) %>%
summarize(total = sum(value)) %>%
arrange(key)
To get something more efficient you should consider writing Scala extension and applying sum directly on a Row object, without exploding:
package com.example.sparklyr.rowsum
import org.apache.spark.sql.{DataFrame, Encoders}
object RowSum {
def apply(df: DataFrame, cols: Seq[String]) = df.map {
row => cols.map(c => row.getAs[Double](c)).sum
}(Encoders.scalaDouble)
}
and
invoke_static(
sc, "com.example.sparklyr.rowsum.RowSum", "apply",
wide_sdf %>% spark_dataframe
) %>% sdf_register()
Short version: when executing the following command qtm(World, "amount") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(653989.801201595, : replacement has 177 rows, data has 175
Disclaimer: this is the same problem I used to have in this question, but if I'm not wrong, in that one the problem was that I had one variable on the left dataframe that matched to several variables on the right one, and hence, I needed to group variables on right dataframe. In this case, I am pretty sure that I do not have the same problem, as can be seen from the code below:
library(tmap)
library(tidyr)
# Read tmap's world map.
data("World")
# Load my dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/ad106eda807f710a6f331084ea091513/raw/dc9b51bfc73f09610f199a5a3267621874606aec/tmap.sample.dataframe.csv",
na = "")
# Compare the countries in df that do not match with World's
# SpatialPolygons.
df$iso_a3 %in% World$iso_a3
# Return rows which do not match
selected.countries = df$iso_a3[!df$iso_a3 %in% World$iso_a3]
df.f = filter(df, !(iso_a3 %in% selected.countries))
# Verification.
df.f$iso_a3[!df.f$iso_a3 %in% World$iso_a3]
World#data = World#data %>%
left_join(df.f, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
qtm(World, "amount")
My guess is that the clue may be the fact that the column I am using when joining both dataframes has different levels (hence it is converted to string), but I'm ashamed to admit that I still don't understand the error that I am having here. I'm assuming I have something wrong with my dataframe, although I have to admit that it didn't work even with a smaller dataframe:
selected.countries2 = c("USA", "FRA", "ITA", "ESP")
df.f2 = filter(df, iso_a3 %in% selected.countries2)
df.f2$iso_a3 = droplevels(df.f2$iso_a3)
World#data = World#data %>%
left_join(df.f2, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
World$iso_a3 = droplevels(World$iso_a3)
qtm(World, "amount")
Can anyone help me pointing out what's causing this error (providing an solution may also be much appreaciated)
Edited: It is again your data
table(df$iso_a3)
I was using this code to create a new Group column based on partial strings found inside the column var for 2 groups, Sui and Swe. I had to add another group, TRD, and I've been trying to tweak the ifelse function do this, but no success. Is this doable? are there any other solutions or other functions that might help me do this?
m.df <- molten.df%>% mutate(
Group = ifelse(str_detect(variable, "Sui"), "Sui", "Swedish"))
Current m.df:
var value
ADHD_iFullSuiTrim.Threshold1 0.00549427
ADHD_iFullSuiTrim.Threshold1 0.00513955
ADHD_iFullSweTrim.Threshold1 0.00466352
ADHD_iFullSweTrim.Threshold1 0.00491633
ADHD_iFullTRDTrim.Threshold1 0.00658535
ADHD_iFullTRDTrim.Threshold1 0.00609122
Desired Result:
var value Group
ADHD_iFullSuiTrim.Threshold1 0.00549427 Sui
ADHD_iFullSuiTrim.Threshold1 0.00513955 Sui
ADHD_iFullSweTrim.Threshold1 0.00466352 Swedish
ADHD_iFullSweTrim.Threshold1 0.00491633 Swedish
ADHD_iFullTRDTrim.Threshold1 0.00658535 TRD
ADHD_iFullTRDTrim.Threshold1 0.00609122 TRD
Any help or suggestion would be appreciated even if the result can be accomplished using other functions.
No ifelse() is needed. I'd use Group = str_extract(var, pattern = "(Sui)|(TRD)|(Swe)").
You could do fancier regex with a lookbehind for "iFull" and a lookahead for "Trim", but I can never remember how to do that.
A little more roundabout, but general if you want whatever is between "iFull" and "Trim" would be a replacement:
str_replace_all(var, pattern = "(.*iFull)|(Trim.*)", "")
Try to use multiple ifelse
library(dplyr)
library(stringr)
m.df <- molten.df %>%
mutate(Group = ifelse(str_detect(var, "Sui"), "Sui",
ifelse(str_detect(var, "Swe"), "Swedish", "TRD")))
Or case_when
m.df <- molten.df %>%
mutate(Group = case_when(
str_detect(var, "Sui") ~ "Sui",
str_detect(var, "Swe") ~ "Swe",
TRUE ~ "TRD"
))
Data Preparation
molten.df <- read.table(text = "var value
'ADHD_iFullSuiTrim.Threshold1' 0.00549427
'ADHD_iFullSuiTrim.Threshold1' 0.00513955
'ADHD_iFullSweTrim.Threshold1' 0.00466352
'ADHD_iFullSweTrim.Threshold1' 0.00491633
'ADHD_iFullTRDTrim.Threshold1' 0.00658535
'ADHD_iFullTRDTrim.Threshold1' 0.00609122",
header = TRUE, stringsAsFactors = FALSE)
For future reference - provide all the necessary components for repeating the analysis e.g., packages and example data
# load ----
library(dplyr)
library(stringr)
# data ----
df=data.frame(var=c('ADHD_iFullSuiTrim.Threshold1',
'ADHD_iFullSuiTrim.Threshold1',
'ADHD_iFullSweTrim.Threshold1',
'ADHD_iFullSweTrim.Threshold1',
'ADHD_iFullTRDTrim.Threshold1',
'ADHD_iFullTRDTrim.Threshold1'),
value = c(0.00549427, 0.00513955, 0.00466352, 0.00491633, 0.00658535, 0.00609122))
df %>%
mutate(Group = case_when(str_detect(var, "Sui")~"Sui",
str_detect(var, "Swe")~"Swedish",
str_detect(var, "TRD")~"TRD"))