I recently updated to R 3.5.0 and RStudio 1.1.453 and now my kable function is not working properly.
When I copy and paste the code and output, you see this:
library(knitr)
fakedata<-data.frame(Species = c(1:8), Sites = sample(1:25, 8, replace = TRUE), Positives = sample(1:100,8, replace=TRUE))
kable(fakedata)
| Species| Sites| Positives|
|-------:|-----:|---------:|
| 1| 22| 79|
| 2| 25| 97|
| 3| 19| 28|
| 4| 15| 22|
| 5| 9| 97|
| 6| 14| 71|
| 7| 1| 70|
| 8| 21| 83|
I get no error at all, and the output is the same with or without Rmarkdown.
I also reinstalled MiKTeX.
When I knit it to a document the output is simply blank, again with no errors.
Is anyone else having this problem?
Is there another update I am missing?
Thanks
have you tried kable(format = "markdown") ?
or kable(format = "html")
Related
I have tried a search, seen that there is apparently a group-by-n-max tag, but the answers don't seem applicable to the problem I have.
I have a large set of data, recording scores, attempts (and of course a load of other crud) against a timestamp , the timestamp is almost like a bucket in itself.
What I currently have, is a relatively simplistic
select sum(score),sum(attempts),time from records group by time order by time asc;
This works well, apart from if the number of people changes per timestamp. So I need to limit the number to be consistent, say 40 to be summed within the group-by, and to make matters worse (although if the limit were achievable then order should be relatively similarly done), it would be an ordered list I would like to limit by.
The timestamp is calculable by doing a select against the table, then I guess it would be possible to do a join with a limit. However it feels like there should be an easier method. Unfortunately it is not an average that I want, otherwise I could of course just add a count to the group.
Edit: Yes, I should have included example input and output.
Input table on the left, note that for times 4 and 8, there are 4 people, a,b,c,d only a and d though are in all times. So limiting for example 3 people as an example. On the right, calculation of their rank within each time, so for times 4 and 8, people c and d are not within the top 3 of the score rank.
picture of input data and example rank calculation
So the basic sum() group by time gives too large a result for times where there are 4 people, i.e. time 4 and 8
image showing calculation of group-by, and desired output
Input (hmmm the table renders properly in the preview)
|-----|---------|-------|-------|
|score| attempts| time| user|
|-----|---------|-------|-------------
|10| 4| 4| a|
|9| 6| 5| a|
|12| 7| 6| a|
|4| 8| 7| a|
|6| 9| 8| a|
|13| 1| 4| b|
|5| 3| 6| b|
|6| 5| 7| b|
|7| 7| 8| b|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|--|--|--|--|
Desired output (see images for a better idea)
|-----|---------|-------|
|score| attempts| time|
|-----|---------|-----------|
|47| 7| 4|
|14| 15| 5|
|22| 16| 6|
|11| 20| 7|
|20| 18| 8|
|--|------|--|
I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+
I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
I think you can do that in two steps: first you sum by id then you take the total and substract by the value for this id.
My idea is a little bit like a group_by(id) %>% summarise(x = sum(x)) %>% mutate(y = sum(x) - x) in dplyr
The solution I propose is based on Window function. It is untested:
Let's first create the data
import pyspark.sql.functions as psf
import pyspark.sql.window as psw
df = spark.createDataFrame([(1,40),(1,30),(2,10),(2,90),(3,20),(3,10),(4,2),(4,5)], ['id','value'])
df.show(2)
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
+---+-----+
only showing top 2 rows
and then apply that approach:
w = psw.Window.orderBy()
df_id = df.groupBy("id").agg(psf.sum("value").alias("grouped_total"))
df_id = (df_id
.withColumn("anti_grouped_total",psf.sum("grouped_total").over(w))
.withColumn('anti_grouped_total', psf.col('anti_grouped_total') - psf.col('grouped_total'))
)
df_id.show(2)
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 3| 30| 177|
| 1| 70| 137|
+---+-------------+------------------+
only showing top 2 rows
So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause) to create your group and anti-group, and then groupBy on that new column.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+
Okay, I have a bit of a noob question, so please excuse me. I have a data frame object as follows:
| order_id| department_id|department | n|
|--------:|-------------:|:-------------|--:|
| 1| 4|produce | 4|
| 1| 15|canned goods | 1|
| 1| 16|dairy eggs | 3|
| 36| 4|produce | 3|
| 36| 7|beverages | 1|
| 36| 16|dairy eggs | 3|
| 36| 20|deli | 1|
| 38| 1|frozen | 1|
| 38| 4|produce | 6|
| 38| 13|pantry | 1|
| 38| 19|snacks | 1|
| 96| 1|frozen | 2|
| 96| 4|produce | 4|
| 96| 20|deli | 1|
This is the code I've used to arrive at this object:
temp5 <- opt %>%
left_join(products,by="product_id")%>%
left_join(departments,by="department_id") %>%
group_by(order_id,department_id,department) %>%
tally() %>%
group_by(department_id)
kable(head(temp5,14))
As you can see, the object contains, departments present in each Order_id. Now, what I want to do is, I want to count the number of departments for each order_id
i tried using the summarise() method in the dplyr package, but it throws the following error:
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "factor".
It seems so simple, but cant fig out how to do it. Any help will be appreciated.
Edit: This is the code that I tried to run, post which I read about the count() function in the plyr package, i tried to use that as well, but that is of no use as it needs a data frame as input, whereas I only want to count the no of occurrences in the data frame
temp5 <- opt %>%
+ left_join(products,by="product_id")%>%
+ left_join(departments,by="department_id") %>%
+ group_by(order_id,department_id,department) %>%
+ tally() %>%
+ group_by(department_id) %>%
+ summarise(count(department))
In the output, I need to know the average no. of departments ordered from in each order id, so i need something like this:
Order_id | no. of departments
1 3
36 4
38 4
96 3
And then I should be able to plot using ggplot, no. of orders vs no. of departments in each order. Hope this is clear
I am currently writing a meta-analysis using a pairwise random effects meta-analysis to compare the complication rates of 5 treatment modalities, with one as the gold standard. I was able to get RR's out of the analysis but was not able to get p-values out of it. How do I get the p-values from this model? I tried to usepval.random but that didn't work. And I couldn't find any other code on CRAN that would help me. Can someone help me with the R-code?
drf <- read.csv("drf zonder moroni.csv", sep = ";", header = TRUE, as.is = TRUE)
##
drf <- drf[, 1:5]
names(drf) <- c("study", "type", "treat", "events", "n")
compl <- subset(drf, type == "Complications")
library(netmeta)
p.compl <- pairwise(treat = treat, event = events, n = n,
studlab = study, data = compl)
n.compl <- netmeta(p.compl, reference = "PC", comb.random=TRUE)
n.compl
netgraph(n.compl, iterate = TRUE, number = TRUE)
Part of dataset
Study| Event Type| Treatment| Number of Events (n)| N| n/N|
Kumaravel| Complications| EF| 3| 23| 0,1304348|
Franck| Complications| EF| 2| 20| 0,1|
Schonnemann| Complications| EF| 8| 30| 0,2666667|
Aita| Complications| EF| 1| 16| 0,0625|
Hove| Complications| EF| 31| 39| 0,7948718|
Andersen| Complications| EF| 26| 75| 0,3466667|
Krughaug| Complications| EF| 22| 75| 0,2933333|
Moroni| Complications| EF| 0| 20| 0|
Plate| Complications| IMN| 3| 30| 0,1|
Chappuis| Complications| IMN| 4| 16| 0,25|
Gradl| Complications| IMN| 12| 66| 0,1818182|
Schonnemann| Complications| IMN| 6| 31| 0,1935484|
Aita| Complications| IMN| 1| 16| 0,0625|
Dremstrop| Complications| IMN| 17| 44| 0,3863636|
Wong| Complications| PC| 1| 30| 0,0333333|
Kumaravel| Complications| PC| 4| 25| 0,16|
P values can be derived by n.compl$pval.random