Reshape a ragged, wide array with repeated variables to long in r - r

I have a table like
+------+---------+---------+---------+----------+---------+
| Code | Display | Synonym | Synonym | Synonym | Synonym |
+------+---------+---------+---------+----------+---------+
| 1 | A | Cat | Dog | Lion | |
| 2 | B | Horse | Penguin | | |
| 3 | C | Donkey | Giraffe | Mongoose | Rabbit |
+------+---------+---------+---------+----------+---------+
I want to output a table like
+------+---------+----------+
| Code | Display | Synonym |
+------+---------+----------+
| 1 | A | Cat |
| 1 | A | Dog |
| 1 | A | Lion |
| 2 | B | Horse |
| 2 | B | Penguin |
| 3 | C | Donkey |
| 3 | C | Giraffe |
| 3 | C | Mongoose |
| 3 | C | Rabbit |
+------+---------+----------+
In other words, I want to pair off Code and Display with each Synonym that is presented, and each Code can have 1 to several synonyms. I've seen examples of reshape used in other contexts, but haven't been able to figure out how to apply it here.

You can use standard reshaping on a ragged array - with melt() from reshape2, you can use the na.rm argument to remove NAs as you go, otherwise you can do it afterward:
library(reshape2)
dat.m <- melt(dat, id.vars = c("Code", "Display"), value.name = "Synonym", na.rm = TRUE)
# Code Display variable Synonym
#1 1 A Synonym Cat
#2 2 B Synonym Horse
#3 3 C Synonym Donkey
#4 1 A Synonym.1 Dog
#5 2 B Synonym.1 Penguin
#6 3 C Synonym.1 Giraffe
#7 1 A Synonym.2 Lion
#9 3 C Synonym.2 Mongoose
#12 3 C Synonym.3 Rabbit
You can drop the variable column if you like:
dat.m$variable <- NULL

Here are two base R approaches.
stack
cbind(mydf[1:2], stack(lapply(mydf[-c(1:2)], as.character)))
# Code Display values ind
# 1 1 A Cat Synonym
# 2 2 B Horse Synonym
# 3 3 C Donkey Synonym
# 4 1 A Dog Synonym.1
# 5 2 B Penguin Synonym.1
# 6 3 C Giraffe Synonym.1
# 7 1 A Lion Synonym.2
# 8 2 B Synonym.2
# 9 3 C Mongoose Synonym.2
# 10 1 A Synonym.3
# 11 2 B Synonym.3
# 12 3 C Rabbit Synonym.3
reshape
Make life easier by renaming your columns first to a pattern like "Synonym_1", "Synonym_2" and so on. Actually, R likes "Synonym.1", "Synonym.2" and so on better....
A <- grep("Synonym", names(mydf))
names(mydf)[A] <- paste0("Synonym_", seq_along(A))
Now, reshape...
reshape(mydf, direction = "long", varying = A, sep = "_")
# Code Display time Synonym id
# 1.1 1 A 1 Cat 1
# 2.1 2 B 1 Horse 2
# 3.1 3 C 1 Donkey 3
# 1.2 1 A 2 Dog 1
# 2.2 2 B 2 Penguin 2
# 3.2 3 C 2 Giraffe 3
# 1.3 1 A 3 Lion 1
# 2.3 2 B 3 2
# 3.3 3 C 3 Mongoose 3
# 1.4 1 A 4 1
# 2.4 2 B 4 2
# 3.4 3 C 4 Rabbit 3

I figured out a maybe indirect way to do this shortly after asking the question:
allergies_output <- reshape(allergies_input,varying=list(grep('Synonym',names(allergies_input),value=TRUE)),direction='long',idvar=c('Code','Display'),v.names='Synonym',names(allergies_input))
This gives some wonky results, but nothing that can't be fixed by dropping some column names.

Related

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

Product calculation by group in R data.table

I'm currently working on transforming a dataset to take the product of each previous observation in a datatable. This is something that is implemented easy in excel but I am struggling to find a non-recursive solution to in data.table. The data in short form, ID has thousands of more levels and thousands of x's per ID in the real data. Each ID has the same number of X's.
| index | ID | X |
|-------|----|------|
| 1 | 1 | 0.8 |
| 2 | 1 | 0.75 |
| 3 | 1 | 0.72 |
| 4 | 2 | 0.9 |
| 5 | 2 | 0.5 |
| 6 | 2 | 0.45 |
What I want to end up with is the following
| index | ID | X | product |
|-------|----|------|---------|
| 1 | 1 | 0.8 | 0.8 |
| 2 | 1 | 0.75 | 0.6 |
| 3 | 1 | 0.72 | 0.432 |
| 4 | 2 | 0.9 | 0.9 |
| 5 | 2 | 0.5 | 0.45 |
| 6 | 2 | 0.45 | 0.2025 |
Where product is equal to x multiplied by all previous values of x for that particular ID. This can be done in a for loop however I am looking for a solution that leverages the use of data.table so this can be run on a cluster.
Reproducible data:
df <- fread('
index ID X
1 1 0.8
2 1 0.75
3 1 0.72
4 2 0.9
5 2 0.5
6 2 0.45
')
You can use cumprod
# If data.table not already loaded, these steps are required first
# library(data.table)
# setDT(df)
df[, Xprod := cumprod(X), ID][]
# index ID X Xprod
# 1: 1 1 0.80 0.8000
# 2: 2 1 0.75 0.6000
# 3: 3 1 0.72 0.4320
# 4: 4 2 0.90 0.9000
# 5: 5 2 0.50 0.4500
# 6: 6 2 0.45 0.2025
If you need to apply a function other than prod, you can use frollapply. For example, the code below gives the same result as the code above.
df[, Xprod := frollapply(X, 1:.N, prod, adaptive = TRUE), by = ID]

r data.table groupby join in pyspark 1.6

I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')

R data.table check if a row exists in another data.table

I have two data.tables like so:
tests
id | test | score
=================
1 | 1 | 90
1 | 2 | 100
2 | 1 | 70
2 | 2 | 80
3 | 1 | 100
3 | 2 | 95
cheaters
id | test | score
=================
1 | 2 | 100
3 | 1 | 100
3 | 2 | 95
Say I now want to include a boolean column in all_scores to tell whether that particular test was cheated on, so the output would be like this:
tests
id | test | score | cheat
=========================
1 | 1 | 90 | FALSE
1 | 2 | 100 | TRUE
2 | 1 | 70 | FALSE
2 | 2 | 80 | FALSE
3 | 1 | 100 | TRUE
3 | 2 | 95 | TRUE
Is there an easy way to do this? The tables are keyed on id and test.
Create the cheat column with initial value of FALSE, then join with cheaters, and update cheat column to TRUE when there's match:
library(data.table)
setkey(setDT(tests), id, test)
setkey(setDT(cheaters), id, test)
tests[, cheat := FALSE][cheaters, cheat := TRUE]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE
Or without setting the keys, use on parameter to specify the columns to join on:
setDT(tests)
setDT(cheaters)
tests[, cheat := FALSE][cheaters, cheat := TRUE, on = .(id, test)]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE

Conditionally remove rows from H2O frame object in R

I have an H2O frame R object like this
h2odf
A | B | C | D
--|---|---|---
1 | NA| 2 | 0
2 | 1 | 2 | 0
3 | NA| 2 | 0
4 | 3 | 2 | 0
I want to remove all those rows where B is NA (1st and 3rd row). I have tried
na <- is.na(h2odf[,"b"])
h2odf <- h2odf[!na,]
and
h2odf <- h2odf[!is.na(h2odf$B),]
and
h2odf <- subset(h2odf, B!=NA)
This works for R Dataframe but not H2O. Giving this error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
ERROR MESSAGE:
DistributedException from localhost/127.0.0.1:54321: 'Cannot set illegal UUID value'
Desired output is
h2odf
A | B | C | D
--|---|---|---
2 | 1 | 2 | 0
4 | 3 | 2 | 0
One option I have is to convert it into R Dataframe, remove rows and convert it back to H2O frame. But that is taking long time because input file size is close to 4.5 GB. Is it possible to do this in H2O frame hex object itself?
I am running Rstudio on aws cluster.
> class(h2odf)
[1] "H2OFrame"
> h2odf
A B C D
1 1 NA 2 0
2 2 1 2 0
3 3 NA 2 0
4 4 3 2 0
[4 rows x 4 columns]
> h2odf[!is.na(as.numeric(as.character(h2odf$B))),]
A B C D
1 2 1 2 0
2 4 3 2 0
[2 rows x 4 columns]

Resources