I have a data looks like below, I would like to skip 2 rows after max index of certain types (3 and 4). For example, I have two 4s in my table, but I only need to remove 2 rows after the second 4. Same for 3, I only need to remove 2 rows after the third 3.
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 94 | 1 |
-----------------
| 57 | 1 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 99 | 1 |
-----------------
| 99 | 1 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
| 97 | 1 |
-----------------
| 96 | 1 |
-----------------
The desired output would be:
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
Here is the code of my example:
data <- data.frame(grade = c(93,90,54,36,31,94,57,16,11,12,99,99,9,10,97,96), type = c(2,2,2,4,4,1,1,3,3,3,1,1,3,3,1,1))
Could anyone give me some hints on how to approach this in R? Thanks a bunch in advance for your help and your time!
data[-c(max(which(data$type==3))+1:2,max(which(data$type==4))+1:2),]
# grade type
# 1 93 2
# 2 90 2
# 3 54 2
# 4 36 4
# 5 31 4
# 8 16 3
# 9 11 3
# 10 12 3
Using some indexing:
data[-(nrow(data) - match(c(3,4), rev(data$type)) + 1 + rep(1:2, each=2)),]
# grade type
#1 93 2
#2 90 2
#3 54 2
#4 36 4
#5 31 4
#8 16 3
#9 11 3
#10 12 3
Or more generically:
vals <- c(3,4)
data[-(nrow(data) - match(vals, rev(data$type)) + 1 + rep(1:2, each=length(vals))),]
The logic is to match the first instance of each value to the reversed values in the column, then spin that around to give the original row index, then add 1 and 2 to the row indexes, then drop these rows.
Similar to Ric, but I find it a bit easier to read (way more verbose, though):
idx = data %>% mutate(id = row_number()) %>%
filter(type %in% 3:4) %>% group_by(type) %>% filter(id == max(id)) %>% pull(id)
data[-c(idx + 1, idx + 2),]
I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor
I have two dataframes one with the original information and the second one with corrections about the first observations. I would like to create a function or find a way to replace in multiple columns the information I have in my first dataframe with the new information I received. I have an ID to identify the observations that need to be replace but since so many columns will be changing for certain IDs I don´t know which will be the appropriate way of changing them.
My first data frame has 500 columns and 1000 observations and my second data frame has 100 columns and 800 observations that will change the original dataframe. I don´t know how to efficiently replace those values according to the ID
Here is an example of what the 2 dataframes look like, I need to replace in multiple columns just some values and a merge is not the most efficient options since I have more than 100 columns at least that will need changes in some of the observations.
I just need to replace the new info and keep the old one
enter image description here
Dataframe 1
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 1 | 1 | 1 | 1 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 3 | 3 |
|a4 | 4 | 4 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 6 | 6 | 6 | 6 | 6 |
dataframe 2
|ID | X1 | X2 | X4|
|a1 | 8 | | 4 |
|a3 | | | 2 |
|a4 | 2 | 9 | |
|an | 1 | | 3 |
The outcome should have the old values of dataframe 1 just with the replacements I got from dataframe 2
outcome
|ID | X1 | X2 | X3 | X4 | XN |
|a1 | 8 | 1 | 1 | 4 | 1 |
|a2 | 2 | 2 | 2 | 2 | 2 |
|a3 | 3 | 3 | 3 | 2 | 3 |
|a4 | 2 | 9 | 4 | 4 | 4 |
|a5 | 5 | 5 | 5 | 5 | 5 |
|an | 1 | 6 | 6 | 3 | 6 |
I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')
I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?
b2 <- b[order(b$ID, -b$age), ]
should do the trick.
The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))