I would like to add a column repeating the numbers 1 to 577 to a dataframe of over 15,000 rows.
This is the dataframe:
> head(corr_dat_cond1)
participant search_difficulty key_resp.corr key_resp.rt target_position distractor1_colour
1 1010 difficult 1 1.0820000 left [0.82,0.31,0]
2 1010 no_search 1 0.5400000 left [-1,-1,-1]
3 1010 difficult 1 0.5119998 down [0.82,0,0.31]
4 1010 no_search 1 0.7079999 right [-1,-1,-1]
5 1010 difficult 1 1.0249999 up [0.82,0.31,0]
6 1010 no_search 1 0.4889998 left [-1,-1,-1]
distractor2_colour non_target_colour non_target_pos cue_uposition target_char non_target_char cue_time
1 [0.82,0,0.31] [0.82,0.31,0] [0.328,0] up = x 1.1
2 [-1,-1,-1] [-1,-1,-1] [0.328,0] right x = 1.2
3 [0.82,0.31,0] [0.82,0,0.31] [0.328,0] down x = 1.0
4 [-1,-1,-1] [-1,-1,-1] [0,0.328] left = x 1.4
5 [0.82,0,0.31] [0.82,0.31,0] [0,-0.328] left x = 1.4
6 [-1,-1,-1] [-1,-1,-1] [0,-0.328] up x = 1.0
cue_colour n cue_validity mrt stdev low_cutoff high_cutoff cond trial_num
1 Mismatch (Onset) cue 577 FALSE 0.7639095 0.2481090 0.0195825 1.5082365 1 1
2 Mismatch (Onset) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 2
3 Mismatch (Onset) cue 577 TRUE 0.7639095 0.2481090 0.0195825 1.5082365 1 3
4 Match (Color) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 4
5 Match (Color) cue 577 FALSE 0.7639095 0.2481090 0.0195825 1.5082365 1 5
6 Mismatch (Onset) cue 577 FALSE 0.5530880 0.1243826 0.1799402 0.9262358 1 6
The trial_num column is my initial attempt at trying adding a column of sequential numbers. This is the code I used:
corr_dat_cond1$trial_num <- 1:nrow(corr_dat_cond1)
However, I'd like the numbers to repeat every 577 rows instead of counting all the way up to the number of rows in the dataframe.
Any help would be appreciated! Thank you.
You can use the rep_len function.
trial_num <- rep_len(1:577, nrow(corr_dat_cond1))
This is the same as calling rep with the length.out specified.
Related
I have one dataframe, that contains my results and another dataframe, that contains e.g. only values. Now I want to add a new column to the first dataframe that has the data from the second dataframe. However, the second dataframe does not have a tidy format or the same rows as the first one. However, the position of the value I want to get from the second dataframe is given in two coloums of the first dataframe.
library(tidyverse)
df1<-data.frame(Row_no=c(1,2,3,4, 1,2,3,4), Col_no=c(1,1,2,2,3,3,4,4), > Size=c(sample(200:300, 8)))
>df1
Row_no Col_no Size
1 1 1 226
2 2 1 208
3 3 2 297
4 4 2 211
5 1 3 209
6 2 3 296
7 3 4 273
8 4 4 261
df2=cbind(rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8), rnorm(8))
> df2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] *1.4568994* -0.3324945 *-0.2885171* -0.79393545 -0.02439371 1.4216918 0.07288639 -0.2441228
[2,] *0.3648703* 0.7494033 *-0.9974556* -0.33820023 -0.30235757 1.5094486 -0.10982881 1.9349127
[3,] 0.5044991 *1.2208453* -0.8748034 *-0.86325341* 0.10462120 -0.3674390 -0.04107733 1.1815123
[4,] -1.2792906 *0.7408320* -0.2711479 *-0.07350530* -0.92132461 -0.7753123 0.99841815 1.5802167
[5,] -0.8801507 0.2580448 0.3099108 0.66716720 -0.01144132 -0.9353671 0.44608715 -0.6729589
[6,] 0.4809844 0.6349390 1.9900160 0.62358533 0.35075449 2.4124712 -1.45171943 0.4409148
[7,] -0.5146914 0.9115070 -0.3971806 -0.06477066 0.46028331 0.7067722 -0.44562194 1.9545829
[8,] -0.4299626 1.8211741 0.3272991 0.06177976 1.25383361 -0.7770162 -0.49841279 0.5098795
The desired result would be something like the following (I put asteriks around the values in df2, to show which I wanted):
Row_no Col_no Size Value
1 1 1 226 1.4568994
2 2 1 208 0.3648703
3 3 2 297 1.2208453
4 4 2 211 0.7408320
5 1 3 209 -0.2885171
6 2 3 296 -0.9974556
7 3 4 273 -0.86325341
8 4 4 261 -0.07350530
However, when I try to run the code
df1%>%
mutate(value=df2[Row_no, Col_no])
I get the error message,
`Fehler: Column `value` must be length 8 (the number of rows) or one, not 64
Which would be expected. However, when I try to index the columns themselves I get
df1%>%
mutate(value=df2[Row_no[1], Col_no[1]])
Row_no Col_no Size value
1 1 1 226 1.456899
2 2 1 208 1.456899
3 3 2 297 1.456899
4 4 2 211 1.456899
5 1 3 209 1.456899
6 2 3 296 1.456899
7 3 4 273 1.456899
8 4 4 261 1.456899
> df1%>%
+ mutate(value[1]=df2[Row_no[1], Col_no[1]])
Error: Unexpected '=' in:
"df1%>%
mutate(value[1]="
So how would I get my desired result? I would prefer to have a tidy solution. Also, the given example is just a minimum reproducible example, my real files are really large, that's why I need a clear solution...
Thanks!
Thanks to #Yuriy Barvinchenko, I was able to figure out a solution:
df1%>%
mutate(value=df2[cbind(Row_no, Col_no)])
> df1%>%
+ mutate(value=df2[cbind(Row_no, Col_no)])
Row_no Col_no Size value
1 1 1 226 1.4568994
2 2 1 208 0.3648703
3 3 2 297 1.2208453
4 4 2 211 0.7408320
5 1 3 209 -0.2885171
6 2 3 296 -0.9974556
7 3 4 273 -0.8632534
8 4 4 261 -0.0735053
The important part was the cbind() in the indexing brackets.
based on answer here
df1$value <- with( df1, df2[ cbind(Row_no, Col_no) ] )
Using purrr::pmap:
df1$Value <- unlist(pmap(list(df1$Row_no, df1$Col_no, list(df2)), ~ ..3[..1,..2]))
and with piping:
df1 %>%
mutate(Value = pmap(list(Row_no, Col_no, list(df2)), ~ ..3[..1,..2]))
The problem is that when you try mutate(value=df2[Row_no, Col_no]), you are actually generating a square matrix of length(Row_no) * length(Col_no) elements, equivalent to df2[df1$Col_no, df1$Row_no]. When you think about it, this is a stack of the 8 "correct" rows, where the correct columns are numbered 1 to 8. The correct elements can therefore be found at [1, 1], [2, 2], [3, 3]...[n, n], i.e. the diagonal of the matrix. The tidiest way to get these into a single column is to multiply it by the identity matrix and take the row sums.
I have replicated your random data here to give a complete solution that matches your example.
library(tidyverse)
df1 <- data.frame(Row_no = rep(1:4, 2),
Col_no = rep(1:4, each = 2),
Size = c(sample(200:300, 8)))
df2 <- cbind(c( 1.4568994, -0.3324945, -0.2885171, -0.79393545,
-0.02439371, 1.4216918, 0.07288639, -0.2441228),
c( 0.3648703, 0.7494033, -0.9974556, -0.33820023,
-0.30235757, 1.5094486, -0.10982881, 1.9349127),
c( 0.5044991, 1.2208453, -0.8748034, -0.86325341,
0.10462120, -0.3674390, -0.04107733, 1.1815123),
c(-1.2792906, 0.7408320, -0.2711479, -0.07350530,
-0.92132461, -0.7753123, 0.99841815, 1.5802167),
c(-0.8801507, 0.2580448, 0.3099108, 0.66716720,
-0.01144132, -0.9353671, 0.44608715, -0.6729589),
c( 0.4809844, 0.6349390, 1.9900160, 0.62358533,
0.35075449, 2.4124712, -1.45171943, 0.4409148),
c(-0.5146914, 0.9115070, -0.3971806, -0.06477066,
0.46028331, 0.7067722, -0.44562194, 1.9545829),
c(-0.4299626, 1.8211741, 0.3272991, 0.06177976,
1.25383361, -0.7770162, -0.49841279, 0.5098795))
df1 %>% mutate(value = rowSums(df2[Col_no, Row_no] * diag(8))) %>% print
# Row_no Col_no Size value
# 1 1 1 267 1.4568994
# 2 2 1 283 0.3648703
# 3 3 2 259 1.2208453
# 4 4 2 235 0.7408320
# 5 1 3 212 -0.2885171
# 6 2 3 263 -0.9974556
# 7 3 4 251 -0.8632534
# 8 4 4 200 -0.0735053
I want sort a data frame by datas of a column (the first column, called Initial). My data frame it's:
I called my dataframe: t2
Initial Final Changes
1 1 200
1 3 500
3 1 250
24 25 175
21 25 180
1 5 265
3 3 147
I am trying with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=False),]
But, the result is of the type:
Initial Final Changes
3 1 250
3 3 147
21 25 180
24 25 175
1 5 265
1 1 200
1 3 500
And when I try with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=TRUE),]
The result is:
Initial Final Changes
1 5 265
1 1 200
1 3 500
24 25 175
21 25 180
3 1 250
3 3 147
I don't understand what happen.
Can you help me, please?
It is possible that the column types are factors, in that case, convert it to numeric and should work
library(dplyr)
t2 %>%
arrange_at(1:2, ~ desc(as.numeric(as.character(.))))
Or with base R
t2[1:2] <- lapply(t2[1:2], function(x) as.numeric(as.character(x)))
t2[do.call(order, c(t2[1:2], decreasing = TRUE)), ]
Or the OP's code should work as well
Noticed that decreasing = False in the first option OP tried (may be a typo). In R, it is upper case, FALSE
t2[order(t2$Initial, t2$Final, decreasing=FALSE),]
I have a time series data.frame where all values are below each other. But on every date there are more cases that come back regulary. Based on the time series I am adding a column with some calculations. These calculations are done case-specific. But for these calculations I need the value of the previous date of that case. I have now idea about which function to use. Can anybody point me to a function or an example somewhere on the net? Thanks!!
To be clear, this is what I mean. On date 1 the old value (before the score) for case 'a' is 1200. Based on the score of 1 the new value becomes 1250. On date 2 the I want this new value 1250 for placed in the column 'old value' (and than do some calculations to come to the new value, that new value has to be placed again in de column old value on date 4 or so et cetera). For case B the same. So the new value after the score on date 1 is 1190 and has to be placed in the correct row on date 3 (on date 2 there is now case B) et cetera for 1000's of cases and dates.
date name_case score old_value new_value
1 a 1 1200 1250
1 b 2 1275 1190
1 c 1 1300 1310
2 a 3 1250
2 c 1 1310
3 B 1 1190
Maybe this will do it. Assuming that we start with:
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 NA NA
5 2 c 1 NA NA
6 3 b 1 NA NA # note ... fixed cap issue
And then make a subset with values for new_value:
dat1 <- dat[ !is.na(dat$old_value), ]
And then replace the NA old_values with results from new_values in the subset by match-ing on name_case
dat[ is.na(dat$old_value) , "old_value" ] <-
dat1$new_value[ match(dat[ !is.na(dat$old_value) ,"name_case" ],
dat1$name_case)]
match generates a numeric vector that is used to index the new_values.
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 1250 NA
5 2 c 1 1190 NA
6 3 B 1 1310 NA
My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596
I came across a lot of posts asking about window joins
rolling median
rolling regression
Since data.table 1.8.8 and the roll parameter my understanding is that we can do those things. Say we have X and Y with same keys say x,y,t, we want to be able to get for each line of X
all the rows of Y where x,y of Y are matching those of X AND where X$t in [Y$t-w1,Y$t+w2]
Here is an example with (w1,w2)=(1,5)
library(data.table)
A <- data.table(x=c(1,1,1,2,2),y=c(F,F,T,T,T),t=c(407,286,788,882,942),key='x,y,t')
X <- copy(A)
Y <- data.table(x=c(1,1,1,2,2,2,2),y=c(F,F,T,T,T,T,T),u=c(417,285,788,882,941,942,945),IDX=1:7,key='x,y,u')
R) X
x y t
1: 1 FALSE 286
2: 1 FALSE 407
3: 1 TRUE 788
4: 2 TRUE 882
5: 2 TRUE 942
R) Y
x y u IDX
1: 1 FALSE 285 2 # match line 1 as (x,y) ok and 285 in [286-1,286+5]
2: 1 FALSE 417 1 # match no line as (x,y) ok against X[c(1,2),] but 417 is too big
3: 1 TRUE 788 3 # match row 3
4: 2 TRUE 882 4 # match row 4
5: 2 TRUE 941 5 # match row 5
6: 2 TRUE 942 6 # match row 5
7: 2 TRUE 945 7 # match row 5
We cannot do Y[setkey(X[,list(x,y,t)],x,y,t),roll=1] because if we have a perfect match on (x,y,t) data.table will discard potential partial matches with X$t in [Y$t-w1,X$t[.
#get the lower bounds and upper bounds for t
X[,`:=`(lowT=t-1,upT=t+5)]
#we get the first line where Y$u >= X$t-1 but Y$u <= X$t+5
X <- setnames(copy(Y),c('u','IDX'),c('lowT','lowIDX'))[setkey(X,x,y,lowT),roll=-6,rollends=T]
#we get the last line where Y$u <= X$t+5 ...
X <- setnames(copy(Y),c('u','IDX'),c('upT','upIDX'))[setkey(X,x,y,upT),roll=6]
#we get the matching IDX
X[!is.na(lowIDX) & !is.na(upIDX), allIDX:=mapply(`seq`,from=lowIDX,to=upIDX)]
R) X
x y upT upIDX lowT lowIDX t allIDX
1: 1 FALSE 291 2 285 2 286 2
2: 1 FALSE 412 NA 406 NA 407
3: 1 TRUE 793 3 787 3 788 3
4: 2 TRUE 887 4 881 4 882 4
5: 2 TRUE 947 7 941 5 942 5,6,7
My questions are:
Am I correct to think that window joins could not be achieve easily before roll?
Can we solve the pb if we want X$t in ]Y$t-w1,Y$t+w2[ (not compact set anymore) ?