Sort list on numeric values stored as factor - r

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...

If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))

Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Related

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Assigning logical value to values higher than given threshold for each case across each year

I have a data frame resembling the extract below:
set.seed(1)
smpl_df <- data.frame(year = c(1500:2011), case = LETTERS[1:4])
smpl_df$var_one <- sample(100, size = nrow(smpl_df), replace = TRUE)
I'm interested in adding one more column to this data frame. I'm interested in the column to take the value 1 if the values in the column var_one were higher than a given threshold for all of the consecutive years represented in the data set. For example, in its present format the table looks like that:
head(smpl_df)
year case var_one
1 1500 A 27
2 1501 B 38
3 1502 C 58
4 1503 D 91
5 1504 A 21
6 1505 B 90
I would like to add a column to the data table (values for the new column are not right, just introduced as a way of example):
year case var_one var_one_higher_than_80_for_all_yrs_for_this_case
1 1500 A 27 0
2 1501 B 38 0
3 1502 C 58 0
4 1503 D 91 1
5 1504 A 21 0
6 1505 B 90 1
Edit
To add to the post following useful points expressed in the comments below. The long table that I'm currently working with could be obtained from the wide table below. In the example below, I added column NewColumn that takes values Yes if for a given case value was higher than 2 and No if the value was lower or equal 2 for all the years. I want to achieve the same effect but on my long table (sample_df).
Edit 2
Following the useful comments concerning the desired final output, my intention is to generate a column that would correspond to the last column in the table below.
maybe be helpful ifelse structure:
smpl_df$var_one_higher <- ifelse("your func",1,0)

Resources