How to sort a data frame by column? - r

I want sort a data frame by datas of a column (the first column, called Initial). My data frame it's:
I called my dataframe: t2
Initial Final Changes
1 1 200
1 3 500
3 1 250
24 25 175
21 25 180
1 5 265
3 3 147
I am trying with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=False),]
But, the result is of the type:
Initial Final Changes
3 1 250
3 3 147
21 25 180
24 25 175
1 5 265
1 1 200
1 3 500
And when I try with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=TRUE),]
The result is:
Initial Final Changes
1 5 265
1 1 200
1 3 500
24 25 175
21 25 180
3 1 250
3 3 147
I don't understand what happen.
Can you help me, please?

It is possible that the column types are factors, in that case, convert it to numeric and should work
library(dplyr)
t2 %>%
arrange_at(1:2, ~ desc(as.numeric(as.character(.))))
Or with base R
t2[1:2] <- lapply(t2[1:2], function(x) as.numeric(as.character(x)))
t2[do.call(order, c(t2[1:2], decreasing = TRUE)), ]
Or the OP's code should work as well
Noticed that decreasing = False in the first option OP tried (may be a typo). In R, it is upper case, FALSE
t2[order(t2$Initial, t2$Final, decreasing=FALSE),]

Related

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Reordering rows based on sequence

With the following data I would like to rearrange rows within each unique ProjectID.
Data <- data.frame(ProjectID=c(123,123,123,124,125,125,126,126,126,126,128,129),
Value=c(1,4,7,3,8,9,2,5,3,7,2,6),
Sequence=c(1,3,2,1,3,2,1,1,4,2,2,4))
Within each unique ProjectID, if the Sequence contains either "1,3,2" or "1,4,2" I want to reorder the rows to follow "3,1,2" or "4,1,2", respectively.
In ProjectID 126, there are four observations with Sequence 1,1,4,2, so the expected result should show 1,4,1,2.
How can I accomplish this?
Result should be as follows
Result <- data.frame(ProjectID=c(123,123,123,124,125,125,126,126,126,126,128,129),
Value=c(4,1,7,3,8,9,2,3,5,7,2,6),
Sequence=c(3,1,2,1,3,2,1,4,1,2,2,4))
do.call(rbind,
lapply(split(Data, Data$ProjectID), function(a){
if(identical(a$Sequence, c(1,4,2))){
a[match(a$Sequence, c(4,1,2)),]
}else if(identical(a$Sequence, c(1,3,2))){
a[match(a$Sequence, c(3,1,2)),]
}else{
a
}
}
)
)
# ProjectID Value Sequence
#123.2 123 4 3
#123.1 123 1 1
#123.3 123 7 2
#124 124 3 1
#125.5 125 8 3
#125.6 125 9 2
#126.8 126 5 4
#126.7 126 2 1
#126.9 126 7 2
#128 128 2 2
#129 129 6 4

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Assign industry codes according to ranges in R

I would like to assign overall industry/parent codes to a data.frame (df below) containing more detailed/child codes (called ChildCodes below). The following data serves to illustrate my data.frame containing the detailed codes:
> df <- as.data.frame(cbind(c(1,2,3,4,5,6),c(110,101,200,2041,3651,2102)))
> names(df) <- c('Id','ChildCodes')
> df
Id ChildCodes
1 1 110
2 2 101
3 3 200
4 4 2041
5 5 3651
6 6 2102
The industry/parent codes are in the .csv file here: https://www.dropbox.com/s/5qtb7ysys1ar0lj/IndustryCodes.csv
The problem for me is the format of the .csv file. The file shows the parent/industry code in column 1 and ranges of child/detailed codes in the next 2 columns. Here is a subset:
> IndustryCodes <- as.data.frame(cbind(c(1,1,2,5,6),c(100,200,2040,2100,3650),c(199,299,2046,2199,3651)))
> names(IndustryCodes) <- c('IndustryGroup','LowerRange','UpperRange')
> IndustryCodes
IndustryGroup LowerRange UpperRange
1 1 100 199
2 1 200 299
3 2 2040 2046
4 5 2100 2199
5 6 3650 3651
So that ChildCode 110 corresponds industry group 1, 2041 to industry code 2 etc. How do best assign the industry/parent codes (IndustryGroup) to df in R?
Thanks!
You can use sapply to get the Industry code for every child code:
sapply(df$ChildCodes,
function(x) IndustryCodes$IndustryGroup[IndustryCodes$LowerRange <= x &
x <= IndustryCodes$UpperRange])
# [1] 1 1 1 2 6 5

Row Differences in Dataframe by Group

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.
set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
"date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
"logical_value"=sample(c(T,F),9,replace=T),
"integer"=sample(1:100,9,replace=T),
"float"=runif(9))
df = df[order(df$group_id,df$date),]
I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.
df
group_id date logical_value integer float
1 1 1974-05-13 FALSE 4 0.03472876
4 1 1979-12-02 TRUE 45 0.24493995
7 1 1980-08-18 TRUE 2 0.46662253
5 2 1978-12-08 TRUE 56 0.60039164
2 2 1981-12-26 TRUE 34 0.20081799
8 2 1986-05-19 FALSE 60 0.43928929
6 3 1983-05-22 FALSE 25 0.01792820
9 3 1994-04-20 FALSE 34 0.10905326
3 3 2003-11-04 TRUE 63 0.58365922
So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:
create_differences = function(data_group){
apply(data_group, 2, diff)
}
But I get errors using the code:
diff_df = lapply(split(df,df$group_id),create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
by(df,df$group_id,create_differences)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator
As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.
Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).
Using this edit, I get a list of data frames (one list entry per group).
Edit 2:
The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:
diff_df
group_id date logical_value integer float
[1,] 1 2029 1 41 0.2102112
[2,] 1 260 0 -43 0.2216826
[1,] 2 1114 0 -22 -0.3995737
[2,] 2 1605 -1 26 0.2384713
[1,] 3 3986 0 9 0.09112507
[2,] 3 3485 1 29 0.47460596
I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.
library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]
# group_id date logical_value integer float
# 1: 1 2029 days 1 41 0.21021119
# 2: 1 260 days 0 -43 0.22168257
# 3: 2 1114 days 0 -22 -0.39957366
# 4: 2 1604 days -1 26 0.23847130
# 5: 3 3987 days 0 9 0.09112507
# 6: 3 3485 days 1 29 0.47460596
It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:
result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]
# group_id date logical_value integer float
#1 1 2029 1 41 0.21021119
#4 1 260 0 -43 0.22168257
#2 2 1114 0 -22 -0.39957366
#5 2 1604 -1 26 0.23847130
#3 3 3987 0 9 0.09112507
#6 3 3485 1 29 0.47460596

Resources