I have created a contingency table with several variables/cols by 510 categories/factors. I want to have factors ordered descending based on the sum of all variables/cols.
Tried converting table back to DF and rowSums but no luck.
Not sure if possible to sort while using table function?
DF structure
'data.frame': 2210 obs. of 7 variables:
$ Paddock_ID: num 1 1 1 1 1 1 1 1 1 1 ...
$ Year : num 2010 2011 2011 2012 2012 ...
$ LandUse : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ LUT : chr "Cer" "Cer" "Cer" "Cer" ...
$ LUG : chr "Wheat" "Wheat" "Wheat" "Wheat" ...
$ Tmix : Factor w/ 6 levels "6","5","4","3",..: 6 5 6 4 6 5 4 5 6 6
...
$ combo : Factor w/ 510 levels "","GLYPHOSATE",..: 416 6 59 119 30
22 510 2 2 509
my table
a <- table(DF$"combo", DF$"LUG")
I get table ok but would like to have it ordered based on sum of all variables/columns i.e. Glyphosate = 124, then clethodim = 69, then paraquat = 53 ... descending for all 510 categories (rows).
Barley Canola Lupin Other Pasture Wheat
GLYPHOSATE 4 46 6 5 23 40
TRALKOXYDIM 0 0 0 0 0 8
MCPA; GLYPHOSATE; METSULFURON 0 0 0 0 0 1
METSULFURON 1 0 0 0 0 1
BUTROXYDIM; METSULFURON 1 0 0 0 0 0
GLYPHOSATE; METSULFURON; PYRAFLUFEN 0 0 0 0 0 1
PARAQUAT 2 7 7 2 28 7
CLETHODIM 0 41 15 3 0 0
Using an example dataset:
grades <- c(1,1,1,2,2,1,1,2,1,1,1,2,3)
credits <- c(4,4,4,8,4,4,8,4,4,4,8,4,4)
df <- cbind(grades, credits)
You can find the rowsums using rowSums().
One possible solution would be to create another column for rowsums and then sort with decreasing = T.
df <- as.data.frame(df)
df$sum <- rowSums(df)
df <- df[order(df[,3], decreasing = T),]
Related
I have to subset this "plt" list.
"Plt" is a list of GPS points, with date and time.
"Labels" is a list of all trips in the day, with start and end time.
I would take the point in row 1 from labels$Start and the point in row 1 from labels$End, search these values in plt$Data_Time column and subset all rows between the Start value and the End value.
> str(labels)
'data.frame': 10 obs. of 8 variables:
$ Date_ST: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ST: Factor w/ 15 levels "01:27:05","01:33:29",..: 13 15 4 10 1 7 8 12 2 11
$ Date_ET: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ET: Factor w/ 15 levels "01:35:25","01:41:11",..: 13 15 3 10 1 5 6 12 2 9
$ Mode : Factor w/ 2 levels "subway","walk": 2 2 2 2 2 2 2 2 2 2
$ ID : int 1 3 4 6 7 9 10 12 13 15
$ Start : chr "2008/04/28 11:27:42" "2008/04/28 11:42:56" "2008/04/29 01:38:21" "2008/04/29 01:57:55" ...
$ End : chr "2008/04/28 11:27:58" "2008/04/28 11:50:10" "2008/04/29 01:41:28" "2008/04/29 02:03:28" ...
> str(plt)
'data.frame': 4377 obs. of 9 variables:
$ Lat : num 40.1 40.1 40.1 40.1 40.1 ...
$ Long : num 116 116 116 116 116 ...
$ X0 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Alt : int 492 492 491 491 491 490 490 490 489 489 ...
$ n.days : num 39589 39589 39589 39589 39589 ...
$ Date : Factor w/ 5 levels "2008-05-21","2008-04-28",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : Factor w/ 2955 levels "01:33:29","01:33:30",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Data_Time: chr "2008-05-21 01:33:29" "2008-05-21 01:33:30" "2008-05-21 01:33:31" "2008-05-21 01:33:33" ...
head(plt)
Lat Long X0 Alt n.days Date Time ID Data_Time
1 40.07045 116.3130 0 492 39589.06 2008-05-21 01:33:29 1 2008-05-21 01:33:29
2 40.07045 116.3133 0 492 39589.06 2008-05-21 01:33:30 2 2008-05-21 01:33:30
3 40.07050 116.3131 0 491 39589.06 2008-05-21 01:33:31 3 2008-05-21 01:33:31
4 40.07052 116.3130 0 491 39589.06 2008-05-21 01:33:33 4 2008-05-21 01:33:33
5 40.07050 116.3129 0 491 39589.06 2008-05-21 01:33:35 5 2008-05-21 01:33:35
6 40.07047 116.3129 0 490 39589.07 2008-05-21 01:33:37 6 2008-05-21 01:33:37
labels
Date_ST Time_ST Date_ET Time_ET Mode ID Start End
1 2008/04/28 11:27:42 2008/04/28 11:27:58 walk 1 2008/04/28 11:27:42 2008/04/28 11:27:58
3 2008/04/28 11:42:56 2008/04/28 11:50:10 walk 3 2008/04/28 11:42:56 2008/04/28 11:50:10
4 2008/04/29 01:38:21 2008/04/29 01:41:28 walk 4 2008/04/29 01:38:21 2008/04/29 01:41:28
6 2008/04/29 01:57:55 2008/04/29 02:03:28 walk 6 2008/04/29 01:57:55 2008/04/29 02:03:28
7 2008/05/12 01:27:05 2008/05/12 01:35:25 walk 7 2008/05/12 01:27:05 2008/05/12 01:35:25
9 2008/05/12 01:51:11 2008/05/12 01:55:35 walk 9 2008/05/12 01:51:11 2008/05/12 01:55:35
I need to do it for each row, so I have thought to use a for cycle.
In the end, I want to keep only the column 1 and 2 (Lat and Long).
for(i in 1:nrow(labels)) {
a = labels$Start[i] #prendo coord inizio/fine percorso
b = labels$End[i]
k = plt[plt$Data_Time >= a & plt$Data_Time < b, ]
LatLong = k[1:2]
head(LatLong)
write.table(LatLong, "~/Desktop/LatLongTrip.txt", sep="\t")
Unfortunately, the result is:
> k = plt[plt$Data_Time >= b & plt$Data_Time < a, ]
> k
[1] Lat Long X0 Alt n.days Date Time ID Data_Time
<0 rows> (or 0-length row.names)
In reality, there are some rows between these two values, could you help me, please?
You don't need a for cycle :)
Here:
First make sure to have library sqldf
Then, setting up a mock data example:
fechasInicioYFin <- data.frame(
fechasInicio = as.POSIXct(c('2016-08-19 10:00','2016-08-25 15:00','2016-09-15 15:00','2016-07-20 11:00')),
fechasFin = as.POSIXct(c('2016-08-19 14:00','2016-08-25 18:00','2016-09-15 19:00','2016-07-20 16:00'))
)
dataConFecha <- data.frame(num1 = c(1,2,3,4,5,6), num2 = c(11:16),
fechas = as.POSIXct(c('2016-08-19 12:00','2016-08-25 16:00','2016-09-15 16:00','2016-07-20 13:00',
'2016-08-19 13:00','2016-09-15 17:00'))
)
Now just join them by the date column and select only the columns that you are interested:
sqldf("select a.*,b.fechasInicio,b.fechasFin from dataConFecha as a join fechasInicioYFin as b on
a.fechas between b.fechasInicio and b.fechasFin")
**using "between" sql statement instead of >= and <=, as suggested by #G. Grothendieck
The output should be something like this:
As you can see, the data is now basically grouped by beginning date and ending date.
I have a dataset that contains time series information regarding soil elevation from several sampling stations. I have modeled the change in soil elevation over time for each station using ggplot. Now I would like to add a line to my graph that depicts a linear model fit to other geological data over time from a different dataset but I have been unable to do so. I know that I can add the slope and the intercept to my functions manually but I would rather not.
My data is as follows..
str(SETdata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1620 obs. of 6 variables:
$ Observation : num 1 2 3 4 5 6 7 8 9 10 ...
$ Plot_Name : Factor w/ 3 levels "1900-01-01","1900-01-02",..: 1 1 1
1 1 1 1 1 1 1 ...
$ PipeDirectionCode: chr "001°" "001°" "001°" "001°" ...
$ Pin : num 1 2 3 4 5 6 7 8 9 1 ...
$ EventDate : num 0 0 0 0 0 0 0 0 0 0 ...
$ PinHeight_mm : num 221 207 192 220 212 212 206 209 203 222 ...
str(FeldsparData)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 540 obs. of 4 variables:
$ Benchmark : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
$ Plot : Factor w/ 12 levels "1a","1b","1c",..: 1 1 1 2 2 2 3 3 3 5
...
$ TotalChange: num 0 0 0 0 0 0 0 0 0 0 ...
$ Day : num 0 0 0 0 0 0 0 0 0 0 ...
The graph I have is
SETdata %>%
ggplot()+
aes(x = EventDate, y = PinHeight_mm, color = Plot_Name, group = Plot_Name)+
stat_summary(fun.y = mean, geom = "point")+
stat_summary(fun.y = mean, geom = "line")
And I would like it to include this line
reg <- lm(TotalChange ~ Day, data = FeldsparData)
My attempts seem to have been thwarted because R does not like that I am using two different datasets.
Here is my data
> str(myData)
'data.frame': 500 obs. of 12 variables:
$ PassengerId: int 1 2 5 6 7 8 9 10 11 12 ...
$ Survived : int 0 1 0 0 0 0 1 1 1 1 ...
$ Pclass : int 3 1 3 3 1 3 3 2 3 1 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 16 559 520 629 417 581 732 96 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 1 1 1 ...
$ Age : num 22 38 35 NA 54 2 27 14 4 58 ...
$ SibSp : int 1 1 0 0 0 3 0 1 1 0 ...
$ Parch : int 0 0 0 0 0 1 2 0 1 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 473 276 86 396 345 133 617 39 ...
$ Fare : num 7.25 71.28 8.05 8.46 51.86 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA NA 130 NA NA NA 146 50 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 2 3 3 3 1 3 3 ...
I have to generate 2 results
1.grouped by title and pclass of each passenger like this
2.display table of missing age counts grouped by title and pclass like this
but when I used what I know both resulted like below
> myData$Name = as.character(myData$Name)
> table_words = table(unlist(strsplit(myData$Name, "\\s+")))
> sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)
Mr. Miss. Mrs. Master. Dr. Rev. Col. Capt. Countess. Don.
289 99 76 20 5 3 2 1 1 1
L. Mlle. Mme. Sir.
1 1 1 1
> library(stringr)
> tb = cbind(myData$Age, str_match(myData$Name, "[a-zA-Z]+\\."))
> table(tb[is.na(tb[,1]),2])
Dr. Master. Miss. Mr. Mrs.
1 3 18 62 7
basically I have to return tables not by total amount like I did above but to display by 3 different rows sorting by Pclass int which the total of 3rows would still be the same as total amount(myTitle = Pclass int 1 / 2 / 3 in 'myData')
so for example, the result of image 1 would mean that Capt. exists only 1 by int 1 unber Pclass data.
how should i sort the total amount by Pclass int 1,2,and 3?
It is hard to tell with no data provided (though I think that it comes from the Titanic dataset on Kaggle).
I think the first thing to do is to create a new factor with Title as you want to make analysis with it. I'd do something like:
# Extract title from name and make it a factor
dat$Title <- gsub(".* (.*)\\. .*$", "\\1", as.character(dat$Name))
dat$Title <- factor(dat$Title)
You'll need to check that it works with your data.
Once you have the Title factor you can use ddply from the plyr library and make the first table (grouped by Title and Pclass of each passenger):
library(plyr)
# Number of occurences
classTitle <- ddply(dat, c('Pclass', 'Title'), summarise,
count=length(Name))
# Convert to wide format
classTitle <- reshape(classTitle, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
classTitle[is.na(classTitle)] <- 0
Almost the same thing for your second requirement (display table of missing age counts grouped by Title and Pclass):
# Number of NA in Age
countNA <- ddply(dat, c('Pclass', 'Title'), summarise,
na=sum(is.na(Age)))
# Convert to wide format
countNA <- reshape(countNA, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
countNA[is.na(countNA)] <- 0
I have the following data frame in R:
> head(df)
date x y z n t
1 2012-01-01 1 1 1 0 52
2 2012-01-01 1 1 2 0 52
3 2012-01-01 1 1 3 0 52
4 2012-01-01 1 1 4 0 52
5 2012-01-01 1 1 5 0 52
6 2012-01-01 1 1 6 0 52
> str(df)
'data.frame': 4617600 obs. of 6 variables:
$ date: Date, format: "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" ...
$ x : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ y : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ z : Factor w/ 111 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ n : int 0 0 0 0 0 0 0 0 29 0 ...
$ t : num 52 52 52 52 52 52 52 52 52 52 ...
What I want to do is split this large df into smaller data frames as follows:
1) I want to have 45 data frames for each factor value of 'x'. 2) I want to further split these 45 data frames for each factor value of 'z'. So I want a total of 45*111=4995 data frames.
I've seen plenty online about splitting data frames, which turns them into lists. However, I'm not seeing how to further split lists. Another concern I have is with computer memory. If I split the data frame into lists, will it not still take up as much computer memory? If I then want to run some prediction models on the split data, it seems impossible to do. Ideally I would split the data into many data frames, run prediction models on the first split data frame, get the results I need, and then delete it before moving on to the next one.
Here's what I would do. Your data already fits in memory, so just leave it in one piece:
require(data.table)
setDT(df)
df[,{
sum(t*n) # or whatever you're doing for "prediction models"
},by=list(x,z)]
I'm working with a data frame of stock information, here is what it looks like:
> str(test)
'data.frame': 211717 obs. of 19 variables:
$ Symbol : Factor w/ 3378 levels "AACC","AACE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ MktCategory : Factor w/ 3 levels "","NNM","SCM": 2 2 2 2 2 2 2 2 2 2 ...
$ TSO : num 37205115 37205115 37205115 37205115 37205115 ...
$ TSO_Date : Factor w/ 200 levels "","1/1/2006",..: 137 137 137 137 137 137 137 137 137 137 ...
$ X.OfMP : int 56 56 56 56 56 56 56 56 56 56 ...
$ MPID : Factor w/ 670 levels "","ABLE","ABNA",..: 608 459 533 618 550 635 307 146 387 482 ...
$ MP_type : Factor w/ 4 levels "","C","M","NR": 2 3 4 3 3 3 3 4 3 4 ...
$ Total_Vol : int 32900 0 2949 758522 41316 706131 29300 16898 362569 1490 ...
$ Total_Rank : int 18 0 35 2 17 3 21 26 5 40 ...
$ Total_Pct : int 0 0 0 14 0 13 0 0 7 0 ...
$ Block_Vol : int 0 0 0 60800 20000 34900 19200 16600 0 0 ...
$ Block_Rank : int 0 0 0 2 6 4 7 9 0 0 ...
$ Block_Pct : int 0 0 0 15 5 9 5 4 0 0 ...
$ YTD_Total_Vol : num 81615 2929 10684 1949230 190874 ...
$ YTD_Total_Rank: int 28 59 44 3 17 5 30 27 12 67 ...
$ YTD_Total_Pct : int 0 0 0 9 0 7 0 0 2 0 ...
$ YTD_Block_Vol : int 0 0 0 197420 80000 390600 60900 73787 55994 0 ...
$ YTD_Block_Rank: int 0 0 0 5 13 3 16 14 17 0 ...
$ YTD_Block_Pct : int 0 0 0 6 3 12 2 2 2 0 ...
So I know how to sum total volume(Total_Vol) by Symbol with the aggregate function:
volbystock<-aggregate(test$Total_Vol,by=list(test$Symbol),FUN=sum)
but I am trying to analyze volume of only a few MPID values. I want to only add the Total_Vol of a Symbol when the MPID is one of the MPIDs in another list. In other words, I only want the Total_Vol of a certain Symbol added if the corresponding MPID is one of the following:
> use_MPID<-c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
Using dply you can do something like this:
# load dplyr
library(dplyr)
# create a vector of MPIDs you are interested on
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
# create a fake dataset just for representation
test <- data.frame(cbind(c("ci", "di", "bi", "bi"), c("GSCO","LATS","TACT","INCA"), c(35, 110, 201, 435)))
names(test) <- c("Symbol", "MPID", "TotalVol")
# use dplyr to summarise your dataset
volbystock <- test %.%
group_by(Symbol) %.%
select(Symbol, MPID, TotalVol) %.%
filter(MPID %in% use_MPID)
It looks like you could just subset your data.frame, by using:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
relevant.symbols <- which(test$MPID %in% use_MPID)
volbystock <- aggregate(test$Total_Vol[relevant.symbols],
by=list(test$Symbol[relevant.symbols]),
FUN=sum)
Does this solve your problem?
edit
Even better, you could use the subset optional argument, along with providing the right formula:
use_MPID <- c("GSCO","LATS","TACT","INCA","LATS","LQNT","ITGI")
volbystock <- aggregate(formula=test$Total_Vol ~ test$Symbol,
subset=(test$MPID %in% use_MPID),
FUN=sum)