How to sum a variable and extract it related variable to appear in the answer - r

I am new to R and to Stackoverflow and I need an assistant in sorting and extracting information from a data frame I created. I need to extract which IATA and NAME has received the most commission. The result should print: 3301, you pay, 12. I can subset each and every IATA but it is a long process. What will be the best function in R to sort all this information and print out this information.
IATA NAME TICKET_NUM PAX FARE TAX COMM NET
3300 pay more 700 john cohen 10 1.1 2 8
3300 pay more 701 james levy 11 1.2 2 9
3300 pay more 702 jonathan arbel 12 1.2 3 9
3300 pay more 703 gil matan 9 1.0 2 7
3301 you pay 704 ron natan 19 2.0 6 9
3301 you pay 705 don horvitz 18 2.0 6 9
3302 pay by ticket 706 lutter kaplan 9 1.2 0 9
3303 enjoy 707 lutter omega 12 1.2 0 12
3303 enjoy 708 graig daniel 14 1.3 1 13
3303 enjoy 730 orly rotenberg 15 1.0 1 14
3303 enjoy 731 yohan bach 12 1.0 1 11

This seems to return what you requested (using Jeremy's code for the second part):
comm <- read.table(text = '
IATA NAME TICKET_NUM PAX FARE TAX COMM NET
3300 pay.more 700 john.cohen 10 1.1 2 8
3300 pay.more 701 james.levy 11 1.2 2 9
3300 pay.more 702 jonathan.arbel 12 1.2 3 9
3300 pay.more 703 gil.matan 9 1.0 2 7
3301 you.pay 704 ron.natan 19 2.0 6 9
3301 you.pay 705 don.horvitz 18 2.0 6 9
3302 pay.by.ticket 706 lutter.kaplan 9 1.2 0 9
3303 enjoy 707 lutter.omega 12 1.2 0 12
3303 enjoy 708 graig.daniel 14 1.3 1 13
3303 enjoy 730 orly.rotenberg 15 1.0 1 14
3303 enjoy 731 yohan.bach 12 1.0 1 11
', header=TRUE, stringsAsFactors = FALSE)
comm2 <- with(comm, aggregate(COMM ~ IATA + NAME, FUN = function(x) sum(x, na.rm = TRUE)))
comm2
max_comm <- comm2[comm2$COMM == max(comm2$COMM),]
max_comm
IATA NAME COMM
4 3301 you.pay 12
Here is an explanation of the first statement:
The with function identifies the data set to use (here comm). The function aggregate is a general function for performing operations on groups. You want to operate on COMM by IATA and NAME. You write that: COMM ~ IATA + NAME. Next you specify the desired function to perform on COMM (here sum). You do that with FUN = function(x) sum(x). In case there are any missing observations in COMM I added na.rm = TRUE within the sum(x) function.

Call that table comm
max_comm <- comm[comm$COMM == max(comm$COMM),]
or just sort it and look at the head
head(comm[order(-comm$COMM),])
Edit: If you want to sum by IATA first then use data.table
library(data.table)
comm2 <- data.table(comm)
sum_comm <- comm2[, list(COMM_SUM=sum(COMM)), by = c("IATA","NAME")]
data.table has an unusual syntax, you could also try dplyr which is supposed to be roughly as good as data.table now

Related

Removing cases and corresponding controls

I have a dataset that looks like this:
patid age gender group pracid matched_id match_eventdate BMI
1 10 M case 100 1 23-05-20 NA
111 12 M control 100 1 23-05-20 20.8
222 9 M case 100 222 23-05-20 15.7
333 8 M control 100 222 23-05-20 21.8
555 8 M control 100 222 23-05-20 19.5
Each case can have up to 3 controls (some will have 1, some 2, some 3). Say, I need to cases that doesn't have BMI recorded(e.g. patid 1).I need to remove the corresponding controls with 1 (patid 111). It can be any number (not 111 as in the example above). How would I do that?
I know I need a for loop to go through the BMI, then save the ID cases that don't match that criteria, then remove those and corresponding controls.
If I understand you correctly, you want to remove all cases and controls when a case has a missing BMI value (NA). You can do this simply in base R by indexing on those conditions.
Code
df[!(df$matched_id %in% df$patid[is.na(df$BMI)]),]
# patid age gender group pracid matched_id match_eventdate BMI
# 6 222 9 M case 100 222 23-05-20 15.7
# 7 333 8 M control 100 222 23-05-20 21.8
# 8 555 8 M control 100 222 23-05-20 19.5
Data - note I am expanding your dataset a bit to include an extra control for patid == 1 and also an additional case with patient ID "5" to ensure validity.
df <-read.table(text = " patid age gender group pracid matched_id match_eventdate BMI
1 10 M case 100 1 23-05-20 NA
111 12 M control 100 1 23-05-20 20.8
111 12 M control 100 1 23-05-20 17.8
5 50 M case 500 5 23-05-20 NA
585 52 M control 500 5 23-05-20 20.8
222 9 M case 100 222 23-05-20 15.7
333 8 M control 100 222 23-05-20 21.8
555 8 M control 100 222 23-05-20 19.5", header = TRUE)
If I misunderstood and this is not the output you want, let me know and I can modify my answer. Good luck!
This is a two-step process, but it does not involve loops. I’m using the ‘dplyr’ package in the following. There are other solutions.
First, you identify which cases you want to remove. In this case, those where BMI is NA:
excluded_patients = data |>
filter(group == 'case', is.na(BMI)) |>
pull(patid)
And the second step is to exclude those patients from the data:
filtered_data = data |>
filter(patid %in% excluded_patients)
Or maybe you need the following (it isn’t clear from your question):
filtered_data = data |>
filter(matched_id %in% excluded_patients)

Store values in a cell dataframe

I am trying to store in multiple cells in a dataframe. But, my code is storing the data in the last cell (on the dd array). Please see my output below.
Can somebody please correct me? Cannot figure out what I am doing wrong.
Thanks in advance,
MyData <- read.csv(file="Pat_AR_035.csv", header=TRUE, sep=",")
dd <- unique(MyData$POLICY_NUM)
for (j in length(dd)) {
myDF <- data.frame(i=1:length(dd), m=I(vector('list', length(dd))))
myDF$m[[j]] <- data.frame(j,MyData[which(MyData$POLICY_NUM==dd[j] & MyData$ACRES), ],ncol(MyData),nrow(MyData))
}
[[60]]
NULL
[[61]]
NULL
[[62]]
NULL
[[63]]
j OBJECTID DIVISION POLICY_SYM POLICY_NUM YIELD_ID LINE_ID RH_CLU_ID ACRES PLANT_DATE ACRE_TYPE CLU_DETERM STATE COUNTY FARM_SERIA TRACT
1646 63 1646 8 MP 754033 3 20 39565604 8.56 5/3/2014 PL A 3 35 109 852
1647 63 1647 8 MP 754033 1 10 39565605 30.07 4/19/2014 PL A 3 35 109 852
1648 63 1648 8 MP 754033 1 10 39565606 56.59 4/19/2014 PL A 3 35 109 852
CLU_NUMBER FIELD_ACRE RMA_CLU_ID UPDATE_DAT Percent_Ar RHCLUID Field1 OBJECTID_1 DIVISION_1 STATE_1 COUNTY_1
1646 3 8.56 F68E591A-ECC2-470B-A012-201C3BB20D7F 9/21/2014 63.4990 39565604 1646 1646 8 3 35
1647 1 30.07 eb04cfc0-e78b-415f-b447-9595c81ef09e 9/21/2014 100.0000 39565605 1647 1647 8 3 35
1648 2 56.59 5922d604-e31c-4b9d-b846-9f38e2d18abe 9/21/2014 92.1442 39565606 1648 1648 8 3 35
POLICY_N_1 YIELD_ID_1 RH_CLU_ID_ short_dist coords_x1 coords_x2 optional SHAPE_Leng SHAPE_Area ncol.MyData. nrow.MyData.
1646 754033 3 39565604 5.110837 516747.8 -221751.4 TRUE 831.3702 34634.73 35 1757
1647 754033 1 39565605 5.606284 515932.1 -221702.0 TRUE 1469.4800 121611.46 35 1757
1648 754033 1 39565606 5.325399 516380.1 -221640.9 TRUE 1982.8757 228832.22 35 1757
for (j in length(dd))
This doesn’t iterate over dd — it iterates over a single number: the length of dd. Not much of an iteration. You probably meant to write the following or something similar:
for (j in seq_along(dd))
However, there are more issues with your code. For instance, the myDF variable is continuously overwritten inside your loop, which probably isn’t what you intended at all. Instead, you should probably create objects in an lapply statement and forego the loop.

Selecting pairs of odd even values in R

I have a large dataset as follows:
head(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type
Sampletype
400 32691 535 -28 3382.981 34.74480 1 S3 2011 2 ha
401 32701 536 -28 3375.263 34.86087 1 S3 2011 2 ha
402 32711 537 -28 3308.103 34.83100 1 S3 2011 2 ha
403 32721 538 -28 3368.721 31.58641 1 S3 2011 2 ha
404 32731 539 -28 3368.604 34.72326 1 S3 2011 2 ha
405 32741 540 -28 3314.713 32.83147 1 S3 2011 2 ha
tail(humic)
SUERC.No GU.Number d13.C Age.(BP) error Batch.Number AMS.USED Year Type Sampletype
5445 70880 3962 -28.4 3390.458 29.12815 34 S4 2016 2 ha
5446 70890 3963 -28.5 3358.861 37.14896 34 S4 2016 2 ha
5447 70900 3964 -28.5 3363.626 26.71573 34 S4 2016 2 ha
5448 70910 3965 -28.5 3408.907 26.69665 34 S4 2016 2 ha
5449 70920 3966 -28.5 3348.463 29.01492 34 S4 2016 2 ha
5450 70930 3967 -28.4 3375.247 26.78261 34 S4 2016 2 ha
I am looking to create a variable to identify pairs of odd and even based on the variable GU.Number. These numbers identify duplicates of the same object - have same d13.C values.
For example,
535 - 536
537 - 538
3963-3964
3965-3966 are pairs.
Note, the column of GU.Number is not a sequence, some numbers are missing.
even.rows <- which(!(humic$GU.Number %% 2))
has.pair <- rep(0,nrow(humic))
for(i in even.rows){
has.pair[i] <- max((humic$GU.Number[i] + c(1,-1)) %in% humic$GU.Number)
}
# add as column of data
humic$has.pair <- has.pair
The has.pair column will be 1 if the GU.Number is even and there exists an odd GU.Number one less or one greater than the given GU.Number. Otherwise it will be 0. As a one-liner:
humic$has.pair <- sapply(1:nrow(humic),
function(x) with(humic,(!(GU.Number[x] %% 2))*max((GU.Number[x] + c(1,-1)) %in% GU.Number)))

groups of different size randomly selected within different classes

i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!
First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise()
returns:
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
left_join(df)
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
left_join(df)
if this doesn't get you what you want, let me know and we'll take another crack at it.

R Read CSV file that has timestamp

I have a csvfile that has a time stamp column as a string
15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N
I use the read.csv option but the problem with that is once I finish the import the data column looks like:
1 15 1035 4530 3502 2 892 482 0 2.006011e+13 2 N
2 15 1034 7828 3501 3 263 256 0 2.007112e+13 3 N
3 15 1035 7832 4530 2 1974 1082 0 2.007112e+13 7 N
4 15 2346 8381 8155 3 2684 649 0 2.008021e+13 9 N
Is there away to strip the date from string as it get read (csv file does have headers: removed here to keep data anonymous). If we can't strip as it get read can what is the best way to do the strip?
Here 2 methods:
Using zoo package. Personally I prefer this one. I deal with your data as a time series.
library(zoo)
read.zoo(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
index=9,tz='',format='%Y%m%d%H%M%S',sep=',')
V1 V2 V3 V4 V5 V6 V7 V8 V10 V11
2006-01-08 08:16:08 15 1035 4530 3502 2 892 482 0 2 N
2007-11-24 17:55:19 15 1034 7828 3501 3 263 256 0 3 N
2007-11-24 19:38:18 15 1035 7832 4530 2 1974 1082 0 7 N
2008-02-07 13:10:02 15 2346 8381 8155 3 2684 649 0 9 N
Using colClasses argument in read.table, as mentioned in the comment :
dat <- read.table(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
colClasses=c(rep('numeric',8),
'character','numeric','character')
,sep=',')
strptime(dat$V9,'%Y%m%d%H%M%S')
1] "2006-01-08 08:16:08" "2007-11-24 17:55:19"
"2007-11-24 19:38:18" "2008-02-07 13:10:02"
As Ricardo says, you can set the column classes with read.csv. In this case I recommend importing these as characters and once the csv is loaded, converting them to dates with strptime().
for example:
test <- '20080207131002'
strptime(x = test, format = "%Y%m%d%H%M%S")
Which will return a POSIXlt object w/ the date/time info.
You can use lubridate package
test <- '20080207131002'
lubridate::as_datetime(test)
Can also specify format for each case depends on your needs

Resources