creating vector from 'if' function using apply in R - r

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks

It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.

You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Related

Remove row with specific value

I have the following data:
library(data.table)
sales <- data.table(Customer = c(192,964,929,345,898,477,705,804,188,231,780,611,420,816,171,212,504,526,471,979,524,410,557,152,417,359,435,820,305,268,763,194,757,475,351,933,805,687,813,880,798,327,602,710,785,840,446,891,165,662),
Producttype = c(1,2,3,2,3,3,2,1,3,3,1,1,2,2,1,3,1,3,3,1,1,1,1,3,3,3,3,2,1,1,3,3,3,3,1,1,3,3,3,2,3,2,3,3,3,2,1,2,3,1),
Price = c(469,721,856,956,554,188,429,502,507,669,427,582,574,992,418,835,652,983,149,917,370,617,876,337,663,252,599,949,915,556,313,842,892,724,415,307,900,114,439,456,541,261,881,757,199,308,958,374,409,738),
Quarter = c(2,3,3,4,4,1,4,4,3,3,1,1,1,1,1,1,4,1,2,1,3,1,2,3,3,4,4,1,1,4,1,1,3,2,1,3,3,2,2,2,1,4,3,3,1,1,1,3,1,1))
How can I remove (let's say) the row in which Customer = 891?
And then I have another question:
If I want to manipulate the data I use data [row, column]. But when I want to use only the rows in which Quarter equals (for example) 4. I use data [Quarter = 4,] Why is it not data [, Quarter = 4] since Quarter is a column and not a row?
I did not find an appropriate answer in the internet which really explains the why.
Thank you.
You have used 'data.table' function to import your data, so you could write :
sales[Customer != 891,]
The data[Quarter = 4, ], ensures that all columns should be returned for the rows where Quarter is equal to 4. The comma(,) is necessary to only select the rows, and not the column Quarter = 4.
When you use indexing, ie, data[row, column] you are telling R to look for either a specific row or column index.
Row: sales[sales$Customer %in% c(192,964),] translates to "search the specific column Customer in the data frame (or table) for any rows that have values that contain 192 or 964 and isolate them. Note that data.table will allow for sales[Customer %in% c(192, 964),] but data frames cant (use sales[sales$Customer %in% c(192,964),])
Customer Producttype Price Quarter
1: 192 1 469 2
2: 964 2 721 3
Columns sales[, "Customer"] translates to "search the data frame (or table) for columns named "Customer" and isolate all its rows
Customer
1: 192
2: 964
3: 929
4: 345
5: 898
...
Note this returns a data table with one column. If you use sales[,Customer] (data table) or sales$Customer (data frame), it will return a vector:
# [1] 192 964 929 345 898 477 705 804 188 231 780 611 420 816 171 212 504 526 471 979 524
# [22] 410 557 152 417 359 435 820 305 268 763 194 757 475 351 933 805 687 813 880 798 327
# [43] 602 710 785 840 446 891 165 662
You can of course combine - if you did, sales[sales$Quarter %in% 1:2, c("Customer", "Producttype")] you would isolate all values of Customer and Producttype which were in quarters 1 and 2:
Customer Producttype
1: 192 1
2: 477 3
3: 780 1
4: 611 1
5: 420 2
...

Calculate mean by decile in Svydesign object

So, I´m working with ENIGH - Database, which stands for ¨National Survey of Household Income and Expenses¨ in Spanish, this is an exercise conducted by the Mexican government and like most surveys of its kind, it works with Weights.
What I´m trying to do is to calculate the mean, maximum and minimum household income by Decile. In other words What´s the income of each 10%, grouping household base on their income.
To be honest, I haven’t gone that far but this is what I got until now:
I need my svydesign object
Convert that into a table using svytable
Arrange using desc() on my income variable
ENIGH_design <-svydesign(id=~upm, strata=~est_dis, weights=~factor_hog, data = ENIGH)
ENIGH_table <- svytable(ing_cor, ENIGH_design)
Here is where it gets tricky, supposing I have 100 rows, I can’t take the first 10 of them because in reality, when taking weights in mind, the might be 9% or 20% (I´m just throwing numbers) of the actual population.
I could use cut() on my income variable but I would be forgetting about weights and results will only be representative of the sample, not total population.
I think that the best approach would be to use a combination of:
mutate() to create a new variable base
if() in conjugation with mutate to define on which decile each row falls to
group_by() and mean() to calculate what I´m aiming for
This way I will have an extra variable which I could use to calculate whatever I want with whatever other variable I wish to. But again, I haven´t define my groups so it´s pretty much useless.
Thank you for reading. Thank you for your help.
Database available: https://www.inegi.org.mx/programas/enigh/nc/2016/default.html#Datos_abiertos
Here is a glimpse of how my DB looks:
folioviv foliohog ubica_geo est_dis upm factor ing_cor
100587003 1 10010000 2 610 180 22,723
100587004 1 10010000 2 610 180 17,920
100587005 1 10010000 2 610 180 27,506
100587006 1 10010000 2 610 180 56,236
100605201 1 10010000 2 620 178 41,587
100605202 1 10010000 2 620 178 135,437
100605203 1 10010000 2 620 178 62,386
100605205 1 10010000 2 620 178 103,502
100605206 1 10010000 2 620 178 27,323
100606301 1 10010000 3 630 223 68,042
100606302 1 10010000 3 630 223 98,537
100606305 1 10010000 3 630 223 53,237
100606306 1 10010000 3 630 223 132,861
100609801 1 10010000 3 640 232 190,033
100609802 1 10010000 3 640 232 28,654
100609805 1 10010000 3 640 232 74,408
100631401 1 10010000 1 650 171 80,761
100711503 1 10010000 1 770 184 38,640
100711504 1 10010000 1 770 184 81,672
There are many more columns but they aren´t necessary for this exercise.
Make a table (dataframe or data.table or tibble) that looks like this:
> dt
folioviv factor ing_tri
1 247 30000
2 200 15000
3 150 50000
incomes <- rep(dt$ing_tri, times = dt$factor)
deciles <- quantile(incomes, probs = seq(0.1, 1, by = 0.1), names = TRUE)
If I were you, I would try with names = FALSE to make it manipulable. Otherwise, it will be a named list and that's a bit annoying.
Oh, and in case you want to compute the mean, just do mean(incomes).
PS: The column folioviv is not actually necessary, but you may want to put it there just in case.

adding and subtracting values in multiple data frames of different lengths - flow analysis

Thank you jakub and Hack-R!
Yes, these are my actual data. The data I am starting from are the following:
[A] #first, longer dataset
CODE_t2 VALUE_t2
111 3641
112 1691
121 1271
122 185
123 522
124 0
131 0
132 0
133 0
141 626
142 170
211 0
212 0
213 0
221 0
222 0
223 0
231 95
241 0
242 0
243 0
244 0
311 129
312 1214
313 0
321 0
322 0
323 565
324 0
331 0
332 0
333 0
334 0
335 0
411 0
412 0
421 0
422 0
423 0
511 6
512 0
521 0
522 0
523 87
In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city.
Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type.
To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following:
[B] #second, shorter dataset
CODE_t2 CODE_t1 VALUE_CHANGE1
121 112 2
121 133 12
121 323 0
121 511 3
121 523 2
123 523 4
133 123 3
133 523 4
141 231 12
141 511 37
So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0.
What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0.
Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real):
CODE VALUE_t0 VALUE_t2 % VAR t2-t0
code1 50 100 ((100-50)/50)*100
code2 70 80 ((80-70)/70)*100
code3 45 34 ((34-45)/45)*100
What I could do so far is:
land_code <- names(A)[-1]
land_code
A$VALUE_t1 <- for(code in land_code{
cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
}
If I use the loop I get an error, while if I take it away:
A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula.
Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here).
Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me.
thanks a lot

lmList - loss of group information

I am using lmList to do linear models on many subsets of a data frame:
res <- lmList(Rds.on.fwd~Length | Wafer, data=sub, na.action=na.omit, pool=F)
This works fine, and I get the desired output (full output not shown):
(Intercept) Length
2492 5816.726 1571.260
2493 2520.311 1361.317
2494 3058.408 1286.516
2502 4727.328 1344.728
2564 3790.942 1576.223
2567 2350.296 1290.396
I have subsetted by "Wafer" (first column above). However, within my data frame ("sub"), the data is grouped by another factor "ERF" (there are many other factors but I am only concerned with "ERF"):
head(sub):
ERF Wafer Device Row Col Width Length Date Von.fwd Vth.fwd STS.fwd On.Off.fwd Ion.fwd Ioff.fwd Rds.on.fwd
1 474 2492 11.06E 11 6 100 5 09/10/2014 12:05 0.596747 3.05655 0.295971 7874420 0.000104 1.32e-11 9626.54
3 474 2492 11.08E 11 8 100 5 09/10/2014 12:05 0.581131 3.08380 0.299050 7890780 0.000109 1.38e-11 9193.62
5 474 2492 11.09E 11 9 100 5 09/10/2014 12:05 0.578171 3.06713 0.298509 8299740 0.000107 1.29e-11 9337.86
7 474 2492 11.10E 11 10 100 5 09/10/2014 12:05 0.565504 2.95532 0.298349 8138320 0.000109 1.34e-11 9173.15
9 474 2492 11.11E 11 11 100 5 09/10/2014 12:05 0.581289 2.97091 0.297885 8463620 0.000109 1.29e-11 9178.50
11 474 2492 11.12E 11 12 100 5 09/10/2014 12:05 0.578003 3.05802 0.294260 9326360 0.000112 1.20e-11 8955.51
I do not want ERF including in my lm but I do want to keep the factor "ERF" with the lm results for colouring graphs later i.e. I want this:
ERF Wafer (Intercept) Length
474 2492 5816.726 1571.260
474 2493 2520.311 1361.317
474 2494 3058.408 1286.516
475 2502 4727.328 1344.728
475 2564 3790.942 1576.223
476 2567 2350.296 1290.396
I know I could do this manually later by just adding a column to the results with a vector containing the correct sequence of ERF. However, I regularly add data to the set and dont want to do this every time. Im sure there is a more elegant way?
Thanks
Edit - data added for solution:
res <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
ERF Wafer (Intercept) Length
1 474 2492 5816.726 1571.260
2 474 2493 2520.311 1361.317
3 474 2494 3058.408 1286.516
4 474 2502 4727.328 1344.728
5 479 2564 3790.942 1576.223
6 479 2567 2350.296 1290.396
If I drop ERF:
res <- ddply(sub, c("Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
head(res)
Wafer (Intercept) Length
1 2492 5816.726 1571.260
2 2493 2520.311 1361.317
3 2494 3058.408 1286.516
4 2502 4727.328 1344.728
5 2564 3790.942 1576.223
6 2567 2350.296 1290.396
Does this made sense? Did i ask the question incorrectly?
Ah, with a bit more research i've answer my own question based on this answer:
Regression on subset of data set
Must look harder next time. I used ddply instead of lmList (makes me wonder why anyone uses lmList...maybe I should ask another question?):
res1 <- ddply(sub, c("ERF", "Wafer"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))

Basic for loop not working

I am trying to get my head around for loops in R and I have what seems to me a very basic example which isn't working.
I have data in a table:
Author ev.ctrl n.ctrl ev.trt n.trt year
1 Cammu 8 56 7 54 1994
2 Eckert 49 137 46 137 2001
3 Kuusela 1 15 1 18 1998
4 Ohlisson 205 625 183 612 2001
5 Rush 259 392 235 393 1996
6 Woodward 7 20 6 40 2004
I want to calculate the sum of the column n.trt I know I could do sum(epidural$n.trt) but want to try and use a for loop.
I have:
for (i in 1:6){
sum(epidural$n.trt[i])
}
This is not giving me anything, not a number nor an error. Any idea what the problem is?
Thanks
Do this instead... we don't need no steenking loops:
> treats <- sum(epidural['n.trt']); treats
[1] 1254
You need to declare sum variable outside of for loop and add values to it. There is no need to call sum function since you have only one value not vector.
s <- 0
for (i in 1:6){
s <- s + epidural$n.trt[i]
}
s

Resources