R: loop or filter over the dataframe and stop when specific condition is met - r

UPDATE:
I have a dataframe df, which looks like this and it is chronological ordered by years.
id
status
601
4
601
4
601
2
601
2
601
2
601
4
601
4
601
2
601
2
601
4
601
2
990
4
990
2
990
4
First Output I want to have is:
Per id the status should alsways start with 2, so if it starts with 4 it should be delted from the df
I want to use a loop to filter over the id and that it stops, when per the number 4 occurs the first time per id:
so I want that it looks like this at the end:
id
status
601
2
601
2
601
2
601
4
601
4
990
2
990
4
and the second output I want to have:
It should stop with 4, no matter how often it occurs in the original dataset. After 4 nothing else should come.
id
status
601
2
601
2
601
2
601
4
601
4
601
2
601
2
601
4
990
2
990
4
I do not know how to do it? Maybe there is also a way with filtering?
I would really apreciate your help

If I understand you question correctly, you can use {dplyr} to get the first 4 rows of each id:
df %>% dplyr::group_by(id) %>% slice_head(n = 4)
How are your two questions different? Try adding some data that we can run and illustrate if the above is insufficient.

Related

Calculate mean by decile in Svydesign object

So, I´m working with ENIGH - Database, which stands for ¨National Survey of Household Income and Expenses¨ in Spanish, this is an exercise conducted by the Mexican government and like most surveys of its kind, it works with Weights.
What I´m trying to do is to calculate the mean, maximum and minimum household income by Decile. In other words What´s the income of each 10%, grouping household base on their income.
To be honest, I haven’t gone that far but this is what I got until now:
I need my svydesign object
Convert that into a table using svytable
Arrange using desc() on my income variable
ENIGH_design <-svydesign(id=~upm, strata=~est_dis, weights=~factor_hog, data = ENIGH)
ENIGH_table <- svytable(ing_cor, ENIGH_design)
Here is where it gets tricky, supposing I have 100 rows, I can’t take the first 10 of them because in reality, when taking weights in mind, the might be 9% or 20% (I´m just throwing numbers) of the actual population.
I could use cut() on my income variable but I would be forgetting about weights and results will only be representative of the sample, not total population.
I think that the best approach would be to use a combination of:
mutate() to create a new variable base
if() in conjugation with mutate to define on which decile each row falls to
group_by() and mean() to calculate what I´m aiming for
This way I will have an extra variable which I could use to calculate whatever I want with whatever other variable I wish to. But again, I haven´t define my groups so it´s pretty much useless.
Thank you for reading. Thank you for your help.
Database available: https://www.inegi.org.mx/programas/enigh/nc/2016/default.html#Datos_abiertos
Here is a glimpse of how my DB looks:
folioviv foliohog ubica_geo est_dis upm factor ing_cor
100587003 1 10010000 2 610 180 22,723
100587004 1 10010000 2 610 180 17,920
100587005 1 10010000 2 610 180 27,506
100587006 1 10010000 2 610 180 56,236
100605201 1 10010000 2 620 178 41,587
100605202 1 10010000 2 620 178 135,437
100605203 1 10010000 2 620 178 62,386
100605205 1 10010000 2 620 178 103,502
100605206 1 10010000 2 620 178 27,323
100606301 1 10010000 3 630 223 68,042
100606302 1 10010000 3 630 223 98,537
100606305 1 10010000 3 630 223 53,237
100606306 1 10010000 3 630 223 132,861
100609801 1 10010000 3 640 232 190,033
100609802 1 10010000 3 640 232 28,654
100609805 1 10010000 3 640 232 74,408
100631401 1 10010000 1 650 171 80,761
100711503 1 10010000 1 770 184 38,640
100711504 1 10010000 1 770 184 81,672
There are many more columns but they aren´t necessary for this exercise.
Make a table (dataframe or data.table or tibble) that looks like this:
> dt
folioviv factor ing_tri
1 247 30000
2 200 15000
3 150 50000
incomes <- rep(dt$ing_tri, times = dt$factor)
deciles <- quantile(incomes, probs = seq(0.1, 1, by = 0.1), names = TRUE)
If I were you, I would try with names = FALSE to make it manipulable. Otherwise, it will be a named list and that's a bit annoying.
Oh, and in case you want to compute the mean, just do mean(incomes).
PS: The column folioviv is not actually necessary, but you may want to put it there just in case.

Rotate Row Cell Values to a Single Column in R

I can't seem to find anything relevant to my specific problem, so I am asking here.
I have my original dataframe here:
Sample#, Fert_A_Mean, Fert_B_Mean, Fert_C_Mean, Fert_D_Mean
1 987, 384, 672, 364
2 567, 845, 398, 243
And I'd like to be able to restructure it like this:
Sample#, Fert_Mean
1 987
1 384
1 672
1 364
2 567
2 845
2 398
2 243
I've found some similar topics on stack-exchange, such as here
but using 't()' in this case doesn't seem to work... or I am doing something wrong. Hopefully one of you folks can help me out. Thanks so much. Using R 3.4.1 through R-studio. Any packages you recommend for your methods are fine.
You could use something like gather from the tidyr package:
library(tidyr)
df2 <- gather(df, key=Fert_Mean, value=value, -Sample)
df2
Sample Fert_Mean value
1 1 Fert_A_Mean 987
2 2 Fert_A_Mean 567
3 1 Fert_B_Mean 384
4 2 Fert_B_Mean 845
5 1 Fert_C_Mean 672
6 2 Fert_C_Mean 398
7 1 Fert_D_Mean 364
8 2 Fert_D_Mean 243
You can remove the Fert_Mean column if you don't want it, and sort by Sample to get the ordering in your example.

How to replace number with text for each row in dataframe?

I have a data frame like this:
id status
241 1
451 3
748 3
469 2
102 3
100 1
203 2
Now what I want to do is this:
1 corresponds to 'good' , 2 corresponds to 'moderate', 3 corresponds to 'bad'.
So my output should be like this:
id status
241 good
451 bad
748 bad
469 moderate
102 bad
100 good
203 moderate
How to do this ? I tried to do this using if else but it is getting complicated.
It sounds like you want a labelled factor. You can try:
df$status <- factor(df$status, labels=c('Good','Moderate','Bad'))
> df
id status
1 241 Good
2 451 Bad
3 748 Bad
4 469 Moderate
5 102 Bad
6 100 Good
7 203 Moderate
It depends on what is the type of status column. If they are not factors, you can do(as pointed out by Pascal)
level<-c("Good","Moderate","Bad")
df$status<-level[df$status]
data
df<-data.frame(item=c("apple","banana","orange","papaya","mango"),grade=c(1,3,2,1,3),stringsAsFactors=FALSE)
And if it is set as factors, you may(as pointed out by Jay)
df$status<-factor(df$status, labels=c('Good','Moderate','Bad'))
data
df<-data.frame(item=c("apple","banana","orange","papaya","mango"),grade=c(1,3,2,1,3),stringsAsFactors=TRUE)

Splitting a column that contains a double in R

I am currently working on a project in R and I have a column that receives output from a kmeans model that chooses the mode cluster that a particular store belongs to. Unfortunately there is a tie so one of the instances in the column is getting assigned to two clusters. See example output below. The columns are rownumber, Store, and Cluster, respectively.
row store cluster
759 759 3
760 760 3
761 761 3
762 762 1, 3
763 763 3
764 764 1
I need to break out the 1 from the ,3 and just keep the one in the column.
You could just do something like this:
my_data <- dplyr::data_frame("row" = 759:764, "store" = 759:764, "cluster" = c("3", "3", "3", "1, 3", "3", "1"))
my_data
Source: local data frame [6 x 3]
row store cluster
1 759 759 3
2 760 760 3
3 761 761 3
4 762 762 1, 3
5 763 763 3
6 764 764 1
my_data$cluster <- my_data$cluster %>% stringr::str_extract("[^,]")
my_data
Source: local data frame [6 x 3]
row store cluster
1 759 759 3
2 760 760 3
3 761 761 3
4 762 762 1
5 763 763 3
6 764 764 1
The line of code that sets my_data$cluster tells R to extract everything from a string that is not a comma; once it hits a comma it stops. Since we use stringr::str_extract instead of stringr::str_extract_all, it only returns the first value.
If the column 'cluster' contains string element, we could do this using sub from base R. We match the comma followed by one or more characters until the end of the string and replace it with ''.
df1$cluster <- sub(',.*$', '', df1$cluster)
If the column is a list, we use sapply to extract the first element
df1$cluster <- sapply(df1$cluster, `[`,1)

creating vector from 'if' function using apply in R

I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]

Resources