For an analysis of the European Social Survey (ESS) I attempt to calculate the the share of respondants having a higher education than their parents. I intend to use a for loop for the calculation. However, I am not able to calculate the shares for each country and year seperately. The rows in the dataframe are the individual observations (about 400k) and I have a row indicating the country (cntry) and year (ESSround) of the respondant. My code looks like this
for (i in 1:nrow(ESS_cleann)) {
ESS_cleann$abs_mobility[i] <- ESS_cleann[ESS_cleann[cntry]==i && ESS_cleann[essround]==i] length(ESS_cleann$educ_mobility[i] [ESS_clean$educ_mobility [i] == "U"])/ESS_cleann[ESS_cleann[cntry]==i&& ESS_cleann[essround]==i] length(ESS_cleann$educ_mobility[i])
}
I am well aware that this is wrong, but I cannot manage to tell R to calculate R the share for each country and year seperately. Help appreaciated a lot!
To give you an idea of the data-structure, these are the heads for all three relevant colums:
ESS_cleann.cntry ESS_cleann.essround ESS_cleann.educ_mobility
1 AT 2 D
2 AT 2 D
3 AT 3 U
4 AT 3 U
5 AT 1 N
6 AT 3 N
I'm not quite sure I understand but are you trying to do something like this?
library(dplyr)
set.seed(2020)
cntry <- sample(c("AT", "UK"), 100, replace = TRUE)
essround <- sample(1:3, 100, replace = TRUE)
mobility <- sample(c("D", "U", "N"), 100, replace = TRUE)
ESS <- data.frame(cntry, essround, mobility)
ESS %>%
group_by(cntry, essround, mobility, .drop= FALSE) %>%
summarise(counts = n()) %>%
mutate(.data = ., perc = counts / sum(counts))
#> # A tibble: 18 x 5
#> # Groups: cntry, essround [6]
#> cntry essround mobility counts perc
#> <chr> <int> <chr> <int> <dbl>
#> 1 AT 1 D 6 0.429
#> 2 AT 1 N 4 0.286
#> 3 AT 1 U 4 0.286
#> 4 AT 2 D 3 0.273
#> 5 AT 2 N 5 0.455
#> 6 AT 2 U 3 0.273
#> 7 AT 3 D 5 0.333
#> 8 AT 3 N 4 0.267
#> 9 AT 3 U 6 0.4
#> 10 UK 1 D 7 0.318
#> 11 UK 1 N 6 0.273
#> 12 UK 1 U 9 0.409
#> 13 UK 2 D 4 0.25
#> 14 UK 2 N 7 0.438
#> 15 UK 2 U 5 0.312
#> 16 UK 3 D 7 0.318
#> 17 UK 3 N 10 0.455
#> 18 UK 3 U 5 0.227
Created on 2020-05-11 by the reprex package (v0.3.0)
Data table sounds like the package you need. You do not provide any data to reproduce the issue but something like this should work:
DT[,.SD[ education.level > parent.education.level, .N/nrow(.SD)], by= c("country", "year") ]
If you want to do this with a for loop, I guess something like this would work:
for (year in years) {
for (country in countries){
subtable <- table[year==yer & country===countr]
store.in.some.variable.or.table.or.something <- nrow( subtable [ education > parental.education, ]) / nrow(subtable)
}
}
hope this helps.
Best regards
JA.
Related
I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)
I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.
The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.
library(tidyverse)
library(plm)
ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))
I first tried a filter over periods like so:
dt <- dt %>% group_by(ID) %>%
filter(year %in% first(year)+1:last(year)) %>%
mutate(mean.y = mean(y))
But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:
See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4
i hope you people could understand it. I would massively appreciate any help.
Bye
We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.
library(tidyverse)
library(plm)
dt %>%
group_by(ID) %>%
mutate(mean.y = ifelse(year == first(year),
NA,
mean(y[year != first(year)], na.rm = TRUE)))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID y year mean.y
#> <fct> <dbl> <fct> <dbl>
#> 1 a 9 2001 NA
#> 2 a 2 2002 3.25
#> 3 a 5 2003 3.25
#> 4 a 3 2004 3.25
#> 5 a 3 2005 3.25
#> 6 b 9 2001 NA
#> 7 b 1 2002 2
#> 8 b 2 2003 2
#> 9 b 3 2004 2
#> 10 c 9 2002 NA
#> 11 c 2 2003 3.5
#> 12 c 5 2004 3.5
Created on 2022-01-23 by the reprex package (v0.3.0)
Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.
library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))
Output
# A tibble: 12 x 4
# Groups: ID [3]
ID y year mean.y
<chr> <dbl> <dbl> <dbl>
1 a 9 2001 NA
2 a 2 2002 3.25
3 a 5 2003 3.25
4 a 3 2004 3.25
5 a 3 2005 3.25
6 b 9 2001 NA
7 b 1 2002 2
8 b 2 2003 2
9 b 3 2004 2
10 c 9 2002 NA
11 c 2 2003 3.5
12 c 5 2004 3.5
More compactly,
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])
I have a data frame that includes information about schools. The code below produces a toy example.
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
The current data has one row per school, with a single character variable indicating the range of grade levels. I'd like to have a longer dataset, with one row per school-by-grade combination. The code below does the job, but it feels like a clumsy workaround, and I'm wondering if there's a more efficient way to produce the same output.
df_long <- df %>%
mutate(low_grade = as.numeric(str_remove(str_extract(grade_range, '[[:digit:]]+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-[[:digit:]]+'),'-')),
fake_join_var = 1) %>%
left_join(data.frame(grade_level = c(1:12), fake_join_var = rep(1,12))) %>%
select(-fake_join_var) %>%
filter(grade_level >= low_grade &
grade_level <= high_grade)
(To be clear, df_long is exactly the output I want, I'm just wondering if there's a simpler way of producing it, maybe with purrr somehow?)
Since your code is based on the difference between low_grade and high_grade, you still have to extract the numerical value from the string.
However, after that, you can simply unnest() the sequence between the two.
Here is the code:
library(tidyverse)
df <- tibble(grade_range = c('1-3','2-5','5-12'),
school = c('AAA', 'BBB', 'CCC'),
score = c(100, 110, 150))
x = df %>%
mutate(
low_grade = as.numeric(str_remove(str_extract(grade_range, '\\d+-'),'-')),
high_grade = as.numeric(str_remove(str_extract(grade_range, '-\\d+'),'-')),
grade_level = map2(low_grade, high_grade, seq)
) %>%
unnest(grade_level)
x
#> # A tibble: 15 x 6
#> grade_range school score low_grade high_grade grade_level
#> <chr> <chr> <dbl> <dbl> <dbl> <int>
#> 1 1-3 AAA 100 1 3 1
#> 2 1-3 AAA 100 1 3 2
#> 3 1-3 AAA 100 1 3 3
#> 4 2-5 BBB 110 2 5 2
#> 5 2-5 BBB 110 2 5 3
#> 6 2-5 BBB 110 2 5 4
#> 7 2-5 BBB 110 2 5 5
#> 8 5-12 CCC 150 5 12 5
#> 9 5-12 CCC 150 5 12 6
#> 10 5-12 CCC 150 5 12 7
#> 11 5-12 CCC 150 5 12 8
#> 12 5-12 CCC 150 5 12 9
#> 13 5-12 CCC 150 5 12 10
#> 14 5-12 CCC 150 5 12 11
#> 15 5-12 CCC 150 5 12 12
waldo::compare(df_long, x)
#> v No differences
Created on 2021-10-01 by the reprex package (v2.0.0)
I am fitting a linear model to this data:
data <- data.frame(Student_ID =c(1,1,1,2,2,3,3,3,3,3,4,4,4,5,6,6,7,7,7,8,8),
Years_Attended = c(1991,1992,1995,1992,1993,1991,1992,1993,1994,1995,1993,1994,1995,1995,1993,1995,1990,1995,2000,1995,1996),
Class = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","C","C","C","C","C"),
marks = c(50,55,46,44,60,66,67,80,91,90,70,75,76,77,77,82,89,88,88,64,65))
The purpose is to create a new column that determines change in marks. I call this column marks.change and I fit the model as follows:
data2 <- data %>% group_by(Student_ID) %>% summarise(
Good.marks = length(marks[!is.na(marks)]),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
Student_ID = unique(Student_ID),
Class = unique(Class),
)
This code works fine. However, as opposed to considering all the years at once, I would like to fit the model above (i.e., the part where I say “marks.change =…”) for every interval in years then averaging them. Meaning I would like to fit the model between 1991 and 1992 only then move to 1992 and 1993, then move to 1993 and 1994 etc up to the final year and then putting the average of these calculations in a new column called marks.change.part2
Is there an easier way to automate this?
You may simplify your existing code a bit
data %>% group_by(Student_ID, Class) %>% summarise(
Good.marks = sum(!is.na(marks)),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
)
# A tibble: 8 x 4
# Groups: Student_ID [8]
Student_ID Class Good.marks marks.change
<dbl> <chr> <int> <dbl>
1 1 A 3 -1.46
2 2 A 2 16.
3 3 A 5 7.2
4 4 B 3 3.
5 5 B 1 0
6 6 B 2 2.50
7 7 C 3 -0.1
8 8 C 2 1.00
Now your question part- If I am understanding you correctly, perhaps you want this. Actually linear model on a two-point data is nothing but calculating slope manually, which you can easily calculate using simple vector maths.
data %>% group_by(Student_ID, Class) %>% summarise(
Good.marks = sum(!is.na(marks)),
marks.change = ifelse(Good.marks>1,
summary(lm(marks ~ Years_Attended))$coefficients[2, 1], 0),
marks.change.part2 = ifelse(Good.marks>1, mean(diff(marks)/diff(Years_Attended)), 0))
# A tibble: 8 x 5
# Groups: Student_ID [8]
Student_ID Class Good.marks marks.change marks.change.part2
<dbl> <chr> <int> <dbl> <dbl>
1 1 A 3 -1.46 1
2 2 A 2 16. 16
3 3 A 5 7.2 6
4 4 B 3 3. 3
5 5 B 1 0 0
6 6 B 2 2.50 2.5
7 7 C 3 -0.1 -0.1
8 8 C 2 1.00 1
let's say I have a data frame which looks something like this
A <- c(1:100)
B <- c(0.5:100)
df <- data.frame(A,B)
And I want to get 25 random rows out of this data frame with
df[sample(nrow(df), size = 25, replace = FALSE),]
But now I want to repeat this sample function 100 times and save every result individually.
I've tried to use the repeat function but I can't find a way to save every result.
Thank you.
As mentioned in the comments, the replicate implementation can reach your goal, i.e.,
res <- replicate(100,df[sample(nrow(df), size = 25, replace = FALSE),],simplify = F)
An alternative is to use sapply (or lapply), i.e.,
res <- sapply(1:100, function(k) df[sample(nrow(df), size = 25, replace = FALSE),],simplify = F)
or
res <- lapply(1:100, function(k) df[sample(nrow(df), size = 25, replace = FALSE),])
replicate() is a great option for this problem.
If you would like your final results in a single table with a column for the ID variable, you can use bind_rows() from the dplyr package. Here is a smaller example (3 samples from a data set of 10 rows) that may allow easier understanding of replicate()'s behavior:
library(dplyr, warn.conflicts = FALSE)
# make a smaller data set of 10 rows
d <- data.frame(
A = 1:10,
B = LETTERS[1:10]
) %>% print
#> A B
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 D
#> 5 5 E
#> 6 6 F
#> 7 7 G
#> 8 8 H
#> 9 9 I
#> 10 10 J
# create 3 samples, with each sample containing 4 rows
reps <- replicate(3, d[sample(nrow(d), 4, FALSE), ], simplify = FALSE) %>% print
#> [[1]]
#> A B
#> 2 2 B
#> 5 5 E
#> 6 6 F
#> 1 1 A
#>
#> [[2]]
#> A B
#> 3 3 C
#> 2 2 B
#> 5 5 E
#> 8 8 H
#>
#> [[3]]
#> A B
#> 4 4 D
#> 9 9 I
#> 3 3 C
#> 8 8 H
# bind the list elements into a single tibble, with an ID column for the sample
bind_rows(reps, .id = "sample_id")
#> sample_id A B
#> 1 1 2 B
#> 2 1 5 E
#> 3 1 6 F
#> 4 1 1 A
#> 5 2 3 C
#> 6 2 2 B
#> 7 2 5 E
#> 8 2 8 H
#> 9 3 4 D
#> 10 3 9 I
#> 11 3 3 C
#> 12 3 8 H
Created on 2019-12-02 by the reprex package (v0.3.0)
I'm new on R and trying to run some statistical test.
My data looks like that :
Name Freqeunce Target Total
Steve 1 A 11
Marcel 1 A 11
Marie 1 A 11
John 2 A 11
Max 2 A 11
Alice 4 A 11
Mariane 1 B 1
Rose 1 C 3
Carla 1 C 3
Happy 1 C 3
I want to realise a Chi2 of homogeneity for each target type ( A, B and C).
I want to know if there is possibility with R to run a loop that will write the p.value of each name in a column or did i have to extract the information before and then realize the Chi2 ?
The objectif is to identify which the different name are less represented in the group according to the frequences. And there is more than 2000 groups, thats why i want a loop.
Thank you for your answer
Baptiste
I think this will answer your question. I don't know if this is the type of chi^2 test you want but you can always change the function. I use group_by and mutate from the dplyr package and write a function to perform the chi^2 test and extract the pvalue.
library(dplyr)
df <- read.table("test2.txt", header = T)
c2_all <- function(x,y){
mat <- matrix(c(x,y),nrow = 2)
c2 <- chisq.test(mat)
return(c2$p.value)
}
result <- df2 %>% group_by(Target) %>% mutate(pvalue = c2_all(Name,Freqeunce))
result
# A tibble: 11 x 5
# Groups: Target [3]
Name Freqeunce Target Total pvalue
<fct> <int> <fct> <int> <dbl>
1 Steve 1 A 11 0.285
2 Marcel 1 A 11 0.285
3 Marie 1 A 11 0.285
4 John 2 A 11 0.285
5 Max 2 A 11 0.285
6 Alice 4 A 11 0.285
7 Sarah 2 B 3 1.00
8 Mariane 1 B 3 1.00
9 Rose 1 C 5 0.223
10 Carla 3 C 5 0.223
11 Happy 1 C 5 0.223