Tranpose and Calculate pearson correlation - r

I am really new to coding and I need to run a number of statistics in a dataset, for example the pearson correlation, but I am having some trouble manipulating the data.
From what I understood I need to transpose my data in order to calculate the pearson correlation, but here's where I'm having some problems. For starters, the column names turn into a new row instead of becoming the new column names. Then I get a message that my values are not numeric.
I also have some NA and I am trying to calculate the correlation with this code
cor(cr, use = "complete.obs", method = "pearson")
Error in cor(cr1, use = "complete.obs", method = "pearson") :
'x' must be numeric
I need to know the correlation between Victoria and Nuria which should yield 0.3651484
here is the dput of my dataset:
> dput(cr)
structure(list(User = structure(c(8L, 10L, 2L, 17L, 11L, 1L,
18L, 9L, 7L, 5L, 3L, 14L, 13L, 4L, 20L, 6L, 16L, 12L, 15L, 19L
), .Label = c("Ana", "Anton", "Bernard", "Carles", "Chris", "Ivan",
"Jim", "John", "Marc", "Maria", "Martina", "Nadia", "Nerea",
"Nuria", "Oriol", "Rachel", "Roger", "Sergi", "Valery", "Victoria"
), class = "factor"), Star.Wars.IV...A.New.Hope = c(1L, 5L, NA,
NA, 4L, 2L, NA, 4L, 5L, 4L, 2L, 3L, 2L, 3L, 4L, NA, NA, 4L, 5L,
1L), Star.Wars.VI...Return.of.the.Jedi = c(5L, 3L, NA, 3L, 3L,
4L, NA, NA, 1L, 2L, 1L, 5L, 3L, NA, 4L, NA, NA, 5L, 1L, 2L),
Forrest.Gump = c(2L, NA, NA, NA, 4L, 4L, 3L, NA, NA, NA,
5L, 2L, NA, 3L, NA, 1L, NA, 1L, NA, 2L), The.Shawshank.Redemption = c(NA,
2L, 5L, NA, 1L, 4L, 1L, NA, 4L, 5L, NA, NA, 5L, NA, NA, NA,
NA, 5L, NA, 4L), The.Silence.of.the.Lambs = c(4L, 4L, 2L,
NA, 4L, NA, 1L, 3L, 2L, 3L, NA, 2L, 4L, 2L, 5L, 3L, 4L, 1L,
NA, 5L), Gladiator = c(4L, 2L, NA, 1L, 1L, NA, 4L, 2L, 4L,
NA, 5L, NA, NA, NA, 5L, 2L, NA, 1L, 4L, NA), Toy.Story = c(2L,
1L, 4L, 2L, NA, 3L, NA, 2L, 4L, 4L, 5L, 2L, 4L, 3L, 2L, NA,
2L, 4L, 2L, 2L), Saving.Private.Ryan = c(2L, NA, NA, 3L,
4L, 1L, 5L, NA, 4L, 3L, NA, NA, 5L, NA, NA, 2L, NA, NA, 1L,
3L), Pulp.Fiction = c(NA, NA, NA, 4L, NA, 4L, 2L, 3L, NA,
4L, NA, 1L, NA, NA, 3L, NA, 2L, 5L, 3L, 2L), Stand.by.Me = c(3L,
4L, 1L, NA, 1L, 4L, NA, NA, 1L, NA, NA, NA, NA, 4L, 5L, 1L,
NA, NA, 3L, 2L), Shakespeare.in.Love = c(2L, 3L, NA, NA,
5L, 5L, 1L, NA, 2L, NA, NA, 3L, NA, NA, NA, 5L, 2L, NA, 3L,
1L), Total.Recall = c(NA, 2L, 1L, 4L, 1L, 2L, NA, 2L, 3L,
NA, 3L, NA, 2L, 1L, 1L, NA, NA, NA, 1L, NA), Independence.Day = c(5L,
2L, 4L, 1L, NA, 4L, NA, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 3L, NA,
NA, NA, NA, NA), Blade.Runner = c(2L, NA, 4L, 3L, 4L, NA,
3L, 2L, NA, NA, NA, NA, NA, 2L, NA, NA, NA, 4L, NA, 5L),
Groundhog.Day = c(NA, 2L, 1L, 5L, NA, 1L, NA, 4L, 5L, NA,
NA, 2L, 3L, 3L, 2L, 5L, NA, NA, NA, 5L), The.Matrix = c(4L,
NA, 1L, NA, 3L, NA, 1L, NA, NA, 2L, 1L, 5L, NA, 5L, NA, 2L,
4L, NA, 2L, 4L), Schindler.s.List = c(2L, 5L, 2L, 5L, 5L,
NA, NA, 1L, NA, 5L, NA, NA, NA, 1L, 3L, 2L, NA, 2L, NA, 3L
), The.Sixth.Sense = c(5L, 1L, 3L, 1L, 5L, 3L, NA, 3L, NA,
1L, 2L, NA, NA, NA, NA, 4L, NA, 1L, NA, 5L), Raiders.of.the.Lost.Ark = c(NA,
3L, 1L, 1L, NA, NA, 5L, 5L, NA, NA, 1L, NA, 5L, NA, 3L, 3L,
NA, 2L, NA, 3L), Babe = c(NA, NA, 3L, 2L, NA, 2L, 2L, NA,
5L, NA, 4L, 2L, NA, NA, 1L, 4L, NA, 5L, NA, NA)), .Names = c("User",
"Star.Wars.IV...A.New.Hope", "Star.Wars.VI...Return.of.the.Jedi",
"Forrest.Gump", "The.Shawshank.Redemption", "The.Silence.of.the.Lambs",
"Gladiator", "Toy.Story", "Saving.Private.Ryan", "Pulp.Fiction",
"Stand.by.Me", "Shakespeare.in.Love", "Total.Recall", "Independence.Day",
"Blade.Runner", "Groundhog.Day", "The.Matrix", "Schindler.s.List",
"The.Sixth.Sense", "Raiders.of.the.Lost.Ark", "Babe"), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Can someone help me?

This code should give you the correlation matrix between all users.
cr2<-t(cr[,2:21]) # Transpose (first column contains names)
colnames(cr2)<-cr[,1] # Assign names to columns
cor(cr2,use="complete.obs") # Gives an error because there are no complete obs
# Error in cor(cr2, use = "complete.obs") : no complete element pairs
cor(cr2,use="pairwise.complete.obs") # use pairwise deletion
Correlation between Victoria and Nuria is 0.36514837 (using pairwise deletion)
Edit:To get just the correlation between Victoria and Nuria with listwise deletion, run the above and then
cr2<-as.data.frame(cr2)
with(cr2, cor(Victoria, Nuria, use = "complete.obs", method = "pearson"))
[1] 0.3651484

As a summary in addition to #Niek's answer. First transpose the data frame by t() by excluding first column (which contains the names and is not numeric and thus cannot used for correlation calculations); assign these names to new columns in same step. Then calculate specific correlations. The solution in one piece would be:
cr2 <- setNames(as.data.frame(t(cr[, -1])), cr[, 1])
with(cr2, cor(Victoria, Nuria, use = "complete.obs"))
[1] 0.3651484
Or for the whole correlation matrix:
cor(cr2, use = "pairwise.complete.obs")

Related

How to get right proportions based on a subset of the data set in R

I want to calculate the proportion of plan with respect to school type in the dataset below. Thing is, I have to first subset the dataset a couple of times. In this case, for example, I need to subset the dataset so that I only get schools which offer from level 0 up to level 04. How can I do that?
edit:
The way below is working, but I'm not filtering all schools that have all levels from level 0 up to level 04 , I'm getting the schools that have either of them. Ideas to that would be much appreciated.
data description:
# SCHOOL = name of the school
# Q9 = type of school
# Q11 = levels that the schools offer
# Q40 = types of planning that the school offers related to each level.
Note: All plannings are related to a school level. Hence, If I don't filter by the levels that each school offers and just calculate this columns' prop, my results will be misleading because each school can offer different level options (all of them, 1, 2 of them, etc)
my attempt:
### D) filter schools which only offer UP TO Level 04 ###
##################################### Quest40_2 ##########################
### check vector's names:
unique(quest40_2$Q11)
unwanted_4_2 <- quest40_2 %>%
filter(Q11 %in% c('level05','level06')) ### FILTER UNWANTED SCHOOL LEVELS
### create a vector with UNWANTED SCHOOL NAMES:
filter_vec_4_2 <- unique(unwanted_4_2$SCHOOL) ### get unique names (only school has 1 name)
### assign the original dataframe to a dummy data frame:
filtered_df_4_2 <- quest40_2
### loop over unwanted schools' names to remove them:
for (i in 1 :length(filter_vec_4_2)) {
filtered_df_4_2 <- filtered_df_4_2[!filtered_df_4_2$SCHOOL == filter_vec_4_2[i],]
}
I need to count how many times EACH 'plan' occurs by each type of 'school' (and get its proportion). Problem = each school can have more than one 'plan' type. Since each school can have more than one plan type, if we want to count the proportion of
plan types according to each school type, we cannot divide by n
b <- filtered_df_4_2 %>% drop_na(Q40) %>%
count(Q40, Q9) %>%
group_by(Q9)
Which leaves me to:
### counting the unique schools by each type of 'curriculo':
b2 <- b2 %>%
select(SCHOOL, Q9) %>%
unique() %>% count(Q9)
## join dataframes and get the proportion of schools that have each type of 'planejamento'
## within the curriculum types
c <- b %>% full_join(b2, by = 'Q9') %>%
mutate(prop = round((n.x/n.y *100), digits = 2)) %>%
select(-n.x, -n.y)
Q1 = I don't think I'm filtering it correctely since I'm not exclusively filtering the schools that offer alll levels up to 01 to 04. I guess I'm doing an 'or' not an 'and'
Q2 = Is there a way to avoid the loop with tidyverse ? Thanks in adv.
Ultimately, I'm trying to get the starts to => 'do the schools plan their agenda for the levels they offer or not and how does the type of school impact this?' (later this can be modeled, but now I just need percentages)
data:
dput(quest40_2)
structure(list(SCHOOL = structure(c("School1", "School1", "School1",
"School1", "School1", "School1", "School1", "School2", "School2",
"School2", "School2", "School2", "School2", "School2", "School3",
"School3", "School3", "School3", "School3", "School3", "School3",
"School3", "School3", "School4", "School4", "School4", "School4",
"School4", "School4", "School5", "School5", "School5", "School5",
"School5", "School5", "School6", "School6", "School6", "School6",
"School6", "School7", "School7", "School7", "School7", "School7",
"School7", "School8", "School8", "School8", "School8", "School9",
"School9", "School9", "School9", "School9", "School10", "School10",
"School10", "School10", "School10", "School11", "School11", "School11",
"School11", "School11", "School11", "School11", "School11", "School11",
"School12", "School12", "School12", "School12", "School12", "School12",
"School12", "School13", "School13", "School13", "School13", "School13",
"School13", "School13", "School13", "School13", "School13", "School14",
"School14", "School14", "School15", "School15", "School15", "School15",
"School15", "School15", "School16", "School16", "School16", "School16",
"School16", "School16", "School16", "School17", "School17", "School17",
"School17", "School17", "School18", "School18", "School18", "School18",
"School18", "School19", "School19", "School19", "School19", "School19",
"School19", "School20", "School20", "School20", "School21", "School21",
"School21", "School21", "School21", "School22", "School22", "School22",
"School22", "School22", "School23", "School23", "School23", "School23",
"School23", "School23", "School24", "School24", "School24", "School24",
"School24", "School24", "School25", "School25", "School25", "School25",
"School25", "School26", "School26", "School26", "School26", "School26",
"School26", "School26", "School26", "School26", "School27", "School27",
"School27", "School27", "School27", "School27", "School27", "School28",
"School28", "School28", "School28", "School28", "School28", "School28",
"School28", "School29", "School29", "School29", "School29", "School29",
"School29", "School30", "School30", "School30", "School30", "School30",
"School30", "School30", "School30", "School30", "School30", "School30",
"School31", "School31", "School31", "School31", "School31", "School31",
"School31", "School31", "School31", "School31", "School31", "School32",
"School32", "School32", "School32", "School32", "School32", "School32",
"School32", "School32", "School32", "School32", "School33", "School33",
"School33", "School33", "School33", "School33", "School33", "School34",
"School34", "School34", "School34", "School34", "School34", "School34",
"School34", "School35", "School35", "School35", "School35", "School36",
"School36", "School36", "School36", "School36", "School37", "School37",
"School37", "School37", "School37", "School37", "School37", "School37",
"School38", "School38", "School38", "School38", "School39", "School39",
"School39", "School39", "School39", "School39"), class = c("glue",
"character")), Q9 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 3L, 3L), .Label = c("typeA", "typeB", "typeC", "typeD"
), class = "factor"), Q11 = structure(c(3L, 7L, 4L, 2L, NA, NA,
NA, 7L, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA,
NA, 4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, NA, 4L, 2L, 1L, 5L,
NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, NA, 7L,
4L, 2L, 1L, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA,
5L, 6L, NA, 4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, NA,
7L, 4L, 2L, 1L, NA, 7L, 4L, 2L, 1L, NA, 4L, 2L, 1L, NA, NA, NA,
2L, 1L, NA, 4L, 2L, 1L, 5L, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L, 2L,
1L, NA, NA, 4L, 2L, 1L, NA, NA, NA, 4L, 2L, 1L, 5L, NA, 7L, 4L,
2L, 1L, 5L, NA, NA, NA, NA, 2L, 1L, 5L, NA, NA, NA, NA, 3L, 7L,
4L, 2L, 1L, 5L, 6L, NA, 4L, 2L, 1L, 5L, NA, NA, 7L, 4L, 2L, 1L,
5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA,
NA, NA, 7L, 4L, 2L, 1L, 5L, 6L, NA, NA, NA, NA, NA, 7L, 4L, 2L,
1L, 5L, 6L, NA, 7L, 4L, 2L, 1L, 5L, NA, NA, NA, 3L, 7L, 4L, NA,
7L, 4L, 2L, NA, NA, 4L, 2L, 1L, 5L, NA, NA, NA, NA, 4L, 2L, 1L,
NA, 7L, 4L, 2L, 1L, 5L, NA), .Label = c("level04", "level03",
"level0", "level02", "level05", "level06", "level01"), class = "factor"),
Q40 = structure(c(NA, NA, NA, NA, 2L, 6L, 5L, NA, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA,
NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 1L, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, NA, NA, NA, 2L, 6L, 4L,
3L, NA, NA, 1L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 2L,
6L, 5L, NA, NA, NA, NA, 2L, NA, NA, NA, NA, 1L, NA, NA, NA,
2L, 6L, 5L, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
1L, NA, NA, NA, NA, 2L, 1L, NA, NA, NA, 2L, 6L, 5L, NA, NA,
NA, NA, 1L, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA,
2L, 6L, 5L, 4L, NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA,
NA, 2L, 1L, NA, NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA,
NA, NA, NA, NA, NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA,
NA, 2L, 6L, 5L, 4L, 3L, NA, NA, NA, NA, NA, NA, 1L, NA, NA,
NA, NA, NA, 2L, 6L, 5L, NA, NA, NA, 2L, NA, NA, NA, 2L, 6L,
NA, NA, NA, NA, 2L, 6L, 5L, 4L, NA, NA, NA, 1L, NA, NA, NA,
NA, NA, 1L), .Label = c("none", "plan1_level0upto02", "plan5_level_05",
"plan4_level_05", "plan3_level_04", "plan2_level_03"), class = "factor")), row.names = c(NA,
-253L), class = c("tbl_df", "tbl", "data.frame"))
This might be what you need or give you a good start. It shows the number of plans per type with levels 0-4, also shows the percentages.
left_join(na.omit(quest40_2[, c("SCHOOL", "Q9", "Q11")]),
na.omit(quest40_2[, c("SCHOOL", "Q40")]), c("SCHOOL"),
multiple = "all") %>%
group_by(Q9, Q40) %>%
filter(sub(".*(\\d+)$", "\\1", Q11) <= 4) %>%
summarize(n = n(), .groups = "drop") %>%
mutate(percentage = n / sum(n) * 100) %>%
print(n = Inf)
# A tibble: 23 × 4
Q9 Q40 n percentage
<fct> <fct> <int> <dbl>
1 typeA none 3 0.974
2 typeA plan1_level0upto02 6 1.95
3 typeA plan5_level_05 4 1.30
4 typeA plan4_level_05 6 1.95
5 typeA plan3_level_04 6 1.95
6 typeA plan2_level_03 6 1.95
7 typeB none 21 6.82
8 typeB plan1_level0upto02 47 15.3
9 typeB plan5_level_05 4 1.30
10 typeB plan4_level_05 14 4.55
11 typeB plan3_level_04 31 10.1
12 typeB plan2_level_03 31 10.1
13 typeC none 32 10.4
14 typeC plan1_level0upto02 21 6.82
15 typeC plan4_level_05 4 1.30
16 typeC plan3_level_04 15 4.87
17 typeC plan2_level_03 18 5.84
18 typeD none 3 0.974
19 typeD plan1_level0upto02 8 2.60
20 typeD plan5_level_05 8 2.60
21 typeD plan4_level_05 8 2.60
22 typeD plan3_level_04 4 1.30
23 typeD plan2_level_03 8 2.60

How to set placeholders in R

df1 <-
structure(c(3L, NA, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 3L, 1L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 2L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 1L, 3L,
2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 3L,
3L, 1L, 3L, 2L, 2L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 3L, 3L,
3L, 3L, 2L, 1L, 2L), levels = c("aaa", "bbb",
"ccc"), class = c("ordered", "factor"))
df2 <-
structure(c(1L, NA, 3L, 1L, 1L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 1L,
1L, 2L, 3L, 1L, 3L, 2L, 2L, 1L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 3L,
1L, 2L, 1L, 3L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 2L, 1L, 3L, 3L, 3L,
2L, 2L, 1L, 3L, 2L, 1L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 3L,
1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 3L, 3L, 3L, 1L, 2L, 1L, 2L, 2L,
3L, 3L, 1L, 3L, 3L), levels = c("aaa", "bbb",
"ccc"), class = c("ordered", "factor"))
df3 <-
structure(c(3L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 3L, 2L, 2L,
3L, 1L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 3L,
1L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 2L, 3L, 3L), levels = c("ddd", "eee", "fff"
), class = c("ordered", "factor"))
dftest1 <-
structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, NA, 2L, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 1L, NA, 2L), levels = c("AAA", "BBB"
), class = "factor")
dftest2 <-
structure(c(NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA,
1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA,
NA, NA, 1L, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, NA, NA,
1L, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 1L, 1L, 1L), levels = "CCC", class = "factor")
I want to put df1, df2 and df3 (in my case factors) together as placeholders vars2use, so to speak.
var2use <- c(var1, var2, var3)
This placeholder should then be combined with other factors (dftest1, dftest2) into a data set.
This implementation works as expected:
df <- data.frame(dftest1, dftest2, df1, df2, df3)
I was hoping that it could also be implemented in this form:
df <- data.frame(dftest1, dftest2, var2use)
But I get an error:
Error in data.frame(dftest1, dftest2, var2use):
arguments imply differing number of rows: 82, 3
The background is that I would like to work with placeholders of this type in different places. Does anyone have an idea how to solve this?
The more direct way to do this is to just put your values in to a separte data.frame and then cbind() the values when you need them. For example
vars2use <- data.frame(df1, df2, df3)
df <- cbind(data.frame(dftest1, dftest2), vars2use)
# df <- data.frame(dftest1, dftest2) |> cbind(vars2use) # alternative syntax
If for some reason those names really do need to be a character vector, you can use mget() to get the values before using cbind
varnames <- c("df1", "df2", "df3")
df <- cbind(data.frame(dftest1, dftest2), mget(varnames))

Improving multivariate regression model in R

I recently conducted a survey within an IT company concering user satisfaction with a specific data management solution. There was one question about the overall satisfaction (dependend variable for my regression). And then various questions about more specific aspects like data quality etc. (independend variables in my regression).
With the help of R, I created a multivariate regression in order to figure out which of the various aspects are most important for the customer satisfaction. However, I believe my results are not 100% correct since some of the results dont make sense. For instance, according to the standardized coeffizient increasing data quality results in less user satisfaction. From my point of view, the coefficient should be positive for all variables.
Maybe somebody here can help me/ give me some tips how to improve my model. Down below you can find my code and the results (anonymized). The rows labeled M-AV are my independend variables. In the columns to the right you can find the standardized coefficent, the standard error, t value and p-value.
#https://www.youtube.com/watch?v=EUbujtw5Azc
#Librarys laden
library(lmtest)
library(car)
library(sandwich)
#Daten einlesen
daten <- read.csv(file.choose(), header = T, sep=";")
#Spalte K transformieren (wird als chr erkannt, ist aber numeric)
daten <- transform(daten, K = as.numeric(K))
str(daten)
#Regressions Modell
#modell <- lm(H ~ M + N + O + P + X + Y + Z + AA + AB + AE + AF + AG + AJ + AL + AM + AN + AQ + AR + AS + AU + AV, daten)
modell <- lm(C ~ M + N + O + P + X + Y + Z + AA + AB + AE + AF + AG + AJ + AL + AM + AN + AQ + AR + AS + AU + AV, daten)
#Vorraussetzungen
# 1 Normalverteilung der Residuen
#Plot Punkte sollten ca. auf Linie liegen (entspricht Normalverteilung). Abweichung am Anfang und Ende ist OK.
plot(modell, 2)
# 2a Homoskedastizität (Streuen Residuen gleich)
plot(modell, 1) #sollte ca. auf Ideallinie liegen
#Breusch-Pagan Test, Null-Hypothese: es liegt Homoskedastizität vor
#falls p-value > 0.05 wird Nullhypothese beibehalten
bptest(modell)
#3 Keine Multikollinarität (unabhängige Variablen korrelieren zu stark)
#Vif sollte auf jeden Fall unter 10 liegen, konservativer unter 6
vif(modell)
#4 Ausreißer/ Einflussreiche Fälle
#https://bjoernwalther.com/cook-distanz-in-r-ermitteln-und-interpretieren-ausreisser-erkennen/
plot(modell, 4)
#Robuste Standardfehler
coeftest(modell, vcov=vcovHC(modell, type ="HC3"))
#Auswertung
summary(modell)
#F-Statistik hat Nullhypothese, das Erklärungsmodell kein Erklärungsbeitrag leistet --> hier <.05, wird also verworfen!
#R2 Wert --> ca. 60% der Variable wird durch Variabeln erklärt (eigentlich 40%, siehe ajustiertes R2)
#standartisierte Koeffizienten um einflussreichste Variable zu finden
zmodell <- lm(scale(C) ~ scale(M)+ scale(N) + scale(O) + scale(P) + scale(X) + scale(Y) + scale(Z) + scale(AA) + scale(AB) + scale(AE) + scale(AF) + scale(AG) + scale(AJ) + scale(AL) + scale(AM) + scale(AN) + scale(AQ) + scale(AR) + scale(AS) + scale(AU) + scale(AV), data = daten)
summary(zmodell)
dput(head(j, 20))
structure(list(A = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L), B = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), C = c(10L, 5L, 9L, 9L, 7L, 10L, 10L, 5L, 10L, 8L,
1L, 8L, 10L, 7L, 8L, 10L, 8L, 2L, 8L, 3L), D = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), E = c(5L, 3L, 4L, 5L, 4L, 4L, 6L, 3L, 5L, 3L, 4L, 2L, 4L,
2L, 3L, 5L, 3L, 4L, 3L, 2L), F = c(5L, 2L, 6L, 5L, 4L, 2L, 6L,
4L, 5L, 6L, 4L, 4L, 6L, 5L, 5L, 6L, 4L, 3L, 5L, 5L), G = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA), H = c(6L, 3L, 5L, 4L, 5L, 4L, 5L, 4L, 5L, 4L, 2L,
5L, 5L, 4L, 4L, 6L, 4L, 5L, 4L, 1L), I = c(6L, 2L, 5L, 4L, 4L,
4L, 5L, 3L, 5L, 4L, 2L, 5L, 5L, 3L, 4L, 5L, 3L, 2L, 4L, 1L),
J = c(3L, 6L, 6L, 5L, 6L, 2L, 5L, 4L, 6L, 6L, 5L, 2L, 5L,
5L, 2L, 6L, 5L, 5L, 6L, 6L), K = c(5, 3.67, 5.33, 4.33, 5,
3.33, 5, 3.67, 5.33, 4.67, 3, 4, 5, 4, 3.33, 5.67, 4, 4,
4.67, 2.67), L = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), M = c(4L, 2L, 6L,
6L, 5L, 6L, 6L, 4L, 6L, 6L, 5L, 6L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 3L), N = c(6L, 5L, 5L, 5L, 6L, 6L, 6L, 5L, 6L, 6L, 4L,
4L, 4L, 3L, 5L, 5L, 4L, 5L, 5L, 2L), O = c(5L, 1L, 5L, 4L,
6L, 6L, 5L, 2L, 6L, 6L, 1L, 5L, 5L, 3L, 4L, 5L, 4L, 2L, 5L,
3L), P = c(6L, 1L, 4L, 4L, 4L, 6L, 6L, 2L, 5L, 3L, 2L, 5L,
5L, 3L, 5L, 5L, 4L, 5L, 2L, 1L), Q = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), R = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), S = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), T = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), U = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), V = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), W = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), X = c(4L, 1L, 3L, 4L, 5L, 6L, 5L, 3L, 5L, 4L, 1L, 5L,
4L, 1L, 4L, 1L, 5L, 2L, 4L, 1L), Y = c(5L, 1L, 3L, 3L, 3L,
6L, 5L, 2L, 6L, 4L, 1L, 3L, 4L, 1L, 5L, 5L, 3L, 2L, 3L, 2L
), Z = c(5L, 1L, 3L, 4L, 3L, 6L, 5L, 2L, 5L, 4L, 2L, 3L,
5L, 3L, 5L, 3L, 2L, 1L, 4L, 1L), AA = c(6L, 4L, 4L, 5L, 5L,
6L, 5L, 3L, 4L, 5L, 3L, 4L, 4L, 3L, 5L, 6L, 5L, 3L, 6L, 2L
), AB = c(6L, 6L, 4L, 4L, 3L, 6L, 5L, 3L, 5L, 3L, 2L, 6L,
5L, 6L, 5L, 5L, 5L, 5L, 6L, 2L), AC = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AD = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AE = c(5L, 1L, 6L, 4L, 6L,
5L, 4L, 3L, 5L, 5L, 2L, 2L, 4L, 1L, 5L, 3L, 3L, 4L, 4L, 1L
), AF = c(4L, 1L, 6L, 2L, 5L, 5L, 4L, 3L, 6L, 4L, 2L, 4L,
5L, 4L, 5L, 4L, 3L, 4L, 6L, 2L), AG = c(4L, 1L, 5L, 2L, 5L,
5L, 4L, 4L, 4L, 4L, 2L, 4L, 5L, 5L, 4L, 2L, 3L, 2L, 6L, 2L
), AH = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AI = c(0L, 0L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L
), AJ = c(3L, 2L, 5L, 3L, 4L, 4L, 6L, 3L, 5L, 5L, 2L, 5L,
5L, 3L, 5L, 5L, 4L, 2L, 5L, 1L), AK = c(NA, NA, 5L, 3L, 4L,
4L, 5L, NA, 6L, 5L, NA, NA, 6L, NA, NA, NA, 4L, NA, NA, NA
), AL = c(4L, 4L, 6L, 4L, 6L, 5L, 5L, 3L, 6L, 5L, 4L, 6L,
5L, 3L, 5L, 4L, 5L, 3L, 6L, 1L), AM = c(5L, 1L, 6L, 4L, 5L,
2L, 4L, 2L, 6L, 4L, 2L, 2L, 6L, 1L, 5L, 3L, 2L, 1L, 4L, 3L
), AN = c(1L, 1L, 6L, 3L, 2L, 6L, 4L, 1L, 6L, 2L, 1L, 4L,
5L, 2L, 5L, 5L, 4L, 4L, 5L, 1L), AO = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AP = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AQ = c(3L, 1L, 6L, 3L, 6L,
1L, 5L, 2L, 6L, 5L, 6L, 3L, 6L, 1L, 5L, 3L, 2L, 2L, 4L, 2L
), AR = c(1L, 4L, 4L, 3L, 6L, 1L, 5L, 1L, 6L, 5L, 5L, 4L,
6L, 2L, 5L, 4L, 2L, 2L, 4L, 2L), AS = c(1L, 1L, 6L, 4L, 6L,
1L, 5L, 3L, 6L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 4L, 2L, 5L, 2L
), AT = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AU = c(5L, 3L, 4L, 4L, 6L,
3L, 5L, 3L, 6L, 5L, 4L, 4L, 4L, 6L, 5L, 6L, 5L, 6L, 5L, 2L
), AV = c(6L, 3L, 5L, 4L, 6L, 2L, 6L, 2L, 6L, 4L, 4L, 4L,
4L, 6L, 4L, 6L, 3L, 6L, 2L, 3L), AW = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AX = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), AY = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), AZ = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), BA = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), BB = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), BC = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), BD = c(5.25, 2.25, 5, 4.75, 5.25, 6, 5.75, 3.25, 5.75,
5.25, 3, 5, 4.75, 3.5, 4.75, 5.25, 4.5, 4.5, 4.5, 2.25),
BE = c(5.2, 2.6, 3.4, 4, 3.8, 6, 5, 2.6, 5, 4, 1.8, 4.2,
4.4, 2.8, 4.8, 4, 4, 2.6, 4.6, 1.6), BF = c(4.333333333,
1, 5.666666667, 2.666666667, 5.333333333, 5, 4, 3.333333333,
5, 4.333333333, 2, 3.333333333, 4.666666667, 3.333333333,
4.666666667, 3, 3, 3.333333333, 5.333333333, 1.666666667),
BG = c(3.25, 2, 5.75, 3.5, 4.25, 4.25, 4.75, 2.25, 5.75,
4, 2.25, 4.25, 5.25, 2.25, 5, 4.25, 3.75, 2.5, 5, 1.5), BH = c(1.666666667,
2, 5.333333333, 3.333333333, 6, 1, 5, 2, 6, 5, 5.666666667,
4, 6, 2.666666667, 5, 4, 2.666666667, 2, 4.333333333, 2),
BI = c(5.5, 3, 4.5, 4, 6, 2.5, 5.5, 2.5, 6, 4.5, 4, 4, 4,
6, 4.5, 6, 4, 6, 3.5, 2.5)), row.names = c(NA, 20L), class = "data.frame")

Convert multiple columns from numeric to factor

I thought this task is simple, then I was surprised that it wasn't.
I have multiple selected columns with coded responses (likert-scales). I want to transform them into a factor variable with factor levels (some of them were never chosen). The questionnair is in German, that is why I you probably won't be able to understand the labels.
df[,c(3:21,23:25)] <- apply(df[,c(3:21,23:25)],2,
function (x) factor(x,
levels = c(0,1,2,3,4),
labels = c("gar nicht",
"gering",
"eher schwach",
"eher stark",
"sehr stark")))
df[,22] <- apply(df[,22],1,
function (x) factor(x,
levels = c(0,1,2,3),
labels = c("gar nicht",
"sofort",
"mittelfristig",
"langfristig")))
I will need to split those data frames because of the different scales. Nevertheless,
it does not transform my data accurately. The outcome is a character.
Here is my test data:
structure(list(ï..lfdNr = 1:20, company = c("Nationalpark Thayathal",
"Naturpark Heidenreichsteiner Moor", "Naturpark Hohe Wand", "Tierpark Stadt Haag",
"Ötscher Tropfsteinhöhle", "Carnuntum", "Stift Heiligenkreuz",
"Ruine Kollmitz", "Schlosshof", "Retzer Erlebniskeller", "LOISIUM Weinwelt",
"Bio Imkerei Stögerer", "Amethyst Welt Maissau", "Donau Niederösterreich tourismus",
"Niederösterreich Bahnen", "Benediktinerstift Melk", "Kunstmeile Krems",
"Die Garten Tulln", "Winzer Krems ", "Domäne Wachau"), A2_1_hitz = c(4L,
NA, NA, 3L, NA, NA, 3L, 2L, 3L, NA, 3L, NA, 3L, NA, 2L, 3L, 3L,
4L, 2L, 3L), A2_2_trock = c(3L, NA, NA, 3L, NA, NA, 3L, NA, 3L,
NA, 2L, NA, 1L, NA, 2L, 4L, 3L, 4L, 2L, 3L), A2_3_reg = c(2L,
NA, NA, 2L, NA, NA, 3L, 2L, 3L, NA, 3L, NA, 2L, NA, 3L, 4L, 2L,
3L, 4L, 2L), A2_4_schnee = c(4L, NA, NA, 3L, NA, NA, NA, 3L,
3L, NA, 1L, NA, 0L, NA, 4L, NA, 3L, 4L, 4L, 1L), B1_1_hitz = c(4L,
NA, NA, 1L, NA, NA, NA, 3L, 3L, NA, 2L, NA, NA, NA, 2L, 3L, 2L,
4L, 0L, 2L), B1_2_trock = c(3L, NA, NA, 2L, NA, NA, NA, NA, 3L,
NA, 0L, NA, NA, NA, 2L, 3L, 2L, 4L, 3L, 1L), B1_3_reg = c(2L,
NA, NA, 1L, NA, NA, NA, NA, 3L, NA, 3L, NA, NA, NA, 3L, 3L, 1L,
2L, 3L, 3L), B1_4_schnee = c(1L, NA, NA, 0L, NA, NA, 0L, 0L,
1L, NA, NA, NA, NA, NA, 4L, 1L, 0L, 4L, 0L, 0L), B2_1_nZuk = c(3L,
NA, NA, 0L, NA, NA, NA, 0L, 0L, NA, 0L, NA, 0L, 3L, 3L, 0L, 3L,
2L, 0L, 0L), B2_2_mZuk = c(3L, NA, NA, 0L, NA, NA, NA, 0L, 2L,
NA, 2L, NA, 0L, 2L, 3L, 0L, 3L, 2L, 3L, 0L), B2_3_fZuk = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 2L, NA, 2L, NA, 0L, 2L, 3L, 0L, 3L,
NA, 3L, 0L), C1_1_aktEin = c(2L, NA, NA, 1L, NA, NA, NA, NA,
2L, NA, NA, NA, NA, NA, NA, 0L, 1L, 3L, 2L, 3L), C1_2_zukEin = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, 0L, 2L,
4L, 3L, 3L), C2_1_bisVer = c(2L, NA, NA, 1L, NA, NA, NA, NA,
2L, NA, NA, NA, NA, NA, 2L, 2L, 1L, 3L, 2L, 2L), C2_2_zukVer = c(3L,
NA, NA, 2L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, 2L, 2L, 2L,
3L, 3L, 2L), C3_1_bisVer = c(NA, NA, NA, 1L, NA, NA, 2L, NA,
3L, NA, NA, NA, NA, NA, 1L, 1L, 1L, NA, 2L, 2L), C3_2_zukVer = c(NA,
NA, NA, 2L, NA, NA, 3L, NA, 3L, NA, NA, NA, NA, NA, 1L, 2L, 2L,
NA, 3L, 2L), C4_1_EinKlim = c(NA, NA, NA, 2L, NA, NA, NA, NA,
2L, NA, 2L, NA, NA, NA, 3L, 0L, 1L, NA, 3L, 1L), D1a_1_StÃ.rke = c(NA,
NA, NA, 3L, NA, NA, NA, NA, 3L, NA, NA, NA, 3L, NA, 2L, 3L, 2L,
3L, 3L, 3L), D1b_1_Dring = c(NA, NA, NA, NA, NA, NA, 2L, 3L,
NA, NA, NA, NA, 2L, NA, 1L, 1L, 1L, 1L, 1L, 1L), D5_1_bestBed = c(NA,
NA, NA, 0L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, 2L, 1L,
NA, 3L, 3L), E1_1_zuBesuch = c(NA, NA, NA, 2L, NA, NA, NA, NA,
3L, NA, NA, NA, NA, NA, 4L, 1L, 4L, NA, 4L, NA), E1_2_wirtBed = c(NA,
NA, NA, 3L, NA, NA, 3L, NA, 2L, NA, NA, NA, NA, NA, 1L, 1L, 4L,
NA, 3L, NA)), row.names = c(NA, 20L), class = "data.frame")
Thanks,
nadine
We need lapply and not apply as apply converts to matrix and matrix can have only a single class
df[,c(3:21,23:25)] <- lapply(df[,c(3:21,23:25)],
function (x) factor(x,
levels = c(0,1,2,3),
labels = c("gar nicht",
"sofort",
"mittelfristig",
"langfristig")))

Return an average of last or first two rows from a different group (indicated by a variable)

This is a follow-up to this question. With a data like below:
data <- structure(list(seq = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L), new_seq = c(2, 2,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
2, 2, 2, 2, NA, NA, NA, NA, NA, 4, 4, 4, 4, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 6, 6, 6, 6, 6, NA, NA, 8, 8, 8, NA, NA, NA), value = c(2L,
0L, 0L, 3L, 0L, 5L, 5L, 3L, 0L, 3L, 2L, 3L, 2L, 3L, 4L, 1L, 0L,
0L, 0L, 1L, 1L, 0L, 2L, 5L, 3L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 3L,
5L, 3L, 1L, 1L, 1L, 0L, 1L, 0L, 4L, 3L, 0L, 3L, 1L, 3L, 0L, 0L,
1L, 0L, 0L, 3L, 4L, 5L, 3L, 5L, 3L, 5L, 0L, 1L, 1L, 3L, 2L, 1L,
0L, 0L, 0L, 0L, 5L, 1L, 1L, 0L, 4L, 1L, 5L, 0L, 3L, 1L, 2L, 1L,
0L, 3L, 0L, 1L, 1L, 3L, 0L, 1L, 1L, 2L, 2L, 1L, 0L, 4L, 0L, 0L,
3L, 0L, 0L)), row.names = c(NA, -100L), class = c("tbl_df", "tbl",
"data.frame"))
for every value of new_seq, which is not NA I need to calculate a mean of 2 observations from respective group in seq (value of new_seq refers to a value of seq). The issue is that:
for those rows, where new_seq refers to a value of seq which appears after (rows 1:2 in an example) it should be a mean of 2 FIRST rows from respective group,
for those rows where new_seq refers to a value of seq which appears before it should be a mean of 2 LAST rows from respective group
#Z.Lin provided excellent solution for the second case, but how it can be tweaked to handle both cases? Or maybe is there another solution with tidyverse?
I think I got it, so I post an answer for the anybody who'll come here from search.
lookup_backwards <- data %>%
group_by(seq) %>%
mutate(rank = seq(n(), 1)) %>%
filter(rank <= 2) %>%
summarise(backwards = mean(value)) %>%
ungroup()
lookup_forwards <- data %>%
group_by(seq) %>%
mutate(rank = seq(1, n())) %>%
filter(rank <= 2) %>%
summarise(forwards = mean(value)) %>%
ungroup()
data %>%
left_join(lookup_backwards, by = c('new_seq' = 'seq')) %>%
left_join(lookup_forwards, by = c('new_seq' = 'seq')) %>%
replace_na(list(backwards = 0, forwards = 0)) %>%
mutate(new_column = ifelse(new_seq > seq, forwards, backwards))

Resources