Loop across rows and colums with nested data - r

I have the following data structure: Meetings in Persons in Groups. The groups met differently often and the number of group members varied for every meeting.
$ GroupID : chr "1" "1" "1" "1" ...
$ groupnames : chr "A&M" "A&M" "A&M" "A&M" ...
$ MeetiID : chr "1" "1" "2" "2" ...
$ Date_Meetings : chr "43293" "43293" "43298" "43298" ...
$ PersonID : num 171 185 171 185 185 113 135 113 135 113 ...
$ v_165 : chr "3" "3" "4" "3" ...
$ v_166 : chr "2" "2" "3" "3" ...
$ v_167 : chr "2" "4" "4" "3" ...
$ v_168 : chr "6" "7" "4" "5" ...
$ problemtypes_categories: chr "Knowledgeproblem" "Knowledgeproblem" "Motivationalproblem" "Coordinationproblem" ...
$ v_165_dicho : num 0 0 0 0 1 1 1 0 0 1 ...
$ v_166_dicho : num 0 0 0 0 0 0 0 0 0 0 ...
$ v_167_dicho : num 0 0 0 0 1 1 0 0 0 0 ...
Now I have to create a new variable that should be binary (0/1) with the name agreement_levels. So, every time, a person in one group has - regarding the same learning meeting - a same problem type category than the other learner(s) of the same group at the same meeting, both learners (or three or four, depending on the group size for a respective meeting) should get the value 1 at the agreement variable, else they should all get 0. Whenever a person (e.g., among four learners) already has a different category of problem than the others, there is a 0 on the agreement variable for all.
If only 1 person is in the data set for one and the same meeting, there must be a NA at agree. When one person has NA at the problemtype variable, however, and there are 2 people in the data set for the same meeting, both get 0 at agree; but if there are 4 people for the same meeting in the data set and one of them has NA at problemtype, then only this person but not the others get NA at agree.
I did already write a command, but it is not working yet and still does not consider the NAs:
GroupID1 <- df$GroupID[1:nrow,]
TreffID1 <- df$TreffID[1:nrow,]
for(i in 1:(GroupID1 -1){
for(j in 1:(TreffID1 -1){
if(df[i, 3] == df[i+1, 3]-1){
if(df[i, 15] == df[i+1, 15]-1){
df[c(i, i+1), 28] <- 1,
df[c(i, i+1), 28] <- 0
Many thanks in advance.
dput(head(df))
structure(list(GroupID = c("1", "1", "1", "1", "1", "2"), TreffID = c("1", "1",
"2", "2", "3", "1"), PersonID = c(171, 185, 171, 185,
185, 113), problemtypen_oberkategorien = c("Verständnisprobleme",
"Verständnisprobleme", "Motivationsprobleme", "Motivationsprobleme",
"Motivationsprobleme", "Motivationsprobleme"), passung.exkl = c("0",
"0", "0", "0", "1", "1")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))

Instead of loops, I used R's dplyr. I'm not sure if I got all your logic correct, since there was a lot there. For example, you didn't specify what would happen for NA problemtype and 3 people. But here is a starting point that uses group_by, so you are looking within each set of rows with the same GroupID and TreffID, and then mutate and case_when, which assign values to a new column, according to criteria, and then functions like n() that count how many rows and n_distinct that count distinct rows so you if it is ==1 then we know they are all the same.
library(tidyverse)
df <- df %>%
group_by(GroupID, TreffID) %>%
mutate(agreement_levels = case_when(n() == 1 ~ -1,
is.na(problemtypen_oberkategorien) & n() == 2 ~ 0,
is.na(problemtypen_oberkategorien) & n() > 2 ~ -1,
n_distinct(problemtypen_oberkategorien, na.rm = FALSE) == 1 ~ 1,
n_distinct(problemtypen_oberkategorien, na.rm = FALSE) > 1 ~ 0,
TRUE ~ -1),
agreement_levels = na_if(agreement_levels, -1)) %>%
select(GroupID, TreffID, problemtypen_oberkategorien, agreement_levels, everything())

Related

Converting columns from character to integer, but only if the tittle of the column/variable name includes the word "Average"

Basically what the title says! I have columns like name, age, year, average_points, average_steals, average_rebounds etc. But all the average columns (there are a lot) are stored as characters. Thanks!
First I created some random data. You can mutate across the columns that starts_with "average" and convert them to as.integer. You can use the following code:
df <- data.frame(name = c("A", "B"),
age = c(10, 51),
year = c(2001, 1980),
average_points = c("3", "5"),
average_steals = c("4","6"),
average_bounds = c("6","7"))
str(df)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: chr "3" "5"
#> $ average_steals: chr "4" "6"
#> $ average_bounds: chr "6" "7"
library(dplyr)
library(tidyr)
result <- df %>%
mutate(across(starts_with("average"), as.integer))
str(result)
#> 'data.frame': 2 obs. of 6 variables:
#> $ name : chr "A" "B"
#> $ age : num 10 51
#> $ year : num 2001 1980
#> $ average_points: int 3 5
#> $ average_steals: int 4 6
#> $ average_bounds: int 6 7
Created on 2022-07-20 by the reprex package (v2.0.1)

Error barplot after running library (plyr)

I have the following dataset
set.seed(42)
cancer <- sample(c("yes", "no"), 200, replace=TRUE)
agegroup <- sample(c("35-39", "40-44", "45-49"), 200, replace=TRUE)
agefirstchild <- sample(c("Age < 30", "Age 30 or greater", "nullipareous"), 200, replace=TRUE)
dat <- data.frame(cancer, agegroup, agefirstchild)
And I am running this code to create a barchart. 2 questions.
1.I would now like to have the chart for the whole dataset not only cancer = yes
2. After I did run the library(plyr) I received a warning it wasn't working with a specific package.
Below plot was working, but after running this library not anymore. This is the error message: "Error in print.default(m, ..., quote = quote, right = right, max = max) :
invalid 'na.print' specification"
riskwoinvasivetrain%>%
group_by(Agegroup) %>%
summarize(prop_cancer = mean(Cancer == 'yes')) %>%
print(n=1000)
And just would like to have a simple frequency table telling me the size (n) of each subgroup. E.g size age 35-39 is
'data.frame': 159093 obs. of 12 variables:
$ Menopause : chr "Postmenopausal" "Postmenopausal" "Postmenopausal" "Postmenopausal" ...
$ Agegroup : chr "45-49" "45-49" "45-49" "45-49" ...
$ Density : chr "Almost entirely fat" "Almost entirely fat" "Almost entirely fat" "Almost entirely fat" ...
$ Race : chr "white" "white" "white" "white" ...
$ BMI : chr "10-24.99" "10-24.99" "10-24.99" "10-24.99" ...
$ AgeFirstBirth : chr "< 30" "< 30" "< 30" "< 30" ...
$ NumberRelativesCancer : chr "zero" "zero" "zero" "zero" ...
$ PreviousBreastProcedure: int 0 0 0 0 0 0 0 0 0 0 ...
$ LastMammogram : int 0 0 0 0 0 0 0 0 0 0 ...
$ SurgicalMenopause : int 0 0 0 0 0 0 0 0 0 0 ...
$ HRT : chr "no" "no" "no" "no" ...
$ Cancer : chr "no" "no" "no" "no" ...````
We can take the count, divide by the sum of 'n' for percentage and then do the plotting with ggplot
library(dplyr)
library(ggplot2)
dat %>%
count(agegroup, cancer) %>%
mutate(prop_cancer = n/sum(n)) %>%
ggplot(aes(x = agegroup, y = n, fill = cancer)) +
geom_col()

R spread function (error in ... undefined columns selected)

I googled my error, but that didn't helped me.
Got a data frame, with a column x.
unique(df$x)
The result is:
[1] "fc_social_media" "fc_banners" "fc_nat_search"
[4] "fc_direct" "fc_paid_search"
When I try this:
df <- spread(data = df, key = x, value = x, fill = "0")
I got the error:
Error in `[.data.frame`(data, setdiff(names(data), c(key_var, value_var))) :
undefined columns selected
But that is very weird, because I used the spread function (in the same script) different times.
So I googled, saw some "solutions":
I removed all the "special" characters. As you can see, my unique
values do not contain special characters (cleaned it). But this didn't
help.
I checked if there are any columns with the same name. But all column names
are unique.
#Gregor, #Akrun:
> str(df)
'data.frame': 100 obs. of 22 variables:
$ visitor_id : chr "321012312666671237877-461170125342559040419" "321012366667112237877-461121705342559040419" "321012366661271237877-461170534255901240419" "321012366612671237877-461170534212559040419" ...
$ visit_num : chr "1" "1" "1" "1" ...
$ ref_domain : chr "l.facebook.com" "X.co.uk" "x.co.uk" "" ...
$ x : chr "fc_social_media" "fc_social_media" "fc_social_media" "fc_social_media" ...
$ va_closer_channel : chr "Social Media" "Social Media" "Social Media" "Social Media" ...
$ row : int 1 2 3 4 5 6 7 8 9 10 ...
$ : chr "0" "0" "0" "0" ...
$ Hard Drive : chr "0" "0" "0" "0" ...
The error could be due to a column without a name i.e "". Using a reproducible example
library(tidyr)
spread(df, x, x)
Error in [.data.frame(data, setdiff(names(data), c(key_var,
value_var))) : undefined columns selected
We could make it work by changing the column name
names(df) <- make.names(names(df))
spread(df, x, x, fill = "0")
# X fc_banners fc_direct fc_nat_search fc_paid_search fc_social_media
#1 1 0 0 0 0 fc_social_media
#2 2 fc_banners 0 0 0 0
#3 3 0 0 fc_nat_search 0 0
#4 4 0 fc_direct 0 0 0
#5 5 0 0 0 fc_paid_search 0
data
df <- data.frame(x = c("fc_social_media", "fc_banners",
"fc_nat_search", "fc_direct", "fc_paid_search"), x1 = 1:5, stringsAsFactors = FALSE)
names(df)[2] <- ""

R: Sort 2-D array in ascending numerical order by attributes

I have created an array "duration" of time-durations using the tapply function in R. The attributes associated with the array vector are "character" class and I believe this is why they are sorting as "1" "10" "100" "2" "20" "200"...example below in code.
The aforementioned attributes are associated with a Trip Number and I would like to sort by this number in ascending order (1,2,3...). I have attempted various attacks using order, sort, converting to data.frame, etc. but have been unsuccessful. Please help!
My code is below.
tripDur <- function (aDate) {
difftime(max(aDate), min(aDate), units = "hours")
}
tmp<- tapply(gps$D_DATE, gps$trip, tripDur)
duration <- tmp; duration
> duration
1 10 100 101 102 103 104 105
14.8155556 4.6188889 1.6166667 15.9366667 27.4000000 18.1200000 16.8522222 16.9066667
> str(duration)
num [1:158(1d)] 14.82 4.62 1.62 15.94 27.4 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:158] "1" "10" "100" "101" ...
Try something along the lines of this.
> my.vec <- letters[1:5]
> names(my.vec) <- c("1", "10", "5", "100", "13")
> my.vec
1 10 5 100 13
"a" "b" "c" "d" "e"
> my.vec[order(as.numeric(names(my.vec)))]
1 5 10 13 100
"a" "c" "b" "e" "d"

What is R table function max size?

I'm using the R table() function, it only gives me 4222 rows, is there some kind of configuration to accept more rows?
table function is not limited to 4222 rows. Most likely, it is the printing limit that gives you the trouble.
Try:
options(max.print = 20000)
also, check the "real" number of rows:
tbl <- table(state.division, state.region)
nrow(tbl)
Nothing wrong with larger tables? What gave you that impression?
> set.seed(123)
> fac <- factor(sample(10000, 10000, rep = TRUE))
> fac2 <- factor(sample(10000, 10000, rep = TRUE))
> tab <- table(fac, fac2)
> str(tab)
'table' int [1:6282, 1:6279] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ fac : chr [1:6282] "1" "5" "7" "9" ...
..$ fac2: chr [1:6279] "1" "2" "3" "4" ...
Printing tab will cause problems - it takes a while to generate and then you'll get this message:
[ reached getOption("max.print") -- omitted 6267 rows ]]
You can alter that by changing options(max.print = XXXXX) where XXXXX is some large number. But I don't see what is gained by printing such a large table? If you were trying to do this to see if the correct table had been produced, size-wise, then
> dim(tab)
[1] 6282 6279
> str(tab)
'table' int [1:6282, 1:6279] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ fac : chr [1:6282] "1" "5" "7" "9" ...
..$ fac2: chr [1:6279] "1" "2" "3" "4" ...
help with that.

Resources