Create new variables based on the names of other variables - r
I have a dataset that looks something like this:
""
"region"
"region_a_price_raw"
"region_b_price_raw"
"region_c_price_raw"
"region_a_adjusted"
"region_b_adjusted"
"region_c_adjusted"
"region_a_pct_chng"
"region_b_pct_chng"
"region_c_pct_chng"
"1"
"C"
0.691277900885566
-1.12168419402904
-1.80708124084338
-0.823054962637259
-1.56205680347623
2.39150423647063
94
43
100
"2"
"B"
-0.917718229751991
0.35628937645658
0.587525813366388
0.839040270582852
0.240455566072964
-0.281641015285604
27
48
21
"3"
"B"
1.2846493277039
0.13190349180679
1.26024317859471
-0.971360861843787
0.257888869705433
-0.979961536031851
92
64
82
What I need to do is create a new variable that has the price variable for each region, for the raw, adjusted and pct_chng variables.
I know how to do this manually. However, there are a lot of regions (far more than the three in the example), as well as multiple percent change variables (I only included one here for sake of brevity).
So what I'm hoping is that, since each relevant price variable includes the region name in it's own variable name, there is some way to do this where I can write a function that automatically detects the region in the variable name, since it's in the named there. I unfortunately don't know how to do this elegantly at present.
library(dplyr)
#creating sample data
df1 <- data.frame(region = sample(LETTERS[1:3],15,replace = TRUE), region_a_price_raw = rnorm(15),region_b_price_raw=rnorm(15),region_c_price_raw=rnorm(15))
df2 <- data.frame(region_a_adjusted=rnorm(15),region_b_adjusted=rnorm(15),region_c_adjusted=rnorm(15))
df3 <- data.frame(region_a_pct_chng=sample(1:100,15,replace = TRUE),region_b_pct_chng=sample(1:100,15,replace = TRUE),region_c_pct_chng=sample(1:100,15,replace = TRUE))
sample <- cbind(df1,df2,df3)
#here's how it would work manually. this would take forever in the actual dataset though
sample <- sample %>%
mutate(price_raw=case_when(region=="A"~region_a_price_raw,
region=="B"~region_b_price_raw,
region=="C"~region_c_price_raw)) %>%
mutate(price_adjusted=case_when(region=="A"~region_a_adjusted,
region=="B"~region_b_adjusted,
region=="C"~region_c_adjusted)) %>%
mutate(pct_chng=case_when(region=="A"~region_a_pct_chng,
region=="B"~region_b_pct_chng,
region=="C"~region_c_pct_chng))
I'm hoping someone has a way to do this that won't have me manually doing this across each region and price variable.
(I think there's a more direct way than this for combining the last three lines into one using a little regex...)
library(dplyr); library(tidyr)
sample %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, region)) %>%
separate(name, c("drop", "region", "type"), sep = "_", extra = "merge") %>%
pivot_wider(names_from = type, values_from = value)
Result
# A tibble: 45 × 6
row drop region price_raw adjusted pct_chng
<int> <chr> <chr> <dbl> <dbl> <dbl>
1 1 region a 0.222 -0.869 92
2 1 region b 0.149 -0.972 19
3 1 region c 1.04 0.116 94
4 2 region a -0.844 -0.755 13
5 2 region b -0.963 -0.547 81
6 2 region c 0.198 1.38 61
7 3 region a 0.444 -0.130 48
8 3 region b -0.0665 -1.69 13
9 3 region c -1.63 0.574 56
10 4 region a 0.0558 -1.00 7
# … with 35 more rows
You never gave a seed to your data. So will use the data with 3 rows above:
sample %>%
pivot_longer(-c(rn, region), names_to = c('grp', '.value'),
names_pattern = 'region_([^_+])_(.+)$') %>%
filter(tolower(region) == grp)
region grp price_raw adjusted pct_chng
<chr> <chr> <dbl> <dbl> <int>
1 C c -1.81 2.39 100
2 B b 0.356 0.240 48
3 B b 0.132 0.258 64
Data
sample <- structure(list(region = c("C", "B", "B"), region_a_price_raw = c(0.691277900885566,
-0.917718229751991, 1.2846493277039), region_b_price_raw = c(-1.12168419402904,
0.35628937645658, 0.13190349180679), region_c_price_raw = c(-1.80708124084338,
0.587525813366388, 1.26024317859471), region_a_adjusted = c(-0.823054962637259,
0.839040270582852, -0.971360861843787), region_b_adjusted = c(-1.56205680347623,
0.240455566072964, 0.257888869705433), region_c_adjusted = c(2.39150423647063,
-0.281641015285604, -0.979961536031851), region_a_pct_chng = c(94L,
27L, 92L), region_b_pct_chng = c(43L, 48L, 64L), region_c_pct_chng = c(100L,
21L, 82L)), class = "data.frame", row.names = c(NA, 3L))
Related
How to discretize a numeric column and summarize by it, with boundaries that don't overlap (equivalent to Google Sheets' "Pivot Group Rule")?
I'm trying to find the R procedure that is equivalent to Google Sheets' Pivot Group Rule. That is, I want to summarize that data by discretizing a numerical column with a fixed interval size that I decide on. I am almost getting the desired output, but am having a trouble with the "(a,b]" interval notation. Example df <- data.frame( num_col = c(1400,9000,15000,17350,20000,22000, 25000,40000,42000,45000,50000,60000,65000,70000,75000, 1e+05,120000,125000,150000,168000,180000,2e+05,225000, 250000,270000,290000,3e+05,350000,4e+05,427000,450000,5e+05, 550000,560000,6e+05,625000,650000,7e+05,750000,8e+05, 850000,9e+05,913000,930000,950000,990000,1e+06,1066167, 1100000,1200000,1250000,1300000,1400000,1420000,1500000, 1700000,1750000,1800000,1900000,1950000,2e+06,2100000, 2300000,2400000,2450000,2500000,3e+06,3150000,3200000, 3300000,3400000,3440000,3500000,3660000,3800000,3850000, 4e+06,4400000,4500000,4600000,4700000,4800000,4900000,5e+06, 5500000,6e+06,6400000,6500000,6600000,6800000,6900000, 7e+06,7200000,7217600,7400000,7500000,7700000,8e+06, 8200000,8495000,8500000,8700000,8900000,9e+06,9200000,9500000, 9600000,1e+07,10500000,10818775,1.1e+07,11500000, 1.2e+07,12500000,12620000,1.3e+07,13200000,13400000,13500000, 1.4e+07,14500000,14800000,1.5e+07,1.6e+07,1.7e+07,17500000, 1.8e+07,18026148,18500000,1.9e+07,19500000,19800000, 19900000,2e+07,2.1e+07,2.2e+07,22500000,2.3e+07,2.4e+07, 2.5e+07,25500000,2.6e+07,2.7e+07,27220000,2.8e+07,2.9e+07, 3e+07,30300000,3.1e+07,31500000,3.2e+07,32500000,3.3e+07, 3.4e+07,3.5e+07,3.6e+07,3.7e+07,3.8e+07,38600000,3.9e+07, 39200000,4e+07,4.1e+07,4.2e+07,4.3e+07,4.4e+07,44500000, 4.5e+07,4.6e+07,4.7e+07,4.8e+07,4.9e+07,49900000,5e+07, 50100000,50200000,5.2e+07,5.3e+07,5.5e+07,5.6e+07,5.7e+07, 5.8e+07,58800000,6e+07,6.1e+07,6.3e+07,6.5e+07,6.6e+07, 6.8e+07,68005000,6.9e+07,7e+07,7.3e+07,7.4e+07,7.5e+07, 7.6e+07,7.8e+07,7.9e+07,8e+07,81200000,8.2e+07,8.4e+07, 8.5e+07,8.8e+07,9e+07,9.2e+07,9.3e+07,9.4e+07,9.5e+07, 9.9e+07,1e+08,1.02e+08,1.03e+08,1.05e+08,1.08e+08,1.1e+08, 1.12e+08,1.15e+08,1.17e+08,1.2e+08,1.25e+08,1.27e+08, 1.3e+08,1.32e+08,1.35e+08,1.4e+08,1.44e+08,1.45e+08,1.5e+08, 1.55e+08,1.6e+08,1.65e+08,1.7e+08,1.75e+08,1.76e+08, 1.78e+08,1.8e+08,1.85e+08,1.9e+08,1.95e+08,2e+08,2.09e+08, 2.1e+08,2.15e+08,2.2e+08,2.25e+08,2.3e+08,2.45e+08,2.5e+08, 2.6e+08,263700000,6e+08), val = c(1,1,1,1,2,1,1,1,1,1,4,3,1,2,2, 8,1,4,4,1,1,7,1,11,1,1,6,2,2,1,3,21,1,1,3, 1,3,1,3,1,1,3,1,1,2,1,24,1,6,8,1,3,2,1,13, 1,1,4,1,1,22,3,1,1,1,13,27,1,2,3,2,1,12,1,1, 1,20,2,3,1,2,1,1,44,2,12,1,4,1,1,1,21,1,1,1, 3,1,15,1,1,5,1,1,8,1,2,1,43,1,1,11,1,24,2, 1,15,1,1,2,8,1,1,34,9,16,1,15,1,1,6,1,1,1,55, 3,11,1,4,5,40,1,9,3,1,14,3,38,1,3,1,7,1,2, 3,34,5,6,6,1,1,1,38,1,6,1,3,1,8,1,1,1,1,1, 25,1,1,3,1,11,1,1,5,1,18,4,1,12,2,4,1,2,11,1, 2,9,1,2,2,14,1,1,1,5,1,9,2,1,1,5,1,16,1,1, 3,1,8,1,2,1,8,7,1,8,1,8,4,1,6,14,2,4,6,8,4, 1,2,3,2,5,2,12,1,1,2,1,3,1,2,6,1,1,1) ) look at the data tibble::as_tibble(df) #> # A tibble: 252 x 2 #> num_col val #> <dbl> <dbl> #> 1 1400 1 #> 2 9000 1 #> 3 15000 1 #> 4 17350 1 #> 5 20000 2 #> 6 22000 1 #> 7 25000 1 #> 8 40000 1 #> 9 42000 1 #> 10 45000 1 #> # ... with 242 more rows desired output desired_output <- tibble::tribble( ~num_col_interval, ~val_sum, "0 - 49999999", 962L, "50000000 - 99999999", 164L, "100000000 - 149999999", 78L, "150000000 - 199999999", 53L, "200000000 - 249999999", 23L, "250000000 - 299999999", 8L, "600000000 - 649999999", 1L ) My attempt library(dplyr) library(ggplot2) df |> group_by(num_col_interval = ggplot2::cut_interval(num_col, length = 50000000 - 1, dig.lab = 10)) |> summarise(across(val, sum)) #> # A tibble: 7 x 2 #> num_col_interval val #> <fct> <dbl> #> 1 [0,49999999] 962 #> 2 (49999999,99999998] 164 #> 3 (99999998,149999997] 78 #> 4 (149999997,199999996] 53 #> 5 (199999996,249999995] 23 #> 6 (249999995,299999994] 8 #> 7 (599999988,649999987] 1 You can see that the interval boundaries overlap. In the first row, it ranges 0 to 49999999, and in the second row, it ranges 49999999 to 99999998. I do understand the difference between ] and ( in the breaks notation. Nevertheless, I wish the ranges in the num_col_interval column to be as in desired_output. How can I programatically format the num_col_interval values to be as in desired_output? I'm mostly looking for a straightforward dplyr solution. Here's how I would do it with Google Sheets, getting the desired output: Several SO posts are relevant, but none of them answered my question: How does cut with breaks work in R Cut function in R - exclusive or am I double counting? Cut by Defined Interval
Try this: library(tidyverse) df <- data.frame( num_col = c(1400,9000,15000,17350,20000,22000, 25000,40000,42000,45000,50000,60000,65000,70000,75000, 1e+05,120000,125000,150000,168000,180000,2e+05,225000, 250000,270000,290000,3e+05,350000,4e+05,427000,450000,5e+05, 550000,560000,6e+05,625000,650000,7e+05,750000,8e+05, 850000,9e+05,913000,930000,950000,990000,1e+06,1066167, 1100000,1200000,1250000,1300000,1400000,1420000,1500000, 1700000,1750000,1800000,1900000,1950000,2e+06,2100000, 2300000,2400000,2450000,2500000,3e+06,3150000,3200000, 3300000,3400000,3440000,3500000,3660000,3800000,3850000, 4e+06,4400000,4500000,4600000,4700000,4800000,4900000,5e+06, 5500000,6e+06,6400000,6500000,6600000,6800000,6900000, 7e+06,7200000,7217600,7400000,7500000,7700000,8e+06, 8200000,8495000,8500000,8700000,8900000,9e+06,9200000,9500000, 9600000,1e+07,10500000,10818775,1.1e+07,11500000, 1.2e+07,12500000,12620000,1.3e+07,13200000,13400000,13500000, 1.4e+07,14500000,14800000,1.5e+07,1.6e+07,1.7e+07,17500000, 1.8e+07,18026148,18500000,1.9e+07,19500000,19800000, 19900000,2e+07,2.1e+07,2.2e+07,22500000,2.3e+07,2.4e+07, 2.5e+07,25500000,2.6e+07,2.7e+07,27220000,2.8e+07,2.9e+07, 3e+07,30300000,3.1e+07,31500000,3.2e+07,32500000,3.3e+07, 3.4e+07,3.5e+07,3.6e+07,3.7e+07,3.8e+07,38600000,3.9e+07, 39200000,4e+07,4.1e+07,4.2e+07,4.3e+07,4.4e+07,44500000, 4.5e+07,4.6e+07,4.7e+07,4.8e+07,4.9e+07,49900000,5e+07, 50100000,50200000,5.2e+07,5.3e+07,5.5e+07,5.6e+07,5.7e+07, 5.8e+07,58800000,6e+07,6.1e+07,6.3e+07,6.5e+07,6.6e+07, 6.8e+07,68005000,6.9e+07,7e+07,7.3e+07,7.4e+07,7.5e+07, 7.6e+07,7.8e+07,7.9e+07,8e+07,81200000,8.2e+07,8.4e+07, 8.5e+07,8.8e+07,9e+07,9.2e+07,9.3e+07,9.4e+07,9.5e+07, 9.9e+07,1e+08,1.02e+08,1.03e+08,1.05e+08,1.08e+08,1.1e+08, 1.12e+08,1.15e+08,1.17e+08,1.2e+08,1.25e+08,1.27e+08, 1.3e+08,1.32e+08,1.35e+08,1.4e+08,1.44e+08,1.45e+08,1.5e+08, 1.55e+08,1.6e+08,1.65e+08,1.7e+08,1.75e+08,1.76e+08, 1.78e+08,1.8e+08,1.85e+08,1.9e+08,1.95e+08,2e+08,2.09e+08, 2.1e+08,2.15e+08,2.2e+08,2.25e+08,2.3e+08,2.45e+08,2.5e+08, 2.6e+08,263700000,6e+08), val = c(1,1,1,1,2,1,1,1,1,1,4,3,1,2,2, 8,1,4,4,1,1,7,1,11,1,1,6,2,2,1,3,21,1,1,3, 1,3,1,3,1,1,3,1,1,2,1,24,1,6,8,1,3,2,1,13, 1,1,4,1,1,22,3,1,1,1,13,27,1,2,3,2,1,12,1,1, 1,20,2,3,1,2,1,1,44,2,12,1,4,1,1,1,21,1,1,1, 3,1,15,1,1,5,1,1,8,1,2,1,43,1,1,11,1,24,2, 1,15,1,1,2,8,1,1,34,9,16,1,15,1,1,6,1,1,1,55, 3,11,1,4,5,40,1,9,3,1,14,3,38,1,3,1,7,1,2, 3,34,5,6,6,1,1,1,38,1,6,1,3,1,8,1,1,1,1,1, 25,1,1,3,1,11,1,1,5,1,18,4,1,12,2,4,1,2,11,1, 2,9,1,2,2,14,1,1,1,5,1,9,2,1,1,5,1,16,1,1, 3,1,8,1,2,1,8,7,1,8,1,8,4,1,6,14,2,4,6,8,4, 1,2,3,2,5,2,12,1,1,2,1,3,1,2,6,1,1,1) ) df |> group_by(num_col_interval = cut_width(num_col, width = 50000000, dig.lab = 10, closed = "left", boundary = 0)) |> summarise(across(val, sum)) |> separate(num_col_interval, into = c("left", "right"), sep = ",") |> mutate(across(-val, parse_number), right = if_else(right < max(right), right - 1L, right), across(-val, ~ format(., scientific = FALSE)), val = as.integer(val)) |> unite(num_col_interval, left:right, sep = " - ") #> # A tibble: 7 × 2 #> num_col_interval val #> <chr> <int> #> 1 " 0 - 49999999" 962 #> 2 " 50000000 - 99999999" 164 #> 3 "100000000 - 149999999" 78 #> 4 "150000000 - 199999999" 53 #> 5 "200000000 - 249999999" 23 #> 6 "250000000 - 299999999" 8 #> 7 "550000000 - 600000000" 1 Created on 2022-12-18 with reprex v2.0.2
dplyr: How to rearrange this dataframe and create new columns by extracting parts of other columns
Let's say I have this dataframe > a T..Gene.names Intensity.Mut_125 Intensity.Mut_250 Intensity.Mut.1000 Intensity.Mut.500 1 NCAN NaN 25.6628 23.8427 NaN 2 AMBP 22.8276 27.0801 25.4740 23.5596 3 CHGB 25.4463 30.0065 27.8181 27.3170 4 APP 25.0346 29.7784 27.0848 24.7314 I need to re-arrange my dataframe so each a$T..Gene.names correspond to a new column. Then, I need a new column called a$sample that extracts the word between Intensity and the number (either 125, 250, 500, 1000 or 2000). An issue is that this word and following number is separated by either . or _ Finally, I need a column named a$volume that correspond to the number. NA should be converted to 0. I tried several attempts with pivot_longer and pivot_wider but this is above my current skill level. Expected output sample volume NCAN AMBP CHGB APP Mut 125 0 22.8276 25.4463 25.0346 Mut 250 25.6638 27.0801 30.0065 29.7784 Mut 500 0 23.5596 27.3170 24.7314 Mut 1000 23.8427 25.4740 27.8181 27.0848 I prefer a dplyr-solution a <- structure(list(T..Gene.names = c("NCAN", "AMBP", "CHGB", "APP" ), Intensity.Mut_125 = c(NaN, 22.8276, 25.4463, 25.0346), Intensity.Mut_250 = c(25.6628, 27.0801, 30.0065, 29.7784), Intensity.Mut.1000 = c(23.8427, 25.474, 27.8181, 27.0848), Intensity.Mut.500 = c(NaN, 23.5596, 27.317, 24.7314)), row.names = c(NA, 4L), class = "data.frame")
reshape2::recast(a, variable~T..Gene.names,fill = 0) %>% separate(variable, c('type','sample', 'volume')) type sample volume AMBP APP CHGB NCAN 1 Intensity Mut 125 22.8276 25.0346 25.4463 0.0000 2 Intensity Mut 250 27.0801 29.7784 30.0065 25.6628 3 Intensity Mut 1000 25.4740 27.0848 27.8181 23.8427 4 Intensity Mut 500 23.5596 24.7314 27.3170 0.0000
Another possible solution using tidyr. pivot_longer(a, cols = !"T..Gene.names", names_to = c('sample', 'volume'), names_prefix = "Intensity.", names_sep = '_|\\.', values_drop_na = T) %>% pivot_wider(names_from = "T..Gene.names", values_fill = list(value = 0)) # A tibble: 4 × 6 sample volume NCAN AMBP CHGB APP <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Mut 250 25.7 27.1 30.0 29.8 2 Mut 1000 23.8 25.5 27.8 27.1 3 Mut 125 0 22.8 25.4 25.0 4 Mut 500 0 23.6 27.3 24.7 names_prefix = "Intensity."removes "Intensity." from the column names. names_sep = '_|\\.' separates the column names either by . or by _
Wide to long with many different columns
I have used pivot_longer before but this time I have a much more complex wide dataframe and I cannot sort it out. The example code will provide you a reproducible dataframe. I haven't dealt with such thing before so I'm not sure it's correct to try to format this type of df in long format? df <- data.frame( ID = as.numeric(c("7","8","10","11","13","15","16")), AGE = as.character(c("45 – 54","25 – 34","25 – 34","25 – 34","25 – 34","18 – 24","35 – 44")), GENDER = as.character(c("Female","Female","Male","Female","Other","Male","Female")), SD = as.numeric(c("3","0","0","0","3","2","0")), GAMING = as.numeric(c("0","0","0","0","2","2","0")), HW = as.numeric(c("2","2","0","2","2","2","2")), R1_1 = as.numeric(c("10","34","69","53","79","55","28")), M1_1 = as.numeric(c("65","32","64","53","87","55","27")), P1_1 = as.numeric(c("65","38","67","54","88","44","26")), R1_2 = as.numeric(c("15","57","37","54","75","91","37")), M1_2 = as.numeric(c("90","26","42","56","74","90","37")), P1_2 = as.numeric(c("90","44","33","54","79","95","37")), R1_3 = as.numeric(c("5","47","80","27","61","19","57")), M1_3 = as.numeric(c("30","71","80","34","71","15","57")), P1_3 = as.numeric(c("30","36","81","35","62","8","56")), R2_1 = as.numeric(c("10","39","75","31","71","80","59")), M2_1 = as.numeric(c("90","51","74","15","70","75","61")), P2_1 = as.numeric(c("90","52","35","34","69","83","60")), R2_2 = as.numeric(c("10","45","31","54","39","95","77")), M2_2 = as.numeric(c("60","70","40","78","5","97","75")), P2_2 = as.numeric(c("60","40","41","58","9","97","76")), R2_3 = as.numeric(c("5","38","78","45","25","16","22")), M2_3 = as.numeric(c("30","34","84","62","33","52","20")), P2_3 = as.numeric(c("30","34","82","45","32","16","22")), R3_1 = as.numeric(c("10","40","41","42","62","89","41")), M3_1 = as.numeric(c("90","67","37","40","27","89","42")), P3_1 = as.numeric(c("90","34","51","44","38","84","43")), R3_2 = as.numeric(c("10","37","20","54","8","93","69")), M3_2 = as.numeric(c("60","38","21","62","5","95","71")), P3_2 = as.numeric(c("60","38","23","65","14","92","69")), R3_3 = as.numeric(c("5","30","62","11","60","32","52")), M3_3 = as.numeric(c("30","67","34","55","45","25","45")), P3_3 = as.numeric(c("30","28","41","24","53","23","52")), R1_4 = as.numeric(c("10","40","61","17","39","72","25")), M1_4 = as.numeric(c("45","20","63","25","62","70","23")), P1_4 = as.numeric(c("45","52","56","16","26","72","27")), R2_4 = as.numeric(c("5","21","70","33","80","68","30")), M2_4 = as.numeric(c("35","21","69","27","85","69","23")), P2_4 = as.numeric(c("35","32","34","25","79","63","29")), R3_4 = as.numeric(c("10","29","68","21","8","71","41")), M3_4 = as.numeric(c("50","37","66","28","33","65","41")), P3_4 = as.numeric(c("50","38","47","28","24","71","41")) ) I would like to sort it out like in the following table the new column names are extracted from the old ones such that (example) in R1_1: R is the namer of the column containing the value previously stored in R1_1 1 (the first character after 'R' in R1_1) is the value used in column Speed 1 (last character of 'R1_1') is the value used in column Sound basically each row corresponds to 1 question answered by 1 person, and each question was answered through 3 different ratings (R, M, P) thank you!
If I understood you correctly, the following should work: df %>% pivot_longer( cols = matches('[RMP]\\d_\\d'), names_to = c('RMP', 'Speed', 'Sound'), values_to = 'Data', names_pattern = '([RMP])(\\d)_(\\d)' ) %>% pivot_wider(names_from = RMP, values_from = Data) This assumes that both “speed” and “sound” are single-digit values. If there’s the possibility of multiple digits, the occurrences of \\d in the patterns above need to be replaced by \\d+.
Solution using our good ol' workhorse reshape. At first we grep the names with a "Wd_d" pattern, as well as their suffixes "d_d" for following use in reshape. nm <- names(df[grep("_\\d", names(df))]) times <- unique(substr(nm, 2, 4)) res <- reshape(df, idvar="ID", varying=7:42, v.names=unique(substr(nm, 1, 1)), times=times,direction="long") Getting us close to the result, we just need to strsplit the newly created "time" variable at the "_" and rbind it to the former. res <- cbind(res, setNames(type.convert(do.call(rbind.data.frame, strsplit(res$time, "_"))), c("Speed", "Sound"))) res <- res[order(res$AGE), ] ## some ordering Result head(res) # ID AGE GENDER SD GAMING HW time R M P Speed Sound # 15.1_1 15 18 – 24 Male 2 2 2 1_1 55 44 55 1 1 # 15.1_2 15 18 – 24 Male 2 2 2 1_2 90 95 91 1 2 # 15.1_3 15 18 – 24 Male 2 2 2 1_3 15 8 19 1 3 # 15.2_1 15 18 – 24 Male 2 2 2 2_1 75 83 80 2 1 # 15.2_2 15 18 – 24 Male 2 2 2 2_2 97 97 95 2 2 # 15.2_3 15 18 – 24 Male 2 2 2 2_3 52 16 16 2 3
Unable to run Two-way repeated measures ANOVA; 0 (non-NA) cases
I am trying to follow the tutorial by Datanovia for Two-way repeated measures ANOVA. A quick overview of my dataset: I have measured the number of different bacterial species in 12 samplingsunits over time. I have 16 time points and 2 groups. I have organised my data as a tibble called "richness"; # A tibble: 190 x 4 id selection.group Day value <fct> <fct> <fct> <dbl> 1 KRH1 KR 2 111. 2 KRH2 KR 2 141. 3 KRH3 KR 2 110. 4 KRH1 KR 4 126 5 KRH2 KR 4 144 6 KRH3 KR 4 135. 7 KRH1 KR 6 115. 8 KRH2 KR 6 113. 9 KRH3 KR 6 107. 10 KRH1 KR 8 119. The id refers to each sampling unit, and the selection group is of two factors (KR and RK). richness <- tibble( id = factor(c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3")), selection.group = factor(c("KR", "KR", "KR", "RK", "RK", "RK")), Day = factor(c(2,2,4,2,4,4)), value = c(111, 110, 144, 92, 85, 69)) # subset of original data My tibble appears to be in an identical format as the one in the tutorial; > str(selfesteem2) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 4 variables: $ id : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ... $ treatment: Factor w/ 2 levels "ctr","Diet": 1 1 1 1 1 1 1 1 1 1 ... $ time : Factor w/ 3 levels "t1","t2","t3": 1 1 1 1 1 1 1 1 1 1 ... $ score : num 83 97 93 92 77 72 92 92 95 92 .. Before I can run the repeated measures ANOVA I must check for normality in my data. I copied the framework proposed in the tutorial. #my code richness %>% group_by(selection.group, Day) %>% shapiro_test(value) #tutorial code selfesteem2 %>% group_by(treatment, time) %>% shapiro_test(score) But get the error message "Error: Column variable is unknown" when I try to run the code. Does anyone know why this happens? I tried to continue without insurance that my data is normally distributed and tried to run the ANOVA res.aov <- rstatix::anova_test( data = richness, dv = value, wid = id, within = c(selection.group, Day) ) But get this error message; Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases I have checked for NA values with any(is.na(richness)) which returns FALSE. I have also checked table(richness$selection.group, richness$Day) to be sure my setup is correct 2 4 6 8 12 16 20 24 28 29 30 32 36 40 44 50 KR 6 6 6 6 6 6 6 6 6 6 6 5 6 6 6 6 RK 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6 And the setup appears correct. I would be very grateful for tips on solving this. Best regards Madeleine Below is a subset of my dataset in a reproducible format: library(tidyverse) library(rstatix) library(tibble) richness_subset = data.frame( id = c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3"), selection.group = c("KR", "KR", "KR", "RK", "RK", "RK"), Day = c(2,2,4,2,4,4), value = c(111, 110, 144, 92, 85, 69)) richness_subset$Day = factor(richness$Day) richness_subset$selection.group = factor(richness$selection.group) richness_subset$id = factor(richness$id) richness_subset = tibble::as_tibble(richness_subset) richness_subset %>% group_by(selection.group, Day) %>% shapiro_test(value) # gives Error: Column `variable` is unknown res.aov <- rstatix::anova_test( data = richness, dv = value, wid = id, within = c(selection.group, Day) ) # gives Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : # 0 (non-NA) cases
I create something like the design of your data: set.seed(111) richness = data.frame(id=rep(c("KRH1","KRH2","KRH3"),6), selection.group=rep(c("KR","RK"),each=9), Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100)) richness$Day = factor(richness$Day) richness$id = factor(richness$id) First, shapiro_test, there's a bug in the script and the value you wanna test cannot be named "value": # gives error Error: Column `variable` is unknown richness %>% shapiro_test(value) #works richness %>% mutate(X = value) %>% shapiro_test(X) # A tibble: 1 x 3 variable statistic p <chr> <dbl> <dbl> 1 X 0.950 0.422 1 X 0.963 0.843 Second, for the anova, this works for me. rstatix::anova_test( data = richness, dv = value, wid = id, within = c(selection.group, Day) ) In my example every term can be estimated.. What I suspect is that one of your terms is a linear combination of the other. Using my example, set.seed(111) richness = data.frame(id=rep(c("KRH1","KRH2","KRH3","KRH4","KRH5","KRH6"),3), selection.group=rep(c("KR","RK"),each=9), Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100)) richness$Day = factor(richness$Day) richness$id = factor(richness$id) rstatix::anova_test( data = richness, dv = value, wid = id, within = c(selection.group, Day) ) Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases Gives the exact same error. This can be checked using: lm(value~id+Day:selection.group,data=richness) Call: lm(formula = value ~ id + Day:selection.group, data = richness) Coefficients: (Intercept) id1 id2 101.667 -3.000 -6.000 id3 id4 id5 -6.000 1.889 11.556 Day2:selection.groupKR Day4:selection.groupKR Day6:selection.groupKR 1.667 -12.000 9.333 Day2:selection.groupRK Day4:selection.groupRK Day6:selection.groupRK -1.667 NA NA The Day4:selection.groupRK and Day6:selection.groupRK are not estimateable because they are covered by a linear combination of factors before.
The solution for running the Shapiro_test proposed above worked. And I figured out I have some linear combination by running lm(value~id+Day:selection.group,data=richness). However, I don't understand why? I know I have data points for each group (see graph). Where does this linear combination come from? Repeated measure ANOVA appears so appropriate for me as I am following sampling units over time.
I had the same issue. Couldn't find out the solution. Finally the following works: install “ez” package newModel<-ezANOVA(data = dataFrame, dv = .(outcome variable), wid = .(variable that identifies participants), within = .(repeated measures predictors), between = . (between-group predictors), detailed = FALSE, type = 2) Example: bushModel<-ezANOVA(data = longBush, dv = .(Retch), wid = .(Participant), within = .(Animal), detailed = TRUE, type = 3)
R unnest_tokens and calculate positions (start and end location) of each token
How to get the position of all the tokens after using unnest_tokens? Here is a simple example - df<-data.frame(id=1, doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] MR #: [** Medical_Record_Number **] Location: [** Location **] ")) Tokenize by white space using tidytext - library(tidytext) tokens_df<-df %>% unnest_tokens(tokens,doc,token = stringr::str_split, pattern = "\\s", to_lower = F, drop = F) How to get the position of all the tokens? id tokens start end 1 Patient: 1 8 1 9 9 1 [** 12 14 1 Name 16 19
Here is the non-tidy approach to the problem. regex = "([^\\s]+)" df_i = str_extract_all(df$doc, regex) df_ii = str_locate_all(df$doc, regex) output1 = Map(function(x, y, z){ if(length(y) == 0){ y = NA } if(nrow(z) == 0){ z = rbind(z, list(start = NA, end = NA)) } data.frame(id = x, token = y, z) }, df$id, df_i, df_ii) %>% do.call(rbind,.) %>% merge(df, .)
I think the first answerer here has the right idea that the best approach is to use string handling, rather than tokenization and NLP, if tokens split on whitespace and character positions is the output you want. If you also do want to use tidy data principles and end up with a data frame, try out something like this: library(tidyverse) df <- data_frame(id=1, doc=c("Patient: [** Name **], [** Name **] Acct.#: [** Medical_Record_Number **] ")) df %>% mutate(tokens = str_extract_all(doc, "([^\\s]+)"), locations = str_locate_all(doc, "([^\\s]+)"), locations = map(locations, as.data.frame)) %>% select(-doc) %>% unnest(tokens, locations) #> # A tibble: 11 x 4 #> id tokens start end #> <dbl> <chr> <int> <int> #> 1 1.00 Patient: 1 8 #> 2 1.00 [** 12 14 #> 3 1.00 Name 16 19 #> 4 1.00 **], 21 24 #> 5 1.00 [** 26 28 #> 6 1.00 Name 30 33 #> 7 1.00 **] 35 37 #> 8 1.00 Acct.#: 39 45 #> 9 1.00 [** 50 52 #> 10 1.00 Medical_Record_Number 54 74 #> 11 1.00 **] 76 78 This will work for multiple documents with id columns for each string, and it is removing actual whitespace from the output because of the way the regex is constructed. EDITED: In a comment, the original poster asked for an approach that would allow tokenizing by sentence and also keeping track of the positions of each word. The following code does that, in the sense that we get the start and end position for each token within each sentence. Could you use a combination of the sentenceID column with the start and end columns to find what you're looking for? library(tidyverse) library(tidytext) james <- paste0( "The question thus becomes a verbal one\n", "again; and our knowledge of all these early stages of thought and feeling\n", "is in any case so conjectural and imperfect that farther discussion would\n", "not be worth while.\n", "\n", "Religion, therefore, as I now ask you arbitrarily to take it, shall mean\n", "for us _the feelings, acts, and experiences of individual men in their\n", "solitude, so far as they apprehend themselves to stand in relation to\n", "whatever they may consider the divine_. Since the relation may be either\n", "moral, physical, or ritual, it is evident that out of religion in the\n", "sense in which we take it, theologies, philosophies, and ecclesiastical\n", "organizations may secondarily grow.\n" ) d <- data_frame(txt = james) d %>% unnest_tokens(sentence, txt, token = "sentences") %>% mutate(sentenceID = row_number(), tokens = str_extract_all(sentence, "([^\\s]+)"), locations = str_locate_all(sentence, "([^\\s]+)"), locations = map(locations, as.data.frame)) %>% select(-sentence) %>% unnest(tokens, locations) #> # A tibble: 112 x 4 #> sentenceID tokens start end #> <int> <chr> <int> <int> #> 1 1 the 1 3 #> 2 1 question 5 12 #> 3 1 thus 14 17 #> 4 1 becomes 19 25 #> 5 1 a 27 27 #> 6 1 verbal 29 34 #> 7 1 one 36 38 #> 8 1 again; 40 45 #> 9 1 and 47 49 #> 10 1 our 51 53 #> # ... with 102 more rows Notice that these aren't quite "tokenized" in the normal sense from unnest_tokens(); they will still have their closing punctuation attached to each word like commas and periods. It seemed like you wanted that from your original question.