R: Labels with 0 Count - r

I am doing this research using RStudio that can generate/tabulate labels that will also show 0 counts. I am using the 'cro and 'calc_cro function in RStudio.
My sample codes:
S4 <- cro(data$S4, list(total(), data$S3A_1, data$User, data$S4, data$S8rev, data$S5rev))
and
S9 <- calc_cro(data, mrset(S9_1 %to% S9_993_1),list(total(), S3A_1, User, S4, S8rev, S5rev))
For example, S4 is a variable code for size, i.e. (1 - Small, 2 - Medium, 3 - Large).
Moreover, sample survey results show only small and large respondents. My codes results will be more like this:
Size
Total
Male
Female
...
Small
15
8
7
...
Large
15
8
7
...
Can someone help me to modify my codes that are using to show also labels with 0 counts like this:
Size
Total
Male
Female
...
Small
15
8
7
...
Medium
0
0
0
...
Large
15
8
7
...
I am thinking right now that this is not possible because R can't determine the range of the labels (in my example 1-3, how would it know that it is 1 to 3 and doesn't have 4,5,..., x number of labels).
However, there are still thoughts in my head saying if I can define/include the range in my codes, would it be possible to make this work?
Here's my dput...
structure(list(RespID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30), S4 = c(1, 3, 3, 3, 3, 3, 3, 1, 3, 1, 1, 1, 3, 3,
1, 3, 1, 1, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1), Gender = c(1,
1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2,
2, 1, 2, 1, 1, 2, 2, 2), Area = c(2, 2, 2, 4, 1, 1, 3, 4, 1,
1, 1, 4, 2, 3, 1, 3, 3, 1, 3, 4, 1, 3, 3, 4, 4, 3, 1, 1, 4, 1
)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"
))
With data labels:
S4 = {1 - Small, 2 - Medium, 3 - Large}
Gender = {1 - Male, 2 - Female}
Area = {1 - North, 2 - East, 3 - South, 4 - North}

I think you just misunderstood function usage, try
library("expss")
S4 <- cro(data$S4, col_vars=list("S3A_1", "User", "S4", "S8rev", "S5rev"))
S4
# | | | S3A_1 | User | S4 | S8rev | S5rev |
# | ------- | ------------ | ----- | ---- | -- | ----- | ----- |
# | data$S4 | 1 | 15 | 15 | 15 | 15 | 15 |
# | | 3 | 15 | 15 | 15 | 15 | 15 |
# | | #Total cases | 30 | 30 | 30 | 30 | 30 |
I'm not very sure, though, what you're actually asking.

Related

as.factor not working with INT values on R

Hey guys if you could please help me. I got this dataset:
q1 q2 q3 m1 m2 b1 b2
A 78 150 2887 4 4 0 1
B 74 142 2904 4 4 1 1
C 79 137 1564 4 4 1 0
D 80 164 4522 2 2 0 0
E 74 173 5025 2 3 0 1
F 73 140 1971 3 3 0 1
I want to transform m1:b2 into factors. If I do
data[,4:7] <- as.factor(data[,4:7])
it doesn't work, the values change to char vectors. It gets messed up like this:
q1 q2 q3 m1 m2 b1
A 78 150 2887 c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3)
B 74 142 2904 c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3)
C 79 137 1564 c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0)
D 80 164 4522 c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1)
E 74 173 5025 c(4, 4, 4, 2, 2, 3) c(0, 1, 1, 0, 0, 0) c(4, 4, 4, 2, 2, 3)
F 73 140 1971 c(4, 4, 4, 2, 3, 3) c(1, 1, 0, 0, 1, 1) c(4, 4, 4, 2, 3, 3)
b2
A c(0, 1, 1, 0, 0, 0)
B c(1, 1, 0, 0, 1, 1)
C c(4, 4, 4, 2, 2, 3)
D c(4, 4, 4, 2, 3, 3)
E c(0, 1, 1, 0, 0, 0)
F c(1, 1, 0, 0, 1, 1)
But if I use lapply it works fine. Can you explain me why? Because I've been using as.factor(d[]) in other occasions and it worked just fine with other data.frame objects. Thank you.
Checking the documentation for as.factor (by typing ?as.factor), you'll see it says that the first argument x is "a vector of data, usually taking a small number of distinct values". If you supply multiple columns of a data frame, they are treated as one vector. In your example, as.factor creates a unique factor level for each unique value in the entire vectorized, concatenation of columns 4 through 7 of your data frame above.
You should use:
data[4:7] <- lapply(data[4:7], as.factor)
or (requiring tidyverse packages)
data <- data %>% mutate_at(4:7, as.factor)
Both of these solutions will correctly treat each column supplied, here columns 4, 5, 6, and 7, as their own vectors, individually. Each one is converted to a factor separately, and re-assigned appropriately.

Creating a new variable based on numeric differences between two other variables in r

Here's an example dataset.
structure(list(vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5,
5, 1, 4, 2, 4, 5, 2, 5), vector2 = c(4, 2, 3, 5, 3, 5, 2, 2,
3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)), class = "data.frame", row.names = c(NA,
-20L))
Basically what I'm trying to do is create a new variable 'Direction' based on differences between these numbers. I want to say something like:
if vector2 == vector1 or vector2 == vector1 +/- 1 than Direction == 'NS'
if vector2 < vector1 -1 or if vector 2 > vector1 + 1 than Direction == 'EW'
Hopefully this makes sense. Thanks!
A similar solution is this (slightly simpler):
Data:
df <- data.frame(
vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5, 5, 1, 4, 2, 4, 5, 2, 5),
vector2 = c(4, 2, 3, 5, 3, 5, 2, 2, 3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)
)
Desired new column:
df$direction <- ifelse(df$vector1==vector2 |
df$vector1==vector2 + 1 |
df$vector1==vector2 - 1, "NS","EW")
Outcome:
df
vector1 vector2 direction
1 1 4 EW
2 4 2 EW
3 4 3 NS
4 2 5 EW
5 1 3 EW
6 3 5 EW
7 2 2 NS
8 3 2 NS
9 4 3 NS
10 5 3 EW
11 3 4 NS
12 5 1 EW
13 5 4 NS
14 1 1 NS
15 4 2 EW
16 2 1 NS
17 4 2 EW
18 5 1 EW
19 2 1 NS
20 5 2 EW
you can try this
df <- structure(list(vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5,
5, 1, 4, 2, 4, 5, 2, 5), vector2 = c(4, 2, 3, 5, 3, 5, 2, 2,
3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)), class = "data.frame", row.names = c(NA,
-20L))
df$direction <- with(df,ifelse((vector2 == vector1) | (vector2 == (vector1 + 1)) | (vector2 == (vector1 - 1)), "NS",
ifelse(vector2 < (vector1-1) | (vector2 > (vector1 + 1)),"EW", NA)))

Can I create many categories of one variable based in two other conditions in r? [duplicate]

This question already has answers here:
How collect additional row data on binned data in R
(1 answer)
Group value in range r
(3 answers)
Closed 3 years ago.
I am doing a statistic analysis in a big data frame (more than 48.000.000 rows) in r. Here is an exemple of the data:
structure(list(herd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), cows = c(1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16), `date` = c("11/03/2013",
"12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013", "16/03/2013",
"13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012", "17/05/2012",
"18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016", "13/07/2016",
"11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013",
"16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012",
"17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016",
"13/07/2016", "11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013",
"15/03/2013", "16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012",
"16/05/2012", "17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016",
"12/07/2016", "13/07/2016"), glicose = c(240666, 23457789, 45688688,
679, 76564, 6574553, 78654, 546432, 76455643, 6876, 7645432,
876875, 98654, 453437, 98676, 9887554, 76543, 9775643, 986545,
240666, 23457789, 45688688, 679, 76564, 6574553, 78654, 546432,
76455643, 6876, 7645432, 876875, 98654, 453437, 98676, 9887554,
76543, 9775643, 986545, 240666, 23457789, 45688688, 679, 76564,
6574553, 78654, 546432, 76455643, 6876)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -48L))
I need to identify how many cows are in the following category of glicose by herd and by date:
<=100000
100000 and <=150000
150000 and <=200000
200000 and <=250000
250000 and <=400000
>400000
I tried to use the functions filter() and select() but could not categorize the variable like that.
I tried either to make a vector for each category but it did not work:
ht <- df %>% group_by(herd, date) %>%
filter(glicose < 100000)
Actually I do not have a clue of how I could do this. Please help!
I expect to get the number of cows in each category of each herd based on each date in a table like this:
Calling your data df,
df %>%
mutate(glicose_group = cut(glicose, breaks = c(0, seq(1e5, 2.5e5, by = 0.5e5), 4e5, Inf)),
date = as.Date(date, format = "%d/%m/%Y")) %>%
group_by(herd, date, glicose_group) %>%
count
# # A tibble: 48 x 4
# # Groups: herd, date, glicose_group [48]
# herd date glicose_group n
# <dbl> <date> <fct> <int>
# 1 1 2012-05-13 (0,1e+05] 1
# 2 1 2012-05-14 (4e+05,Inf] 1
# 3 1 2012-05-15 (4e+05,Inf] 1
# 4 1 2012-05-16 (0,1e+05] 1
# 5 1 2012-05-17 (4e+05,Inf] 1
# 6 1 2012-05-18 (4e+05,Inf] 1
# 7 1 2013-03-11 (2e+05,2.5e+05] 1
# 8 1 2013-03-12 (4e+05,Inf] 1
# 9 1 2013-03-13 (4e+05,Inf] 1
# 10 1 2013-03-14 (0,1e+05] 1
# # ... with 38 more rows
I also threw in a conversion to Date class, which is probably a good idea.

Check row by row and highlight mismatches in row/column when it occurred

I have a data frame with 3 months of data with individual information. Individual information must be fixed during the whole period, however, in my real data set it is not the case. I would like to check row by row and highlight the dates that something went wrong during data entry.
Here is sample of my dataset ( real dataset has more variables):
input <- data.frame(stringsAsFactors=FALSE,
date = c(20190218, 20190219, 20190220, 20190221, 20190222,
20190223, 20190101, 20190103, 20190105, 20190110,
20190112, 20190218, 20190219, 20190220, 20190221, 20190222,
20190223),
id = c("18105265-ab", "18105265-ab", "18105265-ab",
"18105265-ab", "18105265-ab", "18105265-ab",
"18161665-aa", "18161665-aa", "18161665-aa", "18161665-aa",
"18161665-aa", "18502020-aa", "18502020-aa", "18502020-aa",
"18502020-aa", "18502020-aa", "18502020-aa"),
size = c(3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1),
type = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 4, 2, 2),
county = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5),
member_p10 = c(3, 3, 3, 3, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1),
youngest_age = c(5, 5, 5, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7),
sex = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1),
position = c(5, 5, 5, 5, 5, 5, 4, 4, 4, 0, 0, 3, 3, 3, 3, 0, 0))
Is there any way for this type of operation? I would like to have this output at the end:
date id size type county member_p10 youngest_age sex position
1 20190221 18105265-ab 3 4 1 3 5 1 5
2 20190222 18105265-ab 2 4 1 2 7 1 5
3 20190105 18161665-aa 2 4 1 2 7 2 4
4 20190110 18161665-aa 1 2 1 1 7 2 0
5 20190221 18502020-aa 2 4 5 2 7 1 3
6 20190222 18502020-aa 1 2 5 1 7 1 0

How to sum remaing values after using gsub?

This problem is unsolved by my brain, so I'm asking all of you for a little help.
This is part of my data:
rfam[1:20,]
id name
1 RF00001 LL_skoljka_r41782307_x1
2 RF00001 LL_skoljka_r9950955_x1
3 RF00001 LL_skoljka_r49323482_x1
4 RF00001 LL_skoljka_r14141437_x1
5 RF00001 LL_skoljka_r16457227_x3
6 RF00002 LL_skoljka_r40347558_x1
7 RF00002 LL_skoljka_r44415149_x1
8 RF00002 LL_skoljka_r13145032_x1
9 RF00002 LL_skoljka_r29248915_x42
10 RF00003 LL_skoljka_r15936986_x1
11 RF00003 LL_skoljka_r28953530_x1
12 RF00003 LL_skoljka_r32665758_x1
13 RF00003 LL_skoljka_r32835489_x1
14 RF00003 LL_skoljka_r32835498_x1
15 RF04051 LL_skoljka_r33254611_x1
16 RF04051 LL_skoljka_r29761867_x12
17 RF04051 LL_skoljka_r45123665_x2
18 RF04051 LL_skoljka_r34837827_x15
19 RF08595 LL_skoljka_r38900754_x1
20 RF08595 LL_skoljka_r22016530_x1
In first step I want to remove all the nonsense before x in variable name so I use:
rfam$name<- as.data.frame(sapply(rfam$name, gsub, pattern='^.*?x', replacement=""))
Result:
rfam[1:20,]
id name
1 RF00001 1
2 RF00001 1
3 RF00001 1
4 RF00001 1
5 RF00001 3
6 RF00002 1
7 RF00002 1
8 RF00002 1
9 RF00002 42
10 RF00003 1
11 RF00003 1
12 RF00003 1
13 RF00003 1
14 RF00003 1
15 RF04051 1
16 RF04051 12
17 RF04051 2
18 RF04051 15
19 RF08595 1
20 RF08595 1
In second step I would like to sum up values that stay in variable name for each id.
Results should look like this:
view(rfam)
id name
1 RF00001 7
2 RF00002 45
3 RF00003 5
4 RF04051 30
5 RF08595 2
If I want to sum up values, variable should be numeric. Both of my variables are factors. So I transformed id to character using rfam[,1]=as.character(rfam[,1]) and tried to convert name to numeric by rfam[,2]=as.numeric(levels(rfam[,2])[rfam[,2]]). Transformation of id was successful, while name returns "NA's".
I've also tried rfam[,2]=as.numeric(as.character(rfam[,2])), but the result was the same.
I've tried to export data to txt file and then in excel do the rest of analysis, but when I export data, it looks like this:
"id" "name"
"1" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
"2" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
"3" "RF00001" c(1, 1, 1, 1, 9, 1, 1, 1, 11, 1, 1, 1, 1, 1, 1, 3, 7, 5, 1, 1, 1, 9, 1, 14, 10, 7, 1, 5, 1, 1, 1, 1, 1, 7, 1, 2, 1, 1, 1, 9, 1, 7, 1, 1, 1, 1, 1, 1, 10, 7, 1, 10, 7, 1, 1, 1, 1, 1, 7, 1, 10, 1, 1, 1, 1, 1, 1, 1, 7, 1,...)
Now here is my dead end. I don't understand what is happening and I would appreciate if you could help me out.
Update
Having realized your question is not about the grouping part, the problem is that your sapply() function is creating a data.frame inside rfam instead of a vector.
You can use the following data.table solution to correctly convert the rfam$name column to the desired format to be able to group.
setDT(rfam)[,name:= as.numeric(gsub('^.*?x', replacement="",name))]
Now we can use dplyr to attain the desired output:
library(dplyr)
as.data.frame(rfam) %>% group_by(id) %>% summarise(name=sum(name))

Resources