Is there an easy way to get the frequencies column wise? - r

I have a list of Likert values, the values range from 1 to 5. Each possible response may occur once, more than once or not at all per column. I have several columns and rows, each row corresponds to a participant, each column to a question. There is no NA data.
Example:
c1
c2
c3
1
1
5
2
2
5
3
3
4
3
4
3
2
5
1
1
3
1
1
5
1
The goal is to count the frequencies of the answer options column wise, to consequently compare them.
So the resulting table should look like this:
-
c1
c2
c3
1
3
1
3
2
2
1
0
3
2
2
1
4
0
1
1
5
0
2
2
I know how to do this for one column, and I can look at the frequencies with apply(ds, 1, table), but I do not manage to put this into a table to work further with.
Thanks!

This should do it, using plyr:
count_df = setNames(data.frame(t(plyr::ldply(apply(df, 2, table), rbind)[2:6])), colnames(df))
count_df[is.na(count_df)] = 0

You may use table in sapply -
sapply(df, function(x) table(factor(x, 1:5)))
# c1 c2 c3
#1 3 1 3
#2 2 1 0
#3 2 2 1
#4 0 1 1
#5 0 2 2
This approach can also be used in dplyr if you prefer that.
library(dplyr)
df %>% summarise(across(.fns = ~table(factor(., 1:5))))

We may use a vectorized option in base R
table(data.frame(v1 = unlist(df1), v2 = names(df1)[col(df1)]))
v2
v1 c1 c2 c3
1 3 1 3
2 2 1 0
3 2 2 1
4 0 1 1
5 0 2 2
data
df1 <- structure(list(c1 = c(1L, 2L, 3L, 3L, 2L, 1L, 1L), c2 = c(1L,
2L, 3L, 4L, 5L, 3L, 5L), c3 = c(5L, 5L, 4L, 3L, 1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-7L))

Related

R - delete rows according to the value of another row

I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data

R: reordering columns based on order of different column

I have the following data:
x y id
1 2
2 2 1
3 4
5 6 2
3 4
2 1 3
The blanks in column id should have the same values as the next id value. Meaning my data should actually look like this:
x y id
1 2 1
2 2 1
3 4 2
5 6 2
3 4 3
2 1 3
I also have a list:
list[[1]] = 1 3 2
Or alternatively a column:
c(1,3,2) = 1, 3, 2
Now I would like to reorder my data based on column id accroding to the order in the list. My data should like this then:
x y id
1 2 1
2 2 1
3 4 3
2 1 3
3 4 2
5 6 2
Is there an efficient way to do this?
EDIT: I don't think it is a duplicate of in R Sorting by absolute value without changing the data because I do no want to sort by absolute value but by specific order that is given in a list.
A base R option would be (assuming that the blanks in 'id' column is NA)
i1 <- !is.na(df1$id)
df1[i1,][match(df1$id[i1], list[[1]]),] <- df1[i1, ]
df1
# x y id
#1 1 2 NA
#2 2 2 1
#3 3 4 NA
#4 2 1 3
#5 3 4 NA
#6 5 6 2
If we need to change the NA to succeeding non-NA element
library(zoo)
df1$id <- na.locf(df1$id, fromLast = TRUE)
data
df1 <- structure(list(x = c(1L, 2L, 3L, 5L, 3L, 2L), y = c(2L, 2L, 4L,
6L, 4L, 1L), id = c(NA, 1L, NA, 2L, NA, 3L)), class = "data.frame",
row.names = c(NA, -6L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

Deleting Rows per ID when value gets greater than... minus 2

I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))

Resources