I have a dataset with responses to a Likert item on a 9pt scale. I would like to create a frequency table (and barplot) of the data but some values on the scale never occur in my dataset, so table() removes that value from the frequency table. I would like it instead to present the value with a frequency of 0. That is, given the following dataset
# Assume a 5pt Likert scale for ease of example
data <- c(1, 1, 2, 1, 4, 4, 5)
I would like to get the following frequency table without having to manually insert a column named 3 with the value 0.
1 2 3 4 5
3 1 0 2 1
I'm new to R, so maybe I've overlooked something basic, but I haven't come across a function or option that gives the desired result.
EDIT:
tabular produces frequency tables while table produces contingency tables. However, to get zero frequencies in a one-dimensional contingency table as in the above example, the below code still works, of course.
This question provided the missing link. By converting the Likert item to a factor, and explicitly specifying the levels, levels with a frequency of 0 are still counted
data <- factor(data, levels = c(1:5))
table(data)
produces the desired output
table produces a contingency table, while tabular produces a frequency table that includes zero counts.
tabulate(data)
# [1] 3 1 0 2 1
Another way (if you have integers starting from 1 - but easily modifiable for other cases):
setNames(tabulate(data), 1:max(data)) # to make the output easier to read
# 1 2 3 4 5
# 3 1 0 2 1
If you want to quickly calculate the counts or proportions for multiple likert items and get your output in a data.frame, you may like the function psych::response.frequencies in the psych package.
Lets create some data (note that there are no 9s):
df <- data.frame(item1 = sample(1:7, 2000, replace = TRUE),
item2 = sample(1:7, 2000, replace = TRUE),
item3 = sample(1:7, 2000, replace = TRUE))
If you want to calculate the proportion in each category
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9)
you get the following:
1 2 3 4 5 6 7 8 9 miss
item1 0.1450 0.1435 0.139 0.1325 0.1380 0.1605 0.1415 0 0 0
item2 0.1535 0.1315 0.126 0.1505 0.1535 0.1400 0.1450 0 0 0
item3 0.1320 0.1505 0.132 0.1465 0.1425 0.1535 0.1430 0 0 0
If you want counts, you can multiply by the sample size:
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9) * nrow(df)
You get the following:
1 2 3 4 5 6 7 8 9 miss
item1 290 287 278 265 276 321 283 0 0 0
item2 307 263 252 301 307 280 290 0 0 0
item3 264 301 264 293 285 307 286 0 0 0
A few notes:
the default max is 10. Thus, if you have more than 10 response options, you'll have issues. Otherwise, in your case, and many Likert item cases, you could omit the max argument.
uniqueitems specifies the possible values. If all your values were present in at least one item, then this would be inferred from the data.
I think the function only works with numeric data. So if you have your likert categories coded "Strongly disagree", etc. it wont work.
Related
I am quite beginner at R and I was hoping someone could guide me with this question.
I have a data frame that tells me whether 2,000 individuals voted or not. I have to sample 100 individuals and then find what proportion of them voted.
To do that, I decided to assign a number to each individual to differentiate them and do the sample. After that, however, I don't know how to add the variable to know whether they voted or not. Here is what I did:
vote$assignment <- c(1:2000)
sample <- sample(vote$assignment, 100, replace=F, set.seed(100))
sample100 <- as.data.frame(sample)
First lines of the dataframe:
vote assignment
1 1
1 2
0 3
1 4
1 5
1 6
0 7
1 8
1 9
1 10
Any ideas of how I get to that dataframe the information of whether they voted or not?
Thank you!
I am assuming you have 1/0 values in vote column where 1 is for the people who voted and 0 for the people who have not voted.
You can randomly select 100 individuals and take mean of vote column to get percentage of people who voted.
#Assign unique id's for each row
vote$assignment <- seq(nrow(vote))
#Selected random 100 rows
selected_rows <- sample(nrow(vote), 100)
#Get the percentage of people who voted.
percent_voted <- mean(vote$vote[selected_rows]) * 100
You can used the dplyr solution as suggested by Phil. Or you can consider the base solution.
# the data
df <- data.frame(id = 1:2000)
df$vote <- sample(c(1, 0), 2000, replace = TRUE, set.seed(123))
# the sampling
samp_id <- sample(df$id, 100, replace = FALSE)
df_vote <- df[samp_id, ]
id vote
225 225 0
1279 1279 0
1585 1585 1
946 946 1
1578 1578 0
1481 1481 0
651 651 1
1601 1601 0
354 354 1
203 203 0
prop <- mean(df_vote$vote)
prop
So I want is to apply weights to my observations from my data frame, also I already have an entire column with the weights that I want to apply to my data.
So this how my data frame looks like.
weight
count
3
67
7
355
8
25
7
2
And basically what I want is to weight each value of my COUNT column with the respective weight of my column WEIGHT. For example, the value 67 of of my column Count should be weighted by 3 and the value of 355 of my column Count should be weighted by 7 and so on.
I try to use this code from the questionr package.
wtd.table(data1$count, weights = data1$weight)
But this code altered my data frame and end up reducing my 1447 rows to just 172 entries. What I want is to maintain my exact number of entries.
The output that I want, would be something like this. I just want to add another column to my data frame with the weighted values.
Count
Count applying weights
67
####
355
###
I am still not sure how to apply weights to the count data in the way you want.
I just want to show that you can create a new column based on the previous column in a convenient way by using dplyr. For example:
mydf
# weight count
# 1 3 67
# 2 7 355
# 3 8 25
# 4 7 2
mydf %>% mutate(weightedCount = weight*count,
percentRank = percent_rank(weightedCount),
cumDist = cume_dist(weightedCount))
# weight count weightedCount percentRank cumDist
# 1 3 67 201 0.6666667 0.75
# 2 7 355 2485 1.0000000 1.00
# 3 8 25 200 0.3333333 0.50
# 4 7 2 14 0.0000000 0.25
Here, weightedCount is multiplication of weight and count, percentRank shows the rank of each data in weightedCount and cumDist shows cumulative distribution of the data in weightedCount.
This is an example. You can create another column and apply other functions in the similar way.
I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?
Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
library(zoo)
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3
Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92
I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?