R - table() returns repeated factors - r

I'm using FiveThirtyEight's Star Wars survey.
On $Anakin I've assigned 0 (very unfavourably) to 5 (very favourably) as categorical variables to the respondent's view of Anakin. "N/A" on the survey was assigned "". (Did that step on MS Excel)
$Startrek contains whether the respondent's seen Star Trek or not.
starwars <- read.csv2("starsurvey.csv", header = TRUE, stringsAsFactors = FALSE)
as.factor(starwars$Anakin)
as.factor(starwars$Startrek)
tbl <- table(starwars$Anakin, starwars$Startrek)
The table() function returns this:
No Yes
1 0 20 19
2 2 31 50
3 0 68 67
4 1 140 128
5 5 101 139
I'm wondering why the function returns 0, 2, 0, 1, 5 for the factors in $Anakin, since it contains:
starwars$Anakin
[1] 5 <NA> 4 5 2 5 4 3 4 5 <NA> <NA> 4 4
[15] 4 2 3 5 5 5 4 3 3 2 5 <NA> 4 4
[29] 1 1 3 5 2 <NA> <NA> 5 5 4 4 4 3 4
[43] 4 4 4 4 <NA> 2 3 <NA> 4 4 5 4 4 <NA>

The output of table here is confusing because your factor levels (1 to 5) look like row numbers, and there are some blank ("") responses to the Startrek variable which makes it appear like the data is only under the No and Yes columns.
So, the data here is a 5 by 3 table, with the rows representing the score from Anakin (1 to 5) and the columns representing 3 types of response to Startrek ("", No, Yes).
Note that where there are NA's in Anakin, this data is ingored in the table. To count these too, use addNA:
table(addNA(starwars$Anakin), starwars$Startrek)

Related

Imputing missing data conditional on other values in R

I working on a longitudinal patient dataset with some missing data. I'm trying to replicate a missing data imputation approach used by a published study. A snapshot of the first 18 rows of this dataset is below. Briefly, here there are 6 patients belonging to 3 different groups. Each person has been assessed over 3 years across a variety of tests. There is also information on Age, disease severity and a functional capacity score:
ID Group Time Age Severity Func.score Test1 Test2 Test3 Test4
1 A 1 60 5 50 -4 888 5 4
1 A 2 61 6 45 3 3 4 4
1 A 3 62 7 40 2 2 888 4
2 A 1 59 5 50 5 3 6 3
2 A 2 60 6 40 4 2 5 3
2 A 3 61 7 35 3 1 888 2
3 B 1 59 6 40 -4 -4 7 5
3 B 2 59 7 40 3 3 7 5
3 B 3 60 8 30 1 888 888 2
4 B 1 55 7 50 5 888 7 4
4 B 2 56 8 NA 3 1 6 3
4 B 3 57 9 NA 1 -4 6 888
5 C 1 54 7 40 6 6 5 5
5 C 2 55 8 40 4 5 5 4
5 C 3 56 8 35 2 888 5 3
6 C 1 60 6 50 6 7 4 4
6 C 2 61 6 40 5 6 4 888
6 C 3 62 7 30 3 5 4 888
Missing data in this dataset is coded in 3 possible ways. If NA, then the measure was not administered. If -4, the person could not complete the test due to a cognitive problem (i.e., they have poor memory etc.). If 888, then the person couldn't complete the test because of a physical problem (i.e., they have difficulty writing, drawing etc.).
My aim is to impute this missing data using two strategies.
If the missing data are because of a cognitive problem (i.e., where -4), then I want to impute the lowest possible score, given their specific time point and group membership. For example, for Test1 for ID1, I want the -4 substituted with 5 (as that is the only score that belongs to Time 1 and Group A).
If the missing data are because of a physical problem (i.e., where 888), I want to impute this using a regression equation using Age, Severity, and Functional score (Func.score) and all other available Test scores to predict that missing data point.
How can I build this conditional imputing into a dplyr::mutate or an ifelse or case_when function?
In tidymodels, you would have to set them to NA and not use coded values (I really do wish that we had different types of missing values).
No guarantees on this since we don't have a reproducible example but this might work for you:
some_recipe %>%
# for case 1 in your post
step_mutate(Test1 = ifelse(Test1 == -4, 5, Test1)) %>%
# for case 2
step_mutate(
# probably better to do this with across()
Test1 = ifelse(Test1 == 888, NA_integer_, Test1),
Test2 = ifelse(Test2 == 888, NA_integer_, Test2),
Test3 = ifelse(Test3 == 888, NA_integer_, Test3),
Test4 = ifelse(Test4 == 888, NA_integer_, Test4)
) %>%
step_impute_linear(starts_with("Test"),
impute_with = vars(Age, Severity, Func.score,
starts_with("Test")))

Reading in space seperated dataset in R for frequent items

I have a .txt file that consists of numbers separated by spaces. Each row has a different amount of numbers in it. I need to do market basket analysis on the data, however I can't seem to properly load the data (especially because there is a different number of items in each 'basket'). What is the best way to store the data so I can find the frequent items and then check for frequent items in each basket?
Example of data:
1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54
You should be able to input with readLines and then use lapply to separate into numerics. Assume that is in a file named txt.txt:
dat <- lapply( readLines("txt.txt"), function(Line) scan(text=Line) )
The reason I didn't suggest read.table with fill=TRUE (which would give yiu something similar to the otehr answer that has appeared is that the column stucture was not needed. unless there was information encoded in the position of those numbers. I'm wondering whether the might be additional information encoded in the individual lines such as regions or stores or some other entity as the source of particular numbered items. This would be the reason for keeping it in a list structure with an uneven count. You can get a global enumerations just with table:
table( unlist(dat) )
1 2 3 4 5 6 7 8 21 23 32 34 43 54 67 145 154 456
1 4 2 3 2 1 1 1 2 1 2 1 1 1 1 1 1 1
my_text = '1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54'
my_text2 <- strsplit(my_text, split = '\n')
my_text2 <- lapply(my_text2, trimws)
my_text2 %>%
do.call('rbind',.) %>%
t %>%
as.data.frame() %>%
separate(V1, sep = ' ',into = paste('col_', 1:10))
col_ 1 col_ 2 col_ 3 col_ 4 col_ 5 col_ 6 col_ 7 col_ 8 col_ 9 col_ 10
1 1 2 4 3 67 43 154 <NA> <NA> <NA>
2 4 5 3 21 2 <NA> <NA> <NA> <NA> <NA>
3 2 4 5 32 145 <NA> <NA> <NA> <NA> <NA>
4 2 6 7 8 23 456 32 21 34 54

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

How to merge dating correctly

I'm trying to merge 7 complete data frames into one great wide data frame. I figured I have to do this stepwise and merge 2 frames into 1 and then that frame into another so forth until all 7 original frames becomes one.
fil2005: "ID" "abr_2005" "lop_2005" "ins_2005"
fil2006: "ID" "abr_2006" "lop_2006" "ins_2006"
But the variables "abr_2006" "lop_2006" "ins_2006" and 2005 are all either 0,1.
Now the things is, I want to either merge or do a dcast of some sort (I think) to make these two long data frames into one wide data frame were both "abr_2005" "lop_2005" "ins_2005" and abr_2006" "lop_2006" "ins_2006" are in that final file.
When I try
$fil_2006.1 <- merge(x=fil_2005, y=fil_2006, by="ID__", all.y=T)
all the variables with _2005 at the end if it is saved to the fil_2006.1, but the variables ending in _2006 doesn't.
I'm apparently doing something wrong. Any idea?
Is there a reason you put those underscores after ID__? Otherwise, the code you provided will work
An example:
dat1 <- data.frame("ID"=seq(1,20,by=2),"varx2005"=1:10, "vary2005"=2:11)
dat2 <- data.frame("ID"=5:14,"varx2006"=1:20, "vary2006"=21:40)
# create data frames of differing lengths
head(dat1)
ID varx2005 vary2005
1 1 1 2
2 3 2 3
3 5 3 4
4 7 4 5
5 9 5 6
6 11 6 7
head(dat2)
ID varx2006 vary2006
1 5 1 21
2 6 2 22
3 7 3 23
4 8 4 24
5 9 5 25
6 10 6 26
merged <- merge(dat1,dat2,by="ID",all=T)
head(merged)
ID varx2006 vary2006 varx2005 vary2005
1 1 NA NA 1 2
2 3 NA NA 2 3
3 5 1 21 3 4
4 5 11 31 3 4
5 7 13 33 4 5
6 7 3 23 4 5

R: grouped data table with proportions

I have copied my code below. I start with a list of 50 small integers, representing the number of televisions owned by 50 families. My objective is shown in the object 'tv.final' below. My effort seems very wordy and inefficient.
Question: is there a better way to start with a list of 50 integers and end with a grouped data table with proportions? (Just taking my first baby steps with R, sorry for such a stupid question, but inquiring minds want to know.)
tv.data <- read.table("Tb02-08.txt",header=TRUE)
str(tv.data)
# 'data.frame': 50 obs. of 1 variable:
# $ TVs: int 1 1 1 2 6 3 3 4 2 4 ...
tv.table <- table(tv.data)
tv.table
# tv.data
# 0 1 2 3 4 5 6
# 1 16 14 12 3 2 2
tv.prop <- prop.table(tv.table)*100
tv.prop
# tv.data
# 0 1 2 3 4 5 6
# 2 32 28 24 6 4 4
tvs <- rbind(tv.table,tv.prop)
tvs
# 0 1 2 3 4 5 6
# tv.table 1 16 14 12 3 2 2
# tv.prop 2 32 28 24 6 4 4
tv.final <- t(tvs)
tv.final
# tv.table tv.prop
# 0 1 2
# 1 16 32
# 2 14 28
# 3 12 24
# 4 3 6
# 5 2 4
# 6 2 4
You can treat the object returned by table() as any other vector/matrix:
tv.table <- table(tv.data)
round(100 * tv.table/sum(tv.table))
That will give you the proportions in rounded percentage points.

Resources