Reading in space seperated dataset in R for frequent items

Reading in space seperated dataset in R for frequent items - r

I have a .txt file that consists of numbers separated by spaces. Each row has a different amount of numbers in it. I need to do market basket analysis on the data, however I can't seem to properly load the data (especially because there is a different number of items in each 'basket'). What is the best way to store the data so I can find the frequent items and then check for frequent items in each basket?
Example of data:
1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54

You should be able to input with readLines and then use lapply to separate into numerics. Assume that is in a file named txt.txt:
dat <- lapply( readLines("txt.txt"), function(Line) scan(text=Line) )
The reason I didn't suggest read.table with fill=TRUE (which would give yiu something similar to the otehr answer that has appeared is that the column stucture was not needed. unless there was information encoded in the position of those numbers. I'm wondering whether the might be additional information encoded in the individual lines such as regions or stores or some other entity as the source of particular numbered items. This would be the reason for keeping it in a list structure with an uneven count. You can get a global enumerations just with table:
table( unlist(dat) )
1 2 3 4 5 6 7 8 21 23 32 34 43 54 67 145 154 456
1 4 2 3 2 1 1 1 2 1 2 1 1 1 1 1 1 1

my_text = '1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54'
my_text2 <- strsplit(my_text, split = '\n')
my_text2 <- lapply(my_text2, trimws)
my_text2 %>%
do.call('rbind',.) %>%
t %>%
as.data.frame() %>%
separate(V1, sep = ' ',into = paste('col_', 1:10))
col_ 1 col_ 2 col_ 3 col_ 4 col_ 5 col_ 6 col_ 7 col_ 8 col_ 9 col_ 10
1 1 2 4 3 67 43 154 <NA> <NA> <NA>
2 4 5 3 21 2 <NA> <NA> <NA> <NA> <NA>
3 2 4 5 32 145 <NA> <NA> <NA> <NA> <NA>
4 2 6 7 8 23 456 32 21 34 54

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!

I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

R: how to use expand.grid to generate combinations based on group

I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]

One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

Using tidyverse gather() to output multiple value vectors with a single key in a data frame

Despite the conventions of R, data collection and entry is for me most easily done in vertical columns. Therefore, I have a question about efficiently converting to horizontal rows with the gather() function in the tidyverse library. I find myself using gather() over and over which seems inefficient. Is there a more efficient way? And can an existing vector serve as the key? Here is an example:
Let's say we have the following health metrics on baby birds.
bird day_1_mass day_2_mass day_1_heart_rate day_3_heart_rate
1 1 5 6 60 55
2 2 6 8 62 57
3 3 3 3 45 45
Using the gather function I can reorganize the mass data into rows.
horizontal.data <- gather(vertical.data,
key = age,
value = mass,
day_1_mass:day_2_mass,
factor_key=TRUE)
Giving us
bird day_1_heart_rate day_3_heart_rate age mass
1 1 60 55 day_1_mass 5
2 2 62 57 day_1_mass 6
3 3 45 45 day_1_mass 3
4 1 60 55 day_2_mass 6
5 2 62 57 day_2_mass 8
6 3 45 45 day_2_mass 3
And use the same function again to similarly reorganize heart rate data.
horizontal.data.2 <- gather(horizontal.data,
key = age2,
value = heart_rate,
day_1_heart_rate:day_3_heart_rate,
factor_key=TRUE)
Producing a new dataframe
bird age mass age2 heart_rate
1 1 day_1_mass 5 day_1_heart_rate 60
2 2 day_1_mass 6 day_1_heart_rate 62
3 3 day_1_mass 3 day_1_heart_rate 45
4 1 day_2_mass 6 day_1_heart_rate 60
5 2 day_2_mass 8 day_1_heart_rate 62
6 3 day_2_mass 3 day_1_heart_rate 45
7 1 day_1_mass 5 day_3_heart_rate 55
8 2 day_1_mass 6 day_3_heart_rate 57
9 3 day_1_mass 3 day_3_heart_rate 45
10 1 day_2_mass 6 day_3_heart_rate 55
11 2 day_2_mass 8 day_3_heart_rate 57
12 3 day_2_mass 3 day_3_heart_rate 45
So it took two steps, but it worked. The questions are 1) Is there a way to do this in one step? and 2) Can it alternatively be done with one key (the "age" vector) that I can then simply replace as numeric data?

if I get the question right, you could do that by first gathering everything together, and then "spreading" on mass and heart rate:
library(forcats)
library(dplyr)
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE) %>%
mutate(day = stringr::str_sub(variable, 5,5)) %>%
mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs)) %>%
spread(variable, value)
, giving:
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
we can see how it works by going through the pipe one pass at a time.
First, we gather everyting on a long format:
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE)
bird variable value
1 1 day_1_mass 5
2 2 day_1_mass 6
3 3 day_1_mass 3
4 1 day_2_mass 6
5 2 day_2_mass 8
6 3 day_2_mass 3
7 1 day_1_heart_rate 60
8 2 day_1_heart_rate 62
9 3 day_1_heart_rate 45
10 1 day_3_heart_rate 55
11 2 day_3_heart_rate 57
12 3 day_3_heart_rate 45
then, if we want to keep a "proper" long table, as the OP suggested we have to create a single key variable. In this case, it makes sense to use the day (= age). To create the day variable, we can extract it from the character strings now in variable:
%>% mutate(day = stringr::str_sub(variable, 5,5))
here, str_sub gets the substring in position 5, which is the day (note that if in the full dataset you have multiple-digits days, you'll have to tweak this a bit, probably by splitting on _):
bird variable value day
1 1 day_1_mass 5 1
2 2 day_1_mass 6 1
3 3 day_1_mass 3 1
4 1 day_2_mass 6 2
5 2 day_2_mass 8 2
6 3 day_2_mass 3 2
7 1 day_1_heart_rate 60 1
8 2 day_1_heart_rate 62 1
9 3 day_1_heart_rate 45 1
10 1 day_3_heart_rate 55 3
11 2 day_3_heart_rate 57 3
12 3 day_3_heart_rate 45 3
now, to finish we have to "spread " the table to have a mass and a heart rate column.
Here we have a problem, because currently there are 2 levels each corresponding to mass and hearth rate in the variable column. Therefore, applying spread on variable would give us again four columns.
To prevent that, we need to aggregate the four levels in variable into two levels. We can do that by using forcats::fc_collapse, by providing the association between the new level names and the "old" ones. Outside of a pipe, that would correspond to:
horizontal.data$variable <- fct_collapse(horizontal.data$variable,
mass = c("day_1_mass", "day_2_mass",
heart = c("day_1_hearth_rate", "day_3_heart_rate")
However, if you have many levels it is cumbersome to write them all. Therefore, I find beforehand the level names corresponding to the two "categories" using
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
mass_levs
[1] "day_1_mass" "day_2_mass"
hearth_levs
[1] "day_1_heart_rate" "day_3_heart_rate"
therefore, the third line of the pipe can be shortened to:
%>% mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs))
, after which we have:
bird variable value day
1 1 mass 5 1
2 2 mass 6 1
3 3 mass 3 1
4 1 mass 6 2
5 2 mass 8 2
6 3 mass 3 2
7 1 hearth_rate 60 1
8 2 hearth_rate 62 1
9 3 hearth_rate 45 1
10 1 hearth_rate 55 3
11 2 hearth_rate 57 3
12 3 hearth_rate 45 3
, so that we are now in the condition to "spread" the table again according to variable using:
%>% spread(variable, value)
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
HTH

If you insist on a single command , i can give you one
setup the data.frame
c1<-c(1,2,3)
c2<-c(5,6,3)
c3<-c(6,8,3)
c4<-c(60,62,45)
c5<-c(55,57,45)
dt<-as.data.table(cbind(c1,c2,c3,c4,c5))
colnames(dt)<-c("bird","day_1_mass","day_2_mass","day_1_heart_rate","day_3_heart_rate")
Now use this single command to get the final outcome
merge(melt(dt[,c("bird","day_1_mass","day_2_mass")],id.vars = c("bird"),variable.name = "age",value.name="mass"),melt(dt[,c("bird","day_1_heart_rate","day_3_heart_rate")],id.vars = c("bird"),variable.name = "age2",value.name="heart_rate"),by = "bird")
The final outcome is
bird age mass age2 heart_rate
1: 1 day_1_mass 5 day_1_heart_rate 60
2: 1 day_1_mass 5 day_3_heart_rate 55
3: 1 day_2_mass 6 day_1_heart_rate 60
4: 1 day_2_mass 6 day_3_heart_rate 55
5: 2 day_1_mass 6 day_1_heart_rate 62
6: 2 day_1_mass 6 day_3_heart_rate 57
7: 2 day_2_mass 8 day_1_heart_rate 62
8: 2 day_2_mass 8 day_3_heart_rate 57
9: 3 day_1_mass 3 day_1_heart_rate 45
10: 3 day_1_mass 3 day_3_heart_rate 45
11: 3 day_2_mass 3 day_1_heart_rate 45
12: 3 day_2_mass 3 day_3_heart_rate 45

Though already answered, I have a different solution in which you save a list of the gather parameters you would like to run, and then run the gather_() command for each set of parameters in the list.
# Create a list of gather parameters
# Format is key, value, columns_to_gather
gather.list <- list(c("age", "mass", "day_1_mass", "day_2_mass"),
c("age2", "heart_rate", "day_1_heart_rate", "day_3_heart_rate"))
# Run gather command for each list item
for(i in gather.list){
df <- gather_(df, key_col = i[1], value_col = i[2], gather_cols = c(i[3:length(i)]), factor_key = TRUE)
}

Merge two datasets in R

I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.

Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading in space seperated dataset in R for frequent items - r

Related

Get the average of the values of one column for the values in another

R: how to use expand.grid to generate combinations based on group

R: Sum column from table 2 based on value in table 1, and store result in table 1

Using tidyverse gather() to output multiple value vectors with a single key in a data frame

Merge two datasets in R

Categories

Resources