I have the following formulas in excel, but calculation takes forever, so I would like to find a way to calculate these formulas in excel.
I'm counting the number of times an item shows up in a location (Location 1, Location 2, and External) with these formulas:
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="1")
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="2")
=SUMPRODUCT(($N:$N=$A2)*(LEFT($C:$C)="E"))
Here is the dataframe in which the columns with these values will be added:
> str(FinalPars)
'data.frame': 10038 obs. of 3 variables:
$ ID: int 11 13 18 22 39 181 182 183 191 192 ...
$ Minimum : num 15 6 1.71 1 1 4.39 2.67 5 5 2 ...
$ Maximum : num 15 6 2 1 1 5.48 3.69 6.5 5 2 ...
and here is the dataset to which the ItemID will be matched (This is a master list of all locations each item is stored in):
> str(StorageLocations)
'data.frame': 14080 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ CLASSIFICATION : Factor w/ 3 levels "Central 1","Central 2",..: 3 3 3 1 2 3 3 1 2 3 ...
$ Cart Descr : Factor w/ 145 levels "Closet1",..: 36 41 110 1 99 58 60 14 99 60 ...
Sample of Storage Location Data Frame:
ID Classification Cart Descr
123 Central 1 Main Store Room
123 Central 2 Secondary Store Room
123 External Closet 1
123 External Closet 2
123 External Closet 3
So the output for the above would be added to the data frame total pars as the new colums Central 1, Central 2, and External and count the number of times the item was IDd as in those locations:
ID Minimum Maximum Central 1 Central 2 External
123 10 15 1 1 3
This was my output in Excel - a Count of the # of times an item was identified as Central 1, Central 2, or External
If anyone knows the comparable formula in R it would be great!
It's hard to know what you are really asking for without example data. I produced an example below.
Location <- c(rep(1,4), rep(2,4), rep(3,4))
Item_Id <- c(rep(1,2),rep(2,3),rep(1,2),rep(2,2),rep(1,3))
Item_Id_Want_to_Match <- 1
df <- data.frame(Location, Item_Id)
> df
Location Item_Id
1 1 1
2 1 1
3 1 2
4 1 2
5 2 2
6 2 1
7 2 1
8 2 2
9 3 2
10 3 1
11 3 1
12 3 1
sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
> sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
[1] 2
EDIT:
ID <- rep(123,5)
Classification <- c("Central 1", "Central 2", rep("External",3))
df <- data.frame(ID, Classification)
df$count <- 1
ID2 <- 123
Min <- 10
Max <- 15
df2 <- data.frame(ID2, Min, Max)
library(dplyr)
count_df <- df %>%
group_by(ID, Classification) %>%
summarise(count= sum(count))
> count_df
Source: local data frame [3 x 3]
Groups: ID
ID Classification count
1 123 Central 1 1
2 123 Central 2 1
3 123 External 3
library(reshape)
new_df <- recast(count_df, ID~Classification, id.var=c("ID", "Classification"))
> new_df
ID Central 1 Central 2 External
1 123 1 1 3
merge(new_df, df2, by.x="ID", by.y="ID2")
> merge(new_df, df2, by.x="ID", by.y="ID2")
ID Central 1 Central 2 External Min Max
1 123 1 1 3 10 15
Related
I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.
also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.
Excluding c(uniqueid,name) column
the objective is to find if we have any misspelled words with the help of count, or the words have any accent.
df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))
the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city
so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns
I am quite confused with what you call "option" but here is something to start with using only base R functions.
Note: it only refers to the 1st part of the question "I want the count of all variables and their options".
res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq
The output:
> res
variable option Freq
34 language ENGLISH 7
42 gender MALE 7
39 gender FEMALE 5
3 city C 4
1 city A 3
7 city S 3
11 age 34 3
36 language ENGLISH(US) 3
2 city B 2
9 age 22 2
16 age 43 2
27 country CHINA 2
28 country SPAIN 2
30 country UK 2
32 country USA 2
33 language CHINESE 2
35 language ENGLISH(UK) 2
37 language SPANISH 2
38 gender f 2
4 city D 1
5 city E 1
6 city F 1
8 city W 1
10 age 30 1
12 age 37 1
13 age 39 1
14 age 40 1
15 age 41 1
17 age 44 1
18 age 45 1
19 age 52 1
20 age 67 1
21 country AUSTRALI 1
22 country AUSTRALIA 1
23 country BRASIL 1
24 country BRAZIL 1
25 country CANADA 1
26 country CHILE 1
29 country uk 1
31 country usa 1
40 gender m 1
41 gender male 1
You could count calculate the length of unique values in an aggregate.
res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
# city uniqueid name age country language gender
# 1 A 3 3 3 3 2 1
# 2 B 2 2 2 2 2 2
# 3 C 4 4 4 4 2 4
# 4 D 1 1 1 1 1 1
# 5 E 1 1 1 1 1 1
# 6 F 1 1 1 1 1 1
# 7 S 3 3 3 3 3 2
# 8 W 1 1 1 1 1 1
imagine this is the structure of my Data hrd:
'data.frame': 14999 obs. of 2 variables:
$ left : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2
$ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
I want to know the percentage of how many people have left (0 = stayed, 1 = left) for each each level of sales.
This is the closest I come:
hrd %>% group_by(sales) %>% count(left)
However, the output is this:
sales left n
<fctr> <fctr> <int>
1 accounting 0 563
2 accounting 1 204
3 hr 0 524
4 hr 1 215
5 IT 0 954
6 IT 1 273
7 management 0 539
8 management 1 91
9 marketing 0 655
10 marketing 1 203
11 product_mng 0 704
12 product_mng 1 198
13 RandD 0 666
14 RandD 1 121
15 sales 0 3126
16 sales 1 1014
17 support 0 1674
18 support 1 555
19 technical 0 2023
20 technical 1 697
I'm trying something like this:
hrd %>% group_by(sales)
%>% summarise(count = n() )
%>% mutate( leaving_rate = count(left == 1 )/ count )
But the error message is saying
Error: object 'left' not found
Don't use summarise() first because it is truncating your data frame to a summary version. So dropping the column "left" (and any other not mentioned or non-grouping variables) and keeping only "sales" (grouping var) and "count" (mentioned var).
You can do it in one summarize call like this:
hrd %>% group_by(sales) %>%
summarise(percent_left = sum(left) / n())
I have a data frame (labels) that I would like to use as a reference or lookup table of the form:
V1 V2
1 1 WALKING
2 2 WALKING_UPSTAIRS
3 3 WALKING_DOWNSTAIRS
4 4 SITTING
5 5 STANDING
6 6 LAYING
The data frame to use the reference table is (test, ncol = 564, nrow = 2947) where the first three colnames are (test_subject, test_label(num 1-6), data_set) where test_label(1-6) equal the strings referenced above.
Could someone help me figure out how I can use my lookup table to insert a new column called "activity_label" and each observation of that column would correspond to the string equivalent of the referenced number from the reference table.
E.g., if test_label row 1 equals 5 then activity_label row 1 would equal "Standing"
Thanks so much for all of your help!
#
After using the merge method:
> test2[1:10, 564: 565]
angle(Z,gravityMean) activity_label
1 0.04404283 walking
2 0.04134032 walking
3 0.04295217 walking
4 0.03611571 walking
5 -0.09080307 walking
6 -0.08602478 walking
7 -0.07997668 walking
8 0.04372663 walking
9 0.19900166 walking
10 0.20350821 walking
analyzing structure of the remaining dfs
> str(test1)
'data.frame': 2947 obs. of 565 variables:
$ test_labels : int 1 1 1 1 1 1 1 1 1 1 ...
$ test_subject : int 12 12 12 12 4 4 4 12 9 9 ...
$ observ_set : Factor w/ 1 level "test": 1 1 1 1 1 1 1 1 1 1 ...
$ tBodyAcc-mean()-X : num 0.228 0.303 0.237 0.306 0.29 ...
> str(train1)
'data.frame': 7352 obs. of 565 variables:
$ train_labels : int 1 1 1 1 1 1 1 1 1 1 ...
$ V1 : int 27 7 7 26 7 26 6 6 6 7 ...
$ observ_set : Factor w/ 1 level "train": 1 1 1 1 1 1 1 1 1 1 ...
$ tBodyAcc-mean()-X : num 0.262 0.354 0.344 0.292 0.314 ...
One way is to use ifelse :
if data frame = test and activity number column = activitynum,
test$activitylabel <- ifelse(test$activitynum == 1, "walking, ifelse(test$activitynum == 2, "walking_upstairs", ifelse(test$activitynum == 3, "walking_downstairs", ifelse(test$activitynum == 4, "sitting", ifelse(test$activitynum == 5, "standing", ifelse(test$activitynum == 6, "laying", NA))))))
another way is to create a look-up table and then do a merge as suggested by #Jaehyeon:
lookup <- data.frame(activitynum = c(1,2,3,4,5,6), activity = c("walking", "walking_upstairs", "walking_downstairs", "standing", "sitting", "laying"))
survey <- data.frame(id = c(seq(1:10)), activitynum = floor(runif(10, 1, 7)), var1 = runif(10, 1, 100))
merge(survey, lookup, by = "activitynum", all.x = TRUE)
> str(lookup)
'data.frame': 6 obs. of 2 variables:
$ activitynum: num 1 2 3 4 5 6
$ activity : Factor w/ 6 levels "laying","sitting",..: 4 6 5 3 2 1
> str(survey)
'data.frame': 10 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ activitynum: num 1 2 4 1 4 6 2 4 2 2
$ var1 : num 52.3 60.5 53.3 49.8 73.1 ...
I'd do as following. Mapping is done by 'test_label' and 'id' and they are merged using merge(). If you want to keep all values in df, use all.x = T. Otherwise remove it.
set.seed(1237)
lookup <- data.frame(id = 1:6, activity = LETTERS[1:6])
df <- data.frame(test_label = sample(1:6, 10, replace = T))
merge(df, lookup, by.x = "test_label", by.y ="id", all.x = T)
test_label activity
1 1 A
2 1 A
3 2 B
4 2 B
5 3 C
6 5 E
7 5 E
8 6 F
9 6 F
10 6 F
I have an R script that allows me to select a sample size and take fifty individual random samples with replacement. Below is an example of this code:
## Creates data frame
df = as.data.table(data)
## Select sample size
sample.size = 5
## Creates Sample 1 (Size 5)
Sample.1<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.1$Sample <- c("01")
According to the R script above, I first created a data frame. I then select my sample size, which in this case is 5. This represents just one sample. Due to my lack of experience with R, I repeat this code 49 more times. The last piece of code looks like this:
## Creates Sample 50 (Size 5)
Sample.50<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.50$Sample <- c("50")
The sample output would look something like this (Sample Range 1 - 50):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 8800 50
1 3800 50
1 10400 50
1 2200 50
1 29000 50
It should be noted that varaible 'Num' was created for grouping purposes and has little to no influence on my overall question (which is posted below).
Instead of repeating this code fifty times, to get me fifty individual samples (with a size of 5), is there a loop I can create to help me limit my code? I have been recently asked to create ten thousand random samples, each of a size of 5. I obviously cannot repeat this code ten thousand times so I need some sort of loop.
A sample of my final output should look something like this (Sample Range 1 - 10,000):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 9900 10000
1 8300 10000
1 10700 10000
1 6800 10000
1 31000 10000
Thank you all in advance for your help, its greatly appreciated.
Here is some sample code if needed:
Num Dollars
1 31002
1 13728
1 23526
1 80068
1 86244
1 9330
1 27169
1 13694
1 4781
1 9742
1 20060
1 35230
1 15546
1 7618
1 21604
1 8738
1 5299
1 12081
1 7652
1 16779
A very simple method would be to use a for loop and store the results in a list:
lst <- list()
for(i in seq_len(3)){
lst[[i]] <- df[sample(seq_len(nrow(df)), 5, replace = TRUE),]
lst[[i]]["Sample"] <- i
}
> lst
[[1]]
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
[[2]]
Num Dollars Sample
9 1 4781 2
13 1 15546 2
12 1 35230 2
17 1 5299 2
12.1 1 35230 2
[[3]]
Num Dollars Sample
1 1 31002 3
7 1 27169 3
17 1 5299 3
5 1 86244 3
6 1 9330 3
Then, to create a single data.frame, use do.call to rbind the list elements together:
do.call(rbind, lst)
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
9 1 4781 2
13 1 15546 2
121 1 35230 2
17 1 5299 2
12.1 1 35230 2
11 1 31002 3
7 1 27169 3
171 1 5299 3
5 1 86244 3
6 1 9330 3
It's worth noting that if you're sampling with replacement, then drawing 50 (or 10,000) samples of size 5 is equivalent to drawing one sample of size 250 (or 50,000). Thus I would do it like this (you'll see I stole a line from #beginneR's answer):
df = as.data.table(data)
## Select sample size
sample.size = 5
n.samples = 10000
# Sample and assign groups
draws <- df[sample(seq_len(nrow(df)), sample.size * n.samples, replace = TRUE), ]
draws[, Sample := rep(1:n.samples, each = sample.size)]
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45