I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.
also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.
Excluding c(uniqueid,name) column
the objective is to find if we have any misspelled words with the help of count, or the words have any accent.
df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))
the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city
so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns
I am quite confused with what you call "option" but here is something to start with using only base R functions.
Note: it only refers to the 1st part of the question "I want the count of all variables and their options".
res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq
The output:
> res
variable option Freq
34 language ENGLISH 7
42 gender MALE 7
39 gender FEMALE 5
3 city C 4
1 city A 3
7 city S 3
11 age 34 3
36 language ENGLISH(US) 3
2 city B 2
9 age 22 2
16 age 43 2
27 country CHINA 2
28 country SPAIN 2
30 country UK 2
32 country USA 2
33 language CHINESE 2
35 language ENGLISH(UK) 2
37 language SPANISH 2
38 gender f 2
4 city D 1
5 city E 1
6 city F 1
8 city W 1
10 age 30 1
12 age 37 1
13 age 39 1
14 age 40 1
15 age 41 1
17 age 44 1
18 age 45 1
19 age 52 1
20 age 67 1
21 country AUSTRALI 1
22 country AUSTRALIA 1
23 country BRASIL 1
24 country BRAZIL 1
25 country CANADA 1
26 country CHILE 1
29 country uk 1
31 country usa 1
40 gender m 1
41 gender male 1
You could count calculate the length of unique values in an aggregate.
res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
# city uniqueid name age country language gender
# 1 A 3 3 3 3 2 1
# 2 B 2 2 2 2 2 2
# 3 C 4 4 4 4 2 4
# 4 D 1 1 1 1 1 1
# 5 E 1 1 1 1 1 1
# 6 F 1 1 1 1 1 1
# 7 S 3 3 3 3 3 2
# 8 W 1 1 1 1 1 1
I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I start with a data frame we will call DF1:
Team Stat1 Stat2 Stat3 Stat4 Pod
Georgia 1 3 3 6 1
Nevada 2 2 2 7 2
Clemson 3 1 2 4 2
Texas 5 4 2 3 1
I want to use only Stats 1,2,3 (not 4). Based on the value in "Pod" I want to create a row with two teams. Each team would have Stats 1, 2, and 3. It should look something like this:
Team1 Stat1A Stat2A Stat3A Team2 Stat1B Stat2B Stat3B
Georgia 1 3 3 Texas 5 4 2
Nevada 2 2 2 Clemson 3 1 2
This is supposed to indicate that Georgia and Texas are playing one another, Nevada and Clemson are playing, and so on. For every round in the tourney I would have to re-assign pods to the matchups in order to progress through the bracket. So, in this very simplified bracket example, the winner of each of the games would play, let's say Georgia faces Clemson in the final to get this:
Team1 Stat1A Stat2A Stat3A Team2 Stat1B Stat2B Stat3B
Georgia 1 3 3 Clemson 3 1 2
We can use dcast from data.table
library(data.table)
dcast(setDT(df1[-5]), Pod ~ LETTERS[rowid(Pod)],
value.var = names(df1)[1:4], sep="")
# Pod TeamA TeamB Stat1A Stat1B Stat2A Stat2B Stat3A Stat3B
#1: 1 Georgia Texas 1 5 3 4 3 2
#2: 2 Nevada Clemson 2 3 2 1 2 2
We can use aggregate from base R:
aggregate(.~Pod,df[-5],I)
Pod Team.1 Team.2 Stat1.1 Stat1.2 Stat2.1 Stat2.2 Stat3.1 Stat3.2
1 1 Georgia Texas 1 5 3 4 3 2
2 2 Nevada Clemson 2 3 2 1 2 2
If you need an exact match then:
s=do.call(data.frame,aggregate(.~Pod,df[-5],I))
s[c(grep("\\.1",names(s)),grep("\\.2",names(s)))]
Team.1 Stat1.1 Stat2.1 Stat3.1 Team.2 Stat1.2 Stat2.2 Stat3.2
1 Georgia 1 3 3 Texas 5 4 2
2 Nevada 2 2 2 Clemson 3 1 2
I'm trying to add the values from one column of a dataframe to another dataframe by matching values from one column or another.
For example:
I have 2 df's with different length and df2 does not have all the pairs listed in df1:
df1
Year Territory Pair_ID
1 1999 BGD 1 5
2 2000 TAR 6 2
3 2001 JAM 3 7
4 2002 TER 9 2
df2
ID1 ID2 pair pair1 type detail
1 1 5 1 5 5 1 PO N/A
2 2 6 2 6 6 2 SB N/A
3 3 7 3 7 7 3 PO N/A
4 4 8 4 8 8 4 SB N/A
5 4 3 4 3 3 4 SB N/A
I want this:
Year Territory Pair_ID type
1 1999 BGD 1 5 PO
2 2000 TAR 6 2 SB
3 2001 JAM 3 7 PO
4 2002 TER 9 2 N/A
I don't want to completely merge the 2 dataframes. I just want to add the "type" column from df2 to df1 by matching the "Pair" column from df1 to either the "pair" column or "pair1" column in df2. I would also like it to fill in with "N/A" for Pairs that are not found in df2.
I could not find anything that addresses this specific problem.
I've tried this:
df1$type <- df2$type[match(df1$Pairs, c(df2$pair,df2$pair1))]
But it only matches with the "pair" column and ignores the "pair1" column.
Good case for sqldf:
library(sqldf)
sqldf("select df1.Year
,df1.Territory
,df1.Pair_ID
,df2.type
from df1
left join df2
on df1.Pair_ID = df2.pair
or df1.Pair_ID = df2.pair1
")
Results
Year Territory Pair_ID type
1 1999 BGD 1 5 PO
2 2000 TAR 6 2 SB
3 2001 JAM 3 7 PO
4 2002 TER 9 2 <NA>
Try something like
typeA <- df2$type[match(df1$Pairs, df2$pair)]
typeB <- df2$type[match(df1$Pairs, df2$pair1)]
df1$type <- ifelse(is.na(typeA), typeB, typeA)
I have a table called "merged" like:
Nationality CustomerID_count ClusterId
1 argentina 1 1
2 ARGENTINA 26 1
3 ARGENTINO 1 1
4 argentona 1 1
5 boliviana 14 2
6 paragauy 1 3
7 paraguay 1 3
8 PARAGUAY 1 3
I need to create a new Nationality column, searching the max value of Customer_ID_count within each cluster.
I did this other table with the following code:
merged1<-data.table(merged)
merged2<-merged1[, which.max(CustomerID), by = ClusterId]
And I got:
ClusterId V1
1: 1 2
2: 2 1
3: 3 1
After that I did a merge:
tot<-merge(x=merged, y=merged2, by= "ClusterId", all.x=TRUE)
And I got the following table:
ClusterId Nationality CustomerID V1
1 1 argentina 1 2
2 1 ARGENTINA 26 2
3 1 ARGENTINO 1 2
4 1 argentona 1 2
5 2 boliviana 14 1
6 3 paragauy 1 1
7 3 paraguay 1 1
8 3 PARAGUAY 1 1
But I didn't know how to finish. I tried this:
tot[,5]=tot[V1,5]
Because I want to have for each row the Nationality that is in the line shown in column V1. This didn't work.
How can I do the last part? and also is there a better way to solve this?
Thanks!
Note that you may have more that one CustomerID_count that matches the maximum value (e.g. all versions of "paraguay" have CustomerID_count == 1, which is the max for that cluster).
It's very easy using the plyr package:
library(plyr)
ddply(merged, .(ClusterId), mutate, Nationality2 = Nationality[CustomerID_count == max(CustomerID_count)])
This could be a good use-case for `dplyr:
library(dplyr)
merged <- merged %>%
group_by(ClusterId) %>%
mutate(newNat=Nationality[CustomerID_count == max(CustomerID_count)]) %>%
ungroup
print(merged)
## Source: local data frame [8 x 4]
##
## Nationality CustomerID_count ClusterId newNat
## 1 argentina 1 1 ARGENTINA
## 2 ARGENTINA 26 1 ARGENTINA
## 3 ARGENTINO 1 1 ARGENTINA
## 4 argentona 1 1 ARGENTINA
## 5 boliviana 14 2 boliviana
## 6 paragauy 1 3 paragauy
## 7 paraguay 1 3 paraguay
## 8 PARAGUAY 1 3 PARAGUAY