Merge data frames from a list with each other - r

What I need:
I have a huge data frame with the following columns (and some more, but these are not important). Here's an example:
user_id video_id group_id x y
1 1 0 0 39 108
2 1 0 0 39 108
3 1 10 0 135 180
4 2 0 0 20 123
User, video and group IDs are factors, of course. For example, there are 20 videos, but each of them has several "observations" for each user and group.
I'd like to transform this data frame into the following format, where there are as many x.N, y.N as there are users (N).
video_id x.1 y.1 x.2 y.2 …
0 39 108 20 123
So, for video 0, the x and y values from user 1 are in columns x.1 and y.1, respectively. For user 2, their values are in columns x.2, y.2, and so on.
What I've tried:
I made myself a list of data frames that are solely composed of all the x, y observations for each video_id:
summaryList = dlply(allData, .(user_id), function(x) unique(x[c("video_id","x","y")]) )
That's how it looks like:
List of 15
$ 1 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 1 11 8 5 12 9 20 13 7 10 ...
..$ x : int [1:20] 39 135 86 122 28 167 203 433 549 490 ...
..$ y : int [1:20] 108 180 164 103 187 128 185 355 360 368 ...
$ 2 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 2 14 15 4 20 6 19 3 13 18 ...
..$ x : int [1:20] 128 688 435 218 528 362 299 134 83 417 ...
..$ y : int [1:20] 165 117 135 179 96 328 332 563 623 476 ...
Where I'm stuck:
What's left to do is:
Merge each data frame from the summaryList with each other, based on the video_id. I can't find a nice way to access the actual data frames in the list, which are summaryList[1]$`1`, summaryList[2]$`2`, et cetera.
#James found out a partial solution:
Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
Ensure the column names are renamed after the user ID and not kept as-is. Right now my summaryList doesn't contain any info about the user ID, and the output of Reduce has duplicate column names like x.x y.x x.y y.y x.x y.x and so on.
How do I go about doing this? Or is there any easier way to get to the result than what I'm currently doing?

I am still somewhat confused. However, I guess you simply want to melt and dcast.
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150

Reduce does the trick:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names afterwards:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994

Related

Convert data. frame column character to numeric

I am working on data frame which has information like this.
df<- as.data.frame(read.table("headen.bed",header = FALSE, sep="\t",stringsAsFactors=FALSE, quote=""))
C1 C2 C3
33 12249 0,300,3900,400,4500,400,4200
83 9213 0,49,66,75,158,160,170,183,218
146 680 0,3,13,129,274,278,383,481,482,496
I want to do addition of C1 into each element of C3, it would be something like.
C1 C2 C3
33 12249 33,333,3933,433,4533,433,433
83 9213 83 132 149 158 241 243 253 266 301
146 680 146 149 159 275 420 424 529 627 628 642
but somehow its showing that the C3 is a character class, I tried. different ways to convert into numeric type using as.numeric type.convert, character to factor and then numeric
. But still didn't can anyone suggest best way to perform this?
You can try,
mapply(function(x, y)paste(x + as.numeric(y), collapse = ','),df$C1 ,strsplit(df$C3, ','))
[1] "33,333,3933,433,4533,433,4233" "83,132,149,158,241,243,253,266,301" "146,149,159,275,420,424,529,627,628,642"
DATA
df <- data.frame(C1 = c(33, 83, 146),
C2 = c(1, 2, 3),
C3 = c('0,300,3900,400,4500,400,4200', '0,49,66,75,158,160,170,183,218', '0,3,13,129,274,278,383,481,482,496'),
stringsAsFactors = FALSE)
EDIT
To make C3 into numeric you will have to split it into many columns. There are a bunch of ways to do it as shown here. I like the splitstackshape approach, i.e.
library(splitstackshape)
df1 <- cSplit(df, 'C3', sep = ',')
#C1 C2 C3_01 C3_02 C3_03 C3_04 C3_05 C3_06 C3_07 C3_08 C3_09 C3_10
#1: 33 1 33 333 3933 433 4533 433 4233 NA NA NA
#2: 83 2 83 132 149 158 241 243 253 266 301 NA
#3: 146 3 146 149 159 275 420 424 529 627 628 642
str(df1)
Classes ‘data.table’ and 'data.frame': 3 obs. of 12 variables:
$ C1 : num 33 83 146
$ C2 : num 1 2 3
$ C3_01: int 33 83 146
$ C3_02: int 333 132 149
$ C3_03: int 3933 149 159
$ C3_04: int 433 158 275
$ C3_05: int 4533 241 420
$ C3_06: int 433 243 424
$ C3_07: int 4233 253 529
$ C3_08: int NA 266 627
$ C3_09: int NA 301 628
$ C3_10: int NA NA 642

Viewing dataset in RStudio shows different number of observations compared to R commands

I am currently studying data science with R. To practice, I am using the Auto data of the ISLR package. However, I am encountering a confusing situation when viewing the data. When I view the dataset Auto.df in RStudio, I get the following:
However, when I use dim(Auto.df), I get the following:
> dim(Auto.df)
[1] 392 9
And when I use nrow(Auto.df), I get the following:
> nrow(Auto.df)
[1] 392
And when I use str(Auto.df), I get the following:
> str(Auto.df)
'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
And I have the following in my RStudio "Global Environment" tab:
So why does viewing the dataset in RStudio show 397 rows (observations), whilst everything else says that there are 392 observations?
There are 392 observations in the data. What you are viewing are the rownames of the data. You can set rownames as anything and they do not represent row number in the data.
If you check the rownames of Auto dataset you'll realise they are not sequential and some rownames jump by 2. For example, after 32 you don't have 33 but 34. Similarly after 126 there is 128. I don't know why the data is like that but that makes row number at the end to go till 397.

read.csv() parses numeric columns as factors [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 5 years ago.
Have the below dataframe where all the columns are factors which I want to use them as numeric columns. I tried different ways but it is changing to different values when I try as.numeric(as.character(.))
The data comes in a semicolon separated format. A subset of data to reproduce the problem is:
rawData <- "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
21/12/2006;11:23:00;?;?;?;?;?;?;
21/12/2006;11:24:00;?;?;?;?;?;?;
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000
16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000
16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000
16/12/2006;17:29:00;3.520;0.522;235.020;15.000;0.000;2.000;17.000
16/12/2006;17:30:00;3.702;0.520;235.090;15.800;0.000;1.000;17.000
16/12/2006;17:31:00;3.700;0.520;235.220;15.800;0.000;1.000;17.000
16/12/2006;17:32:00;3.668;0.510;233.990;15.800;0.000;1.000;17.000
"
hpc <- read.csv(text=rawData,sep=";")
str(hpc)
When run against the full data file after dropping the date and time variables, the output from str() looks like:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : Factor w/ 4187 levels "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: Factor w/ 533 levels "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : Factor w/ 2838 levels "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : Factor w/ 222 levels "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : Factor w/ 89 levels "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : Factor w/ 82 levels "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Can anyone help me in getting the expected output?
expected output:
> str(hpc)
'data.frame': 2075259 obs. of 7 variables:
$ Global_active_power : num "?","0.076","0.078",..: 2082 2654 2661 2668 1807 1734 1825 1824 1808 1805 ...
$ Global_reactive_power: num "?","0.000","0.046",..: 189 198 229 231 244 241 240 240 235 235 ...
$ Voltage : num "?","223.200",..: 992 871 837 882 1076 1010 1017 1030 907 894 ...
$ Global_intensity : num "?","0.200","0.400",..: 53 81 81 81 40 36 40 40 40 40 ...
$ Sub_metering_1 : num "?","0.000","1.000",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Sub_metering_2 : num "?","0.000","1.000",..: 3 3 14 3 3 14 3 3 3 14 ...
$ Sub_metering_3 : num 17 16 17 17 17 17 17 17 17 16 ...
Not able to test your data frame, but hopefully this will work. I notice that in the output of str(hpc) not all columns are factors. mutate_if can apply a function to those meet the requirement of a predictive function.
library(dplyr)
hpc2 <- hpc %>%
mutate_if(is.factor, funs(as.numeric(as.character(.))))

How to overwrite a factor in R

I have a dataset:
> k
EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504
The variables are as follows. As per the first answer by joran, unused levels can be dropped by droplevels, so no worry about the 898 levels, the illustrative k I'm showing is the complete dataset obtained from k <- d1[1:10, 3:4] where d1 is the original dataset.
> str(k)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 243 NA NA NA 243 NA NA 243 243 NA
$ FATALITIES: num 583 158 116 114 99 90 75 74 67 57
$ INJURIES : num 0 1150 785 597 0 ...
I'm trying to overwrite the WIND factor:
> k[k$EVTYPE==factor("WIND"), ]$EVTYPE <- factor("AFDAF")
> k[k$EVTYPE=="WIND", ]$EVTYPE <- factor("AFDAF")
But both commands give me error messages: level sets of factors are different or invalid factor level, NA generated.
How should I do this?
Try this instead:
k <- droplevels(d1[1:10, 3:5])
Factors (as per the documentation) are simply a vector of integer codes and then a simple vector of labels for each code. These are called the "levels". The levels are an attribute, and persist with your data even when subsetting.
This is a feature, since for many statistical procedures it is vital to keep track of all the possible values that variable could have, even if they don't appear in the actual data.
Some people find this irritation and run R using options(stringsAsFactors = FALSE).
To simply change the levels, you can do something like this:
d <- read.table(text = " EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504",header = TRUE,sep = "",stringsAsFactors = TRUE)
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "HEAT","WIND": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
> levels(d$EVTYPE) <- c('A','B')
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "A","B": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
Or to just change one:
levels(d$EVTYPE)[2] <- 'C'

R subsetting a data frame based on a factor variable formatted like a range (xx-xx)

I am facing this problem for many hours now, but I know I am missing something obvious.
Here is my problem:
I have a data-frame in .xlsx file that can be downloaded here.
I loaded this data-frame into R using RStudio on MAc and called it demoData.
There are 5 variables (AgeRange, Women, Men, Total, and Year).
I am not able to subset this data frame with a condition on the AgeRange. The format of this variable is as follow: xx-xx (00-04 meaning people between 00 and 04 years old). The message I have when I try to do that is that there is no row filling this condition.
The class of the variable "AgeRange" is factor.
Here is my code:
demoData[demoData$AgeRange=="00-04",]
Thank you for your help.
Edit: from Arun. Here's input from head(demoData):
Age Feminin Masculin. Ensemble Annee
1 00-04 720 745 1465 2004
2 05-09 745 767 1512 2004
3 10-14 813 830 1643 2004
4 15-19 824 820 1644 2004
5 20-24 839 823 1662 2004
6 25-29 752 699 1450 2004
# str(demoData)
'data.frame': 272 obs. of 5 variables:
$ Age : Factor w/ 16 levels "00-04 ","05-09 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Feminin : Factor w/ 216 levels "138 ","139 ",..: 112 124 164 165 174 130 106 86 78 66 ...
$ Masculin.: Factor w/ 201 levels "120 ","122 ",..: 132 141 174 169 170 124 111 89 90 75 ...
$ Ensemble : Factor w/ 242 levels "1041 ","1044 ",..: 53 66 115 116 119 50 38 14 9 238 ...
$ Annee : Factor w/ 17 levels "2004 ","2005",..: 1 1 1 1 1 1 1 1 1 1 ...
I read in your xlsx file with the xlsx package:
df<-read.xlsx("C:/Users/swatson1/Downloads/Evolution_Population_2004_2020.xlsx",1)
and it looked like this:
> df
Age Feminin MasculinÂ. Ensemble Annee
1 00-04Â 720Â 745Â 1465Â 2004Â
2 05-09Â 745Â 767Â 1512Â 2004Â
You could replace each column, getting rid of the extra character with something like:
df$Age<-substr(df$Age,1,5)
Alternatively, use gsub as this will work on any column regardless of the length of the entry:
df$Age<-gsub("Â\\s","",df$Age)
Then your code would work:
df[df$Age=="00-04",]
#coppied from the Excel file
str1 <- "00-04 "
utf8ToInt(str1)
#[1] 48 48 45 48 52 160
There seems to be a no-break space at the end of the string. Sanitize your file.
You should be able to remove the no-break spaces using
df$Age <- gsub(intToUtf8(160),"",df$Age)

Resources