Transform single row into rows and columns - r

I have a list of 170 items, each with 12 variables. This data is currently organised in one continuous row (1 observations of 2040 variables), e.g.:
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
but I want it to be organised into 170 columns with 12 rows as follows:
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
I have tried the following:
list2=lapply(list1, function(x) t(x))
but this doesn't alter the data in any way. Is there something else I can do to transform the data?

We convert the string to a vector of numeric elements with scan, split the vector by itself to create a list and convert it to a data.frame
v2 <- scan(text=v1, what=numeric(), quiet=TRUE)
data.frame(split(v2, v2))

If your data is already converted into a vector (as #akrun showed with using scan) you could also do:
data <- 1:2040 # your data
breaks <- seq(1, 2040, 170)
result <- lapply(breaks, function(x) data[x : (x + 169)])
Results in
> str(result)
List of 12
$ : int [1:170] 1 2 3 4 5 6 7 8 9 10 ...
$ : int [1:170] 171 172 173 174 175 176 177 178 179 180 ...
$ : int [1:170] 341 342 343 344 345 346 347 348 349 350 ...
$ : int [1:170] 511 512 513 514 515 516 517 518 519 520 ...
$ : int [1:170] 681 682 683 684 685 686 687 688 689 690 ...
$ : int [1:170] 851 852 853 854 855 856 857 858 859 860 ...
$ : int [1:170] 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 ...
$ : int [1:170] 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 ...
$ : int [1:170] 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 ...
$ : int [1:170] 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 ...
$ : int [1:170] 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 ...
$ : int [1:170] 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 ...

Related

How do I fix a summarise function error in dplyr package?

I have some problems summarise function in "dplyr" package.
This is the code.
library("dplyr")
a <- read.csv("Number of subway passengers.csv",header = T, stringsAsFactor = F)
a <- a[,c(-2,-3,-4,-5)]
colnames(a)=c("Date","4-5","5-6","6-7","7-8","8-9","9-10","10-11","11-12","12-13","13-14","14-15","15-
16","16-17","17-18","18-19","19-20","20-21","21-22","22-23","23-24","0-1","1-2","2-3","3-4","Total")
b <- summarise(a,mean_passenger=mean("Total",na.rm=TRUE))
After running the last code I have some error in summarise.
In mean.default("Total", na.rm = TRUE) : argument is not numeric or logical:returning NA
Why does this error occur?
I attach the result of using the function str.
> str(a)
'data.frame': 16501 obs. of 26 variables:
$ Date : chr "2019-11-01" "2019-11-01" "2019-11-01" "2019-11-01" ...
$ 4-5 : int 32 2 3 0 5 0 11 1 2 0 ...
$ 5-6 : int 438 353 89 182 143 211 187 127 83 175 ...
$ 6-7 : int 529 2019 152 852 161 1078 154 477 115 622 ...
$ 7-8 : int 1612 4520 289 2926 288 4395 302 1044 219 1817 ...
$ 8-9 : int 3405 9906 435 9348 482 13000 386 3662 366 5234 ...
$ 9-10 : int 2360 6525 481 4124 631 6669 550 3510 494 3292 ...
$ 10-11 : int 2377 3571 716 2064 768 2964 841 2593 843 2292 ...
$ 11-12 : int 2853 2951 1090 1889 1359 2501 1686 2813 1262 2349 ...
$ 12-13 : int 3334 3190 1073 1538 1531 2127 1781 2646 1583 2160 ...
$ 13-14 : int 3545 3348 1367 1751 1937 2108 2059 2718 1868 2159 ...
$ 14-15 : int 2850 3179 1782 1403 2466 1926 2405 2579 2303 2071 ...
$ 15-16 : int 4606 3265 2235 1431 2821 1718 3125 2103 2479 1559 ...
$ 16-17 : int 4915 3575 2345 1218 3403 1778 3241 2010 2656 1777 ...
$ 17-18 : int 7472 4191 3627 1249 5807 2396 3796 2033 3583 1599 ...
$ 18-19 : int 11107 5445 7462 1486 10738 3746 4836 2582 5246 1776 ...
$ 19-20 : int 5754 3882 2943 816 4680 2557 3192 1682 2709 1261 ...
$ 20-21 : int 3920 2596 2249 439 3670 935 2107 675 1782 548 ...
$ 21-22 : int 3799 2177 2199 288 4495 510 2452 512 1565 341 ...
$ 22-23 : int 3369 1624 1460 296 4118 384 2407 380 1094 260 ...
$ 23-24 : int 1678 912 640 202 2366 299 1394 323 596 153 ...
$ 0-1 : int 228 478 62 47 271 75 236 143 66 73 ...
$ 1-2 : int 2 39 0 1 1 0 6 10 1 1 ...
$ 2-3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 3-4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Total : int 70185 67748 32699 33550 52141 51377 37154 34623 30915 31519 ...
"Total" is interpreted as a string. We can reproduce the same error with
mean("Total")
#[1] NA
Warning message:
In mean.default("Total") : argument is not numeric or logical: returning NA
We need to use Total without quotes to be interpreted as column.
b <- dplyr::summarise(a, mean_passenger = mean(Total,na.rm=TRUE))

Isolate Data Frame from Data Frame for Multiple Time Series Plot

I want to extract temperature (temp_c) at specific pressure level (press_hpa). As I am filtering my data (dat) using dplyr, I'm creating another data frame which contains the same columns numbers (15) and different length of observation. There were so many solution to plot multiple time series from column but I cant match the solution.. How to plot a multiple time series showing temperature at different level(x = date, y = temp_c, legend = Press_1000, Press_925, Press_850, Press_700)? Kindly help.. Thank you..
library(ggplot2),
library(dplyr)
library(reshape2)
setwd("C:/Users/Hp/Documents/yr/climatology/")
dat <- read.csv("soundingWMKD.csv", head = TRUE, stringsAsFactors = F)
str(dat)
'data.frame': 6583 obs. of 15 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ pres_hpa : num 1006 1000 993 981 1005 ...
$ hght_m : int 16 70 132 238 16 62 141 213 302 329 ...
$ temp_c : num 24 23.6 23.2 24.6 24.2 24.2 24 23.8 23.3 23.2 ...
$ dwpt_c : num 23.4 22.4 21.5 21.6 23.6 23.1 22.9 22.7 22 21.8 ...
$ relh_pct : int 96 93 90 83 96 94 94 94 92 92 ...
$ mixr_g_kg: num 18.4 17.4 16.6 16.9 18.6 ...
$ drct_deg : int 0 0 NA NA 190 210 212 213 215 215 ...
$ sknt_knot: int 0 0 NA NA 1 3 6 8 11 11 ...
$ thta_k : num 297 297 297 299 297 ...
$ thte_k : num 350 347 345 349 351 ...
$ thtv_k : num 300 300 300 302 300 ...
$ date : chr "2017-11-02" "2017-11-02" "2017-11-02" "2017-11-02" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
Press_1000 <- filter(dat,dat$pres_hpa == 1000)
Press_925 <- filter(dat,dat$pres_hpa == 925)
Press_850 <- filter(dat,dat$pres_hpa == 850)
Press_700 <- filter(dat,dat$pres_hpa == 700)
date <- as.Date(dat$date, "%m-%d-%y")
str(Press_1000)
'data.frame': 80 obs. of 15 variables:
$ X : int 2 6 90 179 267 357 444 531 585 675 ...
$ pres_hpa : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ hght_m : int 70 62 63 63 62 73 84 71 74 78 ...
$ temp_c : num 23.6 24.2 24.4 24.2 25.4 24 23.8 24 23.8 24 ...
$ dwpt_c : num 22.4 23.1 23.2 22.3 23.9 23.1 23.4 23 23 23.1 ...
$ relh_pct : int 93 94 93 89 91 95 98 94 95 95 ...
$ mixr_g_kg: num 17.4 18.2 18.3 17.3 19.1 ...
$ drct_deg : int 0 210 240 210 210 340 205 290 315 0 ...
$ sknt_knot: int 0 3 2 3 3 2 4 1 1 0 ...
$ thta_k : num 297 297 298 297 299 ...
$ thte_k : num 347 350 351 348 354 ...
$ thtv_k : num 300 301 301 300 302 ...
$ date : chr "2017-11-02" "2017-11-03" "2017-11-04" "2017-11-05" ...
$ from_hr : int 0 0 0 0 0 0 0 0 0 0 ...
$ to_hr : int 0 0 0 0 0 0 0 0 0 0 ...
str(Press_925)
'data.frame': 79 obs. of 15 variables:
$ X : int 13 96 187 272 365 450 537 593 681 769 ...
$ pres_hpa : num 925 925 925 925 925 925 925 925 925 925 ...
$ hght_m : int 745 747 746 748 757 764 757 758 763 781 ...
$ temp_c : num 21.8 22 22.4 23.2 22.2 20.6 22.4 22 22.4 22.2 ...
$ ... 'truncated'
all_series = rbind(date,Press_1000,Press_925,Press_850,Press_700)
meltdf <- melt(all_series,id.vars ="date")
ggplot(meltdf,aes(x=date,y=value,colour=variable,group=variable)) +
geom_line()
There are two ways of approaching this. What you go for may depend on the bedrock question (which we don't know).
1) For each data.frame, you have all the necessary columns and you can plot each source (data.frame) using e.g.
ggplot()... +
geom_line(data = Press_1000, aes(...)) +
geom_line(data = Press_925, aes(...)) ...
Note that you will have to specify color for each source and having a legend with that is PITA.
2) Combine all data.frames into one big object and create an additional column indicating the origin of the data (from which data.frame the observation is from). This would be your mapping variable (e.g. color, fill, group)in your current aes call. Instant legend.

R find the index of a charactor in dataframe

I have a dataframe with SAT scores for all states in US.
'data.frame': 51 obs. of 7 variables:
$ X2010.rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ state : chr "Iowa " "Minnesota " "Wisconsin " "Missouri " ...
$ reading : int 603 594 595 593 585 592 585 590 585 580 ...
$ math : int 613 607 604 595 605 603 600 595 593 594 ...
$ writing : int 582 580 579 580 576 571 577 567 568 559 ...
$ combined : int 1798 1781 1778 1768 1766 1766 1762 1752 1746 1733 ...
$ participation: chr "3%" "7%" "4%" "4%" ...
I need to find the index of a particular state. I tried the which command but its returning integer(0)
> which(sat$state=="California")
integer(0)
However this command is working for other rows and getting me the index:
> which(sat$combined==1781)
[1] 2
where am I going wrong. Please help.

non meaningful operation for fractor error when storing new value in data frame: R

I am trying to update a a value in a data frame but am getting--what seems to me--a weird error about operation that I don't think I am using.
Here's a summary of the data:
> str(us.cty2015#data)
'data.frame': 3108 obs. of 15 variables:
$ STATEFP : Factor w/ 52 levels "01","02","04",..: 17 25 33 46 4 14 16 24 36 42 ...
$ COUNTYFP : Factor w/ 325 levels "001","003","005",..: 112 91 67 9 43 81 7 103 72 49 ...
$ COUNTYNS : Factor w/ 3220 levels "00023901","00025441",..: 867 1253 1600 2465 38 577 690 1179 1821 2104 ...
$ AFFGEOID : Factor w/ 3220 levels "0500000US01001",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ GEOID : Factor w/ 3220 levels "01001","01003",..: 976 1472 1879 2813 144 657 795 1395 2098 2398 ...
$ NAME : Factor w/ 1910 levels "Abbeville","Acadia",..: 1558 1703 1621 688 856 1075 148 1807 1132 868 ...
$ LSAD : Factor w/ 9 levels "00","03","04",..: 5 5 5 5 5 5 5 5 5 5 ...
$ ALAND : num 1.66e+09 1.10e+09 3.60e+09 2.12e+08 1.50e+09 ...
$ AWATER : num 2.78e+06 5.24e+07 3.50e+07 2.92e+08 8.91e+06 ...
$ t_pop : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_wht : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free_blk: num 0 0 0 0 0 0 0 0 0 0 ...
$ n_slv : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_blk : num 0 0 0 0 0 0 0 0 0 0 ...
$ n_free : num 0 0 0 0 0 0 0 0 0 0 ...
> str(us.cty1860#data)
'data.frame': 2126 obs. of 29 variables:
$ DECADE : Factor w/ 1 level "1860": 1 1 1 1 1 1 1 1 1 1 ...
$ NHGISNAM : Factor w/ 1236 levels "Abbeville","Accomack",..: 1142 1218 1130 441 812 548 1144 56 50 887 ...
$ NHGISST : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ NHGISCTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ ICPSRST : Factor w/ 37 levels "1","11","12",..: 5 13 21 26 22 26 22 10 15 17 ...
$ ICPSRCTY : Factor w/ 273 levels "10","1010","1015",..: 25 93 146 72 247 122 12 10 228 45 ...
$ ICPSRNAM : Factor w/ 1200 levels "ABBEVILLE","ACCOMACK",..: 1108 1184 1097 432 791 535 1110 55 49 860 ...
$ STATENAM : Factor w/ 41 levels "Alabama","Arkansas",..: 32 13 9 36 16 36 16 30 23 39 ...
$ ICPSRSTI : int 14 31 44 49 45 49 45 24 34 40 ...
$ ICPSRCTYI : int 1210 1970 2910 1810 710 2450 1130 110 50 1450 ...
$ ICPSRFIP : num 0 0 0 0 0 0 0 0 0 0 ...
$ STATE : Factor w/ 41 levels "010","050","060",..: 32 13 9 36 16 36 16 30 23 39 ...
$ COUNTY : Factor w/ 320 levels "0000","0010",..: 142 206 251 187 85 231 131 12 6 161 ...
$ PID : num 1538 735 306 1698 335 ...
$ X_CENTROID : num 1348469 184343 1086494 -62424 585888 ...
$ Y_CENTROID : num 556680 588278 -229809 -433290 -816852 ...
$ GISJOIN : Factor w/ 2126 levels "G0100010","G0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ GISJOIN2 : Factor w/ 2126 levels "0100010","0100030",..: 1585 627 319 1769 805 1788 823 1425 1079 2006 ...
$ SHAPE_AREA : num 2.35e+09 1.51e+09 8.52e+08 2.54e+09 6.26e+08 ...
$ SHAPE_LEN : num 235777 155261 166065 242608 260615 ...
$ t_pop : int 25043 653 4413 8184 174491 1995 4324 17187 4649 8392 ...
$ n_wht : int 24974 653 4295 6892 149063 1684 3001 17123 4578 2580 ...
$ n_free_blk : int 69 0 2 0 10939 2 7 64 12 409 ...
$ n_slv : int 0 0 116 1292 14484 309 1316 0 59 5403 ...
$ n_blk : int 69 0 118 1292 25423 311 1323 64 71 5812 ...
$ n_free : num 25043 653 4297 6892 160007 ...
$ frac_free : num 1 1 0.974 0.842 0.917 ...
$ frac_free_blk: num 1 NA 0.0169 0 0.4303 ...
$ frac_slv : num 0 0 0.0263 0.1579 0.083 ...
> str(overlap)
'data.frame': 15266 obs. of 7 variables:
$ cty2015 : Factor w/ 3108 levels "0","1","10","100",..: 1 1 2 2 2 2 2 1082 1082 1082 ...
$ cty1860 : Factor w/ 2126 levels "0","1","10","100",..: 1047 1012 1296 1963 2033 2058 2065 736 1413 1569 ...
$ area_inter : num 1.66e+09 2.32e+05 9.81e+04 1.07e+09 7.67e+07 ...
$ area1860 : num 1.64e+11 1.81e+11 1.54e+09 2.91e+09 2.32e+09 ...
$ frac_1860 : num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
$ sum_frac_1860 : num 1 1 1 1 1 ...
$ scaled_frac_1860: num 1.01e-02 1.28e-06 6.35e-05 3.67e-01 3.30e-02 ...
I am trying to multiply a vector of variables vars <- c("t_pop", "n_wht", "n_free_blk", "n_slv", "n_blk", "n_free") in the us.cty1860#data data frame by a scalar overlap$scaled_frac_1860[i], then add it to the same vector of variables in the us.cty2015#data data frame, and finally overwrite the variables in the us.cty2015#data data frame.
When I make the following call, I get an error that seems to be saying that I am trying to preform invalid operations on factors (which is not the case (you can confirm from the str output)).
> us.cty2015#data[overlap$cty2015[1], vars] <- us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
Error in Summary.factor(1L, na.rm = FALSE) :
‘max’ not meaningful for factors
In addition: Warning message:
In Ops.factor(i, 0L) : ‘>=’ not meaningful for factors
However, when I don't attempt to overwrite the old value, the operation works fine.
> us.cty2015#data[overlap$cty2015[1], vars] + (overlap$scaled_frac_1860[1] * us.cty1860#data[overlap$cty1860[1], vars])
t_pop n_wht n_free_blk n_slv n_blk n_free
0 118.3889 113.6468 0.1317233 4.610316 4.742039 113.7785
I'm sure there are better ways of accomplishing what I am trying to do but does anyone have any idea what is going on?
Edit:
I am using the following libraries: rgdal, rgeos, and maptools
The all the data/object are coming from NHGIS shapefiles 1860 and 2015 United States Counties.

Merge data frames from a list with each other

What I need:
I have a huge data frame with the following columns (and some more, but these are not important). Here's an example:
user_id video_id group_id x y
1 1 0 0 39 108
2 1 0 0 39 108
3 1 10 0 135 180
4 2 0 0 20 123
User, video and group IDs are factors, of course. For example, there are 20 videos, but each of them has several "observations" for each user and group.
I'd like to transform this data frame into the following format, where there are as many x.N, y.N as there are users (N).
video_id x.1 y.1 x.2 y.2 …
0 39 108 20 123
So, for video 0, the x and y values from user 1 are in columns x.1 and y.1, respectively. For user 2, their values are in columns x.2, y.2, and so on.
What I've tried:
I made myself a list of data frames that are solely composed of all the x, y observations for each video_id:
summaryList = dlply(allData, .(user_id), function(x) unique(x[c("video_id","x","y")]) )
That's how it looks like:
List of 15
$ 1 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 1 11 8 5 12 9 20 13 7 10 ...
..$ x : int [1:20] 39 135 86 122 28 167 203 433 549 490 ...
..$ y : int [1:20] 108 180 164 103 187 128 185 355 360 368 ...
$ 2 :'data.frame': 20 obs. of 3 variables:
..$ video_id: Factor w/ 20 levels "0","1","2","3",..: 2 14 15 4 20 6 19 3 13 18 ...
..$ x : int [1:20] 128 688 435 218 528 362 299 134 83 417 ...
..$ y : int [1:20] 165 117 135 179 96 328 332 563 623 476 ...
Where I'm stuck:
What's left to do is:
Merge each data frame from the summaryList with each other, based on the video_id. I can't find a nice way to access the actual data frames in the list, which are summaryList[1]$`1`, summaryList[2]$`2`, et cetera.
#James found out a partial solution:
Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
Ensure the column names are renamed after the user ID and not kept as-is. Right now my summaryList doesn't contain any info about the user ID, and the output of Reduce has duplicate column names like x.x y.x x.y y.y x.x y.x and so on.
How do I go about doing this? Or is there any easier way to get to the result than what I'm currently doing?
I am still somewhat confused. However, I guess you simply want to melt and dcast.
library(reshape2)
d <- melt(allData,id.vars=c("user_id","video_id"), measure.vars=c("x","y"))
dcast(d,video_id~user_id+variable,value.var="value",fun.aggregate=mean)
Resulting in:
video_id 1_x 1_y 2_x 2_y 3_x 3_y 4_x 4_y 5_x 5_y 6_x 6_y 7_x 7_y 8_x 8_y 9_x 9_y 10_x 10_y 11_x 11_y 12_x 12_y 14_x 14_y 15_x 15_y 16_x 16_y
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210 134 58 244 910 403 152 52 1092 617 1012 114 1105 424 548 394
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994 114 854 129 781 306 672 -1 1096 354 525 524 150
Reduce does the trick:
reducedData <- Reduce(function(x,y) merge(x,y,by="video_id"),summaryList)
… but you need to fix the names afterwards:
names(reducedData)[-1] <- do.call(function(...) paste(...,sep="."),expand.grid(letters[24:25],names(summaryList)))
The result is:
video_id x.1 y.1 x.2 y.2 x.3 y.3 x.4 y.4 x.5 y.5 x.6 y.6 x.7 y.7 x.8
1 0 39 108 899 132 61 357 149 298 1105 415 148 208 442 200 210
2 1 1125 70 128 165 1151 390 171 587 623 623 80 643 866 310 994

Resources